Incident report for Visma eAccounting, major outage Saturday 7th July 2018
We now have more insights into the root cause of the incident and want to share details about how we handled it and what actions have been taken to prevent and mitigate similar problems in the future.
- We identified and verified the incident as a major outage in Visma eAccounting at around 09:00 a.m.
- The first information was published on our service health status page, status.visma.com, at 09:22 a.m.
- Developers and infrastructure personnel from the Visma eAccounting team and our Operations team started working on the incident at around 09:45 a.m.
- At 10:30 a.m we identified that the problem was related to an infrastructure issue. Visma eAccounting had reached the storage limit, and therefore the databases could not reclaim space and were impossible to reach. For users of Visma eAccounting, the problem occurred when logging into the service. Instead, users were automatically logged out and could not access the service.
- The first action when experiencing problems with storage limits is to scale up the production environment, and this is an operation that we often carry out. We started this procedure at 10:45 a.m. Unfortunately, we were unable to perform this operation, due to receiving error messages in the Microsoft Azure portal.
- We contacted our infrastructure provider Microsoft Azure for assistance at 11:30 a.m.
- In parallel with troubleshooting together with the Microsoft Azure teams, we initiated the process to move to our secondary data centre. This operation started at 12:00 a.m. Our secondary data centre is an identical copy of the first data centre to ensure that no customers would lose any data.
- At 1:30 p.m. Microsoft confirmed that the failover to our secondary data centre was in progress but that it was taking longer than expected.They also confirmed that they experienced problems with scaling up one group of databases, which in turn was blocking the progress to complete a full failover. We were in need of assistance from engineers at Microsoft to finalise the failover.
- At 3:30 p.m. we could see that certain groups of databases were available again and that some customers were able to access and use Visma eAccounting.
- At 4:30 p.m. Visma eAccounting was up and running again and available for all customers.
- An incident report was created and an incident review meeting was held with key personnel and management at 10:30 a.m. on the 8th of July. During the review meeting, a plan was outlined with both short-term and long-term follow-up tasks.
Root cause:
The Visma eAccounting incident on the 7th of July was related to an infrastructure issue and capacity issues in Microsoft Azure. Visma eAccounting had reached the storage limit. The root cause is related to nightly backup routines that failed to clean up temporary databases, which in turn used up necessary storage space.
We immediately tried to scale up the max data size for this group of databases. Due to capacity issues in Microsoft Azure, the operation failed. At this point we tried to failover to another data centre. It was successful for all databases except four. These four databases blocked and thus prevented the failover from completing. Engineers from Microsoft were already working on this issue when we contacted them. When Microsoft managed to unblock the databases, we successfully failed over to the secondary data centre.
Summary:
- Visma have held incident review meetings where a plan was created with both short-term and long-term follow-up tasks. We have for example made improvements in monitoring, created new alerts and several other actions with follow-up tasks to mitigate similar incidents in the future.
- Microsoft have shared their root cause analysis with Visma as well as made recommendations and suggestions on improvements to mitigate similar incidents in the future.
- No data was lost and we did not experience any hacker attacks.
- We had not made any changes to our infrastructure or code close to the time this problem started to occur.
- Visma is now running Visma eAccounting in our primary production environment, which we changed back to on the 23rd of July.
We apologise to all customers who were affected by this incident. This was an unfortunate set of circumstances for customers, partners and the development team behind Visma eAccounting.