Investigating an issue in eAccounting

Incident Report for Visma Cloud Services

Postmortem

Incident report for Visma eAccounting, major outage Saturday 7th July 2018

We now have more insights into the root cause of the incident and want to share details about how we handled it and what actions have been taken to prevent and mitigate similar problems in the future.

We identified and verified the incident as a major outage in Visma eAccounting at around 09:00 a.m.
The first information was published on our service health status page, status.visma.com, at 09:22 a.m.
Developers and infrastructure personnel from the Visma eAccounting team and our Operations team started working on the incident at around 09:45 a.m.
At 10:30 a.m we identified that the problem was related to an infrastructure issue. Visma eAccounting had reached the storage limit, and therefore the databases could not reclaim space and were impossible to reach. For users of Visma eAccounting, the problem occurred when logging into the service. Instead, users were automatically logged out and could not access the service.
The first action when experiencing problems with storage limits is to scale up the production environment, and this is an operation that we often carry out. We started this procedure at 10:45 a.m. Unfortunately, we were unable to perform this operation, due to receiving error messages in the Microsoft Azure portal.
We contacted our infrastructure provider Microsoft Azure for assistance at 11:30 a.m.
In parallel with troubleshooting together with the Microsoft Azure teams, we initiated the process to move to our secondary data centre. This operation started at 12:00 a.m. Our secondary data centre is an identical copy of the first data centre to ensure that no customers would lose any data.
At 1:30 p.m. Microsoft confirmed that the failover to our secondary data centre was in progress but that it was taking longer than expected.They also confirmed that they experienced problems with scaling up one group of databases, which in turn was blocking the progress to complete a full failover. We were in need of assistance from engineers at Microsoft to finalise the failover.
At 3:30 p.m. we could see that certain groups of databases were available again and that some customers were able to access and use Visma eAccounting.
At 4:30 p.m. Visma eAccounting was up and running again and available for all customers.
An incident report was created and an incident review meeting was held with key personnel and management at 10:30 a.m. on the 8th of July. During the review meeting, a plan was outlined with both short-term and long-term follow-up tasks.

Root cause:

The Visma eAccounting incident on the 7th of July was related to an infrastructure issue and capacity issues in Microsoft Azure. Visma eAccounting had reached the storage limit. The root cause is related to nightly backup routines that failed to clean up temporary databases, which in turn used up necessary storage space.

We immediately tried to scale up the max data size for this group of databases. Due to capacity issues in Microsoft Azure, the operation failed. At this point we tried to failover to another data centre. It was successful for all databases except four. These four databases blocked and thus prevented the failover from completing. Engineers from Microsoft were already working on this issue when we contacted them. When Microsoft managed to unblock the databases, we successfully failed over to the secondary data centre.

Summary:

Visma have held incident review meetings where a plan was created with both short-term and long-term follow-up tasks. We have for example made improvements in monitoring, created new alerts and several other actions with follow-up tasks to mitigate similar incidents in the future.
Microsoft have shared their root cause analysis with Visma as well as made recommendations and suggestions on improvements to mitigate similar incidents in the future.
No data was lost and we did not experience any hacker attacks.
We had not made any changes to our infrastructure or code close to the time this problem started to occur.
Visma is now running Visma eAccounting in our primary production environment, which we changed back to on the 23rd of July.

We apologise to all customers who were affected by this incident. This was an unfortunate set of circumstances for customers, partners and the development team behind Visma eAccounting.

Posted Aug 07, 2018 - 12:54 CEST

Resolved

The incident on Visma eAccounting is now resolved and the service is up and running as normal. We have been monitoring the service the last hours and everything looks good. We will post a postmortem with information about root-cause the coming week.

Posted Jul 07, 2018 - 21:21 CEST

Monitoring

The service is up and running again. This issue should be solved for all customers. We will keep monitoring to ensure high quality the coming hours.

Posted Jul 07, 2018 - 16:38 CEST

Update

We are still working on the issue we can see that some customers can login and access Visma eAccounting now but we keep on trouble shooting and working on solving the incident.
Next update will come from us in about one hour.

Posted Jul 07, 2018 - 15:53 CEST

Update

We still working hard to solve the incident as soon as possible. We have identified that it's related to problems scaling up our infrastructure environment. Our current hypothesis is that this problems was triggered by an unexpected peak during our nightly backup routines. We are working to solve it together with our infrastructure provider and we are in progress to do a failover to our secondary datacenter and that can take 1-2 hours. We have not lost any data and there is not any hacker attack.

Posted Jul 07, 2018 - 14:22 CEST

Update

We continue to working on the solution and have done the first part and estimate to be up in a couple of hours . Sorry for the inconvenience we are working as fast as possible to solve the incident and as soon as we have new information to share we will share it here.

Posted Jul 07, 2018 - 13:10 CEST

Identified

We have identified the cause of the incident and are currently working on finding a solution for this, we estimate to have a fix in place within 1-2 hours. Next update about the progress on this incident will be at 1 pm. Our developers and infrastructure people working on solving this incident together.

Posted Jul 07, 2018 - 12:04 CEST

Update

We have identified the cause of the incident and are currently working on finding a solution for this.

Posted Jul 07, 2018 - 11:08 CEST

Update

We continue to investigate this issue.

Posted Jul 07, 2018 - 10:44 CEST

Update

There is still problems to access (log in) to Visma eAccounting. We continue to investigate together with our developers and infrastructure provider. We hope to be up soon.

Posted Jul 07, 2018 - 10:18 CEST

Update

We are continuing to investigate this issue we the problems only affect Visma eAccounting/Visma eEkonomi/Visma ePasseli.

Posted Jul 07, 2018 - 09:43 CEST

Update

We are continuing to investigate this issue. We are continuing to investigate this issue we the problems only affect Visma eAccounting/Visma eEkonmi/Visma ePasseli.

Posted Jul 07, 2018 - 09:42 CEST

Investigating

We are currently investigating a service disruption in Visma Online (log in page). When a users logs in in Visma eAccounting you are looped and get logged out. We will provide more information shortly. This affect common services like for example Visma eAccounting, Visma Advisor, Visma Website/Webshops is not possible to reach.

Posted Jul 07, 2018 - 09:22 CEST

This incident affected: Spiris - Bokföring & Fakturering/Visma eAccounting (Spiris - Bokföring & Fakturering/Visma eAccounting, Spiris - Bokföring & Fakturering Externt API/Visma eAccounting External API, Spiris - Deklaration & Årsredovisning/Årsbokslut), Visma Cloud Platform (Spiris/Visma Online), Spiris - Lön (Spiris - Anställd), and Common Services (Spiris - Website & Webshop/Visma Website & Webshop, Spiris - Kreditupplysning).