Investigating an issue in Visma eAccounting

Incident Report for Visma Cloud Services

Postmortem

Update 9.th June Incident report Visma eAccounting

1) Incident report for Visma eAccounting, major outage 5th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

We identified and verified the incident as a major outage on Visma eAccounting at around 7:00 a.m. Our monitoring systems detected the issue and alerted our personnel.
First information was published on our service health status page, status.visma.com, at 7:35 a.m.
We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 7:50 a.m.
Internal troubleshooting started at 8:10 a.m. We identified that the problem was related to one of Visma eAccounting’s main databases. We were unable to connect to, or perform any actions on, that specific database. We tried several actions to provide the database with more resources. We also tried to create a copy and to make a connection to the copy. None of these attempts were successful.
We contacted our infrastructure provider, Microsoft Azure, at 9:10 a.m.
At 11:15 a.m, Microsoft confirmed that the database was not healthy and assigned an engineering team that worked on the issue throughout the afternoon. The early information we received was encouraging, so we predicted that the issue would be resolved within an hour. However, the team underestimated the problem so the estimated resolution time increased during the afternoon.
At around 4:00 p.m, we received confirmation that there was a bug in Azure that had somehow been triggered by a transaction in Visma eAccounting that took a very long time to run. Based on the discussions with Microsoft during the day, the issue seemed to be difficult to solve and several teams and engineers were involved in troubleshooting the issue.
At 7:00 p.m, Microsoft started to perform recovery and failover procedures to the secondary replica. This resolved the issue in Visma eAccounting.
At 8:20 p.m, Visma eAccounting was up and running again.
An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 7th June. During the review meeting, a plan was created with follow-up tasks to mitigate similar incidents in the future. We called in extra personnel on 6th June for troubleshooting and monitoring throughout the day.

2) Incident report for Visma eAccounting, major outage 7th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

We identified and verified the incident as a major outage on Visma eAccounting at around 6:00 a.m. Our monitoring systems detected the issue and alerted our personnel.
First information was published on our service health status page, status.visma.com, at 6:19 a.m.
We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 6:00 a.m.
We identified that the problem was related to one of Visma eAccounting’s main databases and was connected to the previous incident. We were unable to connect to, or perform any actions on, that specific database.
We contacted our infrastructure provider, Microsoft Azure, at 6:23 a.m. and reported the issue.
At 7:06 a.m, we started a parallel process with a plan to restore the specific database in order to ensure we had a backup plan if the issue took more time to solve than estimated.
At 7:34 a.m, Microsoft confirmed that the database was locked in a similar way as on 5th June.
At 7:47 a.m, Microsoft started to perform a failover procedure to a replica with higher capacity. This resolved the issue in Visma eAccounting.
At 7:47 a.m, Visma eAccounting was up and running again.
An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 7th June. During review meetings throughout the day, a plan was created with follow-up tasks to mitigate similar incidents in the future. For example, we scaled up the database to maximum capacity and planned for a on-duty schedule for monitoring throughout the night.

3) Incident report for Visma eAccounting, major outage 8th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

We identified and verified the incident as a major outage on Visma eAccounting at around 2:30 a.m. Our monitoring systems detected the issue and alerted our personnel.
We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 2:45 a.m.
We identified that the problem was related to one of Visma eAccounting’s main databases and was connected to the incident from 5th June. We were unable to connect to, or perform any actions on, that specific database.
At 3:39 a.m, Microsoft confirmed that the database was locked in a similar way as on 5th June - the transaction log was full.
First information was published on our service health status page, status.visma.com, at 4:04 a.m.
At 5:14 a.m, Microsoft updated the database with correct settings for the automatic growth of the transaction log. This change will permanently mitigate this issue, and fixed the problem in Visma eAccounting.
At 5:15 a.m, Visma eAccounting was up and running again.
An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 8th June.

Root cause:

One of the databases used by Visma eAccounting was not configured correctly in Microsoft Azure and several backup routines that should prevent this scenario failed. The issue was triggered by a nightly maintenance job in Visma eAccounting that took an unusually long time to run. Normally, this is not a problem but the incorrect configuration of the database in question prevented the job from using the resources needed, even though they were available. The database is part of a service we buy from Microsoft, and Microsoft had to solve the configuration problem and also make a general change in Azure SQL.

Technical details:

The problem started when a global transaction was unable to complete. This caused the transaction log to slowly grow until the maximum limit was reached. Failovers and alerts at Microsoft should have prevented this from happening but due to several unfortunate events, that was not the case. The result was a full transaction log which prevented further writing to the database. Incorrect configuration, related to the transaction log and its automatic growth settings, caused failure a further two times. Microsoft resolved the issue by rebuilding one of the replicas and have also identified several improvements that they are working on.

Summary:

No data has been lost.
There was no hacker attack.
We had not made any changes to our infrastructure or code close to the time this problem started to occur.
There was no lack of resources (engineers and infrastructure personnel) since our key personnel were available and involved.
Both Visma and Microsoft have held review meetings. Microsoft has fixed several critical issues already and has created plans with follow-up tasks to mitigate similar incidents in the future.

We apologize to all customers impacted by this. This was an unfortunate set of circumstances for customers, partners and the development & operations team behind Visma eAccounting.

Posted Jun 07, 2017 - 08:59 CEST

Resolved

The issue is now resolved and the service is up and running as normal.

Posted Jun 05, 2017 - 21:58 CEST

Monitoring

The service is up and running again! This issue should be now be solved for all customers. We will keep monitoring the coming hours to ensure that everything works good. We will post information about root-cause as soon as possible, when we have gathered information from all sub-teams. Thanks for your patience we are so sorry.

Posted Jun 05, 2017 - 20:33 CEST

Update

We are working hard together with our infrastructure provider to solve this issue. This incident have the highest priority and several employees and teams is on the case, unfortunately the problem is a bit tricky to solve but we are on it and aiming to fix it this evening. Sorry for any problems this have caused you.

Posted Jun 05, 2017 - 18:56 CEST

Update

We are still working on solving the issue, unfortunately we have no more information to provide right now. Next update will be at 18:00.

Posted Jun 05, 2017 - 16:11 CEST

Update

We have no new information to give at the moment. Every possible recourse are working on solving this together with our service provider. Thanks for your patience and we are sorry for all trouble this may cause. We will update again when it is solved or at 16:00.

Posted Jun 05, 2017 - 14:23 CEST

Update

We are doing everything we can to get the service back to normal and it is our highest priority. We are in close contact with our service provider. Unfortunately we cant say when the issue will be solved. We suspect that the problem relates to the database servers. But we are working on several possible solutions.

Posted Jun 05, 2017 - 12:47 CEST

Update

We are still working on solving the problem and get the service back up and running again. This issue have the highest priority, and all our resources are working on this. We update when we have more information to provide.

Posted Jun 05, 2017 - 10:39 CEST

Update

We are working together with our service provider to solve the problem at the moment. We are sorry for any trouble this may cause.

Posted Jun 05, 2017 - 09:48 CEST

Identified

We have identified the cause of the incident and are currently working on finding a solution for this. We will update when we have more information.

Posted Jun 05, 2017 - 08:46 CEST

Update

We are still investigating to find the root cause for the disruption. Thanks for your patience.

Posted Jun 05, 2017 - 08:08 CEST

Investigating

We are currently investigating a service disruption in Visma eAccounting. We will provide more information shortly.

Posted Jun 05, 2017 - 07:39 CEST