Investigating an issue in Visma eAccounting
Incident Report for Visma Cloud Services
Postmortem

Update 9.th June Incident report Visma eAccounting

1) Incident report for Visma eAccounting, major outage 5th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

  • We identified and verified the incident as a major outage on Visma eAccounting at around 7:00 a.m. Our monitoring systems detected the issue and alerted our personnel.

  • First information was published on our service health status page, status.visma.com, at 7:35 a.m.

  • We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 7:50 a.m.

  • Internal troubleshooting started at 8:10 a.m. We identified that the problem was related to one of Visma eAccounting’s main databases. We were unable to connect to, or perform any actions on, that specific database. We tried several actions to provide the database with more resources. We also tried to create a copy and to make a connection to the copy. None of these attempts were successful.

  • We contacted our infrastructure provider, Microsoft Azure, at 9:10 a.m.

  • At 11:15 a.m, Microsoft confirmed that the database was not healthy and assigned an engineering team that worked on the issue throughout the afternoon. The early information we received was encouraging, so we predicted that the issue would be resolved within an hour. However, the team underestimated the problem so the estimated resolution time increased during the afternoon.

  • At around 4:00 p.m, we received confirmation that there was a bug in Azure that had somehow been triggered by a transaction in Visma eAccounting that took a very long time to run. Based on the discussions with Microsoft during the day, the issue seemed to be difficult to solve and several teams and engineers were involved in troubleshooting the issue.

  • At 7:00 p.m, Microsoft started to perform recovery and failover procedures to the secondary replica. This resolved the issue in Visma eAccounting.

  • At 8:20 p.m, Visma eAccounting was up and running again.

  • An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 7th June. During the review meeting, a plan was created with follow-up tasks to mitigate similar incidents in the future. We called in extra personnel on 6th June for troubleshooting and monitoring throughout the day.

2) Incident report for Visma eAccounting, major outage 7th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

  • We identified and verified the incident as a major outage on Visma eAccounting at around 6:00 a.m. Our monitoring systems detected the issue and alerted our personnel.

  • First information was published on our service health status page, status.visma.com, at 6:19 a.m.

  • We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 6:00 a.m.

  • We identified that the problem was related to one of Visma eAccounting’s main databases and was connected to the previous incident. We were unable to connect to, or perform any actions on, that specific database.

  • We contacted our infrastructure provider, Microsoft Azure, at 6:23 a.m. and reported the issue.

  • At 7:06 a.m, we started a parallel process with a plan to restore the specific database in order to ensure we had a backup plan if the issue took more time to solve than estimated.

  • At 7:34 a.m, Microsoft confirmed that the database was locked in a similar way as on 5th June.

  • At 7:47 a.m, Microsoft started to perform a failover procedure to a replica with higher capacity. This resolved the issue in Visma eAccounting.

  • At 7:47 a.m, Visma eAccounting was up and running again.

  • An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 7th June. During review meetings throughout the day, a plan was created with follow-up tasks to mitigate similar incidents in the future. For example, we scaled up the database to maximum capacity and planned for a on-duty schedule for monitoring throughout the night.

3) Incident report for Visma eAccounting, major outage 8th June 2017

We now have more insights into the root cause and want to share details about how we handled the incident and what action has been taken to prevent and mitigate similar problems in the future.

  • We identified and verified the incident as a major outage on Visma eAccounting at around 2:30 a.m. Our monitoring systems detected the issue and alerted our personnel.

  • We started the incident handling process. Developers and infrastructure personnel from the Visma eAccounting team started working on the incident at around 2:45 a.m.

  • We identified that the problem was related to one of Visma eAccounting’s main databases and was connected to the incident from 5th June. We were unable to connect to, or perform any actions on, that specific database.

  • At 3:39 a.m, Microsoft confirmed that the database was locked in a similar way as on 5th June - the transaction log was full.

  • First information was published on our service health status page, status.visma.com, at 4:04 a.m.

  • At 5:14 a.m, Microsoft updated the database with correct settings for the automatic growth of the transaction log. This change will permanently mitigate this issue, and fixed the problem in Visma eAccounting.

  • At 5:15 a.m, Visma eAccounting was up and running again.

  • An incident report was created and an incident review meeting was held with key personnel and management at 9:00 a.m. on 8th June.

Root cause:

One of the databases used by Visma eAccounting was not configured correctly in Microsoft Azure and several backup routines that should prevent this scenario failed. The issue was triggered by a nightly maintenance job in Visma eAccounting that took an unusually long time to run. Normally, this is not a problem but the incorrect configuration of the database in question prevented the job from using the resources needed, even though they were available. The database is part of a service we buy from Microsoft, and Microsoft had to solve the configuration problem and also make a general change in Azure SQL.

Technical details:

The problem started when a global transaction was unable to complete. This caused the transaction log to slowly grow until the maximum limit was reached. Failovers and alerts at Microsoft should have prevented this from happening but due to several unfortunate events, that was not the case. The result was a full transaction log which prevented further writing to the database. Incorrect configuration, related to the transaction log and its automatic growth settings, caused failure a further two times. Microsoft resolved the issue by rebuilding one of the replicas and have also identified several improvements that they are working on.

Summary:

  • No data has been lost.

  • There was no hacker attack.

  • We had not made any changes to our infrastructure or code close to the time this problem started to occur.

  • There was no lack of resources (engineers and infrastructure personnel) since our key personnel were available and involved.

  • Both Visma and Microsoft have held review meetings. Microsoft has fixed several critical issues already and has created plans with follow-up tasks to mitigate similar incidents in the future.

We apologize to all customers impacted by this. This was an unfortunate set of circumstances for customers, partners and the development & operations team behind Visma eAccounting.

Posted Jun 09, 2017 - 15:46 CEST

Resolved
This incident has been resolved.
Posted Jun 07, 2017 - 07:53 CEST
Update
The service is back up and running again, but we will monitor this during the following days, and investigate so this never will never happen again.
Posted Jun 07, 2017 - 07:53 CEST
Identified
We have identified the cause of the incident and its connected to the major outage we had 5.th of June. We are now working with several initiatives to get Visma eAccounting up again both in together with our infrastructure provider and internally at Visma. All developers and infrastructure personal is having the full attention to solving this incident as soon as possible.
Posted Jun 07, 2017 - 07:19 CEST
Investigating
We are currently investigating a major outage in Visma eAccounting . We will provide more information shortly.
Posted Jun 07, 2017 - 06:19 CEST