Investigating an issue in Visma Severa / Visma.net Project Management

Incident Report for Visma Cloud Services

Postmortem

Since moving to a new hosting provider our service has suffered from several service disruptions and degraded performance periods. We are very sorry for the inconvenience this has caused to our customers.

The issues started on 15th of November, one week after the migration and they still continue as spikes in our performance, causing occasional slowness for end-users.

One of the goals of new hosting platform was to upgrade SQL Server version from 2014 to 2017. Extensive testing was done with this database version. For example our development environment was running with this version over a year without issues. However under production load we discovered that some critical complex database queries didn't work well when thousands of users used the system simultaneously and the SQL Optimizer started to make incorrect decisions.

This issue is called "Execution plan regression" and requires optimization of problematic queries. We reverted our databases to the SQL server version 2014 but still several queries had to be optimized and the Optimizer had to be told to behave differently.

A second issue was with SQL Server’s temporary tables and objects, which are used to cache and collect data from the queries. SQL Managed Instance didn’t properly handle the queries using temporary objects which caused all the queries to queue up and hang there.

We had to change our queries not to use SQL server’s temporary tables and objects. While we were fixing the problematic queries, a bug regarding the temporary objects was found in SQL Managed Instance and is the root cause for these issues.

Since the 15th of November we have had several production releases, where most of the problematic queries have been fixed. More fixes are coming to production during this week and we are constantly monitoring our service and fixing the poorly behaving queries as we see them occurring. Our supplier has promised a fix for the SQL Managed Instance bug during Q1 2020.

Posted Dec 16, 2019 - 12:51 CET

Resolved

The issue is now resolved and the service is up and running as normal.

Posted Nov 20, 2019 - 14:39 CET

Monitoring

The service is up and running again. This issue should be solved for all customers. We will keep monitoring to ensure high quality.

Posted Nov 20, 2019 - 13:20 CET

Identified

At 13:05 CET we will downgrade our SQL database version to 2014, which is the version we were running on in Rackspace, our previous hosting provider. We hope this would mitigate the issues in the SQL Optimizer and execution plans we are now facing in the newer version of the database. This change will not cause any downtime, but will momentarily slow the system down.

Posted Nov 20, 2019 - 12:33 CET

Investigating

Performance has degraded again. We will update status at 15 CET. We are sorry for the inconvenience this is causing for our customers.

Posted Nov 20, 2019 - 12:03 CET

Monitoring

A scheduled a maintenance break has been done we are monitoring the results.

Posted Nov 20, 2019 - 06:59 CET

Identified

Performance started to decrease again around 14:30 CET. API queries are failing and system is slow to use or causes errors.
We have scheduled a maintenance break for tomorrow morning.
Status will updated next time tomorrow morning, after 7:00 CET

Posted Nov 19, 2019 - 15:08 CET

Monitoring

A fix has been implemented and system works now for majority of the customers but some individual customers might still get some errors. Those errors will be fixed during the day.

Posted Nov 19, 2019 - 07:12 CET

Update

Since last Friday we have struggled with performance in our application. For our customers it has been visible as slowness but also as various errors all over the system.
The issue is caused by the migration from On-premise SQL Server to Azure SQL Managed Instance. Suddenly the SQL hints and optimizer plans just started to behave badly.
Unfortunately we also have issues in our deployment pipeline, which also changed when migrating to Azure, and it taken us half a day to get part of the fixes out.
We just had a production release and expect to have more later today and tomorrow.
Our whole development team is working on this issue so that our customers would gain the usability of the service back on the level where it should be.
New status will be posted tomorrow morning at 7 CET/CEST.

Posted Nov 18, 2019 - 14:57 CET

Update

We have identified the cause of the incident and are currently working on finding a solution for this. We will inform you next time at 15:00 CEST or if the issue has been fixed before that.

Posted Nov 18, 2019 - 13:36 CET

Update

We have identified the cause of the incident and are currently working on finding a solution for this. We will inform you next time after two hours or if the issue has been fixed before that.

Posted Nov 18, 2019 - 11:26 CET