Investigating an issue in Connect

Incident Report for Visma Cloud Services

Postmortem

We would like to present a summary of the events and actions we did related to the sign in issues that happened during 11th and 13th of September:

The application was running without any known issues and suddenly on 11th of September at 15:00 it went into an unstable state showing an error related to out of memory. Memory metrics did not show any exceeding of the maximum limit so we did not know the root cause for the error. The application went back to normal without our intervention.
On 12th of September at 08:00 the same unstable state was reached again showing errors related to memory and a lot of logs trying to write to memory. We have started new hardware on our infrastructure provider AWS and the system went back to normal state for a couple of hours. After that the same unstable behavior and same memory related errors were shown. We were in contact with our infrastructure provider AWS to check the hardware and help us identify what can cause the memory issues.
Our infrastructure provider AWS has checked and they did not identify any hardware issues.
In the same time we have started to rewrite the code to no longer use the memory cache, since this was the part that had a lot of errors in our centralized logging system and the error we always have seen was related to memory.
In the late evening of 12th of September we have deployed to production a version of the application where we no longer used the memory cache in order to not get into the memory related issues. When we have deployed the new version, the application was stable and it looked as if we have solved the issue.
In the morning of 13th of September, the application had again same errors and we have decided to change the infrastructure instance type to more powerful hardware: with a lot more CPU and memory.
Unfortunately, after around 30 minutes the application had again same errors and without us knowing what the root cause is. We have decided to roll-back to a version that was running on 22nd of August and since then the application is stable.

Our investigations, after 13th of September and until today, show that the root causes of the incident are:

One change causes CPU increase which triggers the application to be unstable while CPU is reaching the limit. The CPU increase is caused by: .NET Core 2.2 Microsoft HttpClient class created using HTTP ClientFactory. The ClientFactory has a bug as acknowledged by Microsoft in this thread: https://github.com/aspnet/Extensions/issues/2060
This feature passed performance tests in our test environment and it was running in production without incident for a week due to AWS CPU Burst feature (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html).
Lack of monitoring of CPU and Burst Credits (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-monitoring-cpu-credits.html) made our team unaware of the CPU Credits masking the problems.

We apologize for all the inconvenience created by these incidents . The team has been working since then to eliminate the change that has caused the CPU increase and we are adjusting the monitoring metrics to prevent future application misbehavior.

Once again – we are sorry for any inconvenience this might have caused you.

Posted Sep 19, 2019 - 16:26 CEST

Resolved

We have been monitoring during the day. and the service has been running normal.
We will continue investigations during next days and will add a postmortem with our findings.

Posted Sep 13, 2019 - 17:04 CEST

Update

We are continuing to monitor for any further issues.

Posted Sep 13, 2019 - 08:43 CEST

Monitoring

Posted Sep 13, 2019 - 07:59 CEST

Update

We are continuing to investigate the issue.

Posted Sep 13, 2019 - 07:36 CEST

Update

We are continuing to investigate this issue.

Posted Sep 13, 2019 - 06:31 CEST

Investigating

We are currently investigating a service disruption in Visma Connect. We will provide more information shortly. Other services are also affected.

Posted Sep 13, 2019 - 06:26 CEST

This incident affected: Visma Cloud Platform (CP - Visma Connect), Spiris - Bokföring & Fakturering/Visma eAccounting (Spiris - Deklaration & Årsredovisning/Årsbokslut), and Sticos Descartes.