As you may have noticed from our status updates, we suffered a lengthy interruption in service this morning.
To begin with, I want to apologise for the frustration this has caused you. As users of our own product, we know how disruptive issues can be. And as users, we’d also expect to know what went wrong, and what’s being done to prevent it happening again. That’s what this post is all about.
What was affected?
Between 10:32 and 12:09 UTC on Friday 20th March, our tracking API was not functioning correctly. This means that GoSquared stopped tracking new data from all sources during this period.
Additionally, during this period, the APIs powering GoSquared’s web apps were taken out of service in response to failing health checks related to the tracking issues.
Why did this happen?
We process billions of requests a day. To deliver our fast, real-time service at such scale, we run clusters of numerous instances on Amazon EC2, the service we use to run our servers.
Unfortunately, one of the EC2 instances completely stopped functioning without any notice or warning signs. All health checks and metrics were normal prior to the fault, otherwise we would have taken evasive action to prevent the issue.
The EC2 instance was running a node that was part of the database cluster we use for ingesting data into GoSquared. Losing this node put the cluster into an unhealthy state, making it unable to handle data reliably. This meant the system could no longer process data. Frustratingly, our failover plan did not prevent an outage.
We tried to recover the node, a process that should only take minutes, but the EC2 instance was entirely unresponsive – refusing to even shut down or restart. Instead, we had to migrate over to a new cluster which is non-trivial and accounted for the length of the outage.
That sounds scary. Was any existing data lost?
No. Only new data tracked during the outage period was missed. All existing data already saved in GoSquared is safe and sound.
Why did the node stop working?
We’re confident it’s due to a faulty EC2 instance for reasons outside of our control.
People sometimes warn about the stability of EC2, but in our experience, instance instability like this is rare in 6 years of running on EC2. Nothing is 100% reliable though, and unfortunately this affected us today.
Fortunately, we already know what to do to prevent this happening again:
- Improve our failover strategy for our tracking database clusters.
- Revisit and improve our API health checks.
- For the worst case scenario, expedite or automate deploying new database clusters.
Once again, on behalf of the GoSquared team, I am sorry for the problems today. If you have any questions about the incident, the impact on your account or anything else relating to the GoSquared service, please get in touch and we’ll be happy to discuss.