Last night was the Passion Capital Christmas Party, so (naturally) the GoSquared team was in attendance.
This meant that our whole ops team, who make sure that everything stays running, were out at a party. And Sod’s law dictates that this is the exact time at which things are likely to go horribly wrong.
— Eileen Burbidge (@eileentso) December 2, 2015
(featuring the GoSquared team somewhere among the crowd)
So how do we keep on top of things here at GoSquared? How do we monitor that everything is working as it should, and how do we respond when things go wrong? And how do we manage that with the whole team at a party?
Here’s a few of the tools we use:
Monitoring and Alerting
Our two main sources of metrics and alarms are Server Density and Amazon CloudWatch. Together they keep an eye on all of the metrics from our EC2 instances (CPU, memory, disk space, networking), as well as our load balancers (HTTP request count, response codes, latency, healthy/unhealthy instances) and databases (connection count, query rates etc.).
Along with all these metrics we have hundreds of alarms, which will trigger and alert us as soon as something doesn’t look right. If request latency is too high, or a database is running out of disk space, or we’re sending too many 5xx HTTP status codes, an alarm will trigger and we’ll know about it.
Alerts are sent to the team via PagerDuty. One team member is “on call” at any given time, and PagerDuty takes care of alerting that person via push notification, SMS or phone call depending on severity, and of escalating it to another team member if the on-call is unable to respond. We also have a dedicated Slack channel set up with the PagerDuty integration and Hubot, so anyone on the team can know when something’s up.
It’s all well and good having lots of metrics and monitoring, but if an alarm goes off and you’re not actually able to deal with the issue, it’s pretty useless. There’s plenty we can do purely from our phones to address any issues.
The AWS Mobile Console App enables us to very quickly perform simple tasks on our AWS resources. Whether that’s checking that a particular CloudWatch metric is returning to normal levels, or modifying the throughput on one of our DynamoDB tables, or rebooting a crashed database instance, or manually upscaling one of our Auto Scaling groups, all of these tasks can be accomplished in seconds from the mobile app.
For more complex issues we can use SSH to access any of our EC2 instances to check memory usage or whether a process is stuck, and restart services or perform other maintenance as necessary. We have a VPN set up to allow us access to instances inside our Virtual Private Cloud, internal DNS which allows us to address instances without having to memorise or look up IP addresses, and Panic’s excellent Prompt app means we can do this all from our phones without having to waste valuable time grabbing a laptop and finding a WiFi connection.
Putting it all together
It takes a lot of different components and services to make all this possible, but the combination of effective monitoring, effective alerting, and the ability to respond and resolve issues quickly and effortlessly, means that when something does go wrong (because it always will), we’re able to deal with it, and get back to the party.