{"id":5790,"date":"2015-03-20T13:48:13","date_gmt":"2015-03-20T13:48:13","guid":{"rendered":"https:\/\/www.gosquared.com\/blog\/?p=5790"},"modified":"2019-11-28T11:53:03","modified_gmt":"2019-11-28T11:53:03","slug":"outage-2015-03-20","status":"publish","type":"post","link":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20","title":{"rendered":"Outage on Friday 20th March"},"content":{"rendered":"<p>As you may have noticed from our <a href=\"https:\/\/twitter.com\/GoSquaredStatus\" target=\"_blank\" rel=\"noopener noreferrer\">status updates<\/a>, we suffered a lengthy interruption in service this morning.<\/p>\n<p>To begin with, I want to apologise for the frustration this has caused you. As users of our own product, we know how disruptive issues can be. And as users, we&#8217;d also\u00a0expect to know what went wrong, and what&#8217;s being done to prevent it happening again. That&#8217;s what this post is all about.<\/p>\n<h2>What was affected?<\/h2>\n<p>Between 10:32 and 12:09 UTC on Friday 20th March, our tracking API was not functioning correctly. This means that GoSquared stopped tracking new data from all sources during this period.<\/p>\n<p>Additionally, during this period, the APIs powering GoSquared&#8217;s web apps were taken out of service in response to failing health checks related to the tracking issues.<\/p>\n<h2>Why did this happen?<\/h2>\n<p>We process billions of requests a day. To deliver our fast, real-time service at such scale, we run clusters of numerous instances on Amazon EC2, the service we use to run our servers.<\/p>\n<p>Unfortunately, one of the EC2 instances completely stopped functioning without any notice or warning signs. All health checks and metrics were normal prior to the fault, otherwise we would have taken evasive action to prevent\u00a0the issue.<\/p>\n<p>The EC2 instance was running a node that was part of the database cluster we use for ingesting data into GoSquared.\u00a0Losing this node put the cluster into an unhealthy state, making it unable to handle data reliably.\u00a0This meant the system could no longer process data. Frustratingly, our failover plan did not prevent an outage.<\/p>\n<p>We tried to recover the node, a process that should only take minutes, but the EC2 instance was entirely unresponsive &#8211; refusing to even shut down or restart. Instead, we had to migrate over to a new cluster which is non-trivial and accounted for the length of the outage.<\/p>\n<h4>That sounds scary. Was any existing data lost?<\/h4>\n<p>No. Only new data tracked during the outage period was missed. All existing data already saved in GoSquared is\u00a0safe and sound.<\/p>\n<h4>Why did the node stop working?<\/h4>\n<p>We&#8217;re confident it&#8217;s due to a faulty EC2 instance for reasons outside of our control.<\/p>\n<h2>Future prevention<\/h2>\n<p>People sometimes warn about the stability of EC2, but in our experience, instance instability like this is rare in\u00a06 years of running on EC2. Nothing is 100% reliable though, and unfortunately this affected\u00a0us today.<\/p>\n<p>Fortunately, we already know what to do to prevent this happening again:<\/p>\n<ul>\n<li>Improve our failover strategy for our tracking database clusters.<\/li>\n<li>Revisit and improve our API health checks.<\/li>\n<li>For the worst case scenario, expedite\u00a0or automate deploying new database clusters.<\/li>\n<\/ul>\n<p>Once again, on behalf of the GoSquared team, I\u00a0am sorry for the problems today.\u00a0If you have any questions about the incident, the impact on your account or anything else relating to the GoSquared service, please <a href=\"https:\/\/www.gosquared.com\/customer\/portal\/emails\/new\" target=\"_blank\" rel=\"noopener noreferrer\">get in touch<\/a>\u00a0and we&#8217;ll be happy to discuss.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you may have noticed from our status updates, we suffered a lengthy interruption in service this morning. To begin&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1586],"tags":[],"class_list":["post-5790","post","type-post","status-publish","format-standard","hentry","category-gosquared-updates"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v18.6 (Yoast SEO v19.0) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Outage on Friday 20th March - GoSquared Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Outage on Friday 20th March\" \/>\n<meta property=\"og:description\" content=\"As you may have noticed from our status updates, we suffered a lengthy interruption in service this morning. To begin...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20\" \/>\n<meta property=\"og:site_name\" content=\"GoSquared Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GoSquared\" \/>\n<meta property=\"article:published_time\" content=\"2015-03-20T13:48:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-11-28T11:53:03+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TheDeveloper\" \/>\n<meta name=\"twitter:site\" content=\"@GoSquared\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Geoff Wagstaff\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/#organization\",\"name\":\"GoSquared\",\"url\":\"https:\/\/www.gosquared.com\/blog\/\",\"sameAs\":[\"https:\/\/instagram.com\/gosquaredteam\",\"https:\/\/www.linkedin.com\/company\/go-squared-ltd.\",\"https:\/\/www.facebook.com\/GoSquared\",\"https:\/\/twitter.com\/GoSquared\"],\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.gosquared.com\/blog\/wp-content\/uploads\/2015\/07\/gosquared.png\",\"contentUrl\":\"https:\/\/www.gosquared.com\/blog\/wp-content\/uploads\/2015\/07\/gosquared.png\",\"width\":1270,\"height\":250,\"caption\":\"GoSquared\"},\"image\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/#website\",\"url\":\"https:\/\/www.gosquared.com\/blog\/\",\"name\":\"GoSquared Blog\",\"description\":\"Turn visitors into customers.\",\"publisher\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.gosquared.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage\",\"url\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20\",\"name\":\"Outage on Friday 20th March - GoSquared Blog\",\"isPartOf\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/#website\"},\"datePublished\":\"2015-03-20T13:48:13+00:00\",\"dateModified\":\"2019-11-28T11:53:03+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.gosquared.com\/blog\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Outage on Friday 20th March\"}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage\"},\"author\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/56a3341790c8a0603f96066fb8d42448\"},\"headline\":\"Outage on Friday 20th March\",\"datePublished\":\"2015-03-20T13:48:13+00:00\",\"dateModified\":\"2019-11-28T11:53:03+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage\"},\"wordCount\":486,\"publisher\":{\"@id\":\"https:\/\/www.gosquared.com\/blog\/#organization\"},\"articleSection\":[\"GoSquared Updates\"],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/56a3341790c8a0603f96066fb8d42448\",\"name\":\"Geoff Wagstaff\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/66792d2e4d04406697b9a5f322664691590a386bc15b7146d143bbca07aa8889?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/66792d2e4d04406697b9a5f322664691590a386bc15b7146d143bbca07aa8889?s=96&d=mm&r=g\",\"caption\":\"Geoff Wagstaff\"},\"sameAs\":[\"https:\/\/twitter.com\/TheDeveloper\"],\"url\":\"https:\/\/www.gosquared.com\/blog\/author\/echo\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Outage on Friday 20th March - GoSquared Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20","og_locale":"en_US","og_type":"article","og_title":"Outage on Friday 20th March","og_description":"As you may have noticed from our status updates, we suffered a lengthy interruption in service this morning. To begin...","og_url":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20","og_site_name":"GoSquared Blog","article_publisher":"https:\/\/www.facebook.com\/GoSquared","article_published_time":"2015-03-20T13:48:13+00:00","article_modified_time":"2019-11-28T11:53:03+00:00","twitter_card":"summary_large_image","twitter_creator":"@TheDeveloper","twitter_site":"@GoSquared","twitter_misc":{"Written by":"Geoff Wagstaff","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Organization","@id":"https:\/\/www.gosquared.com\/blog\/#organization","name":"GoSquared","url":"https:\/\/www.gosquared.com\/blog\/","sameAs":["https:\/\/instagram.com\/gosquaredteam","https:\/\/www.linkedin.com\/company\/go-squared-ltd.","https:\/\/www.facebook.com\/GoSquared","https:\/\/twitter.com\/GoSquared"],"logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.gosquared.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.gosquared.com\/blog\/wp-content\/uploads\/2015\/07\/gosquared.png","contentUrl":"https:\/\/www.gosquared.com\/blog\/wp-content\/uploads\/2015\/07\/gosquared.png","width":1270,"height":250,"caption":"GoSquared"},"image":{"@id":"https:\/\/www.gosquared.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"WebSite","@id":"https:\/\/www.gosquared.com\/blog\/#website","url":"https:\/\/www.gosquared.com\/blog\/","name":"GoSquared Blog","description":"Turn visitors into customers.","publisher":{"@id":"https:\/\/www.gosquared.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.gosquared.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage","url":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20","name":"Outage on Friday 20th March - GoSquared Blog","isPartOf":{"@id":"https:\/\/www.gosquared.com\/blog\/#website"},"datePublished":"2015-03-20T13:48:13+00:00","dateModified":"2019-11-28T11:53:03+00:00","breadcrumb":{"@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.gosquared.com\/blog\/outage-2015-03-20"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.gosquared.com\/blog"},{"@type":"ListItem","position":2,"name":"Outage on Friday 20th March"}]},{"@type":"Article","@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#article","isPartOf":{"@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage"},"author":{"@id":"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/56a3341790c8a0603f96066fb8d42448"},"headline":"Outage on Friday 20th March","datePublished":"2015-03-20T13:48:13+00:00","dateModified":"2019-11-28T11:53:03+00:00","mainEntityOfPage":{"@id":"https:\/\/www.gosquared.com\/blog\/outage-2015-03-20#webpage"},"wordCount":486,"publisher":{"@id":"https:\/\/www.gosquared.com\/blog\/#organization"},"articleSection":["GoSquared Updates"],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/56a3341790c8a0603f96066fb8d42448","name":"Geoff Wagstaff","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.gosquared.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/66792d2e4d04406697b9a5f322664691590a386bc15b7146d143bbca07aa8889?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/66792d2e4d04406697b9a5f322664691590a386bc15b7146d143bbca07aa8889?s=96&d=mm&r=g","caption":"Geoff Wagstaff"},"sameAs":["https:\/\/twitter.com\/TheDeveloper"],"url":"https:\/\/www.gosquared.com\/blog\/author\/echo"}]}},"wps_subtitle":"An explanation of what happened","_links":{"self":[{"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/posts\/5790","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/comments?post=5790"}],"version-history":[{"count":0,"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/posts\/5790\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/media?parent=5790"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/categories?post=5790"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gosquared.com\/blog\/wp-json\/wp\/v2\/tags?post=5790"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}