As many of you have noticed, we’ve had some disruptions to our email testing services today. We’d like to apologize for these disruptions, explain what went wrong, and share what we’re doing to fix the issue.
At 1:41am PDT our primary hosting company, Amazon, began to experience some very serious issues with the part of their system that manages the virtual hard-drives that we use to store our (and many other companies’) machines. These issues caused a ripple effect through their hosting hardware, adding more pressure than they could handle and, after a few hours, they were unable to start any additional machines.
Litmus automatically scales up our capacity as our demand grows during the day, then scales back down by turning machines off as we slow down for our quieter periods. The issues began to kick in during an especially quiet time for Litmus. As these issues prevented Amazon customers from starting new machines, Litmus was unable to increase capacity to meet the growing demand as European clients started their day using Litmus. We spent the better part of the day running on just 16 testing machines (our minimum capacity during the very early hours of the morning).
Key members of our development team were initially notified when it appeared we were under capacity for that time of the day. As the US east coast woke up and began to add their testing demand, all members of staff were notified that something very serious had happened and that Litmus was no longer able to meet the current demand. We worked hard to find capacity where ever we could. We utilized all of our physical hardware to help process the rapidly growing queues, we scaled up hosting with the other providers we use and we did our best to get new machines starting on Amazon. By 9:30AM PDT we had managed to start an extra 14 machines on Amazon, bringing the total to 30. At this time of the day, we would typically be running over 250 machines.
Amazon have been in touch to let us know that the problems are almost resolved and that we should start to see our extra capacity come online in the next hour or two. Indeed, we’ve already jumped up to 51 online machines and that number is rising fast.
This service disruption on Amazon’s part is incredibly rare; this is the first time we’ve seen an outage of this size. Because a number of other large websites use Amazon’s hosting services, we’re not alone in experiencing serious issues this morning. Reddit, Foursquare, Quora and many others services are either down or running at heavily reduced capacity.
Please note that our Email Analytics platform is designed so that our capacity always heavily out paces demand and has not been affected. All Email Anaytics servers are online and crunching data as normal.
We will continue to keep you updated on Twitter at http://www.twitter.com/litmusapp.
Thank you for your patience,
The Litmus Team