Technical details about our upgrade

By

Following our recent upgrade, Litmus has been running really fast. We’re very pleased. However, quite a few customers asked us for some more technical details about what exactly has been going on. This article should help explain exactly what the problems were and how we solved them.

It’s a little technical, but then that’s the point. Here goes…

The problem

Over the last couple of months it became abundantly clear our architecture was not designed to keep pace with our current growth rate. We were experiencing growing pains in the form of slower tests, and, in some cases, brief outages of the entire system.

Our testing system revolves around clients. These clients (hundreds of them) run your tests and deliver screenshots. The clients are managed by a queue, which deals with incoming tests and allocates them to the appropriate clients.

Our old architecture was designed to receive a request into a database. Our queue server would then find a client that could run that test. It would make a connection to that client and maintain that connection until the test was complete. We’d then go back to that same database and update the request with the results (the screenshots) we received from the client. This system had some drawbacks. The most notable drawbacks:

  • While the connection to the client was open, the client couldn’t process other tests.
  • The queue server, under times of high demand, would start to lock up. This would prevent new tests from getting through.
  • The database we read new requests from and wrote results to were one and the same. This often lead to database bottlenecks—our results database currently contains over 9 million completed tests.

Our solution

Our solution was to create a new queuing system altogether. New requests would be queued separately from the existing 9 million results. The clients would poll asynchronously, and each client would run multiple tests at the same time. The connections to the clients would not persist.

Sounds great right? So what the heck went wrong?

Well, a couple of things actually. We were stuck between a queue server that was dying a slow and painful death; and rushing to implement its replacement. We constantly weighed “do we spend more time putting band aids on a solution we knew wouldn’t last” vs. “let’s just blow it up and start over”. Blowing it up was definitely the riskier short term option but we had to think about the long term health and viability of the system. In the end, since we had the new queue system ready to go, we rolled it out. Unfortunately we quickly learned that we had not accurately load tested the new system. While architecturally it was sound it couldn’t handle the sheer volume of tests we were seeing.

The good news is that new solution was very scalable. We just needed a little more time to scale it up… building more servers, getting a beefier database backend, and so on.

Late nights, early mornings, working it out

Once committed to the new queue we worked day and night—quite literally. We went from 2 load balanced queue servers to 9, each running dual ASP.NET processes. Effectively we are now running 18 queue servers! We moved to a new database backend—from MySQL to Microsoft SQL Server—moved to 64-bit processors, 8GB of RAM, the works. Now that the dust has settled on all our work, we’re starting to see some of the fastest running Litmus tests we’ve ever seen.

What does this mean to you, our customer?

This new system is built to last. We can scale to however many queue servers we need. There’s room to grow the database server as well. Now that we have a robust queuing system in place we can get back to focusing on what matters to you—more email clients, mobile clients, more Mac clients, better spam testing. We won’t have to spend our time putting band aids on a system that should be transparent to you, the customer.

It’s possible that throughout this period you’ve lost confidence in us and our ability to deliver. That’s a fair sentiment. All we can do is earn your confidence back the same way we earned it the first time: by delivering the best email and browser testing platform possible. Every day.