It all started off with such a simple idea – we had a client, their site was not performing well, so we started load testing. Now we didn’t have LoadRunner or any other expensive load testing solution to hand, so we opted for a web-based system instead.
The system ran really well, in fact it did exactly what we wanted (albeit for a little chunky interface), and an intellectual challenge was born – surely it must be easy as pie to write a script that will zombie a bunch of servers in the cloud and point them at a target… an ethical DDoS
So Loadzen was born – as a python shell script and some cobbled together RPC code.
It was only after it actually worked and was surprisingly effective that we thought about taking it to market, so the long road started to making it market worthy.
As this is a technical blog for technical people, let’s talk about what it does under the hood…
The whole system runs of a three-tier architecture:
This separation is essentially so that the website acts as a client to the job server, ensuring the machinery that manages and generates tests is fully separated and isolated from the business end. We can bring down the site and the job server will continue running (and retain your results).
The job server will spawn generators as needed to meet the specific load requirement as required by the test being run at that moment.
But that’s not all, the system actually uses a single thread for each “virtual user”, given we know how many threads any specific generator can support we simply meter out accordingly, this way we can load-share multiple tests on the same load generators, with some processes running one set of test scenarios while the others are running a completely different test. This ensures maximum utilisation of the systems running at that moment (they’re expensive!) and also ensures we’re not just spawning a ton of new servers for each test.
This is the basic architecture of the system at the highest level, but there are a few cool little tricks in the overall architecture that we’ll get into later as we discuss the feature set.
The standard workflow for a load test is:
- Identify your use cases
- Create scenarios for those use cases
- Determine the ‘mix’ of the use cases (e.g. 20% of visitors will buy something, 50% will bounce and 30% will just browse or search)
- Set up the test and the load maximum
- Run the test
Loadzen does all of the above, the load generators will automatically scale out the ‘mix’ of tests based on the growth rate of the test curve, they will act in complete lock-step to ensure that each ‘wave’ of users starts at the same time and they will strive to introduce some realistic behaviour by running the virtual users at various stages of ‘drunkeness’, varying their step-rate through a scenario randomly so that we simulate more realistic user loads.
The Load generators will then record and average out each wave and report the data back to the job server, which stores it and makes it available to a client.
The load generators and the job server both work with Python running Pyro RPC, the reason for this choice? Complete object transparency and interropability between client and server, so that load generators have access to jobserver functions and jobservers can pass test objects to load generators with a single function call with no translation layer. This is a little fiddly, but in the end offers us the ability to just code without worrying about data types and formatting errors.
Both the job server and the load generators run as instances in Amazon EC2, and are controlled using the rather awesome Boto library for AWS.
The website is written in Django with MySQL and a shiny fat server provided by the good folks at Media Temple.
Probably the most interesting part of the website is the real-time results and control feed that is induced every time a test is started. This is actually a real-time push feed form the job server that uses a bastardisation of Socket.IO and an EventJS clone for Python called Tornado (from those good folks at live journal) all backed by an infrastructure queue powered using RabbitMQ and Pika.
The actual infrastructure for the site looks like this when we introduce these systems back in (and to think, all of this effort just so you get some shiny animations and a graph on a screen!):
Can I just say this now: I love RabbitMQ, I have fallen in love with real-time systems thanks to setting this up – it’s amazing how your viewpoint changes when you start thinking in terms of queues and channels and processors. When this feed was set up we briefly considered completely re-tooling the system to have a full-blown RabbitMQ back-end to power ALL of the things.
Pragmatism (thankfully) won out.
By running a seperate Tornado server to handle the push feed we again managed to decouple everything, this way the thing that manages the transport is decoupled form the web client is decoupled form the work generator, ensuring we can work on each independently and not have a monolithic code base.
Making it easy to use
The day before launch (I really shouldn’t be admitting this), I wrote the Chrome extension client that made Scenario creation MUCH easier than was originally built into the website (although the manual wizard still had to be built to pave the way for other clients).
It’s a massive bit of kit, that works but could always be made better, one of the key things learnt from this exercise is making a decision of when to make something work and when to make something beautiful, we all want to code stunning software and have great code that is well architected, but if you want to get something out the door, you need to make pragmatic choices of when to say “earmark it for the next build, iterate and improve as you go.
At the same time, we learnt to ry to identify those bugs that seem niggling that you know will turn into cancerous, nasty, evil blobs that you have to work around because you were too lazy to tackle that nasty problem head on.
Anyway, I hope you guys enjoy load testing with Loadzen!
Happy coding (and testing),
- 10.01.12 / 10am