A little bit of chaos

How would your system behave if a monkey entered your data center randomly ripped cables and flung things around?

If you’re not familiar with the concept of chaos engineering, you should be.

Often conflated with Chaos Monkey, which is just one high-profile early example of the practice, chaos engineering is the practice of intentionally breaking things in production to build confidence in a system’s ability to withstand unexpected conditions.

As explained in the book Chaos Monkeys (the book really has little to do with engineering, but it’s still an entertaining read):

Imagine a monkey entering a ‘data center’, these ‘farms’ of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand [i.e. flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.

How would your system behave if a monkey entered your data center and did those things?

Wouldn’t it be great if your system continued to operate, or at the very least, informed you of the problems in time to fix them, under the influence of such chaos?

If you’re hosting your service in the cloud, the big cloud providers actually give you a really easy* way to introduce a little bit of chaos into your system, and actually save money at the same time.

What’s the trick? Unstable VMs!

Amazon offers EC2 Spot Instances, Google offers (Preemptible VM instances)(https://cloud.google.com/compute/docs/instances/preemptible), and Azure has its Spot VMs.

The concept is simple: You get a discount on the normal VM fees in exchange for practically no stability guarantee. In fact, in some configurations, you’re guaranteed that your VM will live at most 24 hours.

Aside from saving money, why would this be hepful?

Well, if you want to be sure that your application can withstand random restarts and crashes, how better to test that capability than to host it entirely on Spot/Preemptible VMs, that will force a restart potentially every few hours?

To be clear, this is not a full-fledged replacement for proper chaos engineering. It’s a baby step. But it’s an easy* one.


*Easy, you say? Well, setting up Spot/preemptible VMs is easy. You can do that in just a few minutes. The hard part comes when you discover that your application doesn’t work well when it’s being constantly restarted or goes down sometimes. But then again, that is the point of the exercise, right?

Share this