Chaos Monkey Guide for Engineers

Tips, Tutorials, and Training

In 2010 Netflix announced the existence and success of their custom resiliency tool called Chaos Monkey.

What is Chaos Monkey?

In 2010, Netflix decided to move their systems to the cloud. In this new environment, hosts could be terminated and replaced at any time, which meant their services needed to prepare for this constraint. By pseudo-randomly rebooting their own hosts, they could suss out any weaknesses and validate that their automated remediation worked correctly. This also helped find "stateful" services, which relied on host resources (such as a local cache and database), as opposed to stateless services, which store such things on a remote host.

Netflix designed Chaos Monkey to test system stability by enforcing failures via the pseudo-random termination of instances and services within Netflix's architecture. Following their migration to the cloud, Netflix's service was newly reliant upon Amazon Web Services and needed a technology that could show them how their system responded when critical components of their production service infrastructure were taken down. Intentionally causing this single failure would suss out any weaknesses in their systems and guide them towards automated solutions that gracefully handle future failures of this sort.

Chaos Engineering Is

"the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Chaos Monkey helped jumpstart Chaos Engineering as a new engineering practice. Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds to failure conditions, you can identify and fix failures before they become public facing outages. Chaos Engineering lets you validate what you think will happen with what is actually happening in your systems. By performing the smallest possible experiments you can measure, you're able to "break things on purpose" in order to learn how to build more resilient systems.

In 2011, Netflix announced the evolution of Chaos Monkey with a series of additional tools known as The Simian Army. Inspired by the success of their original Chaos Monkey tool aimed at randomly disabling production instances and services, the engineering team developed additional "simians" built to cause other types of failure and induce abnormal system conditions. For example, the Latency Monkey tool introduces artificial delays in RESTful client-server communication, allowing the team at Netflix to simulate service unavailability without actually taking down said service. This guide will cover all the details of these tools in The Simian Army chapter.

What Is This Guide?

The Chaos Monkey Guide for Engineers is a full how-to of Chaos Monkey, including what it is, its origin story, its pros and cons, its relation to the broader topic of Chaos Engineering, and much more. We've also included step-by-step technical tutorials for getting started with Chaos Monkey, along with advanced engineering tips and guides for those looking to go beyond the basics. The Simian Army section explores all the additional tools created after Chaos Monkey.

This guide also includes resources, tutorials, and downloads for engineers seeking to improve their own Chaos Engineering practices. In fact, our alternative technologies chapter goes above and beyond by examining a curated list of the best alternatives to Chaos Monkey -- we dig into everything from Azure and Docker to Kubernetes and VMware!

Who Is This Guide For?

We've created this guide primarily for engineers who are looking for an in-depth resource on Chaos Monkey, as a way to get started with Chaos Engineering. We want to help readers see how Chaos Monkey fits into the practice of Chaos Engineering.

Why Did We Create This Guide?

Gremlin's goal is to empower engineering teams to build more resilient systems through thoughtful Chaos Engineering. We're on a constant quest to promote the Chaos Community through frequent conferences & meetups, in-depth talks, detailed tutorials, and the ever-growing list of Chaos Engineering Slack channels.

While Chaos Engineering extends well beyond the scope of one single technique or idea, Chaos Monkey is the most well-known tool for running Chaos Experiments and is a common starting place for engineers getting started with the discipline.

The Pros and Cons of Chaos Monkey

Chaos Monkey is designed to induce one specific type of failure. It randomly shuts down instances in order to simulate random server failure.

Pros of Chaos Monkey

Prepares You for Random Instance Failures
Chaos Monkey allows for planned instance failures when you and your team are best-prepared to handle them. You can schedule terminations so they occur based on a configurable mean number of days and during a given time period each day.
Encourages Redundancy
Part and parcel of a distributed architecture, redundancy is another major benefit to smart Chaos Engineering practices. If a single service or instance is brought down unexpectedly, a redundant backup may save the day.
Built Into Spinnaker
Chaos Monkey Version 2.0 relies on Spinnaker. This is both a pro and a con. It enables cross-cloud compatibility but requires that the user is using Spinnaker.

Cons of Chaos Monkey

Requires Spinnaker
As discussed in The Origin of Chaos Monkey, Chaos Monkey does not support deployments that are managed by anything other than Spinnaker.
Requires MySQL
Chaos Monkey also requires the use of MySQL 5.X, as discussed in more detail in the Chaos Monkey Tutorial chapter.
Limited Failure Mode
Chaos Monkey's limited scope means it injects one type of failure - causing pseudo-random instance failure. Thoughtful Chaos Engineering is about looking at an application's future, toward unknowable and unpredictable failures, beyond those of a single AWS instance. Chaos Monkey only handles one of the "long tail" failures that software will experience during its life cycle. Check out the Chaos Monkey Alternatives chapter for more information.
Lack of Coordination
While Chaos Monkey can terminate instances and cause failures, it lacks much semblance of coordination. Since Chaos Monkey is an open-source tool that was built by and for Netflix, it's left to you as the end-user to inject your own system-specific logic. Bringing down an instance is great and all, but knowing how to coordinate and act on that information is critical.
No Recovery Capabilities
A big reason why Chaos Engineering encourages performing the smallest possible experiments is so any repercussions are somewhat contained -- if something goes awry, it's ideal to have a safety net or the ability to abort the experiment. Unfortunately, while Chaos Monkey doesn't include such safety features, many other tools and services have these capabilities, including Gremlin's Halt All button, which immediately stops all active experiments.
No User Interface
As with most open source projects, Chaos Monkey is entirely executed through the command line, scripts, and configuration files. If your team wants an interface, it's up to you to build it.
Limited Helper Tools
By itself, Chaos Monkey fails to provide many useful functions such as auditing, outage checking, termination tracking, and so forth. Spinnaker supports a framework for creating your own Chaos Monkey auditing through its Echo events microservice, but you'll generally be required to either integrate with Netflix's existing software or to create your own custom tools in order to get much info out of Chaos Monkey.