Introduction to GameDays

Practice makes perfect

Gamedays are like fire drills -- an opportunity to practice a potentially dangerous scenario in a safer environment. They are the capstone which allows us to measure the resilience of a system. Running a Gameday tests our company -- from engagement to incident resolution, across team boundaries and job titles. It verifies the system at scale, ultimately in production. By proactively testing these events, we can choose the terms of engagement -- the time, the place and the root cause.

The first step to any drill is knowing what to practice. Are you evacuating one of your data centers or cloud provider's regions? Validating that the loss of a key dependency won't bring you down? Planning out the Gameday is a great opportunity to collaborate with other parts of the company, share context, and learn about the system as a whole.

Communicate

When in doubt, over communicate. Especially when you're going to be breaking things. Let everyone interested or involved know your plan. Have a Command Center -- a chat room, conference bridge, or a conference room -- where anyone can check on the status of the Gameday. Share your success and abort criteria, as well as any dashboards you are watching. Having many eyes on the problem can speed up detection if things go wrong.

What happens if things go wrong?

Always have a rollback plan and a set of abort criteria. Often if there is any customer facing impact (beyond what is expected), then the impact is reverted and an investigation begins. Monitoring, alerting, and engagement are key parts of your system to verify.

Start small

It is critical to understand and minimize the blast radius of an exercise. Run your Gameday first in a test or staging environment. Start with the smallest blast radius that will teach you something about your system. This may be breaking a single container, degrading a single instance, or injecting failure into a single request. Next fail the entire service, zone, or a percentage of requests. At each step you either gain confidence in your system or find an issue which needs to be fixed.

Dial it up

There is a benefit to starting small and dialing it up -- different scales teach us different characteristics of our system. At small scale we test the functional: Do we handle exceptional cases correctly? Is our system usable in a degraded state? At large scale we learn about resource constraints and cascading failure: Do we protect ourselves if traffic builds up? Are our timeouts set aggressively enough? Only by testing the small and the large scale will we be prepared for what will occur in the real world.

Run in Prod

The end goal of any Gameday is to run in Production.

Fail services regularly. Take down data centers, shut down racks, and power off servers. Regular controlled brown-outs will go a long way to exposing service, system, and network weaknesses. Those unwilling to test in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon.
James Hamilton
On Designing and Deploying Internet-Scale Services

It's production that matters at the end of the day. That's where customers live, where the money is made. It's production's configuration that counts if things go wrong. Many systems are tuned for the ‘happy-case', and find themselves woefully unprepared when failure strikes.

Train your team

Furthermore, when failure does strike, there is little time for learning during the event. Don't train your on-calls by handing them a pager and wishing them good luck. Teams need to proactively test their reactive skills! By regularly testing important scenarios, your teams will build muscle memory and be able to act quickly and confidently in a crisis. De-mystify and de-stress your incidents by practicing in advance!

Start

Gremlin Gameday: Breaking DynamoDB

By now, you might have read previous blog posts on running a Gameday . Better yet, your team has run a Gameday and…

Philip Gebhardt

Software Engineer

Start

How to Run a GameDay

GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created…

Eugene Wu

Solutions Architect

Start

Inside Gremlin: 2019 Gremlin GameDays Roadmap

GameDays were created with the goal of increasing reliability by purposefully creating major failures on a regular basis…

Tammy Butow

Principal SRE

Introduction to GameDays

Practice makes perfect

Communicate

What happens if things go wrong?

Start small

Dial it up

Run in Prod

Train your team

Related

Gremlin Gameday: Breaking DynamoDB

How to Run a GameDay

Inside Gremlin: 2019 Gremlin GameDays Roadmap

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources

Featured

Communicate

What happens if things go wrong?

Start small

Dial it up

Run in Prod

Train your team

Related

.css-19qz2qv{color:#333;}.css-19qz2qv:hover,.css-19qz2qv:focus{-webkit-text-decoration:underline!important;text-decoration:underline!important;text-decoration-skip-ink:auto;}.css-19qz2qv:hover,.css-19qz2qv:focus{color:#333;}Gremlin Gameday: Breaking DynamoDB

How to Run a GameDay

Inside Gremlin: 2019 Gremlin GameDays Roadmap

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources

Featured

Gremlin Gameday: Breaking DynamoDB