Backcountry.com is one of the largest online specialty retailers of clothing and outdoor recreation gear. They have offices in Park City, Salt Lake City, Portland, Virginia, Costa Rica, and Germany. They rely on software-driven machinery to convert online purchases into ready-to-ship packages with little to no human intervention in their distribution center.
Normally I would not consider destructive testing in production, but Gremlin made it easy and safe.
As part of their normal ecommerce preparation for Black Friday and Cyber Monday, Backcountry goes through a long period of testing in order to root out bugs that could cause service disruptions. In 2018, Backcountry's engineering team adopted Chaos Engineering in order to take a more proactive stance to their Black Friday preparation.
During Black Friday 2017, traffic loads impacted Backcountry's SLOs, despite following preparation best practices. To prevent future issues like this, their forward-looking engineering organization led by Director of Engineering, Gustavo Leiva, and Principal Software Engineer, Jose Esquivel, sought a new approach for the organization to test for potential Black Friday incidents.
There was a strong business case for Chaos Engineering as a potential solution because it gives teams the ability to test for real world outage scenarios and also gives the business confidence in production system resilience. Despite this business case, Backcountry had no previous experience with Chaos Engineering and would need support to successfully introduce it into their engineering culture.
And because the testing would be in production, Backcountry needed software that they were confident could run safely and securely.
We don't have a shipping warehouse in staging. If we were going to be confident our systems would be stable during peak traffic, we had to test in production.
Backcountryโs search for Chaos Engineering software included looking at open source options, building their own solution, and using enterprise tooling. They ultimately elected to use Gremlin's hosted solution, which runs on AWS, ensuring that Gremlin's platform is robust and scalable. Building their own tool would take away from their existing feature roadmap, and they wanted to begin their Chaos Engineering journey as quickly as possible. Their requirements for rigorous safety and security quickly ruled out open source options as the current offerings lack security features and support. Gustavo and Jose worked with Gremlin's success team to plan a GameDay that would recreate conditions from Black Friday 2017, and proactively look for other gaps as well. Gremlin was easy to install and configure and allowed their team to get up and running very quickly. The plan also included specific SLO abort criteria that if reached would take advantage of Gremlin's Halt All Attacks feature to restore their warehouse operations to steady state.
We considered building our own tooling as well as the available open source tools. Gremlin was the only solution mature enough to make us comfortable running in production.
Gustavo Leiva
Director of Engineering
The SLO disruptions in 2017 lasted 72 hours. We incorporated Chaos Engineering with Gremlin into our Q4 preparation and we had zero incidents in 2018.
Diagnosing the SLO issues in 2017 took hours. We used Chaos Engineering to improve Time to Diagnose of the system to less than 5 minutes by testing and tuning our logging, monitoring, and traceability.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started