The Comprehensive Chaos Engineering Platform

Everything you need to safely, securely, and simply build reliable software through Chaos Engineering.

Use Gremlin's comprehensive set of failure modes to experiment across your system, including bare metal, any cloud provider, containerized environments, kubernetes, applications, and serverless.

Resource Gremlins
Throttle CPU, Memory, I/O, and Disk
State Gremlins
Reboot hosts, kill processes, travel in time
Network Gremlins
Introduce latency, blackhole traffic, lose packets, fail DNS

Test for failure in your code
Fail or delay serverless functions
Narrow the impact to a single user, device, or percentage of traffic

Test anywhere.

VirtualMachines

Containers

Kubernetes

Serverless

BareMetal

Gremlin is designed with redundant failsafes that restore your system to a healthy state at the first sign of trouble.

Halt all and roll back experiments with a single click
Trigger roll backs based on your monitoring
Status Checks prevent experiments from running when systems are unstable

Gremlin is SOC II compliant and follows industry standard security practices.

Least Permissions

Gremlin runs on default Linux permissions and doesn’t require root access

Ready for Production

Multi-factor authentication, Secure Single Sign On, and Role-Based Access Control (RBAC)

Audit Trails

Every action on the platform is tracked for compliance

3rd Party Testing

Gremlin regularly undergoes regular security auditing by a 3rd party

More about Gremlin security

Get up and running in 3 lines of code. Manage Gremlin from our intuitive UI or the command line.

echo "deb https://deb.gremlin.com/release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.listsudo apt-key adv --keyserver key server.ubuntu.com--recv-keys C81FC2F43A48B25808F9583DBFF170F324D41134 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6sudo apt-get update && sudo apt-get install -y gremlin gremlin

Validate your systems can respond to common failures.

Run pre-configured chaos experiment scenarios based on real-world outages.

Validate autoscaling rules

Increases CPU utilization to test that your autoscaling is properly configured.

CPU

Segment AWS autoscale outage

Read the report

Prepare for host failures

Shuts down an increasing percentage of your hosts so you can prepare for inevitable host failure.

Shutdown

Google Compute Engine Persistent Disk issue in europe-west1-b

Read the report

Handle the unreliable network

2 seconds of latency is added to a growing number of hosts so you can validate clients continue to respond without issue.

Latency

Github Oct 21 Incident Analysis

Read the report

Be resilient to unavailable dependencies

Increasing amounts of traffic are dropped across a service to ensure your system can still function.

Packet Loss

S3 Outage 2017

Read the report

Prepare for region evacuation and disaster recovery

Blackhole traffic to a region so you can demonstrate disaster preparedness.

Blackhole

Netflix project 2013

Read the report

Withstand DNS outages

Blocks internal or external DNS traffic so you can identify single points of failure.

DNS

DynDNS Outage 2016

Read the report

Simulate real-world scenarios that can impact performance, uptime, and customer experience. Run pre-built scenarios based on actual outages and be sure your system is resilient to common cloud failures.

Verify that your autoscaling works
Prepare for host failure
Handle a slow, unreliable dependency
Perform zone and region evacuations
Validate your capacity plan

Configure scenarios based on common outages.

Chain attacks together
Scale the impact magnitude
Increase the blast radius

Scenarios provide you the ability to divide your attacks into incremental steps to mitigate the risk of complex experiments.

Dial up the blast radius over time

Increase the magnitude

Record your hypothesis, observe, and record the results of your experiments so you can take action and improve the reliability of your system.

Follow how your experiments perform over time to prevent the drift into failure. Status Checks prevent scheduled experiments from running when the system is in an unsteady state.

Gain confidence in the reliability of your Kubernetes clusters and train your team.

Choose objects to target

1. Choose a cluster

2. Choose a namespace

Target all objects

Deployments

0 of 2 selected

Deployment-1
- 1 ReplicaSet
- 1 Pod
Deployment-2
- 7 Pods

StatefulSets

0 of 1 selected

StatefulSet-1
- 2 Pods

DaemonSets

0 of 2 selected

DaemonSet-1
- 1 Pod
DaemonSet-2
- 1 Pod

Blast Radius

0 of 5

Deployment

StatefulSet

DaemonSet

ReplicaSet

Pod

Filter and control access by cluster and namespace to easily find and harden specific Kubernetes objects
Prevent noisy Pods from bringing down your application
Ensure you can withstand common Kubernetes failure modes including CPU throttling, DNS issues, and Blackholes

Validate your self-healing and orchestration
Be sure your app autoscales as expected
Find out what happens when you unexpectedly lose Pods - are your customers negatively impacted?

Verify your Kubernetes migration is regression free
Identify critical bugs lurking within your clusters before they cause an outage
Share what you learn with the rest of your organization

The Comprehensive Chaos Engineering Platform

Improve reliability at every level of your stack

Build resilient infrastructure

Resource Gremlins

State Gremlins

Network Gremlins

Test for application failure

Run chaos experiments in any environment

.css-4zleql{display:block;}VirtualMachines

Containers

Kubernetes

Serverless

BareMetal

Safely test in production

Secure from the ground up

Least Permissions

Ready for Production

Audit Trails

3rd Party Testing

Simple to use

Chaos Engineering Scenarios

Get started with outage templates

Validate autoscaling rules

Segment AWS autoscale outage

Prepare for host failures

Google Compute Engine Persistent Disk issue in europe-west1-b

Handle the unreliable network

Github Oct 21 Incident Analysis

Be resilient to unavailable dependencies

S3 Outage 2017

Prepare for region evacuation and disaster recovery

Netflix project 2013

Withstand DNS outages

DynDNS Outage 2016

Prove you can withstand common failures

Build and share your own Scenarios

Safely scale the impact of your experiments

Increase the magnitude

Hypothesize and observe

Track, share, and schedule experiments

Chaos Engineering on

Blast Radius

Be confident in the reliability of your Kubernetes clusters

Confidently operate Kubernetes in production and prevent downtime

Develop quickly and safely using Kubernetes

Company

Resources

Featured

VirtualMachines