Everything you need to safely, securely, and simply build reliable software through Chaos Engineering.
Improve reliability at every level of your stack
Use Gremlin's comprehensive set of failure modes to experiment across your system, including bare metal, any cloud provider, containerized environments, kubernetes, applications, and serverless.
Build resilient infrastructure
Resource Gremlins
Throttle CPU, Memory, I/O, and Disk
State Gremlins
Reboot hosts, kill processes, travel in time
Network Gremlins
Introduce latency, blackhole traffic, lose packets, fail DNS
Test for application failure
Test for failure in your code
Fail or delay serverless functions
Narrow the impact to a single user, device, or percentage of traffic
Run chaos experiments in any environment
Test anywhere.
VirtualMachines
Containers
Kubernetes
Serverless
BareMetal
Safely test in production
Gremlin is designed with redundant failsafes that restore your system to a healthy state at the first sign of trouble.
Halt all and roll back experiments with a single click
Trigger roll backs based on your monitoring
Status Checks prevent experiments from running when systems are unstable
Secure from the ground up
Gremlin is SOC II compliant and follows industry standard security practices.
Least Permissions
Gremlin runs on default Linux permissions and doesn’t require root access
Ready for Production
Multi-factor authentication, Secure Single Sign On, and Role-Based Access Control (RBAC)
Audit Trails
Every action on the platform is tracked for compliance
3rd Party Testing
Gremlin regularly undergoes regular security auditing by a 3rd party
Simulate real-world scenarios that can impact performance, uptime, and customer experience. Run pre-built scenarios based on actual outages and be sure your system is resilient to common cloud failures.
Verify that your autoscaling works
Prepare for host failure
Handle a slow, unreliable dependency
Perform zone and region evacuations
Validate your capacity plan
Build and share your own Scenarios
Configure scenarios based on common outages.
Chain attacks together
Scale the impact magnitude
Increase the blast radius
Safely scale the impact of your experiments
Scenarios provide you the ability to divide your attacks into incremental steps to mitigate the risk of complex experiments.
Dial up the blast radius over time
Increase the magnitude
Hypothesize and observe
Record your hypothesis, observe, and record the results of your experiments so you can take action and improve the reliability of your system.
Track, share, and schedule experiments
Follow how your experiments perform over time to prevent the drift into failure. Status Checks prevent scheduled experiments from running when the system is in an unsteady state.
Chaos Engineering on
Gain confidence in the reliability of your Kubernetes clusters and train your team.
Choose objects to target
Deployments
0 of 2 selected
1 ReplicaSet
1 Pod
7 Pods
StatefulSets
0 of 1 selected
2 Pods
DaemonSets
0 of 2 selected
1 Pod
1 Pod
Blast Radius
0 of 5
Deployment
StatefulSet
DaemonSet
ReplicaSet
Pod
Be confident in the reliability of your Kubernetes clusters
Filter and control access by cluster and namespace to easily find and harden specific Kubernetes objects
Prevent noisy Pods from bringing down your application
Ensure you can withstand common Kubernetes failure modes including CPU throttling, DNS issues, and Blackholes
Confidently operate Kubernetes in production and prevent downtime
Validate your self-healing and orchestration
Be sure your app autoscales as expected
Find out what happens when you unexpectedly lose Pods - are your customers negatively impacted?
Develop quickly and safely using Kubernetes
Verify your Kubernetes migration is regression free
Identify critical bugs lurking within your clusters before they cause an outage
Share what you learn with the rest of your organization