Site Reliability Engineering

Site Reliability Engineering (SRE) is the outcome of combining system operations responsibilities with software development.

Tammy Butow (Principal SRE @ Gremlin) and Robert Ross (CEO @ Firehydrant) discuss how SREs can being proactive with Chaos Engineering

Learn about the rise of Site Reliability Engineering, and how the role of this type of incident management can not only coexist with, but also strengthen a DevOps approach to development.

Download white paper

SRE's primary job is making and keeping a service and an application reliable, and this involves a lot of moving pieces! The following graph shows the Service Reliability Hierarchy, according to Google. Scroll over each layer to see how Chaos Engineering can help.

Product

Development

Capacity planning

Testing + release procedures

Postmortem analysis

Incident response

Monitoring

Site Reliability Engineers have a responsibility to quantify how confident they are in the systems that they maintain. Chaos Engineering is an important discipline to validate reliability with controlled experiments to test various attributes of your system, from Monitoring all the way up to the Product.

SRE's measure reliability in the following ways, and there are often SLOs for each.

Availability

How much uptime does your application have (measured in 9s)

There are 2 important KPIs of Availability.

SLA (defined and agreed to in a contact - e.g., 99.9%)
SLO (Internal objective, usually greater than the SLA - e.g., 99.99% )

Durability

How resilient is your system to data loss?

This can be measured in 9s as well. You have your systems and replicas under primaries, and then you have your backups. The more layers of backups, the more durable. Turtles all the way down!

Performance

How responsive is your application as measured by:

Traffic
Error Rate
Saturation
Latency
Packet Loss
...to name several.

Capacity & Configuration

It's important to validate autoscaling rules.

In the cloud you may not need to buy new hardware to plan for a launch or big event, but you still need to make sure you're configured to scale when the time comes.

Learn how DevOps and SREs can work together to create high performing, reliable sites.

Site Reliability Engineering teams are made up of people from diverse backgrounds who work together toward the common goal of keeping systems and services reliably available.

Free resources and tools that will help you learn the skills you need to become an SRE.

We polled the industry to give you a sense of salary ranges for SREs.

As you're building your SRE team, here's some questions to find the best ones and some job descriptions you can use.

Site Reliability Engineering

Running reliable production systems

A primer on SRE for engineering leaders

Incident repro & playbook validation for SREs

SRE Best Practices for Incident Management

The SRE reliability hierarchy

SREs and Chaos Engineering

Metrics

Availability

How much uptime does your application have (measured in 9s)

Durability

How resilient is your system to data loss?

Performance

How responsive is your application as measured by:

Capacity & Configuration

It's important to validate autoscaling rules.

SRE vs DevOps: Can they coexist or do they compete?

The role and responsibilities of SREs in software engineering

How to become a top notch SRE

How much money do SREs make?

SRE interview questions and job descriptions

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources

Featured