Podcast: Break Things on Purpose | Alex Hidalgo, Director of Reliability at Nobl9

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

You can subscribe to Break Things on Purpose wherever you get your podcasts.

If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

In this episode of the Break Things on Purpose podcast, we speak with Alex Hidalgo, Director of SRE at Nobl9.

Episode Highlights

Alex's adventure into the absurd (3:00)
Google's pager list mishaps (9:37)
Crashing NYU's Exchange Server and Hyrum's Law (14:19)
Bartending makes you better (19:16)
Nobl9 (22:37)
What Alex is currently excited about (30:07)

Transcript

Patrick Higgins: Matilda has been an absolutely atrocious puppy, since we got home, she's just been growling at the door. It's been ...

Alex Hidalgo: Matilda. Oh, you're ... You're like you're wearing headphones. Nevermind. Okay.

Jason Yee: We could have a special podcast guest.

Patrick Higgins: Welcome to Season Two of Break Things On Purpose. My name is Patrick Higgins. I'm a chaos engineer at Gremlin, and I'm joined here today by Jason Yee who is the Director of Advocacy at Gremlin. How are you doing Jason?

Jason Yee: Hey, Pat, I'm doing great.

Patrick Higgins: Jason, uh, happy new year. Uh, we're back for the new year. That's that's, uh, exciting times, obviously we're living in chaotic times but we had a chat light last year with Alex Hidalgo , who is at Nobl9, um, and a author of the SLO book I wanted to ask you to kind of reminisce and, talk about what you enjoyed about that chat, how that went for you.

Jason Yee: Yeah, I really enjoyed, I mean, I always enjoy chatting with Alex. Uh, every time I get. The opportunity to hear he's got just such a wealth of experience. And one thing that we were talking about earlier that I loved was how he relates that experience to non-technical experiences as well. We all realized that at some point in our lives, we had, worked in the service industry as bartenders and chefs and cooks and things. And a lot of what we do, the technical stuff really. Comes down to people, processes and how we deal with customers and having those experiences of having to serve people really informs that.

Patrick Higgins: Yeah, definitely. he's also an excellent storyteller. Like he taught a bunch of different stories that I really got into and he's, as you said, he's got so much experience, um, That he's had, like, these different things have happened to him that have been really funny and really interesting. I'm really excited for everyone to hear them. So without further ado, let's get into it. Our conversation with Alex Hidalgo.

Patrick Higgins: Today, Jason and I are welcoming Alex Hidalgo to the Break Things On Purpose podcast. Alex is a Principal SRE at Nobl9, and he's an author of the SLO book. How are you doing Alex?

Alex Hidalgo: Doing Great, thanks so much for having me.

Patrick Higgins: Thanks so much for being here. So on the podcast we like to ask, our guests about a specific horrible incidents that they've encounted in their career what happened? How did you discover the issue? How did you go through resolving it? And what was that process like for you personally?

Alex Hidalgo: mean, I've been doing this for a while, so I have way too many of these stories, unfortunately, some of them sad and some of them honestly hilarious and. Especially being on the prodmon team at Google for a long time. Right? We're the team responsible for ensuring everyone else gets sure that they get their alerts and that they can know how their service is going.

So I'm ensuring that was running was, definitely a feat. But my favorite version of the story is one that's kind of filled with absurdity and I just love to share it. A long time ago I was working for a managed service provider, like an IT firm . And , one of our clients, like I was the designated like Linux guy and networking guy and one of our clients, they wanted to upgrade this enterprise software.

I won't name the vendor. They're still around today and doing very well. But it involved like four different components and they all had to be, they all talk to each other, you know almost like early microservices in a way. And, this isn't the days of, of, before everything was just like packaged, like I had a checklist of like 40 steps I had to take, so I, this was late enough that early virtualization existed. So I spun up three servers and, uh, I followed the checklist, you know, like very carefully. And you, you know, edited each config file exactly how it was supposed to place license key in the place. And then I started the services and the order that they were supposed to be started.

And the last one. And, and none of this worked without all four services running, right? The last one just. Started and then crashed and just dump the entire heap dump the entire heap, the standard out, uh, which is problematic enough. And there was just too much data there for me to dig through and figure out what, what was wrong.

So it's like, okay, cool. I've got these new virtual machines. Let me just blow them away and we'll start from scratch. And so I did it again and follow the checklist very carefully and I did it a third time and same result every time. As soon as that last service started, it just crashed. I could not, I could not get it up.

And so eventually realize, Oh wait, this is paid-for enterprise software. Let me just contact their support. And so I contact the support, first support engineer. Can't figure anything out on the second support engineer. Can't figure anything out we're on like day two or three or even four by now, you know? And eventually I get escalated to, I believe it was the, you know, like the vice president of engineering or something, but this person wrote most of the original code, right? He was one of the co-founders of this company and we can't figure it out. We're trying everything, the most minor things like, well, it runs on red hat for everyone else. Let's try scientific Linux and just, every possible thing we could think of. And eventually him and I exchange personal cell phone numbers even, and every morning, like we wake up and like he's Colorado time, so I'll wait a bit. But like we call each other and we're trying to figure it out.

And I've given him access to the boxes by now. So he can poke around still, just like it's, everything is supposed to be right. As far as we can tell, except this one last service just keeps crashing every single time. And we're out of ideas at this point, and, but I had to get it done. We were being paid as a company to upgrade this software, luckily there's a whole separate environment, right?

This is like version five of the software and version four was still running on some other servers, somewhere else. It was okay. That was taking the a bit, but it couldn't take forever. And I just keep poking around. I keep Googling, but the product wasn't that well-known right.

Like not that many people used it and couldn't find anything. And eventually during all this troubleshooting, I turn on those VIM settings or displays every special character. it shows you tabs and spaces and, I randomly on a lark, just cause like, I don't know, I was checking everything. It was rechecking, the config files and this and that. I opened the license key file and I noticed at the end of the file backslash R backslash N because this license key had been copied from a windows machine. And this caused the entire service to crash. I ran dos to Unix, right? I replaced the new line.

Like the character turned new line with just a new line character. Everything came up perfectly, no problems at all. After that, a single character in the license key file that otherwise upon just your eyes looking at it looked totally perfect because this entire service to not even be able to start.

Patrick Higgins: Like what, what do you take away from a story like that? Like what do you learn from next time?

Alex Hidalgo: I mean, I do tell the story mostly because it was so absurd and it's kind of funny, but on the same token, it teaches you that the problem can lie anywhere and you should never make assumptions. If you encounter an incidence, the first thing you should ask yourself, what is different about reality than I thought, right? What difference is there? The, my understanding of the world, what has changed to have caused this problem and yeah, in this case, it was a very kind of outlier one that I may never run into again, but on the same token, um, you know, I'll always be checking the format or license key files for the rest of my life.

Patrick Higgins: How did you go ahead explaining this to your bosses afterwards? Like, what was that story like?

Alex Hidalgo: Well, my bosses didn't care because we were charging this client by the hour. Um Now it was, it was also, it was a very small company. Uh, there was like seven, eight of us total and we're all really, yeah, good friends. And there was no, you know, leadership chain. I had to worry about placating, and the client, they hired us because they didn't have the technical chops.

Right. They, they knew how to use this product really well. Uh, they built a very successful business, so I'm not going to call it who they are either, but they were eventually bought by Google and, you know, but they didn't know how to do this. And that's why they hired my company that help them with this upgrade in the first place.

And sure. It took a lot longer than expected, but I don't really remember there being any friction there either. Um, in many situations I could imagine there being, but, uh, I kind of lucked out. It was everyone involved was kind of understanding.

Jason Yee: I feel like there's a good story here too. From the software developer perspective, right. Of constantly like check your input filters, the fact that it crashed. Because of that rather than returning, like you have an invalid license key.

Alex Hidalgo: Yeah. I never, I don't think I ever even asked and if I did, I cannot remember. What the details are of exactly how the code couldn't handle this in that catastrophic other manner. Uh, I remember it was a Java app, it was all proprietary code. Uh, this wasn't anything I could go poke at.

This is not stuff any of us could poke at. Um, but Hey, if anyone out there is listening and can think of a way that, uh, a Java program may completely crash, especially back in like 2010, uh, because the input wasn't validated just right. Please let me know 'cause I've always kind of wondered about that.

Here, I'll tell another quick one. Cause it also relates to like input validation in a way or being able to input the wrong thing. I was on prod mon at Google and I was on call, uh, for the alert manager side of things. Right? So the entire infrastructure at Google that, delivers pages to people.

So very important. And uh, I'd already packed up for the day and I. I'm 20 feet away from my desk, something like that. And I get a page. It's got very weird text on it, you know, and then I got another and another and I'm like, okay, these pages don't make sense to me what they're saying. And I'm in charge of the alerting infrastructure right now so let me go back. I better turn around and sit down at my desk. And by the time I get back to, you know, to like the seating area, I hear everyone's pagers going off and I'm like, well, this is going to be fun. And it takes us a little bit to figure out exactly what's going on. And that part isn't interesting really, but turns out what happened was someone went to go share a documents and you can share documents, Google docs with mailing lists and these mailing lists will auto-complete. And at Google you don't need to have the person's address. Right. Normally if you're sharing, like, you know, a G suite doc, uh, you know, often you have to have talked to that person before in some organizations, they may set it up. So you have everyone, but at Google it was set up. So everyone had access to a mailing list.

And I guess through some kind of typo or something, this person, hit underscore twice. And then shared it with a mailing list that was purposely named with two underscores. So it wouldn't like accidentally auto-complete and things like that. And turns out this was a old mailing list that existed during a migration, an internal mail server migration at Google. And so it contains something like 40,000 email addresses. Many of them old pager addresses. Including those of people who no longer worked at Google, this was a years and years old document. I think it was a six year old at like a six year old mailing list. So people got alerted all over the world, all over the world, whether you worked at Google or not anymore.

And. The problem compounded itself because people were getting, a lot of people were getting these as emails, right? Because it was, obstensively an email list. Just many of them were plus pager, which would redirect to your pager. And so people were like, reply all, please stop or reply all. I think you share the wrong document with me, which then of course send pages to everyone else as well.

And within an hour, we got it all under control was one of those, you know, in our. And our incident retrospective, we had great items, like where did we get lucky? The person in charge of that mailing list in Australia happened to be up and recognize immediately what that person even went as far and found the buginizer like Google's internal ticketing system found the bug in Azure bug that was still open that said, we need to delete this list, but we also. We also got really fun stuff. Like, you know, there was another engineer in Australia who had forgot to set their alarm that morning, but they got woken up anyway. Uh, we got to add things in the section of the retrospective to, um, multiple people reached out and said, Hey, it was really nice to hear from Telebot again, it was really nice to hear from Google's paging system 'cause I've been gone for so long. And all of that, because a long lost mailing list that should have been deleted half a decade before accidentally auto completed in someone's share this document.

Patrick Higgins: Wow. It really begs the question. How many of these bigger older companies have that foot gun just lying around all the time.

Alex Hidalgo: Yeah. I mean, technical debt is difficult. Right. Even those of us who care about it, the most, we're always leaving it behind, you know, and the engineer who's responsible for that list originally, you know, originally was one of the best engineers I ever worked with. Like, you know, like I knew him personally and, you know, uh, stuff happens and that's fine and stuff breaks sometimes.

And, uh, but sometimes they break and funny and interesting.

Patrick Higgins: Yeah, a hundred percent

Jason Yee: I can see a, uh, a new Gremlin, chaos engineering attack of sending emails to all the email addresses.

Patrick Higgins: Mailing List

Jason Yee: Yeah, that's some good chaos engineering.

Patrick Higgins: Like particularly if it's r and it automatically replies all a couple of times,

Alex Hidalgo: I mean

Patrick Higgins: That would be in order.

Alex Hidalgo: There are so many stories of entire companies, mail servers going down, right. Especially, you know, uh, not quite as frequently though that most people have like hosted, you know, email services. But I mean, I think it was just like eight years ago or so that NYU's exchange server went down. Because someone accidentally emailed every student at NYU and being college students, they all, they knew what they were doing.

They were purposely hitting a pile and this just caused such a, you know, or at least I think it was NYU. If anyone's listening and I'm wrong, it was some college. Uh, but it, it caused such a feedback loop that the exchange server just died, you know? I mean a good way to think about it is if someone's able to do it, they will write like Hyrum's law. Like, are you familiar with Hiram's law? So Hiram, uh, engineer at Google, um, actually just wrote the, software architecture, a Google book, I think it's called. But anyway, his observation, his law is with a sufficient number of users of an API. It doesn't matter what you promise in the contract. All observable behaviors of your system will be depended upon by someone.

Jason Yee: So I think it's interesting, right? The, the whole funny thing of, of having this false alarm because of that email list, but I think it ties back to something that you've written a lot about and that's SLOs and just like generally monitoring and what we should be tracking in the alerting on. So I'm curious if you could dive a little bit more into that and like, let's chat about your thoughts on SLOs. Like how did you get there? Number one, like what brought you to SLOs?

Alex Hidalgo: Yeah. So, I mean, In a way, it was just introduced to me naturally because I was an SRE at Google. Right. Um, I may have eventually gone on to write the book, but I didn't come up with the concept. Um, at least not how it was originally formulated. And, uh, it was just a thing that you did. And, you know, I didn't totally get it at first, to be honest, you know, I'd spent years and years in industry already.

I'm like, what are we doing with this? And, but then we were forced to, because the product I was working on was also a cloud product. Or at least a back to cloud product. And since Google had SLAs and all their GCP services needed in a SLO set, you know, at a level below that. So we would know if we were out of error budget before, you know, we might violate our SLA and that made it make a little bit more sense, but it still didn't resonate with me.

Um, what did eventually though, is when we deleted all the rest of our alerts, When we moved to a world where we only got alerted on fast burn and got tickets on slow burn, right? So the idea being our math says we're burning through the error budget at rates that is not recoverable without likely human intervention.

And when you get to that point and it's difficult to get there, but we, you can get to a point where you're reasonably sure you're only catching a page if it's actually going to cause you to violate your SLA, then that's pretty awesome. There's so many false positives that go away. There's just so much general pager load that just disappears.

And I was like, wow, these things are kind of cool, you know, but I didn't quite understand how to use them for things that didn't have an SLA sitting in front of them. Uh, but then I joined the CRE team, the customer reliability engineering team, uh, which is a group of veteran SRE kind of tasked with teaching Google's largest cloud customers, how to SRE.

And when you're trying to have conversations, not just cross team or cross org, but cross company cross industry, right? Like, it's not like Google's cloud customers are all other large tech companies, you know, how do you have the proper vernacular? How do you figure out how to have conversations about things and what the CRE team decided is that was going to be SLOs.

Um, So the idea was basically, you know, uh, if we can engaged with you, we will help you. We will teach you what we've learned. We will examine your systems. We'll make you more robust. We'll make you more resilient and therefore making more money reliable, but we need to know how to speak the same language first.

So the idea was basically we will come on-site with you. We will run an SLO workshop. It will be hands-on, we'll spend a whole week with you. You know, I'd spent a whole week at various different, various different companies, offices. But the goal was, we need you to establish at least starter SLOs. And then once we're measuring your reliability from that standpoint, uh, then we can engage further.

Then we can figure out how to really isolate where the problems are and things like that. And that's when the really clicked with me. That's when I started to understand the potential behind these kinds of approaches. And that's when it clicked with me that this is maybe a new formulization, but it's something that everyone already knows. Nothing's ever perfect, right, don't shoot for a hundred percent, humans are actually okay with failure, and I started recognizing this in everything I'd ever done for a living. When I was a bartender, I'd tried to greet all my customers within 30 seconds. And I knew I couldn't greet all of them within 30 seconds, if it was busy but I also knew that if I greeted enough of them, I'd still have a good night, but if it was way too busy and too many people were walking out, then it wasn't a good night anymore. Right and that's that little story that's all SLOs really are. When you really get down into the nitty gritty, it's accepting the fact that you're going to have failures. It's accepting the fact that. Your customers, your users are actually okay with that. Every human is cool with something breaking every once in a while, as long as it doesn't break too much. That's what SLOs really are.

Patrick Higgins: I absolutely love that you use experiences from bartending in how you think about your current work. Because I absolutely do that as well. Like thinking about, uh, like any number of things, like, uh, like queues and code promotion, like how to be successful. I always take it back to bartending and like the physicality of getting things to people. Yeah. That's awesome. I love it. It's really interesting. You bring up the idea of like, um, it, it seems like so much about best practices is like establishing this common vernacular with people you're trying to convey concepts to. And really a lot of it's just about getting terminology, succinct and correct, and establishing a common agreement with it as well. I think that's really interesting.

Alex Hidalgo: So many problems are just over miscommunication, you know, uh, just because the problem there is that humans are very emotional creatures, right? And we, establish definitions of things in our heads and the difficult to convince ourselves that, maybe this thing that we were convinced meant, this actually means this, right?

Like it's difficult to convince people up if you. Catch them when they're first learning, you can be like, no, no, no, actually this thing means this, but preconceived notions can be very difficult to dissuade people of, we hold onto them, they help form our reality, right? Like if, if we suddenly learn this thing, we thought our whole lives is actually wrong, that can be shocking. That can be jarring. And, you know I think it's true just in the workplace as well, to a lesser extent. But you know, once you believe something, when you think something, then it's difficult to change that, and that's exactly why establishing a common vernacular is so important because, uh, people likely know all these words, but they may have entirely separate definitions of them.

And that can make things even worse, right? If someone doesn't know the word, they'll be okay, what does that mean? But if you both know the phrases, but you have even slightly different definitions, even of talking past each other, without even realizing you are, and that ends up in disaster all the time.

Jason Yee: So we've been chatting about your time at Google. You've recently joined a new organization, so congrats on the new gig, but tell us about, uh, we were saying Nobl9, right? Is the name of it. Uh, tell us a little bit more about what you're doing there.

Alex Hidalgo: Yeah. Before I do, I do want to give a shout out to Squarespace. I was there for two years, between Google and Noble9 and I absolutely loved it. The only reason I left Squarespace is because I'm so excited about what Noble9 is doing. Yeah, as you alluded to, like as SLOs are kind of my thing at this point and, uh Noble9 is aiming to build the most comprehensive SLO platform. People often think that SLOs are something that can be simple to do. Um, and this is often because the philosophies are simple.

Let's define what an SLI means, let's define what an SLO means less defined one error budget means, and then people will go start to do it and then they realize, Oh wait, my monitoring tool, can't actually calculate error budgets. Right. Very few do. And even those that do only do so in like one way, and there's four or five potential ways that you can calculate error, budgets, and then you realize, okay, cool. This is fine. We'll build some tooling. So now you build your own internal service because nothing exists out there to help you do this stuff. And then you realize that some of your metrics when you're talking about SLOs, you're generally talking about high volume request, response, API things. But that's fine, but you don't just want to measure the latency of your API requests, you want to measure a whole user journey.

So now suddenly you have to build tooling to allow you to, you know, actually probe or trace, you know, across many different services. And then after that you run into a service that only has like four data points per hour. And if you have a single error per hour, that almost make it seem like you're only being 75% reliable, but you know, that's not actually the case because you're actually running fine the rest of the time, you don't have the data points to prove it. So then, you know, okay, cool. I can solve this with stats. And so you go out there and you learn about binomial distributions and way to normalize this data over time. And then suddenly, you know, you realize you need a whole team to build all this for you because there aren't vendors doing this. There are there aren't.

Metric systems that do this. There aren't time, series, uh, you know, systems like that can actually do this. And the next thing, you know, you've spent two years building this tooling, which is basically the story of my time at Squarespace. Um, right. It's, there's so much more to it outside of the generic examples, the generic example of, you know, your web API, and let's make sure that your latency isn't too high or it's not too high too often.

Um, that's not what most people's services look like. It's the easiest to explain. And it's what a lot of Googles look like. And that's why, you know, that's how they defined it. And that's how they were, you know, wrote about it in the first two, uh, the first few SRE books. Uh, but that's not what everyone else's stuff looks like.

And that makes it really difficult to adopt this approach in any kind of meaningful way. So that's what we're doing at Noble9. We're looking to build that tooling, so you don't have to keep building it. So people don't have to keep building it themselves. Uh, but beyond that, um, because we're an entire company focused on this, as opposed to you and your side project, um, it's going to be the most comprehensive version of this possibly imaginable.

So we're going to do things like we'll be able to collect, uh, you know, it's still very early for us. Uh, so I have no timelines on any of this. But, I mean, I have a list of 36 data sources we currently hope to integrate with, for example. So we're not just talking about like, we'll convert your Promethease metrics for you.

We're talking about, let's talk, let's talk to your business logic systems. Let's talk to, you know, literally anything that can send us data. Yeah. Do you do so we'll do the math for you and we'll give you better data to make better decisions.

Patrick Higgins: You're obviously dealing with a varied set of circumstances when it comes to different, uh, potential customer use cases. Um, have you had any edge cases yet where like that's kind of happened and you've been like, Oh, I did not say that coming. Like, I didn't expect that at all.

Alex Hidalgo: A little bit, you know, we only have a handful of beta customers right now. Um, you know, again, we're still very early, uh, but, uh, yeah, we've already ran into situations where queries against certain monitoring vendors are not returning the data we expected. Uh, we're not returning the data we thought, and the data looked totally different once we got it.

Once we grabbed it out of their API versus what the customer thought, it looked like inside that vendor's tool, you know? And you know, so like it's not a super exciting example, but yeah, we're already running into things like we're following the API docs and we thought we were following the query documentation, you know, like the query language documentation, and, uh, it's still operated in a way that we didn't expect the data did not look like how we expected it.

Jason Yee: That's something we encounter a lot, just in chaos engineering, right? Uh, I was chatting with a customer the other day and they were like, we injected some chaos that was supposed to consume all the CPU and we see it in one graph and we're not seeing it in the other, but it clearly says cluster CPU, percentage, why isn't this working? Yeah. And you dig down through multiple layers of docs and suddenly you find the note that says, Oh, this doesn't actually mean that this means this other thing.

Alex Hidalgo: Yup. And per chance that graph I've been looked at for years, uh, thinking it represented something when actually it didn't right.

Patrick Higgins: That's such a good Example of the fact that we're like looking at these things, trying to figure out, trying to discover like those preconceived notions that we're trying to break out over and trying to like, kind of really push the boundaries of what we believe into trying to generate these like new models of a whole new world with new beliefs.

Alex Hidalgo: You know, and sometimes you never even figure out what's actually going on. You know, like I remember at Squarespace, we had a dashboard for the ELK stack, right? The big Elastic Search log ingestion stuff, and we were running into some problems with something and, you know, I'm trying to dig into it. I can't really figure out what's up.

And I'm like, Oh, maybe the network links are saturated. And I go and look at the graphs that we had set up and. They looked fine. Okay. But it's really feels like maybe the network links are getting saturated. So I logged on onto one of the servers and, and when I ran, IF top showed a whole different story.

Right. Like pushing a 16 times the amount of data, then these graphs are showing us. So like, okay, let's find the graphs are wrong. So I go to the graphs and I look at the query and I'll look at the metrics and it was just a stats-D exporter and it looked right. And I go check the stats D docs and. It looked right.

And I never once figured out what it was. Like, I never quite figured out what that discrepancy was, but the Kernel VI of top was telling me something entirely different than what stats D was. And, you know, I just replaced the graphs with a totally different data source and you know, like that was fine.

Um, but. Right. For as like, as far as I know, those graphs existed for several years and people had just assumed that they were accurate. And, uh, when it was time to actually examine the data within those graphs, they just simply weren't.

Patrick Higgins: Well, Alex, I would like to ask you about the things that are going on for you at the moment, what you're excited about. Could you plug your pluggables in terms of what you're super excited about right now?

Alex Hidalgo: Yeah. So, um, I think we're at a really interesting time in the industry, because I feel like for the first time that I've been involved, at least in tech with a capital T uh, that people are, I understand that we need to be looking outside of our own discipline and we can learn so much from others and that we shouldn't just be trying to come up with everything from scratch.

I see this from everything from. People just discovering that us statistics as a discipline and can help you learn about numbers. All the way to, the adoption of looking at what safety engineering and resilience engineering can teach us. Um, it just seems like people are finally more open than they ever have been before to let's learn from others instead of trying to be the best, uh, software is not different, uh, not different enough to not be able to learn from others.

So that's in a very. Large schemes, something I'm incredibly excited about. Um, I'm very happy that we are finally starting to see some people out there truly understand what observability means as opposed to just metrics collection. Two companies I'm not affiliated with, but I love them both very much lightStep and honeycomb are both absolutely phenomenal. Go check them out. I actually love what Gremlin is doing. I love just the general acceptance of let's make sure that we understand our systems by, you know, not necessarily always breaking them gas engineering doesn't have to involve breaking. Let's understand our systems better. Right. We cannot make them safe or resilient or robust and therefore not reliable. Uh, and that's my whole thing, right? Like reliability, um, without understanding the better and to understand the better we can't just let them sit stagnant. And so yeah, broadly, um, uh, those are some things I'm like, I'm most excited about just seeing it spread across the rest of the industry.

I have some qualms. I hope that these things, aren't all subsumed by the marketing departments of various companies. Like we've seen happen with, you know, like dev ops originally was a philosophy. And now it's a Microsoft Azure product name. I don't. I can't even wrap my head around that much less people with the title DevOps, sorry. I'm not trying to be insulting to anyone. They're just like I'm old. And you know, that journey that term's taken, um, you know, it's fine language evolves. I get that. I just hope it doesn't happen with things like observability or chaos engineering or resilience engineering, um, or even just the word reliability itself. Right. I see it very often get conflated with availability and they're very different things. Um, so yeah, I'm excited about a lot. What's going on, uh, slightly pensive, uh, hoping that the, uh, uh, popularity of some of these things doesn't ultimately, become their downfall. Uh, but, you know, I think it's actually a really, really cool time to be in the reliability space to be in the space of, you know, how can we make these global scale distributed multi-component deep systems?

How can we make our complex systems? How can we make them reliable? How can we make them more useful to our users as well as the people that have to maintain them? I'm generally pretty optimistic.

Patrick Higgins: Awesome. Well on that note, because I think that's a beautiful note to end this on. Thanks so much. Thanks for joining us today, Alex.

Alex Hidalgo: Thanks so much. I had a blast being here.

Podcast: Break Things on Purpose | Alex Hidalgo, Director of Reliability at Nobl9

Episode Highlights

Transcript

How to test for expired TLS/SSL certificates using Gremlin

Podcast: Break Things on Purpose | Mikolaj Pawlikowski, Engineering Lead at Bloomberg

Company

Resources

Featured