- 18 min read

Podcast: Break Things on Purpose | Mikolaj Pawlikowski, Engineering Lead at Bloomberg

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

You can subscribe to Break Things on Purpose wherever you get your podcasts.

If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

In this episode of the Break Things on Purpose podcast, we speak with Mikolaj Pawlikowski, Engineering Lead at Bloomberg.

Episode Highlights

  • Why Chaos Engineering? (1:29)
  • Miko's Book (6:55)
  • Chaos Engineering for Frontends (10:21)
  • eBPF (12:10)
  • SLOs (16:28)
  • What Miko is currently excited about (21:56)

Transcript

Jason Yee: We may cut this later, but I have a totally random question. So for those who are listening to the podcast, you can't actually see Miko, but he's got this plush penguin head that's sitting next to him. I'm curious if there's a story behind the penguin head.

Mikolaj Pawlikowski: Well, um, it's complicated. Um...

Patrick Higgins: Hello and welcome to today's episode of Break Things On Purpose. My name's Patrick Higgins and I'm a Chaos Engineer at Gremlin.

Jason Yee: and I'm Jason Yee, Director of Advocacy at Gremlin.

Patrick Higgins: Today we speak with Mikolaj Pawlikowski, who is a software engineer project lead at Bloomberg. He's also the author of Chaos Engineering: Site Reliability Through Controlled Disruption. How you doing today Miko?

Mikolaj Pawlikowski: Doing great. Thanks for having me.

Patrick Higgins: We're very excited to have you here today, Most recently your, your book has come out. That's been something that I think a lot of people in our space have taken a lot of interest in, and it's been really well received so we've got a bunch of questions about that. We would like to ask you, if you've had, if you've had those experiences in the past where you've really felt, the heartache that have come with not thinking about things proactively and what that's looked like for you.

Mikolaj Pawlikowski: That's a, that's a tough question. I think, the entire reason why we started with, this chaos engineering thing to begin with was because we started working on a brand new project using Kubernetes back in 2016 or so. And, all of a sudden we find ourselves with all this new software, all this new pieces, a lot of moving parts where we, you know, it's, a lot of patches coming in and there wasn't like a good manual that would tell you, Oh, this is the best way to configure it out. You had to kind of figure this out by yourself. So. Um, it kind of came naturally as an evolution of just trying to make sure that all the things that we can think of or all the outages or all the problems that we had before, we would simulate them again to make sure that next time we're not actually, you know, prone to running into the same issues and that later evolved into things like Powerful Seal when we opened service tools separately and eventually kind of distilled into writing that book. So, you know, kind of often joked that this was all kind of like a sleeping aid to, to sleep a little bit better at night and not be called that much.

Patrick Higgins: That that's fantastic. Have you found one of the. One of the more interesting things, when you make that transition from firefighting into thinking about things ahead of time that you can start to look at abstractions like Kubernetes and really think in more logical and curious terms about. Um, uh, about it more as like a learning approach, rather than, desperately trying to get things fixed and put out the fires.

Mikolaj Pawlikowski: Sure, I mean, you know, typically like the good presentations that you see at conferences, they're either about some big outage that they had and they fixed our, an outage that was prevented. I think that part of it is that if you start with a new code base, like it was the case for us, which could bring to you, or if you start to refine the new project, the best way to understand that is to kick the tires and understand, how these things work or how they break and all of that.

So this is like naturally again, going into this direction of discovering those things. But most of the time, the things that doesn't necessarily make for particularly exciting presentations, this is like easy stuff, the low-hanging fruit and I'm sure, your experiences has probably been similar without that. Now that the example that I use now to get like a gotcha moment for when I explained this to people who are a little bit skeptical, it's like, everybody knows systemd and systemd services and a lot of people don't now that by default, if you just put it to restart, always, it doesn't actually mean it's always going to restart because the other default parameters mean that if you have, like, I think it's by default. Five crashes in 10 seconds, period. It's just going to stop and everybody's going to be like, Whoa, what's going on? So when I'm teaching a little bit of chaos engineering and during like training internally and externally, that's like a good moment when they're like, Oh way. Well even simple things like that. It's one line you actually, if you didn't test it properly, if you didn't actually simulate this kind of thing, you know, You might run into trouble. So you don't really have to go all the way to Kubernetes to, to prove those things. You, you can start for the sample. And I know, um, is that also your experience? Do you find that that's, um, that's how people react?

Patrick Higgins: I found that it's really like what you're describing is like a really interesting way to look at getting that light switch to flick for people with chaos engineering, where it's like, um, taking a very simple, configuration, default configuration or parameter, or like some kind of thing where, you know, most people think it's one way, but in fact, it's the other way.

And if you can test that and show it to them like face-to-face then they really have the, they kind of got to confront the fact that they're kind of like, Whoa, this changes the way I think about this one thing, how many other things may fall into the bucket? Where I don't actually, where I think I know what it looks like and then once we actually press that button or pull that lever, it isn't exactly as we'd expect.

Mikolaj Pawlikowski: Yeah, exactly. And it's very often simple things and, you know, people have this misconception that it's only for like massive distributed systems. And if you're not at scale of Netflix, you shouldn't be touching about. Uh, my point of view is that you can start with any system, even a single process and just this mindset typically gives you a lot of value for a little effort.

Patrick Higgins: Yeah, absolutely. Yeah.

Jason Yee: These days, even those simple systems are starting to become complex systems, right? When we. Look at, what does it take to just set up a simple blog these days? And you're looking at various hosting providers and, you've got a database and a front end and it's suddenly a much more distributed app than it was, five, 10 years ago.

Mikolaj Pawlikowski: Yeah. So what do I need for my blog? Well, I need Kubernetes, I need load balancers in front of it. I need a CDN because, you know, I expect it to blow in popularity and a day now. Sorry. Yeah. Thanks gets complex really quick. That is nice. That's true.

Patrick Higgins: Yeah, I was wondering that. Your book, Well, it, it breaks the book's kind of broken down into three separate sections where you kind of upfront are dealing with introducing, the principles of chaos engineering, you, translate that then in the second section into thinking about chaos engineering experiments and then, in the third section, it's a lot of bringing things together and integrating that into like, real world situations, uh, from an organizational perspective or a business perspective, perhaps.

Um, and that second section was really interesting to me because, the experiments that you chose was so varied, it wasn't that you're really aiming that book at reliability engineers, particularly, this is something that was like practical and that, application engineers could pick up or, people coming from all different backgrounds. And I was wondering how much of that was intentional and also like what you think that means for the kind of people that are gonna look to your book and take value from it.

Mikolaj Pawlikowski: You know, I would like to think it was fully intentional. Um, the basically when, um, so the story behind us kind of funny because, I wasn't suspecting anything when I woke up that day and I get a call out of the blue from the Manning Publisher. And they're like, Oh yeah, well, we, we liked the things that you've been doing with chaos engineering and a Powerful Seal, the presentation is why, why didn't you write a book? I was like, well, they've already written like a book or two about that. And then we're talking about that. And we realized that shared there, there are these books that talk about the mindset and the is great. But, in order to be able to actually apply that, it's easy to fall into the trap of thinking that this is actually only for Netflix and Google and whatever, and that the same methodology can't necessarily be applied elsewhere. So that was basically how, they talked me into writing that book in the first place too, to show the different layers. And like you said, I was initially trying to actually start, entirely from like the sys calls and build up all the way to Kubernetes.

We had to switch it a little bit, the order for like the marketing reasons. Um, but yeah, the, the goal was to basically shout out. This is really not rocket science. Most of that is simple stuff that everybody can do. And, regardless of whether you're in SRE at Google, which is great, or running your Kubernetes stuff or even, you know, managing the Kubernetes bringing to you so that you need to understand behind the scenes of how those things actually work and all the way to, okay we have this legacy system we kind of know what it's, how it's supposed to work. Are we really sure about that? Okay. I think so.

Um, yeah, the type of content was specifically designed to go through the different stacks and different technology and different languages to show that it's about the breadth here rather than any specific tool. And, I try to throw in tools here and there, obviously where it may make sense and gives you value, but you notice that most of those things are used tools that have been there forever and don't necessarily require, you to do anything brand new or to pay money for, for any of that. So, Yeah, that was, um, you know, I'm, I'm kind of happy that you noticed that trend in the book, because that was one of the big parts of what I was trying to achieve with it.

Patrick Higgins: Yeah, I found it, To that point of view of thinking about, uh, chaos engineering and its effect on front ends. I thought it was really cool, as someone who's spent a lot of time in front end code, I really appreciated it because, taking that perspective, obviously of thinking about the fact that, what we're really thinking about is user experience and reliability from a user's perspective as well. So being able to like, take that really holistic view of the way that, these things, if the way that this practice can really like help, not just us, but create less painful for customers as well. I thought it was like, Great. I think a lot of people are going to have that really affect the way that they think about practices of chaos engineering, but also just reliability generally like that it's not abstracted away from customers. It's actually fundamentally important to the customer experience.

Mikolaj Pawlikowski: I'm really happy. You're bringing this up, you know, the JS crowd actually appreciated that, so that was great. Um, yeah, I guess the flip side of that is that it was really challenging to kind of compress enough information about those different stacks, technologies, languages, and whatnot into single chapters without overwhelming people. Because, you know, you could write a book about each of those things that will be touched in this chapter. So I hope did I struck, uh, a reasonable balance, but yeah, JS chapter was particularly funny too, write and I hope I didn't offend the JS crowd too much.

Jason Yee: I'm curious. Uh one thing that I wanted to know earlier on, you had mentioned starting simply and so I'm curious, is there, is there one favorite chaos engineering experiment that you've run repeatedly on systems and found the most benefit from.

Mikolaj Pawlikowski: Oooh. Deep question. Um I think that from my personal, like benefits, before I actually wrote that book I didn't really use things like strace that much on anything more complicated that like, you know, basic thing. And, uh through the process of writing this and finding those examples, I realized that it's much more relevant than we expect and understanding at the kind of lower level, what the thing is actually doing it's not that hard. And, I think I'm not there and you probably have seen those. So this is not directly really, answering your question a little bit on the side, but. Um, I think you might have seen it in the book. I keep repeating how amazing eBPF is. And that has really been like a game changer in the last couple of years for us in terms of what visibility we can achieve and at what cost, because obviously strace is great, but, the performance kit means that you come really the touch, anything in production with that.

And with eBPF, you can. So for the listeners who don't know it's, uh, the Extended Berkeley Packet Filter, and it's a part of the kernel that allows you to more or less write arbitrary snippets of code that can be attached to various events. And also allows for generating aggregations. So it can generate things like , counters and not burst, other maps that are directly executed into kernel without the penalty. And then you can export that data outside, to get your visibility. So we've had an amazing amount of, well, great fun to begin with, but also the extended visibility that we're able to achieve with really small snippets of code, to gain, the kind of observability that we didn't really have before at all, so that's really been like a game changer for us. And that's something that I keep recommending to everybody because especially in the context of chaos engineering, when the observerability is so important, you can do a lot of stuff without actually modifying any of that code. And most of the time, without the application, knowing anything about your observability. So. Uh, it's not really like my favorite, um, experiment. It's more like an entire family of observability that opens up in this space. So if you're not using it, you should, so check it out.

Jason Yee: Yeah, I know a lot of monitoring companies have definitely taken advantage of eBPF, in order to gain observability and so most of the, uh, at least SaaS offerings out there I know, or have used it. And that was super exciting. So it does sound like, one of those good things that you've started to do with chaos engineering is really just to get in there and actually start to validate your monitoring and things like that.

Mikolaj Pawlikowski: Yeah. It looks a little bit scary at the beginning, but once you get past of, the fact that you need to occasionally look up some bit of the kernel code. It's not that bad, so good stuff.

Patrick Higgins: What is like the post game day process for you? How do you go through postmortems or post game day wrap ups what does that look like? And do you have any tips for that as well?

Mikolaj Pawlikowski: That's an interesting one. I got to say that I don't really do a lot of game days because, I find that most, the benefit that get from that is just to generate the initial buy-in from like team members, you know, to kind of make it a little bit more fun, kind of like hackathon style and, you know, get them to get excited about it but then I don't really want them to just do it, you know, on one particular day, every month or whatever. I just want them to think that way through, to begin with. So, I'm not super keen on, you know, this kind of, yeah, it is cool. And I see the benefit, but it's not necessarily to something that, that I do very often. So probably the wrong person to ask that question. Nothing wrong with them though. Yeah.

Patrick Higgins: No absolutely. In terms of the experiments you do run, what tends to be the kind of models that you go after for introducing chaos experimentation generally to teams?

Mikolaj Pawlikowski: Sure, you know so obviously that depends very much on what kind of team it is. Like, for example, for my team we typically, we're starting with, when we actually, you know, materialized in paper, our SLOs that we're expecting to hit, the kind of obvious first step is to just run some kind of continuous verification of this. That we satisfied as SLOs, even though we have the kind of failure that we expect and, if you're running any cloud environment, the failure is there all the time. So, typically we would just, see the things that were breaking and kind of continuously over the time add them up, um, to whatever you know, process we ran. For the community staff, we run a lot of those, uh, just as, uh scenarios for a powerful seal, because it's easy to do. Uh, but you know, it could be anything. So the SLOs are typically like a good place to start because. Just like, you know, the kind of thing that everybody knows they need to have, but, um, in practice, um, you know, it, it kind of blurs others a bit here and there. I'm imagining that, you know, if, if you, on the other hand we talked about the JS and the frontend stuff, if you're running. A team like that you might not necessarily need to have SLOs in terms of, you know, the, the performance of your, of your front end, if you do that's great. So I would expect that, you know, on this kind of spectrum, the other side of the spectrum, uh, it would be very different. And, um, you know, probably just poking around a little bit and, uh, kind of ad hoc experiments might already be like a good start, you know, the kind of thing. Um, like in the book when he just writes a little snippet and you verify that things are going well and, um, you know, kind of depending on your pipeline, um, just make sure that it's automated later on. So, yeah. You know, it's kind of like, I don't really have a one fits all answer for this. Um, like a, like you said, there was a lot of different scenarios, um, and different rules apply.

Patrick Higgins: Absolutely. Yeah.

Jason Yee: So I'm curious along those lines, uh, when you mentioned, you know, SLOs, uh, I think a lot of folks are starting to adopt that and there's often some confusion around. Great. So you're saying I should start with my SLOs, but how do I actually start, uh, do you have any sort of advice on how to create good SLOs?

Mikolaj Pawlikowski: I think that, it's probably true that having a bad SLO is might be even worse than having no SLO at all. I think that it's one of those things, again, that a lot of people see as something complicated and, uh, something that requires a degree in maths. But most of the time from my experience, the maths is kind of, you know, the back of the napkin kind of calculation. And that tends to be good enough. A lot of the time you do need to estimate this thing. So anyway, so, you know, if you can calculate, if you can multiply and divide, you're typically doing okay. So like -

Jason Yee: but what if I can't do that?

Mikolaj Pawlikowski: Find someone with a calculator. I think that, my advice would be just to start easy, like start really easy and build up. It's kind of like the same thing with alerting, when teams start alerting things, they typically go like all in and they want alerts and all of that.

And then over the time you realized that this one is actually pretty noisy. This one has an arbitrary value in it. Well, which I was that value that value really isn't that relevant anymore. That fresh code should be changed. So good thing like the right alerts, takes a bit of trial and error. I'm kind of surprised I'm saying this on record, but it is a little bit of equal parts, art and science.

I'm sure you've had some experiences with that too. And, it's like the same thing with SLOs and, uh sometimes you have them directly coming from the business and you don't really have a choice and you designed the entire system to meet a certain SLO, right. Um, but there's a lot of the gray area where you need to decide on something that's reasonable. The definition of reasonable is going to depend from one, one person to another. So yeah, if I can give you one piece of advice is to just not think about this as, rocket science and just start simple and iterate, unless obviously, your cases are clear clear-cut and it comes from above, in which case, good luck.

Jason Yee: I like that you say art and science, because for me, it's usually, pain and value or annoyance and value of like, All right. I'm getting super annoyed at this alert. Like it's just time to deal with it. I've known for a week that it's, that it's awful. Cause I've gotten hundreds of alerts. So then you turn that off and you try to fiddle with things as you, as you say, iterate.

Mikolaj Pawlikowski: Yeah. That's that's, that's another good measure. Like the threshold of pain eventually is a point when you're actually going to fix it. Yeah.

Patrick Higgins: Is there anything coming up for you in terms of what you're excited about seeing at the moment, in the world of reliability and perhaps in chaos engineering, what is exciting you right now?

Mikolaj Pawlikowski: One of things that, are personally exciting for me is the fact that I am seeing a little bit of a movement in the kind of ecosystem from people immediately thinking that this is just some kind of gimmick and you know, they, they remember the slogans from blog posts a couple of years ago and just like randomly break things in production.

And that's usually, you know, that means that that's not really going to go anywhere. And I think that with the work that's being done by different companies and, uh, you know, as more materials become available and, uh, trainings and workshops and whatnot, it's becoming a little bit more demystified and, uh people will no longer, you know, have this kind of weird, um, laughter when you say chaos engineering to them with a straight face.

So, there's definitely that, I think another thing that like the, like you mentioned before, there's seems to be like a new startup doing observability every other day now. And um, you know, it's kind of interesting on one hand, but I think. As this things are happening. And as we mature the kind of observability and the EBFs of the world to, uh, and, and we leverage all of that gets, uh, the observability to a point where it's easy to use. And, you know, there are tools that do that for you. I think that, uh, at this point, the adoption can, can, can raise much quicker because a lot of people I speak to, um, the main roadblock that they mentioned is, you know, horrible name and chaos engineering, but typically the second and the third one are about, um, In like lack of maturity in their observability. If you don't have like a really good way to verify that you didn't break anything, then you probably won't go breaking things because, you know, what's the point. Right? And, uh, and the third thing is typically like this training aspect. So we know that's why I'm kind of trying to address it a little bit with the book.

So I'm hoping that this combination, this little cocktail of things is going to get us to a place sooner than later. Uh, where, you know, it's just a normal practice and maybe not that stage, we should change the name from chaos engineering to like resilience or something, so that people stop being confused.

Patrick Higgins: Yeah. So something like PR proactive failure mitigate. I don't know if that's a...

Mikolaj Pawlikowski: that sounds corporate ready. I'm

Patrick Higgins: That sounds great. Yeah. But we can sell that to bosses. So that's all right.

Mikolaj Pawlikowski: right about that.

Patrick Higgins: Have you kind of found that introducing this processes, like in terms of the people around you, have you seen less burnout potentially or, um, just like better experiences for the people you work with?

Mikolaj Pawlikowski: Yeah. I mean, the fact that you like the SRE people, they really react to the idea called less at night. And if there is going to be an outage to generate that outage when they're in the office. So there's definitely a good reception to that. I, you know, I don't really have statistics on how well that actually works out, but just the idea of knowing that, you know, you're not just, rolling the dice every time.

And you did your best to try to detect this as it's already, I think from the therapeutic point of view a good thing but it also works, you know, with the management, they like hearing that this kind of outage we've got to covered, that's not gonna happen again because we know for sure, because we tried the same problem and it actually doesn't cause anything anymore. You know, I don't have a degree in psychology. Um, but, um, my personal reception is that it does help.

Patrick Higgins: Yeah, that totally checks out. I can imagine. Knowing that something's not going to break the same way for like the 10th time is going to reduce people's anxiety. For sure. Miko, do you have in terms of doing some shameless self-promotion could you, could you give us all the deets what's coming up for you? What's exciting. Uh, the name of your book, of course. All of the great things?

Mikolaj Pawlikowski: Sure. The book is called Chaos Engineering: Site Reliability Through Controlled Disruption. I know that's a mouthful, it's the third attempt at the subtitle. That will hopefully work, it's out of a bit delayed. I think it's now supposed to hit amazon for the physical companies, uh, in mid-January.

So if you just go to manning.com, you can get an online copy now, or pre-order the physical copy. Otherwise, if you'd like to, uh, stay in touch with me, I do run a small newsletter, chaosengineering.news. Uh, you can put your email and, specifically here from me. Otherwise, if you are looking for resources to, to start with chaos engineering, that's not a shameless plug but, I think one of the best resources is the awesome chaos engineering list on Github. Theres plenty of different links, different things. And. It's fairly up-to-date so, uh, I tend to recommend this one. And if you want to chat about that, or if you'd like me to give a presentation to your team or a talk reach out on LinkedIn, and I will figure something out

Patrick Higgins: Awesome. Thanks Miko. That's great.

Categories
Podcasts, Industry
Transport Layer Security (TLS) , and its preceding protocol, Secure Sockets Layer (SSL) , are essential components of the modern Internet. By encrypting network communications, TLS protects both users and organizations from publicly…
Read more

Company
  • Team
    Join us

© 2021 Gremlin Inc. San Jose, CA 95113