Tiny DevOps episode #44 Tod Hansmann — Observability as an engineering enabler
September 27, 2022
In this episode:
- What is observability (o11y)?
- What can observability do for you?
- What metrics should you track?
- How does observability relate to logging, alerting, monitoring, and other practices?
- Who should be responsbile for obervability?
- How heavily should upper management be involved?
- How does observability relate to culture?
- CI/CD as a prerequisite for observability
- Why metrics are better than logs
- Surprising metrics that can be important
- The relationship between monitoring and automated testing
- Good observability as an enabler for canary deployments, test in production, and other practices
- How to define service level objectives
- How do you define "uptime"
- How to address corner cases
- Why being on call is desireable
Book: Site Reliability Engineering
Automated: Ladies and gentlemen, The Tiny DevOps Guy.
Jonathan Hall: Hello, everybody. Welcome to another episode of The Tiny DevOps Podcast where we like to talk about solving big problems with small teams. I am your host Jonathan Hall, and today I'm excited to have Tod Hansmann with me. Welcome, Tod.
Tod Hansmann: Glad to be here.
Jonathan: Awesome. Why don't you tell us a little bit about yourself? Actually, before you do that. I want to let the audience know. We're talking about observability today, what it is, why you should care, or maybe you shouldn't care. To do that, I have Tod here. Tod, tell us a little bit about yourself. What you do professionally, and how maybe that relates to observability?
Tod: I call myself a problem solver. I run a small consultancy with some friends of mine. We do software stuff, but this is all based on 25 years of being a Dev or IT professional of other sorts back before the internet. When you learn to Dev, you learned all the things about computers ever. That's my background and why I know all the things on the Ops side of the house traditionally speaking compared to just the engineering side. That's why I care about observability.
Jonathan: You learned to program before the internet like I did?
Jonathan: I use the public library in the account of '64 when I was getting started. [laughs]
Tod: My dad brought home a big old stack of books related to DOS and QBasic.
Jonathan: There you go. Awesome.
Tod: That was what we had in the house.
Jonathan: Sweet. You said you working in a consultancy. How many of you are there?
Tod: There are six of us at the moment.
Jonathan: What kind of work do you do? Do you do software dev work, or offer IT stuff, or what?
Tod: I'm terrible at tooting my own horn. We do mostly software engineering workflow. We will work on a project basis as well. That's how we work, but the project is not the value. The idea is we can teach you the faster ways of development that we've come up with as an industry.
We can teach that by example. Get some teams going with us as we go and then we create catalyzing change. That's why we're called the Catalyst Squad, catalystsquad.com.
Jonathan: What companies do you work with? Are they big companies? Small companies? All across the board? What is it?
Tod: We're not picky about the company size specifically. Typically, the need is going to be in a larger shop. Somewhere between 50 engineers and 1,000 engineers. It's typically around the range that we're most effective.
Jonathan: The topic for today is observability, which is of course a broad topic. There's a lot of things we could talk about there. What kinds of things do you see? Maybe we'll start with your clients and we'll broaden a little bit, but the companies you work with as clients. What challenges do they have when it comes to observability?
Tod: I think it depends on what stage you're in, it's a lot easier to build observability into a greenfield project upfront. Whereas, it really has to work into your entire workflow and to your entire ethos. Changing to match that need is very difficult for a lot of shops, especially traditional shops that aren't tech.
They don't know how to manage that kind of change. Even tech companies of your have tried and failed many times to make that shift, shift everything left, et cetera.
Jonathan: I should've started with this honestly, but let's start with defining what observability is and means. Like I said at the beginning, it could mean a whole bunch of different things maybe to different people. What is your concept of observability? What do you think that means?
Tod: It will change slightly depending on audience that you're talking to. The executive team is going to have a very different opinion about what it means and what they care about in that broad topic to the average engineer, that sort of thing. For me, it's anything related to how your system or systems are behaving.
It's not just alerting and monitoring. The metrics capturing is one thing, but it's also the analysis side. You need tools to do that analysis and that's usually overlooked until there's a problem. I think that side of the house, it gets less press, I guess.
Jonathan: Whose responsibility would you say observability is if you have a tech team? Is it developers? Is it the operations guys? I don't know. Is it on someone else?
Tod: It's 2022 and if you have an ops team and you call them an ops team, you're probably behind on several friends here.
That's step one is we're going to need to get you up to the point where observability is even a thing we can talk about. I believe that a lot of the responsibility for the actual work being done should be on engineering. Traditionally, the ops side of the house would have been heavily involved and they should be embedded in that world. For me, the actual responsibility, the cultural responsibility is throughout the organization.
You can't have a good observability program without the C-Suite being on board. They have to understand what's the costs? What is the priorities adjustment that's going to happen when an alert happens? That whole culture needs to be there.
Jonathan: We hear a lot about culture, especially on a podcast that's called DevOps because DevOps is often said to be culture, right?
Jonathan: Often though, I get a sense-- Culture is also-- It's a confusing term, and it's nebulous, and it's hard to point to. You can't say, "Oh, look they have the right culture or we have the right culture whatever." What are some elements of this as it relates to observability? What are the things that you'd look for to know if you have the right culture or where to improve that culture? I know it's a nebulous question because this is a nebulous topic, but where do you start in that area?
Tod: Capabilities is usually where I begin. There was a good talk on [unintelligible 00:06:38] there's a keynote somewhere. I don't remember who the speaker was or any of this, so this is a terrible, terrible quote. Somebody was on stage and they said about CI/CD. They said, "If you can't build your entire infrastructure from scratch in less than an hour, you're not ready for CI/CD."
I thought that was a really interesting observation worth debating in another form at some point. The idea that you have this capability over here, that means that you're ready for completely related but completely unrelated capabilities over here. You're not ready to do monitoring and gather metrics if you're not automating deployments, and version control, and things like that.
That doesn't make a whole lot of sense until you're on the other side and you can see clearly the pattern that relates the two. Usually, that's where I start is start asking questions about, "How do you deploy? What is the release look like? What do versions look like for you? How often do you deploy?" Almost the exact questions from the state of DevOps reports over the last six years. Ask those questions and you find out where the maturity level is and you work from there.
Jonathan: Basically, what you're saying is it sounds like a lot of the observability concerns don't even worry about them until you have CI/CD in place. Is that right?
Tod: That's an example, but yes, there are several things that you probably need to do before you ever get to observability in any meaningful way. You still want to go through some of the motions, install Grafana and Prometheus, and all those things. You can't use them effectively until you have the rest in place.
Jonathan: Let's imagine you're talking to somebody. It could be your client, it could be a listener, it could be someone at a bar and they're like, "Yes, we have this cool tech. We're a start-up whatever," then the topic of observability comes up like, "What's that? Why do I need this? Why would I care? Grafana looks cool, but what does it do for my business?" [laughs]
Tod: This is again, depends on the audience. SRE is a topic that it's supposedly defined, but also is not defined at all.
The world of observability depends on the org. If you're a five-person team just period and you're working on some SaaS product, you still want observability. As a matter of fact, I have side projects with a friend or whatever that we literally run the whole thing and we have observability. We have all the metrics going to Prometheus, we have dashboards, we have alerting.
Automating the pains out so you can speed up recovery is a huge deal for anybody. I do it not because like, "Oh, I really want to be sophisticated about these tools." It's a lot more effort. I want instead to be lazier about ops things or I don't want to have to troubleshoot things, I'm going to build some intelligence into my metrics so I could see exactly what's happening on this dashboard. Those things matter to even small groups, but to larger groups, you can provide different dashboards for different audiences.
You can see, the thing I care about is is the business healthy? There's your executive dashboard. Just show it a single, like green, yes. Red, no. There's lots of simple ways that this really adds value to everything. We lean on dashboards because it's a much more visual thing, but also developer happiness. Product managers will love it because, Hey, I can see exactly how my feature is doing. I can see when I turned it on. I can see how that affected, all these different completely unrelated metrics. It's powerful once you get there.
Jonathan: You talk about speeding up recovery. How much of a driver is that for observability in your mind, and how much is it other things?
Tod: I think it's one aspect. It's like restorations driving backups. They're not the reason you do backups the way you do. There's a million other reasons you did the [unintelligible 00:11:08] thing for tape backups back in the day. There's lots of policies and procedures and things that you do for other reasons like making sure that things are consistent or that you get a certain timeframe or whatever. I think the same applies over in-- recovery is the ultimate holy grail of incident management.
Honestly, if you have a quick recovery, you have good controls in place and good understanding, not necessarily like you've done observability well. There's a lot that goes into recovery. Really observability is just about being able to find out what it is or that there is a problem in the first place quicker rather than your customers telling you, which is the worst way to find out you have a problem. I think it's related, but just a little overlap maybe.
Jonathan: Let's talk about the different-- I don't know if what's the right words, the different approaches or tools maybe that go into observability, you talked about Grafana which, of course, you can graph anything from memory utilization to how many requests you're serving per second. There's all these things you can monitor there. Maybe we could just create a quick bucket list of what are the things that you can monitor, and then next we'll talk about which ones are the most important. What are the things that you might consider for observability purposes?
Tod: I think the basics are really accessible to people. You always start with CPU mem, disc space, things like that. Unless you are in a cloud-native world like if you only have a Kubernetes cluster, disc space is not usually something you're concerned about in the same way like you care about volumes filling up, but you measure those differently and it's-- There's problems there, but other than that, it's just so easy for everyone and then everybody thinks like, "This was kind of a letdown."
You got to follow that up with more powerful metrics. Requests are really good and can be pulled out of the logs that people are probably already using. That's always nice. You want to avoid logs and that's a weird thing for people to process when they're beginning their whole observability journey. Metrics are better than logs. Why? Well, sit down. The biggest thing to me for getting someone that magic flash of observability is powerful is to get their own metrics into something that they're visualizing, something that they're exploring like oh yes, compare this to how many requests per second you're having or something?
Those off-the-shelf metrics. Add your own. Now suddenly yes, every time you have an order posted to fulfill you run out of disk space over here or something. Those are powerful moments. I think that's the real spark that you want. Those sorts of things are what you measure.
Jonathan: You just said that you should avoid logs. I'd like you to talk about that more because I think that's a counterintuitive point. Probably just go over a lot of people. Let's talk about that because-- Yes, let's talk about that and then I'll dig in. [laughs]
Tod: All right. Fair enough. It is a fairly controversial statement and especially for people that are just starting on their journey. Logs are really useful in the exact moment and they go back maybe 10 minutes. That's what you typically will use logs for is you're just trying to find the bug during a debug session or an incident or something like that. Everybody has a very visceral, exact experience with logs and they're like, "Don't take these away from me and that's fine."
You don't want to take them away before you have something better, work on the better thing first. It is better when you're going through the actual troubleshooting to have the right metrics instead, so you know the behavior on the assembly line of processing a request. You know exactly where it breaks down. That's much more relevant information than logs that tell you the same thing because the logs will take you longer to get there and they have to be parsed out of the rest of the noise.
It's not about having good succinct logs. If you're logging very, very appropriately, you still have a ton of logs that aren't related to the incident, because those are related to other things you might troubleshoot later. Logs are a lot of noise. They're hard to process. You can't bring the context out. That's just fear for a human. No computer can take action on them unless you're doing a lot of log parsing. That's a lot of time and money spent on log parsing that you should be spending on metrics.
A number's easier for a computer to respond to. It doesn't feel like you should, but you should as shoe logs, and in all cases, except for you want to capture the last 10 minutes or whatever. Sure. That's fine. When you're debugging, they can be a nice signal, especially if you're not monitoring everything under the sun already. If you haven't gotten metrics for everything that you would look at logs for, keep them on, it's fine. Just don't store them for three years.
Jonathan: Yes. Okay.
Tod: You'll slowly just switch back to metrics as you get more and more going.
Jonathan: Is there still a place for logs when you're trying to debug something, a customer called and last night, this weird thing happened has never happened before. I mean if you had a metric for that it would be one little tick. I can't imagine that that would be useful that that's the place where logs are useful, or does that eventually go away too?
Tod: I think if you go down the rabbit hole, you'll find different flavors of it, at least as far as my experience goes. Some groups will still keep logs, but they just have more sophisticated request tagging or something on those logs, so that they could just do that debugging. The more effective one that I've seen, but takes more engineering upfront is there's audit tools that they build.
It's either just an audit table of all the things that people do or they'll have a whole audit system that'll feed from different services and do some tracking of requests and whatnot. It's a waffing, I guess, but some people really like that. It's cool to see it. I myself wouldn't do it because I don't have that many developers to dedicate to such a thing, but if you have it. Great. Awesome.
Jonathan: Speaking of not having enough developers, that's going to be a common theme, I think for many of our listeners. Because we focus on small teams and where does it make sense to start? Should you bypass the logging thing entirely and like, okay, we're never going to do logs. We're only going to do monitoring or do you start with logs and wing yourself off of that later?
If you're a two-person startup and you're just trying to get some money, you're trying to get customers before your runway, runs out. Where do you focus your effort most effectively here?
Tod: I don't think, storage is cheap. If you are in a scenario where you have like-- I chiefly work in humanities right now, but even before that, you're typically aggregating logs somewhere and there's tons of systems for whatever you're doing to do that easily. It's okay. Spend the extra $5 a month and have some logs or something. It's not terrible but start your metrics journey early if you're just starting out.
Start your observability tooling on that side of the house as well, the moment you start getting going. Build that muscle slowly over time and then you'll never really look back and that's fine. Again, it's super cheap. It's nice insurance.
Jonathan: What are some of the more, maybe unusual metrics you've seen that maybe surprised you that they were useful or would surprise an average person that are useful? CPU makes sense, but someone wondering something strange.
Tod: There's some interesting latency metrics that you wouldn't think matter, but they're very interesting in other fields or I guess more niche use cases. Your typical web code casting is not going to be benefiting too much from this, but it is also super interesting. Maybe if you get the metric for free log it anyway. Latency of a request that makes sense to people from my ingress to service A. What's the latency between that?
More interesting to me was latency of things like I/O on disk or I sent a request, I sent the CIS call and it took however long to return. Things like that are really interesting especially in the VoIP world like we're on a platform that we're doing VoIP and video at the same time. Those latencies are really, really important, especially to things like jitter. Latency goes up and down other people sharing your line, whatever happens.
That can be measured very, very easily. You can alert to different things in your system, everybody's showing latency, then something's wrong on our network. Great. Let's dive into that. It's a weird one unless you're really deep into networking, which I'm sure somebody on the listening end is going to be. Props to you you already know this, but a lot of devs don't and it's worth keeping in mind, especially if you get out of the code app only thing.
Jonathan: What do you think is the relationship between good monitoring and automated testing?
Tod: I think they're very related skills. Again, my background is dev before the internet where people assume like, "Oh, you learned to code and that was great." "No, no." I learned how to monitor or how to flip DIP switches on a motherboard so that I could get the exact frequency for this other piece of hardware to respond to my code. We had manuals bigger than any book that you ever had in school. It's a thing. That whole world you learn everything, you just learn everything and nobody tells you that's difficult or anything. You're just driven by, I really want to make this gorilla on the screen, do a thing
Tod: Especially as a kid. Today there's a lot of stigma about learning all these things like, "Oh, that's not my job. Or oh, that's a skill that's going to take a lot longer to learn," but learning how to write tests, intimidates some devs. That's fine. I understand why, but it's not actually that hard. Everybody discovers that when they go through that Slack. Same thing with observability, it's having good monitoring, having good hygiene for your alerts are things that devs just do when they learn that skill.
It feels like, "Oh, this is a whole sophisticated realm." "No, this is a Tuesday afternoon conversation. Then you'll be an expert it's not too hard." I think that when you have good monitoring hygiene not just blur management or anything like that, but you have a good, healthy relationship of, "I'm going to look at these metrics and make sure that I'm not missing anything at a glance on a daily basis but I'm not relying on looking at them all the time."
I don't have a separate window. That's constantly on this dashboard because I trust my tools to alert me. When you're there, that's a very different feeling than that first rush of I got to watch this all the time. No, you don't. The whole point is to not have to worry about it until you have to worry about it. Let the computer do the thinking for you.
Jonathan: I suppose related to that if you have good observability, good monitoring, and good alerting in place, it also affords you some additional freedom to do things like testing and production, which maybe you'd be really nervous to do if you're not going to get an alert as soon as something goes wrong.
Tod: Yes. Allows you to do things like Canary deployments for the testing that's automated that thing. It's amazing all of the things that are hot buzzwords in DevOps world, how they are driven by observability. You need metrics to be able to do that.
Jonathan: If you're trying to do Canary deployment, but you don't know if you're Canary deployment works, then you're dead in the water. [laughs]
Tod: You've got this wonderful deployment operation that doesn't actually matter.
Jonathan: Let's talk about some of the numbers we might look for. Maybe a service level objective or something. How do you come to that? Of course at the high level, when we're talking about the executive dashboard, that's either green or red and it's just a checkmark basically is the system in good health? That's a big business decision, what's our key metric? Are we selling widgets? Are we serving our API required?
What are we doing as a business? Can we do that to whatever level we define? I know that you need to get the business involved, maybe the CEO, even to determine that. When you start to get lower down to the more nitty gritty, what do you look for to decide what numbers do we care about for whether maybe a Canary deployment or should we send an alert to the guy at 3 o'clock in the morning because things are really bad now or maybe they're not that bad. We don't send an alert. How do you approach that whole problem?
Tod: Let's break that into two buckets. The business side I'll say and the alert management side. For the business side, there's two variables that you going into the conversation that you must have with the business, whoever the business stakeholders are, usually the person writing the cheque for the level of service should be the business owner in this case. You have an SLO, it's a business objective first and foremost, it's an agreement, it's a contract with the business that you've negotiated saying the business is willing to pay for this level of service. You need to make sure that, that happens. The higher the level of service, the more the cost is.
Out of those two things you need or those two inputs for that conversation are. What does the business need to pay? What's the dollar amount or the dollar range that they are required to pay for a certain level of service? They're going to want to adjust that. Don't just have one number, have different levels and different designs to go through for that whole rigamarole. The second is what is it capable of now? If you're talking about an in-place service, that design is static until you can refactor it a lot, you need to know.
That's not going to be five nines at all. The P99 on this is going to be terrible because it's just not that capable. Then bring the price of that and what it would it take to get higher. It's a huge development effort is the business willing to pay for it. That seems like a pretty straightforward framework for business side, but it doesn't work at all for triage of alerts. That is just going to be on a case-by-case basis, especially if there's multiple alerts happening at once, which cascading failures happen all the time.
That does factor in. Usually, if this doesn't happen until Monday morning, what's the impact to the business? There's lots of things like, "Oh man, emails won't go out until I fix this." That can happen on Monday morning. Unless you're huge scale and your provider will only allow so many or whatever, you should know those variables, but ultimately the conversation is about what is the business need in that case as well. You're just thinking of it through the lens of the user experience and trying to get those calls. If you don't have authority for those calls, that's lack of autonomy and that's a different problem, but at 3:00 AM, you should be able to make all decisions about whether to handle alert or not. If you can't then you're not the person that should be receiving the alert.
Jonathan: If you don't mind, I'd like to talk a little bit more about that first aspect, the business thing, because I think that's going to be the part that's going to trip up a lot of people. The audience is mostly technical. Where do you start? Let's imagine, a conversation that I've had recently, where your CEO comes and says, "Hey, can we promise our customers 99.99% uptime?" We can do that. We can put that into a contract.
I'm like, 'No, probably not. That's just a couple minutes of downtime in a week. That's not very much we probably broke that SLA yesterday." How do you go from, we don't have any idea what our system's really capable of except maybe a gut feeling to first, I guess measuring what is our uptime or availability. However, we define that. Then from there, let's start with that and then I'll follow up. We have a monitor. I think we're using Uptime. I don't remember the name of the tool. There's a thousand of them that just send a ping to your server. Oh, I got 100% response rate so my server's good right, but that's not that simple. Where do you go from there?
Tod: I think the critical question is what are you measuring and what are you not? Because usually the old way of thinking, when you just pinged a server and assumed it was up was that if the server is running, it's got self-healing mechanisms like supervisor will take a crash and restart the process or whatever you've got in place. All of those should just be working. You'd have that measure. Then the first time that fails a year and a half in, suddenly everybody's scrambling to figure out what the heck happened to the server and lots of forensics going on, war rooms, et cetera. That's not the place you want to be at all.
The magic happens when you understand you're measuring all of the things you can to piece together an idea of what was happening. You can't actually measure anything succinctly. You can't say, "is it up right now?" I have to go check and actively check again and again, and again, every time I want that answered. You have to gather the metrics of, well, it was up for the last five minutes. It's likely that it's up now. I'll wait five seconds before I check again, or whatever.
Once you start building that sophistication into your measurement, like, "Oh, I've never really thought, I can't just hammer my server to ask and I will never really truly know if it's up." Those are hard things to realize, but once you do, then you start measuring all the other things and it's much better world. Move through that grief as fast as you can. The other side of that coin that you were talking about is bringing that to the businesses, you have to start asking questions about business value, which is not usually the-- go to your PM or something if you have never done this before.
It's totally fine. They'll be happy to talk business all day long with you, but you have to figure out what does the business actually care about? Why? Why do customers care about that sort of thing? Then you get to a point where you're actually bringing data that disproves the current theory.
That's always a good feeling like, "Oh, I as the engineer know more about the user's behavior than you do because I have the data and I will bring it to you and it will be great and we will commune with the user." [laughs] That's a good place to be. You have to discover those metrics of like, "Oh, I'm actually measuring user behavior now not server Uptime." That's a very different world. They're very different conversations. Everything changes.
Jonathan: Let's imagine a simple scenario, you have an e-commerce store selling widgets and an obvious thing to do. Of course, we've already decided, okay, just ping the server isn't sufficient. That doesn't actually prove that things are working. Maybe the next step is we monitor have we had any sales in the last 15 minutes. If that ever drops to zero, okay, maybe there's a problem we haven't seen before or things might appear to be working.
Certain services are responding, but nobody's checking out so that's a problem. That's at least in theory a measurement of user behavior. Maybe the users are trying to check out, but they're failing for some reason or maybe our competitor just has a better sale going on right now. [laughs] Is that the kind of thing you're talking about? Measuring number of sales rather than measuring are the servers responding and the CPUs are only at 15% utilization and stuff like that. Is that the kind of thing?
Tod: You can progress there. You start measuring by every time a sale happens, we just increment a counter and then you discover rate monitoring where you can like, oh, that number changes if you window it five seconds or something. I can see how many orders a second or whatever we're getting. Then I can draw trends over time. I can see like, oh, look at the last week. If you're with outside of a 20% band, alert me because that's weird behavior. It's been consistent for months. Then you have your first black Friday and then like, oh, I didn't account for this and alerts went off and hence everybody's Thanksgiving being ruined forever, but [laughs] you start discovering those things.
I think that those are usually more valuable because let's pretend that your servers were on fire and they just weren't responding to anything, but the user was still able to buy widgets from your e-commerce site. Do you care about the server on fire? Is that the emergency? Business is still happening. That's a weird thought experiment, but it drives home. The server isn't actually the thing that matters.
Jonathan: Where would you go from there? Let's imagine that we now have our checkout monitoring in place and we're confident that we're getting the correct number of checkouts, but we've certainly discovered that all of our, maybe our Japanese Yen sales are not happening because there's some conversion problem or maybe one particular product, that page is broken and we're not seeing that. It's just one product of thousands that's not being sold. How would you detect that sort of thing? One of these corner cases.
Tod: Corner cases are interesting. Every company is going to be different because it's going to depend on your architecture. I immediately thought, oh, I would just have that in the whatever payments processing service. I would just be doing metrics on a cardinality of what Yen or whatever the currency is. That makes sense to me. Of course, that makes sense to me because I just designed the perfect system to make this really easy. I've strawed my end the heck out of this. Congratulations me. [laughs]
The ultimate problem is you have to inject metrics for everything everywhere in your system. Even then it will still not be enough to answer all the questions, but if you have the architecture, just injecting measurements at every point you can, including in adding a cardinality for payment processing and whatever, you can do that in a monolith. That's fine, but only if that module is called every time. If you have completely different modules for different things and it's hard to coordinate, now you have a different problem.
It will show up in your observability, but it's not actually an observability problem. It's an architecture problem. [laughs] I think from there you just start measuring as much as you can all the time ever. When you discover problems that you're missing metrics for, that's a project to find more metrics. Embrace the failure as they say, and fix it today and then fix it forever. In that order. You don't necessarily do the right fix right now. You do the bandaid and then you heal the system. That sort of thing.
Jonathan: How do you do that for something that doesn't happen frequently? Maybe it's an every Sunday morning thing, or maybe it's we sell this model of widget only once every three months. You're not going to get a five-second or even maybe a one-day alert because that thing doesn't happen very often. Do you have to use different tools to address that or what's the answer there? If there is one. [laughs]
Tod: There's other tools out there, like in the Cloud Native World we're using Century a lot. It's getting a lot of healthy press I think. It's more like the old APM world than traditional metrics. You get stack traces and all that, but you can still alert off of that because it's just data. It's event-driven, but it's data. Those are good for rare things or if you can't actually do like transactions processing is a really interesting one, because if you can't actually do a transaction to test the end-to-end, you can't get true end-to-end observability. The stripe's really good.
The plug for Stripe I guess. [laughs] They're really good at having a Dev environment. You can hit the Dev API with the Dev API key that you have, et cetera, and do full end-to-end tests up to them, but you can't actually check that you're doing a payment without doing a payment. At the end of the day, something is being mocked and you can't trust the whole thing. That's an impossible problem to solve. It's an NP-hard problem.
The reality is that you're going to just let those things slip, and you're going to have other pieces of data covering that ground. You have some overlap everywhere. It's like a sprinkler system. You want two heads to cover over each other. This is exactly the same. There's going to be places that this system is going to miss because it's just not designed for that. Just involve more systems. You don't want too many, but [laughs] they specialize. It's fine.
Jonathan: What other questions should I or the listeners be asking when it comes to this? What else do we need to know?
Tod: I would say that the other side of observability that people are more uncomfortable with, they should be having discussions about I want to be on call as an engineer. As a software engineer, I want to be on call because there's no other way that I will care about a system not dying at 3:00 AM then it will wake me up and there's no way to circumnavigate that. None. At some point I need to be responsible for the thing, or I have the wrong incentives.
That is difficult for a lot of Devs to like, "I've never been on call. I don't want to be woken up at 3:00 AM." That's good. You should be lazy like that. I'm that lazy? It's fine. I've been on call for 10, 15 years, somewhere in there off and on. It's totally fine to be on call especially if you have the power to fix it.
That's where the discussion needs to happen. It's not just Devs that need to step up to the plate, the culture needs to be there to support it like I got an alert at 3:00 AM, that's what I'm fixing today. I don't care about your feature anymore. This cannot happen again. That's the culture you need and that's going to take some discussion ahead of time, I think.
Jonathan: That's always a contentious issue. For many reasons some companies don't even want to pay for on-call. [chuckles] Other people, just the company wants to but the engineers don't want to do it. There's a lot of reasons that could be complicated, but good advice. I agree completely. It's not just on call, but the biggest motivator for a stable system is to not be woken up.
That's also the biggest motivator to write solid tests and to make sure that you understand the customer requirements before you write the feature, all these things. When you pay the price for the failure, you're more motivated to do it right.
Jonathan: Cool. Well, I've enjoyed the conversation. I've learned a few things. The idea that logs are out bad idea is a little bit new to me. I mean, I see where you're coming from and I think it makes sense. I'll have to chew on that one a little while.
Tod: That's all right. You don't have to agree with me either.
You'll come around eventually.
Jonathan: It might be partly. I mean, I'm sure it's largely due to the fact that I've not worked in a lot of environments with really solid monitoring. I mean, I've worked in a couple, I worked at booking.com for a while that they're a big-- they have a strong emphasis on monitoring. I don't think I ever saw a log when I was there. I could probably get some debug output from something somewhere if I tried, but it wasn't routine by any means to look at logs there.
It probably makes sense. Again, I'll chew on it some more. I appreciate the challenging thought. How can people get in touch with you if they're interested or are you on social media or anything like that?
Tod: I am on social media. I'm pretty much Tod Punk everywhere. It's Tod with one D, P-U-N-K. Catalystsquad.com, you can use the contact us page there if you're interested in that side of the world. I'm always willing to talk on Discord or Twitter or whatever, I'm really accessible. As long as people don't abuse that, that'll probably stay the way that I always operate.
Jonathan: Great. Any resources you can recommend for people, if they're interested in learning more about these topics about how to do observability, is there a good book to read or go to your company's website and hire you? What's the best way to go down this path further than a 30-minute podcast?
Tod: If you want, the best intro to, I guess, DevOps, and observability for my flavor is the SRE book is the start of that journey. I don't think that by any means that it's the Bible of anything, but it is the right talking points for discussing what that culture looks like at a given organization. From there, I think you just need to build projects. You can jump on open source, you can jump on anything like that.
Find a community that will essentially think this way in a cloud-native observability-driven way, and just work on that. Just lurk and observe observability. [chuckles] That's usually where I say to start. Usually, there's Discord community or something around the language that you like. They'll be doing modern stuff. You're going to be in containers at some point on that journey because that's the thing that everybody does now.
As long as you're comfortable learning all these things, they'll walk you through steps and how to do things on your local environment and things like that. Lots of friendly groups out there, I would start there.
Jonathan: Great. Wonderful advice. Thanks for your time. Thanks for challenging us. We'll be in touch, Tod Punk on LinkedIn, Twitter-
Jonathan: -everywhere. Great, Tod. Thanks so much. Until next time.
[00:44:53] [END OF AUDIO]
Adventures in DevOps 129: The Future of Intelligent Monitoring and Alerting with Ava Naeini
Ava Naeini shares her patent-pending tool that uses ML to determin the health and performance of distributed systems.
How serious is a deployment failure?
A deployment failure should go through standard alert channels, but should it page whomever is on call?
When not to monitor your systems
With too many alerts, you can be paralized into inaction.