Tiny DevOps episode #47 Paul Cothenet — Observations on observability

Tiny DevOps Podcast / Continuous Improvement

In this episode…
- How to choose an obervabillity tool/platform
- Why AWS doesn't provide the best observability platform
- Teaching the team to use observability
- How to convince stakeholders that observability is valuable
- What would you miss the most if your observability platform was no longer available?
- The business value of a good observability solution
- Making observability metrics easy for management to use
- What does it all cost?
- Advice for getting started

Resources
Rands Leadership Slack: https://randsinrepose.com/welcome-to-rands-leadership-slack/

Guest
Paul Cothenet
Twitter: @paulcothenet
Company, and jobs: patch.io

Transcript

Voiceover: Ladies and gentlemen, The Tiny DevOps guy.

[music]

Jonathan Hall: Hello, everybody, welcome to another episode of The Tiny DevOps Podcast. I'm your host, Jonathan Hall. Today we're talking about observability and some stories from the field about implementing observability. I'm excited to have Paul Cothenet, welcome. Would you tell us a little bit about yourself? What you do professionally, and then we'll start?

Paul Cothenet: Yes, absolutely. The last name is French, in case you're wondering. Probably no doubts after hearing me talk but I've lived in the US for the past 14 years. I'm currently a staff software engineer at a company called Patch and the climate tech space. Before that, I had been a founder for seven years at a company called MadKudu and yes, that's about it. [laughs]

Jonathan: Very cool. I reached out on the Ranch leadership slack for people who had experience with doing observability in small companies, and you responded. You said, You've done this at two different companies. How do you want to tackle this? You want to have chronologically the first one first or is that the best way to go?

Paul: Yes, I can probably tell you a little bit more about the context. For context for everyone, I started my own company around 2014. It was two co-founders and myself and I was the main technical co-founder but the fun fact about it is I learned how to code on the spot and I learned DevOps. I was a product manager before so it's not completely out of the field but I had not done any DevOps of any kind by myself before. In fact, the first company I worked at was a PM, started 2011 and it was still all on-premise. It was first experience coding, first experience DevOps, first experience cloud at all whatsoever.

I essentially figured things out on my own for a couple of years before I was even able to hire a couple of people to help me. In that current job, it's more like "Okay, now I have had like seven years experience doing this," it was more a little bit more, or whatever I learned and set it up for best practices. Got a lot faster, the second time around the FTC. [laughs]

Jonathan: Good. That's great. We can learn hopefully from your experience there. Well, let's start with the first one then. What kind of company was it? What were you selling or doing?

Paul: The company is still running. I hope I don't see anything stupid.

[laughter]

Paul: Still very good friends with the folks there but we're doing essentially machine learning for sales. It is analyzing a bunch of marketing data, event analytics, connecting to Salesforce, essentially replicating all of that inside own data warehouse, and then running machine learning algorithms on top of it. Something that was quite data engineering, heavy probably be called AI in 2022. It wasn't in 2015 but relatively did IAV with a lot of processing batches, real-time and all of that stuff.

I always if you only think about the architecture at a high level, it's like ingest data from external sources, massage it into one place, and then at the very end, you want to send it back into Salesforce so that people who use Salesforce can see the intelligence there essentially.

Jonathan: What prompted you to start the journey into observability at this company? Why did you feel the need?

Paul: I started to believe it was around '16, '17, seeing a couple of demos, or folks starting to talk about what they're doing with New Relic or Datadog and being like, "Oh, that sounds interesting." I was definitely on a full AWS stack so I had a couple of hostings and CloudWatch metrics and I have a lot of respect for a lot of wherever AWS does. CloudWatch, I don't think I've heard anyone saying it's their favorite observability platform whatsoever.

I think even because of that, you don't even know what you don't know. It's like, "Oh, that's all I can do. I have those crappy logs somewhere and metrics that I can't really--" and it just never clicked. I think with that interface and voice tools that I could create my own metrics or as a way to look at it or centralize everything. Seeing a demo, I think at the time of Datadog, I believe it was segment was showing what they were doing with the [unintelligible 00:04:58]. I was just blown away as. I was like, "That's the level of visibility I can have in my system, I should absolutely try this."

The main driver, I would say, or the main thing that counted for us was more with those out-of-the-box tools, you get decent data if you know where to look at about the basic metrics. You can find if you have a memory problem or CPU problem. You can look into it. Where it really clicked for me at the time was actually getting into, into custom metrics. Essentially, StatsD-type stuff and the driver was because we had those data pipelines, where a lot of a thing that our customers care a lot, or we realized our customers cared a lot.

It's like, "I create new lead in Salesforce and what I care about, the SLA I care about, essentially, is making sure that they're scored within a certain time." All right, and realize that essentially, it was one of those things that if you look at all the existing tools you have is nothing in the out-of-the-box observability that will tell you that. Metric that is very specific to what your product is doing but it's not essentially latency CPU or memory. [chuckles]

It's like, "How do we do something that can easily measure really that SLO?" That's what's been known to be called SLO. "How do we keep track of it and how do we get alerted about it?" It really started with essentially having a little [unintelligible 00:06:45] job that was measuring the field rate, essentially sending it to Datadog.

Then that was actually magical, because out of the box, because we now had a SLO and we could do graphs and alerts and pager when something went wrong, which completely flipped the game for us. We heard from the customer telling us first at something was wrong [laughs] to us being able to be notified way before that. It's kind of like that. That ability to track really the metric we cared about rather than what came out of the box, that was a game changer at the time.

Jonathan: You started with-- You were on AWS, just kind of whatever tools came automatically is basically what you're using at first, right?

Paul: Yes, and remember, I was a one person, both developer and DevOps. It's always this field where you're like, "Oh." You don't even know what you don't know, in terms of what's possible to track but what became super interesting then after, and I guess that's the more applicable, I guess, learnings for observability was really great day, I was [unintelligible 00:08:00] was stability of just tagging everything and you have--

For me, the, the big revelation there was when you can see like, "Oh, I can have the logs, I can have to machine, then you know a vendor." That's how it started essentially with StatsD metrics, but then I pulled the thread on what was possible in the field and you can-- I guess the marketing of which things I've come to call this the three pillars or whatever. [laughs] Couldn't tell you which. It's probably logs, metrics, and traces. I don't know, the marketing may have changed since I last looked at it, but it was pulling that thread and it was kind of magical, realizing that you can have almost three things in one place that all self-reference each other.

When you can get to the point of to take my example, something's wrong with that customer. I can immediately figure out on what machine was the thing happening and then you can pull and figure out which logs were exactly on the same machine, or, truly track the log to the trace. I think that to me, was the moment that really clicked with observability. I was like, "Oh." Compared to what I was doing before having everything in one place with cross-referencing tags and things it's just makes you so much faster to debug anything, essentially because you can really find that rope and just pull that thread and really figure out everything in one place. [laughs]

Jonathan: I think I liked it. The three pillars are fuzzy in your mind because that gives me a chance to ask a question that I can't ask everybody and that is what is observability to you? Because you're not going to give me the buzzword answer. You're going to have to tell me what you think?

[laughter]

Paul: It's a great question. I think if I think about it now I think it's both when I pick up an idea, what immediately comes to mind is essentially everything I need to debug something, when something goes wrong, is one aspect of it. I'm also realizing that over the years I was starting to use it more proactively, which is how do I know everything's going right if nothing's firing and use that to show it across the team and across the company.

I was like, "Hey, look I have, I've all these latency metrics that you didn't even ask for, and I can't--" That's actually a great question. I'm thinking in my head it's almost observability. It's almost I want to measure and track things that I may need later to look into, either a problem or something good, but to really understand what's going on without my system.

I think before I really got into that world I was like we had a problem and we'll literally implement logging for the next time around. I think what's being for me it's like, "Oh, I already have the data to look into what happened," rather than, "Oh, next time we'll know about it because we designed this thing." I don't know if that's a good definition, but essentially being able to know after the fact what was going on in my system without having to reimplement anything has been pretty key.

Jonathan: You started with AWS didn't like it. You mentioned Datadog and I think you mentioned a few others. What did you settle on at this company or was it a collection of different tools?

Paul: I had started with New Relic at the time, so again, it was 2015, 2016, they were like the main APM game in town. It was a magical experience at first, like getting that APM because again, like going back to that thing, I was like, Oh, I wish it measured things I don't even think I will need later. It's especially it was a pre-basic or basic, it was the main application, without all the data stuff was a no JS express thing. I remember that thing. I was like, Oh, it's like two lines of code and you get all the metrics about things that it would've taken you days to even think of where to track it or to monkey patch your application to like track some of those metrics.

That experience was pretty magical. I have to say that and it may have changed again, it's like 2016, wasn't like New Relic's pricing was really brutal when it came to like scaling that at the time. I never really dug into like the user interface. Again, a friend of mine even show me what they were doing with Datadog and I was like that's cool, but they don't have the APM yet.

I think as soon as they launched the NO GS APM, I was like, oh, this will be nice because your logging product looks great. Your metrics, I was a user of the metrics product. When they launched APM and Logs I was that is cool because I'll have everything in one place. I think before that, I can't remember what I was using for centralized logs, but it wasn't New Relic. The logs in one place, the APM and another, I was cool and I was starting to dabble with metrics and I think the moment where all of them like collapsed together and like in one place and Datadog and in that case, I was like this is exactly what I need. I'm just going to get everything in one place. [laughs]

Jonathan: You had APM, you had logs, I'm assuming you had alerts when things reached certain thresholds they broke or whatever.

Paul: Some basic things. It was like, "Oh, if the app stops responding, please let me know."

Jonathan: How many were at the company when you started this journey? Did you say one or two? In engineering, I guess is what I care about.

Paul: It was probably two people in engineering, but really one and DevOps and infrastructure like myself essentially.

Jonathan: Any, challenges you faced while you were doing this?

Paul: Yes, it's interesting. I can tie it to what I'm doing, what my more recent experience there is like, I think as like one-- The challenge, honestly, it's really easy to set up these days. I can say I can't think when I had to reset a couple of things, like the-- I'll talk about it a little bit more, but it was one challenge with those tools is scripting in the UI where you end up doing a lot of things manually, like monitors and stuff that are like you go in the interface. That second go around, I've tried to set out as many of those things as possible and terraform, so that monitoring is scripted is like infrastructure as code that's not always lives up to what you want to do, I have to say.

There's actually a Datadog Terraform provider which is pretty cool, which essentially lets you script as code some of the monitors. I think you can do dashboards. Lets you literally write and use version control for some of this stuff, which is really cool. One trap I think with some of it all of those tools is really like one of you do it one off in the UI, you create an alert and then you don't know what's happening to it. You don't know if it's been deleted by someone. We can go back over that a little bit, which I think is a challenge with some of like graphical user interface. Honestly, one challenge I faced at my current company at Patch, or it's not really a challenge, but it's getting adoption.

I know that Boost tools, [unintelligible 00:15:53] I know how to use them, and when someone starts at the company, they might not even know that you can monitor that or that you can create like a new monitor. I love creating a new monitor. At the time I release a product. It's almost like I had tracking. I make sure everything is there in one place, but I realized that a lot of other software engineers on the team may not even know which tools exist or where they're right.

I have a most, when you're on board, how do I tell people about, it'll do like, "Hey, have you thought about--" People ask you like, "Oh, how do I check my release?" It's like, "Oh," then I do a quick phone call and show them like, "Oh, this is how I would like create dashboard, like me maybe measure this, maybe add a StatsD metric." I love doing StatsD metrics when you have feature branches, you can see how many calls go into each branch. I've realized that the overall education I think hasn't completely reached everyone in the field.

You still have a-- Some people with more DevOps experience know that they can use this and maybe like the chunk of developers have even still no idea that those tools exist. A big part of my job, I feel like is educating and showing what I've been doing and what you can do. It's a teaching people how to fish like problem. I don't want to have to be, I don't want to be in the bottleneck on like every release. It's like let me set up a suitability. Getting that full ownership to the rest of the team is still a little bit of a challenge.

Jonathan: How do you approach that challenge? How do you teach the other developers and engineers what tools are there and the mindset to use them?

Paul: I wish I had a standardized answer there. I think I'm what I would say is I try to be as demonstrative as I can. It's like, hey, when I release my own feature or my team really is the own feature. It's almost like every time you do the announcement and here's the dashboard that we going to use to measure it and be really demonstrative. You don't want, I think it's like rules, like you don't want those tool like flips. You don't want those tools to be only like, people with DevOps experience feel like they can use them. I want everyone on the team to be like, Oh, I'm going to release this feature. I'm not sure I'm going to see if it's working other, I want that to be as democratic as possible.

I want two things, like one show they exist, and then once people in know them, it's like almost like someone does a release and they ask you questions like, "Oh, let me show you." I do a lot of like on the fly demo and just try to say, "Oh, look, it's like in five minutes you can create this dashboard that will essentially show you the health of the release of your new feature, that it almost becomes it's muscle memory." It's you should do this as you really use.

It takes a little bit of time. It's definitely like a step in the onboarding. It's giving access to everyone and maybe doing a demo. I know I have an old video that I probably need to rerecord because it's a bit old of like onboarding video of like 25 minutes. It's like, here's what you can do, here's what we track. He knows how you use it. Probably should refresh that. It's one of those things that I think it takes a little bit of time and process to get right, but if you get it right, it really pays off in nice ways in my experience.

Jonathan: Let's move on to the second company where you did this talk about Then it'll come back and ask some more general questions. Tell us, what were they selling, what was their situation, and how did that go?

Paul: Absolutely. The next company which is my current one is called Patch. We are what we call like the infrastructure for the sustainable economy. We offer actually tool for the carbon removal industry. We are trying to help with the carbon dioxide removal field, which is very ambitious vision. In practice, I joined that 10 company when it was 10 people. It was essentially a marketplace Web App to buy carbon removal and an API to the same. A slightly less complex operationally product, it's a Ruby-on-Rails monolith with a couple of background jobs. There's a lot of complexity in domain modeling, but not as much so far in DevOps.

It's couple of pods running a Rails application and a couple of pods that run background jobs. Nothing as complicated as the data pipelines that I had last time around. I also found myself in the same situation where I think the engineer that set up, that infrastructure left like two weeks after I joined and so I had to crash course, essentially replace. Okay, you own the infrastructure now.

Definitely, my first reflex was like do I under [unintelligible 00:21:07] I think with all that five years of experience working with observability, my first reflex was like, where's my data? Where can I see if something goes wrong? Where can I see the health of what's going on and if something goes wrong. Where do I look at? It was definitely one of my first reflexes probably was in the first, one or two months was can I get Datadog on there?

We took to the CEO, I had a little bit of a pitch to do because initially, can't we get that with AWS tooling? Trust me, it's not good. I remember doing a pitch and it was a pretty short pitch. I think I did that within my first month was essentially, let's get the logs in one place. Let's get the metrics of the infrastructure. I was almost a first 90-day task is where are we at? Can we get something as good as what I had before? There was some challenges because there was a Kubernetes infrastructure, which I was not familiar with at the time. I was also bootstrapping my own knowledge of Kubernetes [laughs] at the same time.

Jonathan: That's always its own challenge. [laughs] I love Kubernetes, but it took a little while.

Paul: Side note, I'm not sure that this company needed Kubernetes at the time. I'm glad we have it now, but it was one of those things. It seems a little, but that was actually a good, having visibility into that. Let me understand it a lot faster. I love most observability tools like these will give you some visualization of the different servers you have and everything you have running.

It's actually a really good way to understand even what your system looks like. it's a lot more visual if you can see everything on one screen. That was definitely a one month. Very different experience there in the sense that I feel the first company was probably three years of me even understanding what I needed to have or what existed and how to use it, which is this one.

Now I know how useful this has been for me in the past. It's almost first thing I'm doing is getting observability done properly. As I mentioned, it's been a lot more about showing people how to use it. As we expanded the size of the team. I'll do everyone, know how to use those tools and can make the most of it. It's been almost like three years to learn everything and then a month to replicate it the second time around, which was nice.

Jonathan: How did the pitch go when you were trying to convince the CEO that you needed something better than AWS? I can imagine a lot of our listeners might be in that situation. They know that they need something better, but how do they convince management?

Paul: It was interesting as well because I was just coming in, so I think that the CEO initially was like, I put a little task to discuss with him. I don't think we need this-- Curt answer, I don't think we need this right now. You just come in the new company and like, "Oh, crap." You've never argued with someone. For me, it was also a part of learning. I realize there was a very strong opinion weekly scenario, so I was able to convince him pretty easily, but I could probably pull that document out. I might still have it somewhere. I'll see if I find it for the reading notes.

That's a great question. What did I say? I said my pitch was around, it just makes me a lot faster debugging when something goes wrong by adding it with everything in one place with the pitch. If I remember well, I think I spent a lot of time-- I was familiar with the cost structure and I think it was helpful to explain how much it would cost and how that would scale. We can expand into that is not necessarily always easy the pricing of all-in-one observability platform tends to have a lot of tentacles and a lot of variability. Having an idea of like, this is going to cost us like $300 a month and going to make me a lot more productive, just knowing what's going on.

I was a little bit lucky enough to be. It was a little bit of pulling the muscle, of pulling the string. Trust me, I've done this before. I think that I'm joking there, but I think I had some weight. This is a tool I've used like for five years and it's saved my life countless amount of time. I'm lucky enough to have a CEO that is a former software engineer that knows how to trust. The senior people [unintelligible 00:26:38] which is great to have. If you don't have, if you're listening to this and you don't have that, I don't know how to solve this. [laughs]

Jonathan: No. Okay.

Paul: I think some visibility on the cost and benefit, it's always like, in the end, it's your CEO care was like, is this going to make me more money or save me money or make my people faster? We could go into that like longer be super interested, give on your podcast. One thing I've tried to do at this company is to keep [unintelligible 00:27:09] small. That was a little bit of an argument, that shift left. It's after the fact. That's not the pitch I've made, but it's almost like we do like shift left on security, like shift left on observability. I feel like is probably something you want to do.

Honestly, we still don't have a full like DevOps or infrastructure person. The team is eight people today, and I feel like that was the right decision. If you can have everyone on the team do a little bit of it and know how to do it, it probably delays when you'll bring that full-time person on. If I had to redo that argument, I'll be like, it's also a way to have everyone know what's going on rather than have with silos of like, there's one infrastructure person that really knows what the hell's going on, and everyone else is clueless. It's probably a good argument there, if you think about the business argument to management. I actually let us hire less people over time and that can be a big argument.

Jonathan: What would you miss the most if you couldn't use-- I guess I'll say Datadog. Whatever tool you're using, I think it's Datadog. If you couldn't use Datadog tomorrow, if it was gone, what would you miss the most?

Paul: I think it's that correlation of every tool. It's the fact that it can jump from a log to a trace to a metric with some things. I know my way into cloud watch, I can find something. The search is pretty horrible so I would miss the search. I'm sure you can do a lot of things with cloud watch monitors and logs and things.

Honestly that speed of being able to go through the interface, I've seen some demos of [unintelligible 00:29:01] which seems to be doing really well there as well. It's like the speed of being able to jump from one thing to the other. I haven't been able to replace that with the AWS native tooling. The information is there, but almost the amount of brain cognition it takes you to find it, it's just like orders of magnitude difference. That snappiness of having everything in one place is probably what I would miss the most.

Jonathan: Of course from a business standpoint, that's worth a lot because that's the time that your engineers are spending trying to, if it takes you 20 minutes to find a log or whatever piece of information versus 20 seconds, that's lost money, so that's really important.

Paul: There was a good anecdote, we haven't had a lot of outages in the past year, which has been really good. I think I had a speed record in debugging or in finding out what was wrong. Now it's almost like a sixth sense where I have like a couple dashboard, couple places where I can go to debug, and I had to do one this year, which was the evening before my wedding. We had to [unintelligible 00:30:10] outage. I was on vacation. I got paged anyway, and it took me literally I think eight minutes to go in, get on the phone, pull up a couple of dashboards, figure out what was wrong, and then disconnect so that I could finish my speech for my wedding. It literally has it was very happy to have the good tooling that night.

Jonathan: It sounds like Datadog needs to hire you as a spokesperson. It's like the perfect story for why you want their product.

[laughter]

Paul: I don't know if you want to say it. There's probably some version of it that's a little less like workout, but--

[laughter]

Jonathan: Let's talk. Early on, and you mentioned some of the challenges with the graphical user interface. If the biggest selling point that you see here is just the ability to find stuff quickly, right? Do you see room for improvement still? I'm sure it's not perfect, nothing ever is, but--

Paul: Yes,

Jonathan: What are your thoughts in that area?

Paul: I think about it, the way I think about my own challenges right now, I think that this discoverability to people I think is to someone who's never done it's like that's the downside. There's a lot of information, there's a lot of things you can do. It's left to me as the team member to like teach everyone about it, right? I wish there were a way where I could like, as you log in is a onboarding experience, I would tell you what's important, what matters, and how, there's a lot of features, right? When you think about it that is a lot to-- that is like each of those platforms I mean CloudWatch in the first place.

There's a lot to learn, and I think that that still has, still has this challenge of, like, being extremely powerful. If I think about if you had shown me this like eight years ago, I would probably have run away, overwhelmed by the amount of options, right? So I think there's still a challenge there to find the right balance between someone who comes in and also sometimes I like to send a product manager ask me like, "Hey, how do we know who's using this feature?"

The challenge gets even bigger when you start bringing if you want to start bringing "non-technical" people to it, right? I assume it's even more of overwhelming if we start with it, right? So this onboarding question was like, how do you deal with non-power user the same way you deal with the power users I think is a challenge, for these platforms. At least in my opinion.

Jonathan: I haven't used Datadog for years, and so I'm sure they've changed drastically since I last used them. It's probably been five or six years, at least since I last used them. Do they export like a management dashboard? Like, or make it easy for you to show management like a here's our website visits or whatever and the website's up or down now?

Paul: Yes, it's very powerful, so you can create pretty much every dashboard, every dashboard you want to, but I think it's still left to you to figure out what is what management should pay attention to, which makes sense, right? I think there's so much variety and like applications it's just like the late just average latency, and you have like 50 services. It doesn't matter but I think I haven't added, I haven't used too many like all application recently as well.

I think the challenge we're facing those platform has no idea, like by default, what the business metric is, right? They get the latency of every service, they don't know which ones are mission-critical or not. I think that editorial part is still left up to you to like figure out like I'm going to create this eye-level dashboard. You can definitely create the dashboard. It's up to you to figure out, what your management might care about and what you should show them.

Jonathan: What metrics do you track right now? Let's start at the highest level, what would you show to your CEO or the board of directors right now?

Paul: Yes, it's a good question. These days it's like number of outages and like errors, as I mentioned, it's a fairly simple, at least from the outside, is the API or the website and API up or down and how many fail requests do we have? I think we care decently about latency, right? So just looking at a latency of like the main API endpoints and making sure they perform.

The thing that I've liked as well it's even I'm not looking like one metric. Whenever you can use, like, look at the internal latency what I like is that we have a synthetics test set up, right? So the API and the dashboard being tested from the outside, that's my go-to because like I've made the mistake myself in the past of like, hey, look, all my internal monitoring is working the app is responding and then you have an error that's like external to it.

Either, or not even errors, right? Your server is in Oregon, and your customers in the UK get a 250 milliseconds penalty just from distance and the speed of light and, that's not visible, right? And so showing something that's more like external with like synthetic, synthetic synthetics test is the best way. It's like this is our response to actual users and look like it's not me doctoring the test. This is literally something simulating a user, and this is what they see. I think that's the useful thing I showed you. External stakeholders "Hey, look someone in London gets an 800-millisecond response." I'm not making this up. This is not how it works in my machine-type scenario.

External visibility is good. This is the uptime we have it. We have it plugged into like an external status page. That's always super helpful as well so that we know that I think I would say they are the word, It's actually something like very important there if you want to try to sell your stakeholders on observability is making sure that you understand their point of view in terms of what matters to them might not be what matters to you, right? You might be interested and this is how the traffic is spread between my Kubernetes spots or whatever. Well, is this up for users, right? Or is this accessible from everywhere? You don't want to have like you can have the same problem when with software testing, right?

It's like I have 99% unit test coverage and your thing still doesn't do the work, right? I think it's important to have those different layers. I think when you sell it to your stakeholders, you want to make you always want to like-- Internal team productivity is important, but you don't want to fall into the trap where all you're measuring is what matters to you. You want to keep track of what the what's the state of things for the end users, right? You only look for your keys under the lights. You want to look for your keys where you lost them, right? [laughs]

Jonathan: Have you had any horror stories trying to implement observability, or any mistakes that you wished you had known better?

Paul: That's a good question. I think there's been a trend in the past five years of letting you send pretty much whatever you want and then set limits on it. I'm pretty sure you can do pretty bad mistakes if you don't set any limits on how much things you ingest or how many billing limits because it's very cloud in 2022 things. Let's you set up everything and then send you the bill, and I'm pretty sure you can make some pretty horrible mistakes there, right?

If you have like a Datadog in particular, you get charged by the number of instances you are running, you are the agent. Let's say you have a runaway with a scaling process, [chuckles] your bill's going to be probably a lot bigger than you'd like, right? Same with logging and stuff, I tend to start and try to put some, especially on logging and things like set some conservative limits so that I don't it's not a blank check to the observability company.

I can't think of-- Aside from that, like runaway billing, I can't think of it can get expensive. That's another thing, like the drawback of having everything in one place it's a little bit the cloud platforms. It makes you a very good job of like, Oh, I want to add this one more thing to measure because it's there, right? Why wouldn't I want observability in that? It can balloon little bit. I have to say that there are a couple things that I enabled first. I don't need that right now. I don't need to pair, right?

I think Datadog or the oldest you can get, we use Ruby-on-Rails, right? So you can get like Ruby level-- What's the word? Tracing, not even tracing, but like you can get code-level performance traces. That is not a problem for us right now we're not at the point we optimize API requests. We don't optimize code-level memory allocation. Paying for tracing for that, if it's something that you look at once every nine months. Do you really need to pay for that peace of mind?

I think that the billing there, I think the challenge with everything DevOps is as well as like optimizing it for the sake of optimizing it, right? Where you have this perfect observability system, and then no one's using your application. You don't want to make that mistake. That's probably the biggest mistake everyone makes. I don't think there's too many footguns, honestly. The most likely outcome is you don't really get good at options or people don't use it. You want to avoid runaway dealing but I don't think you can make any crazy mistakes there.

I might be wrong if someone has some really horrible horror stories. Yes, I think those things are well-optimized as well. I haven't seen any performance impact of doing this. I think that's something that seems like a soft problem like the memory footprint of adding agents or tracing. I think there's a lot of effort that's been spent by most companies to make sure that observing doesn't affect the performance of what you're observing too much. [laughs]

Jonathan: Is there any advice you could give to somebody who's getting ready to embark on this journey of implementing observability? How should they start? Should they just use a tool? Which tool if you want to make recommendation? Should they start in one specific area with logging or with tracing? How do you start? I know, that's a big question.

Paul: That's a great question. I mean, I would definitely do it earlier than not. Especially if you have nothing. Obviously, if it wasn't clear, I'm a big Datadog user and I appeal you recommend it. I'm sure competing platforms have the same thing. I would give it a try, right? It's still the experience of getting that full view, the snappy UI with everything in one place is an order of magnitude more than what's offered by default by the cloud platform. I would give it a shot.

If you're like me, you probably won't go back. I would get started early, I think as you build a team, it's one of those things or if everyone builds the right reflexes like CICD. It's almost becoming like CICD, in my opinion, it's table stakes. You set it up early, you set the right basis, your entire team knows what to do, how to use it, and just make everyone more productive.

Honestly, the only downside I see is cost. I don't spend very much time optimizing it or changing it, you just want to keep your costs in touch, especially if you're a small company with not too many resources. You don't overspend on top of your AWS thing. Getting started is probably the most important.

I think as I mentioned I would look to have-- If you're already using something like TerraForm, I would love maybe little less obvious advice is to look from the get-go which part of your setup you use infrastructure as code for. The very obvious one is if you have an install agent, I mean, in this day and age, it's like nodes, you probably don't install anything manually on a server anyway. You want to make sure that all the agents are part of however you deploy the software, right, so like a sidecar or anything, it should probably be part of your TerraForm CloudFormation setup.

If you're not using it, that's a completely different topic but I do strongly recommend infrastructure as code if you're getting started. Make sure it's good but I would even push it one more level if you start setting business critical monitors. In the case of Datadog can script those with TerraForm. If TerraForm or whatever is not available for stack you can probably look at ways where you can write a tiny script that use the API to make sure the alert is created for business-critical thing.

I think that's super important because play can set up things can be finicky. You can delete things by mistake and then one day your mission-critical alerts is not there, is not monitoring what you expect it to monitor. Maybe if there's one non-obvious advice there is look at scripting your very critical part of observability.

Jonathan: Good advice. Yes. Great. Is there anything that I or our audience should ask you that I haven't yet?

Paul: No, but I think we've gone quite around the loop [inaudible 00:45:27]. I'm realizing that most of what I have to say is the technical challenges, obviously, only a small part of the thing, right? It's probably a joke there about if a tree falls in the forest, it's recording on your observability platform, but no one looks at it, did it really make a sound?

[laughter]

Paul: You want to make sure that people know how to use it, and you're not making noise out there. The domain of one expert who knows where to look at. I think democratizing and shifting that left in your company it's probably going to take you more time than all the rest combined. It's probably where a lot of the benefit is, right? It's making sure that everyone knows how to debug their own application.

I mean, it's the extension of the DevOps info threads. You want to make sure it's not the domain of one person, but everyone cares about it. Everyone knows how to monitor the application. Think about if you're going to implement it, are you going to teach people are to use it, are you going to make sure that it's part of the quiver of tools that every developer on your team is going to be able to use? It's probably underinvested when you think about it as an engineer, but it hasn't been very critical.

Jonathan: Paul, if people are interested in reaching out if they have questions after listening, are you available for contact? How can they get a hold of you?

Paul: Yes, absolutely. I have a Twitter account that I use mostly for reading I don't post pretty much anything but I can be contacted on there. It's @PaulCothenet, my last name. That's pretty much it, or you can find me on GitHub, same place, it's probably my contact information in there. Feel free to reach out. If you're interested, I made the shift from my company to the current one was essentially to work on as a software engineer on the climate problems that we are facing. If you are interested in making that transition, I'm also very happy to talk about the topic.

Jonathan: Do you have any openings right now, or [crosstalk]?

Paul: Yes, we have a lot of openings for software engineers, distributed in the US for now. US time zones, but any locations in the US.

Jonathan: What URL if people are interested in looking?

Paul: Yes, that would be patched.io. Very simple, not patched.com, that's an old social network thing but patched.io.

Jonathan: Simple to remember.

Paul: Exactly.

Jonathan: Paul, thank you so much for coming on. Thanks for sharing your story. I hope that some of our listeners are inspired to take on the challenge and the reward of observability after doing this.

Paul: Yes, I wish and again, any questions or anything feel free to reach out.

Jonathan: Wonderful. Thanks, and until next time, talk to you later.

Paul: Thank you very much.

[music]

[00:48:40] [END OF AUDIO]

Tiny DevOps episode #47 Paul Cothenet — Observations on observability

November 8, 2022

Transcript

Related Content

Ideals

A Look at Atlassian's April 2022 Jira Outage

Being the most senior engineer doesn't make you a CTO

Tiny DevOps episode #47 Paul Cothenet — Observations on observability

November 8, 2022

Transcript

Related Content

Ideals

A Look at Atlassian's April 2022 Jira Outage

Being the most senior engineer doesn't make you a CTO

Improve your software delivery