Tiny DevOps episode #14 Ben Curtis — Incident response on a bootstrapped budget

Tiny DevOps Podcast / Agile Principles

Resources
Ship it! A Practical Guide to Successful Software Projects
Site Reliability Engineering: How Google Runs Production Systems or read free online
The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data
Honeybadger exception monitoring

Today's Guest
Ben Curtis
Twitter: @stympy
Email: ben@honeybadger.io

Transcript

Ben Curtis: Like, okay, we fixed that bottleneck and then, oh, there's another one in another spot. It wasn't always the same problem, but it was these growth problems, they used to call them good problems and they are good problems to have, but they're still problems.

Commercial: Ladies and gentlemen, the Tiny DevOps guy.

[music]

Jonathan Hall: Hello, everyone. Welcome to another episode of the Tiny DevOps Podcast. I'm your host Jonathan Hall. On this show, we like to focus on DevOps for small teams. That's the name tiny. Today I'm excited to have a special guest Ben Curtis who has been working with Honeybadger for a while on a small team there, and he's going to tell us his story about, basically, incident response and on-call rotations and those sorts of challenges on his small team. Ben, welcome to the show.

Thank you so much for joining. Maybe you can start by giving us a little bit of background about you and your time at Honeybadger and what that's been like.

Ben: Sure. Well, thanks for having me on the show. Appreciate it. Honeybadger for those that aren't familiar is a monitoring service for web developers. We monitor exceptions and uptime and Cron jobs. My role at Honeybadger as one of the co-founders, since the beginning, I've been responsible for keeping the lights blinking happily. That's what I like to do. In the early days those lights on actual servers, and these days it's virtual lights on virtual instances, but I really enjoy building fun tech solutions and dealing with the ops side and that's what keeps me going everyday.

Jonathan: Really good. How long have you been with Honeybadger then? Since the beginning, right?

Ben: Yes, since the beginning. We launched about nine years ago.

Jonathan: When you started, I don't know, was it just you and a small set of people? What did it look like say for the first year?

Ben: For the first, well, several years, there were just three co-founders. We had Starr and Josh and me, and we're all Ruby on Rails developers by background, and we'd worked together before doing various kinds of freelancing projects. We all just dove in and started writing whatever code need be written, and to me fell all the responsibilities for getting the servers set up and handling all the ops stuff. In the early days, it was really just one server.

We started at a leased facility where we just leased the server, and I remember I think it was about $75 a month and we weren't sure that the startup was actually going to work. We weren't sure that people were actually going to pay for it, and so we didn't want to get all invested early on and, and having a bunch of resources. We figured, well, we could put one server out there and put the web app, put the database, put the Redis, all of it on one thing, and that worked great for quite a while actually.

Jonathan: When did it stop working great?

[laughter]

Jonathan: What changed?

Ben: Well, over time we actually found out that the product would work, that the business would work, that people would actually want to pay money for us, and of course we got more customers and got more traffic and we started building out our servers. First step was to add a dedicated database server. We got another leased instance from the same company and we moved Postgres over there, and then as time went on we got more and more. Actually, early on, the first iteration of the product, we didn't even have a queue.

Errors would come into our API and we would just write them straight to Postgres. After a while Postgres just couldn't keep up with the right traffic, and so we introduced a queue, so we started using Sidekick, and that uses Reddis as its store. Then we added a couple of servers for doing Reddis stuff. Basically, we had a nice three-tier, your typical three-tier architecture at that point. We had our web application servers. We had our database servers and everything was going fine for quite a while. That lasted us a good while, and we would spin up new servers as the load increased.

Then at some point it just got to a point where our hosting facility couldn't keep up with our needs. It wasn't a matter of hardware. We could get a hardware whenever we wanted, but the networking was actually becoming the problem. We would just have these intermittent outages and we would open up tickets with the support staff, and even though they had a 24 hour on hands kind of thing, there were times when the response rate wasn't great.

Maybe an hour, they'd get back to us, and it just became an issue where we couldn't rely on their network to stay up enough for us because we can't go down. People who are having a bad day with their application, they're sending us a bunch of errors that we need to be able to track that, and downtime just doesn't work well for us. We eventually had to move from that facility and we decided at that point to move to Amazon, the primary reason being that we had over-provisioned because you have to handle bursts of traffic.

We had more than we needed, but we wanted to get to a point where we could scale up and not have to over-provision, we figured that Amazon with auto-scaling groups and that sort of stuff would be a great solution for us. We figured if anybody, they would know how to make sure the network worked.

Jonathan: What was the timeframe then that you switched to Amazon?

Ben: We switched to Amazon about early 2017. We launched mid 2012. We got about four years of a timeout at this facility.

Jonathan: Just to put the timeframe into perspective, what does your team look like now? How many of you are there? Are you still on Amazon?

Ben: Yes. We still have the three co-founders, we've added a marketing person and another developer. Now we're up to a grand total of five and that surprises people a lot of times because we tend to punch above our weight. We compete with some big providers like Rollbar and Sentry. We do it with many fewer employees than they have, and we are still on Amazon to this day. We're still in US-East-1, and we have a few more instances now than we did back in 2017, and we have a lot more automation in place than we did in 2017 as we've learned lessons along the way.

It's been great. We're happy with Amazon. The only real snag for us with Amazon is just we spend too much money on them, [laughs] but they provide a good service so we're happy to do that.

Jonathan: One question that just jumps to mind you say may be it's surprising how few people you have doing this work. Does that mean you're overworked? Or how do you manage that?

Ben: The way that we've managed that is, and I think key for your listeners is that we've tried to minimize the amount of work we have to do. We don't want to be babysitting servers. We don't want to have a human intervention as much as possible, and so we've spent a whole lot of time on automating our systems.

Anytime there's a problem that requires a human intervention, then we do a post-mortem where we analyze what happened there, document the issues that happened and then try to create some automation around making sure that a human doesn't have to get involved in that kind of problem again. As we've done that many, many, many times with all the issues that have popped up over the years, now we can sit back and we can see alerts coming in.

Then we can see deployed messages for instances being spun up, for example, to handle maybe too much of a backlog or one issue or another. The automation is really the key, like being able to have robots doing the work so that we don't have to hire more humans to do it.

Jonathan: I think I know the answer to this, but has it always been that way? Obviously, you had to build that automation, but what did that story look like?

Ben: AWS did exist back when we launched Honeybadger and back in 2012, and we chose not to go with Amazon when we first launched because we didn't want to take the time to really build a resilience system according to Amazon's requirements. For example, if you're running an EC2 instance there's no guarantee that instance is going to stay up. Amazon tells you be prepared. The instance could be terminated for whatever reason, whatever time, and you've got to be able to handle that.

In the early days, we didn't have the time to build the automation to handle that, having an auto-scaling group that would automatically spin up an instance and automatically get the code that it needs and et cetera, et cetera. We knew we had to depend on the old school way of just hoping the server doesn't die. We have a long [unintelligible 00:09:08] server, it runs, it runs, it runs and everything's fine. In the early days we just deployed our stuff, sometimes manually.

If we had a tweak to our Nginx config, let's say for our web server, then we would just hop on the production web servers and edit that file and reload the service. We used Capistrano to deploy our code, our web app, but automation really wasn't there, and over time we, of course, realized as we had a launch server number two, and then servers three and four, and then so on. It's like, oh, we really need to have systems that'd be able to replicate our config. We leaned on Ansible, which has been a huge benefit to us. We love it, and that got us to a point where we could get a bare server from our provider, and then within few minutes have it provisioned for whatever role. Then as we move to Amazon, we realize, "Okay, now we got to be able to handle server going away, an instance going away at any time, so that means every role has to have an auto-scaling group that has some automated provisioning so what we ended up doing was building some AMIs, some Amazon machine images like a golden master if you're old school.

Those AMIs were built using Ansible so we provision let's say, it's going to be a web server so we provision Nginx, and the Ruby app, et cetera. Then when that AMI becomes part of an auto-scaling group, it has a configuration that tells it how to go and pull down the latest code and start running so basically, you can boot and go. That doesn't have to be human involved in and spinning at that instance. In the early days, there is zero automation and then we move to Ansible to help us repeat our configuration in a repeatable way, and then use that as we move to Amazon to build our auto-scaling groups.

Jonathan: As the primary or perhaps only infrastructure person at Honeybadger, are you able to take holidays?

Ben: As the only infrastructure person for the first several years, no, I wasn't actually able to take holidays. I could get away from the computer. I remember vividly if I ever-- I did go to conferences. I did travel. For the plane ride, I would just knock on wood and hope that nothing happened during those few hours but while I was at the conference, I always had my laptop with me. There were conferences where I would miss a session or in the middle of a session, just all of a sudden have to respond to an issue and whip out the laptop and get to work. That was not fun.

If I were to go to advise myself from nine years ago, I might give myself some advice about maybe spending more money on automation earlier on so that I wouldn't have as many years of doing that. These days since we have so few incidents, yes, I get to take vacations now. I remember, I guess it was probably about not too long after the Amazon transition when I took my first extended multiple-day vacation, and I didn't even take the laptop. I still had my phone with me so that I could walk somebody else through.

My co-founders are great, and they're technical. They can understand stuff, and so I can, push comes to shove, I can call them and say, "Hey do this, run that," that kind of stuff. We do have playbooks so that when there are things that go wrong, we, of course, have some documentation say, "Okay, check this thing and check that thing." I remember vividly this one July 4th, holiday, I actually decided, "Yes, I'm going to take this day off.

Josh was on call that day, and poor Josh, of course, there was an issue on July 4th, like a 10:00 in the morning or something, and he and I were texting back and forth. He was reading the run books and seeing where there were gaps, of course. These days, I get to take some good vacations, but in the early days, I was pretty much tied to the servers.

I think the key on the vacation thing, if you haven't had the chance yet to write the automation, and maybe if you had, like if you know there are things that can break or that have broken in the past, and you have documented them, I think having great playbooks is the next step you want to get to, to be able to take that vacation, to be able to have somebody who can, maybe it's a co-founder or maybe it's an employee, maybe it's just a friend, I have another entrepreneur friend that also runs a web app business, and he's doing it basically solo.

When COVID hit, he started to make some plans like, "Well, if I end up in the hospital, who's going to take care of this if something goes wrong?" He and I've been friends for a long time, and so he asked if I would dive in. I said, "Sure, I was happy to do that on emergency scenario." What he had done was, if this happens, then go look, X, Y, and Z and if that happens to go look at A, B, and C. He had these great run books. He and I sat down initially and just had a one or two-hour orientation thing.

For me, it's like, "Make sure I could SSH into the right boxes and that I can see the things I need to see." I think if you're in that situation where you're that solo, or maybe the only person responsible for Ops, I think, as you do work, as you do a thing, maybe you upgrade a package, the typical day-to-day stuff that you do or the problems you need to deal with. If you write step by step things down that you did, then you can hand that to someone else and say, "Hey, here's what you do when you encounter this problem." That helps quite a lot.

Jonathan: This is an off-the-wall question, but did you ever have that adrenaline high when things were crashing, and you were required to step out of that conference to do something. Did it make you feel important, sort of like the wild west cowboy coder guy?

[laughter]

Ben: That question went in a different direction than I was expecting. [laughs] Have I felt that anxiety that stress? Yes, and terribly so, but did I feel important? No, I did not feel important. I felt the whole business was going to fail and it was all my fault. I felt like the biggest idiot in the world, and I felt like anybody else could figure out this problem in five minutes, why is it taking me five hours, and if I don't fix this, then I have to go find a real job, those kind of feelings.

You might then say, "Well, when you're done, when you actually fixed it, did you feel important? No, I still felt like, "Oh, phew, glad I dodged that bullet, and when's the next one going to happen?" Like, "Oh, I somehow amazingly did not destroy the business today. Am I going to destroy the business tomorrow?" I don't know. Maybe that speaks some personality quirk of mine, but I never felt important.

Jonathan: I don't know, the first thing comes to my mind is that it speaks to you as a co-owner of the business, rather than an employee who might get kudos from the boss when they fix something, if you know what I mean?

Ben: Right.

Jonathan: Some people, I think, do get an emotional high when they're on the button trying to fix something, but maybe it's not their butt on the line when that happens.

[laughter]

Ben: Yes, I can see that might be a difference.

[laughter]

Jonathan: You mentioned one of your other co-founders being on call that week, how did you first set up this uncalled rotation, and how did you balance that between the three of you, and I suppose more than that now?

Ben: Initially, it was ad hoc. We use Slack for everything because we're 100% remote and so it was hop on slack and say, "Hey, I want to take off tomorrow, who can be around to watch the stuff, basically." We would all get alerts. If there was a critical thing, then all of our cell phones would go off, and then they would just expect that I would handle it and then we would hop on to slack and if I didn't handle it, then one of them would jump in.

Eventually, we ended up moving to pager duty, which is fantastic, just to be able to have rotation. When I feel comfortable, that we had enough documentation in place that anybody could handle most issues, then we set up this rotation and pager duty where each of us takes a week, Monday to Monday. We actually have two levels, so there's the first level is whoever has the rotation for the week and then the second level is me. I'm like the default escalation, so you can set up rules for how long an issue can go out before escalate to whatever.

Our rules are whoever's on-call gets the first notice. If it goes unacknowledged for 30 minutes, then it escalates to the second level, which is me all the time and that's worked out really well. Initially, everyone gets paged and then to a point where, "Okay, now only one person gets paged." That was the best quality of life improvement, [chuckles] because I could then go and do a thing for a few hours and not have to worry about making sure if my phone had reception, or whatever.

I knew someone else would at least take a look at it. Maybe I'd have to be called in on something, but more often than not, someone else is going to be talking about it and handling it before I even had to look at it.

Jonathan: How often do you get alerts these days?

Ben: As far as pager duty alerts, where someone might get woken up at 4:00 AM, that happens very, very rarely. Maybe once every six months, maybe once a year, something like that, but we do have an alerts channel in Slack where there are lower-level alerts, non-urgent alerts, things that if they persist, you probably need to do something, but it's just an early warning system, so we'll see alerts in there a couple of times a week, just depending. Those would be, for example, our main thing is we're processing all these jobs as incoming API requests so we have a queue and that queue, of course, can get backlogged if there was some slowdown in processing. Our two primary alerts that we keep an eye on very closely are the depth of the queue, how many jobs are there outstanding, and also the latency like, "How long has the oldest job been in the queue?"

Early on in the early years, if that queue depth, let's say went above 1000 or whatever the number was, then it would send a page and everyone's like, okay, drop everything and go look and see what's happening because we might have to spin up a new server. These days, we have CloudWatch alarms that are watching those metrics. If the count goes high, or if the latency goes too high, then a new instance just gets spun up and it does its job and it'll just automatically resolve itself.

Over time, we had to bump up that number, so maybe it's 1,000 and then 10,000, and whatever the number is today. Then also how closely we had to worry about it. Back a few years ago if we saw that latency number hit the alarm and Slack, we'd like, oh, it's time to take a look. Now, it's like, it'll be fine. If it goes another 20 minutes, then you might have to take a look. We'll get an alert that A, the latency is up and then we'll get an alert like, oh, latency is fine now.

Once you start getting these runbooks in place and you have an idea of how your system behaves, you start to realize, okay, well, this is an early indicator of what could become a problem. In our case, the latency or the depth of the queue. You start building alarms around those metrics. First, you have to have metrics for those things. Then you build alarms around those metrics. Then hopefully, you can have some sort of automation that responds to those alarms. That's the progression that we've made at Honeybadger.

Jonathan: What kind of volume of alerts were you getting, say, in the first year or two? I'm assuming it was much higher.

Ben: Yes, it was. The funny thing is, we had initially planned on running Honeybadger as a side business. Starr and I worked for a startup in Seattle. We had a nice job that we enjoyed, and we didn't really want to leave it but we found that we'd add a new customer. All of a sudden, we had a bunch of new volume and like, it was 11:00 in the morning and we see this alarm, like, oh, this [unintelligible 00:22:03] is crazy. We have to stop everything we're doing with our job and go and do this side thing. After a while, it just became untenable.

We couldn't, in good conscience, couldn't have a job and have this thing that's always interrupting us. In the early days, it was crazy. Get a new customer, of course, you weren't watching the customers sign up. We were doing our work at our regular job, and then all of a sudden alerts because, like, all this new traffic and we have to go and handle that. Daily, we would see, because one thing would break or another. Today, it might be in the early days, we're learning what the process was for dealing with stuff.

We're learning where things can break. It might be the cue. Like, okay, we fixed that bottleneck and then oh, there's another one in another spot. It wasn't always the same problem but it was these growth problems. They used to call them good problems. They are good problems to have but they're still problems.

Jonathan: How did you learn to manage this completely, basically, uncharted territory, that you're facing new problems every day or every week or every customer? Did you learn any tricks to help with this or was it just fly by the seat of your pants?

Ben: Every system is different. In our case, there were certain-- we could anticipate where a slowdown would happen based on past alerts. In the early days, one issue that we kept hitting was these slowdowns in queries and Postgres, because if you've dealt with a lot of data you know that sometimes you get to a tipping point in your database where like X thousand rows is fine but Y thousand rows is not fine. You go and add an index, and maybe that speeds it up or you find some other denormalized solution or whatever it is.

We would hit that problem repeatedly as we hit these various levels of data. It just became, we got gut feel for, oh, the system is doing this. We've seen this kind of behavior in the past and so it's likely and this kind of problem. Let's go look at the slow queries and see what the slow queries look like. It's like, okay, there's a different query now that's showing up as slow. Then now we have an idea of how to go and fix that thing. It just takes experience to get a feel for how your system behaves. Over time, you do see these patterns.

Again, it helps if you're documenting, as you're fixing stuff. Then you can see, oh, yes, we did this and we did this and we did it again. I had experience before with databases in particular, and so I knew things like oh, you add an index and all of a sudden magic can happen. The things can automatically be performant. It's not quite that easy, but it feels that way. When you finally get it you're like, yes, and now everything's working again. Once you have an idea of how your different components behave, then you can get a feel for, oh, this part of the system is probably where the problem is.

Jonathan: You mentioned just again, you've mentioned it before, documentation, I'd like to hear a little about how you handle an incident. You mentioned your post mortem process. What has that come to look like these days? If you were to have an incident tomorrow, how would you handle that?

Ben: We had an incident a few hours ago, actually, that was caused by a human. It was caused by me. [laughs] These days, a lot of the incidents that we have, I would say a good percentage of them are human-caused, like some sort of deployment, new code change or a new configuration goes out in production. Like, oh, there's an unintended side effect. The process, once the problem is fixed, I guess even during the problem, a problem arises, alerts are going off. Everyone who's available is in Slack trying to help out.

There's typically one person, typically me, who's the lead and trying to diagnose the problem. Then, everyone there is basically watching the alerts and seeing if they can think of anything to do, and if not, waiting for direction from the lead. There's a lot of interaction happening on Slack. Hopefully, by the end of the incident, by the time we've discovered the problem and deployed a fix or worked around, whatever, hopefully, everyone who's somewhat interested has some idea of what happened, how it happened and what the resolution was because they've been in Slack.

They've been following along as the incident has progressed. We got that situation already. Then after the incident's resolved, usually the next day, because you got to give yourself some time to come down from that. It's an anxiety-inducing experience. You got a lot of adrenaline going on. I don't know about you, but I've had days problems are like, okay, this is it, the business is done. I've sunk the whole thing because I can't figure out how to fix this. Anyway, so we give ourselves a little bit time and then the next day, write up a post mortem.

Usually in Basecamp is where we keep our long-lived documentation. It'll be, here's the-- and it's pretty informal. It's like, okay, here's where we were, here's what caused, here's the change that happened or the incident or whatever the trigger was. Here's the research that we did. Here's the things that we tried, and here's the thing that eventually worked. Then the most important part of the post mortem is, then what's the plan for avoiding this kind of incident in the future?

It might be things like, well, this is a new kind of thing that we hadn't really monitored before, so we need to add a metric for this particular behavior. Then if we have a metric, then we should have an alarm based on this threshold. Then with that alarm, we should take this action. We'll just write that plan, whatever the plan might be for avoidance incident in the future. Maybe if it was a bad upgrade, like a bad code push, maybe the resolution plan is, okay, we had some more review time or we do this other steps in our staging environment or whatever.

We discussed that plan in the Basecamp thread. Once we agree that there's a good plan for moving ahead, then we make those changes, whether it's an Ansible change or Terraform change, because Terraform is where we manage our AWS resources, or maybe it's just a process change. Whatever that change is, then we implement that in one of our repos as a PR or just as a new runbook. It's a pretty informal process. There's not an actual meeting kind of thing.

We do believe, though, in the no-blame situation, you're probably familiar with this where we don't say, oh, it was such and such's fault for doing that thing. It's like no, the fault was that something bad happened and we just need to fix that thing. It's like, hey, it's really not a big deal if someone pushes some bad code. What should have happened was that review should have caught that. Let's be better at reviews.

Or if there was a new service that we put out there, we didn't have enough monitoring in place like, okay, let's make a plan for our next launch that when we have a new thing like this, then we think about better the metrics that we need to measure, and the alarms we need to have in place and that sort of thing.

Jonathan: Looking back, is there anything you would have done differently or that you would have changed over the last eight, nine years if you had the chance?

Ben: I've thought about that a few times. One thing that I might have done differently, although I'm not 100% sure, I've thought a number of times, maybe we just should have started on AWS with all the automation work upfront. That would have made my life a lot easier for the first few years. At the same time, that would have delayed our launch by several months, just because it did take a lot of work to get exactly what I wanted. Maybe I'm just too picky, I don't know.

That's the number one thing that I keep coming back to is, if I would have started with all the automation framework in place, obviously, we still would have had lessons to learn and things to change over time, but if we would have been there from the get-go, then we might have been in a better starting point to be able to handle some of those early issues more gracefully.

Jonathan: Looking back, if you had taken the approach of building the automation first, would it had been motivating, considering that you had a day job and I'm assuming you were working nights and the weekends and you would not have had customers, would you have been able to maintain the momentum and motivation to build that automation under those circumstances, or was that pressure required to get that done?

Ben: That's an excellent question. There were three of us, we weren't funded, like you said, it was nights and weekends. Starr was working mostly on the front-end code, Josh was working on our client code, working with client libraries to send us data. My job not only was ops but also to build the back-end code, so all the processing pipeline. If I'd been spending all my time on the ops stuff, then we wouldn't have a back end for handling, so the product would have been delayed by months, I'm sure of it.

That probably would have been a premature optimization, probably not worth it. Maybe the real answer is that we should have done it sooner, but not straight from the start. I don't know. It was good experience, though. By the time we did the migration to AWS, I felt really comfortable because I had had many months to practice doing a cutover and doing all the things.

Jonathan: How do you feel now about the way things are with regard to operations at Honeybadger?

Ben: I feel it's a world better than it used to be. I'm much more relaxed. [chuckles] I'm able to take a vacation when I want. I like to ride my bike on the trails nearby. I can go for an afternoon bike-ride for several hours and not have to be concerned that, all of a sudden, I have to go back early. It's night and day, really. These days, I'm excited to be working on that kind of stuff because it just works 99% of the time. When we do have a problem, there's a clear path to how we can not have that problem again.

We have a lot of the infrastructure in place to be able to say, "Okay, well, we can just add this kind of thing or split it this way." It feels a lot better. I guess the good news is, the early period was painful, but the current day, the past several years have been very unpainful. I guess there's hope for anyone who's in the painful days, like, yes, you can get past that point. [laughs]

Jonathan: What does the future look like? What kinds of optimizations do you hope to make in the next 6 to 12 months?

Ben: I'd love to find magic ways to get our AWS bill cut. [chuckles] Unfortunately, I've been thinking about it for years now and I don't have a good solution yet, but maybe they'll come up with some great cost-cutting measures on their end and that'll help us. I don't know. I think one of the things that we want to do is we want to have a more-- Well, one thing we really want to do is be able to deploy a stack to other regions easily, to be able to have region-to-region failover.

Today, if you US-East-1 went away, well, half the internet would be dead, but so would we. [chuckles] We don't currently have an automated system to switch over to another region. We can manually recover, we have a playbook for that, we have a plan for that, but we can't just flip a switch, which would be awesome. That's one thing we're working on is getting our Terraform-- We still have a few places in our AWS configuration where it was done in a console and it's not quite in code yet.

One of the efforts that we're doing this year is putting all that in Terraform, going through and creating a whole new stack and giving basically a push-button deploy, where we can say, "Okay, our region went away, let's do a manual recovery, but let's not have to push anything besides this one button, Terraform apply or whatever it is." That's what we're working towards. I think I will sleep better at night with that. That's really how we guide the outwork at Honeybadger.

Every time we get to a peaceful place, then it's, "Okay, what's keeping Ben up at night? What do I wake up worrying about?" Like, "Oh, if this happened overnight, that would be bad. Okay, how do we handle that?" Kind of thing. That's the next step for us is, like, "Hey, if the region went away, that would be bad, how do we handle that?"

Jonathan: Really good. If any listeners are going through some of these same struggles, maybe they're co-founders, maybe they're just working on a small company, do you have any resources you can recommend that could help them through this journey?

Ben: I think preemptively reading up on things is good. There's a great book, well, it's kind of old now, called Ship It!, published by O'Reilly. Some of the stuff is outdated, does not apply, but it has good things to think about if you're new to the industry. Google has a great book, their Systems Reliability Engineering book is fantastic. Although, you have to temper that with where you are on your life cycle as a company because, obviously, it was written by people that have a lot of resources and a lot of things happening and, probably, you don't have that much happening.

You have to filter out the stuff that doesn't really apply to you, but there's good, I think, insights in there and ways to think about things. The great book The Unicorn Project is just a fun read. Again, it gives you that, like, "Here's all the bad things that happen in big enterprises." I think it can give you some appreciation for maybe the smaller kinds of problems you do deal with it, kind of makes you feel better. [chuckles] That's a fun narrative read. You don't go into that thinking, "Okay, I got to learn some stuff today." No, it's interesting, and you learn some stuff along the way.

Of course, Stack Overflow is a great place to be at any time you have an error. I don't know. I think as far as resources go, I think a lot of just my experience from playing with Linux for a long, long time just comes into play, I get a feel for how stuff works, and, like, "Oh, yes, it's a file description limit kind of problem." You understand what that is when you've played with it long enough. I think probably the best thing to do is just play, learn stuff. If you hear about Ansible you're like, "What the heck is that?"

Then you learn what that is, you're like, "Oh, well, let me go try, let me go play with that." As opposed to just reading about it or reading what someone else did with it, go and set up an instance. I think, even today, one of the best ways I spend my time is I spin up a bare instance in EC2 and I play with stuff, and I try things out. I build a new Terraform thing to see how stuff works. I build a new VPC, and then I tear it down, and I only spend like five pennies on it.

I think more than spending time in communities like Reddit or Discord or whatever, I think spending time just playing with the technology and learning stuff that way is a time well spent.

Jonathan: Ben, how can people get in touch with you?

Ben: Feel free to ping me on Twitter. I'm Stympy on Twitter, S-T-Y-M-P-Y. Of course, people are welcome to reach out to me at e-mail, ben@honeybadger.io. I do hang out on Reddit from time to time in the Ruby on the Rails subreddit, so I might answer a question here and there. I'm happy to answer DMs or mentions on Twitter, that's probably the best way.

Jonathan: Great. If anybody here is looking for some error monitoring and so on, of course, Honeybadger is available. How do we find the best option there?

Ben: If you're running a web app, you definitely want to check out Honeybadger.io. That is the place where you can get the most awesome exception monitoring on the planet. It's really focused towards developers. We want to have developers have a better day. We feel very strongly that our mission is to have great software for developers, so that they can have a better experience with their work. I think if you try Honeybadger and you see how it helps you deliver a better product and helps make your customers happy, I think you'll agree that it's a great product for that.

Jonathan: Ben, before we sign off, is there anything else you'd like to say?

Ben: Thanks for having me. It's really appreciated. For all those who are struggling with those heavy ops burdens on their shoulders alone, I feel your pain, and do feel free to reach out to me anytime if you need some commiseration.

Jonathan: Thanks so much, Ben. Thanks for listening. I'll talk to you guys next time.

[music]

[00:39:24] [END OF AUDIO]

Tiny DevOps episode #14 Ben Curtis — Incident response on a bootstrapped budget

August 10, 2021

Transcript

Related Content

I'm all three kinds...

Three kinds of people

The biggest challenges of incremental software delivery

Tiny DevOps episode #14 Ben Curtis — Incident response on a bootstrapped budget

August 10, 2021

Transcript

Related Content

I'm all three kinds...

Three kinds of people

The biggest challenges of incremental software delivery

Improve your software delivery