Tiny DevOps episode #45 Dave Mangot — Should you deploy on Fridays?
October 11, 2022
Dave Mangot is a speaker, author, teacher, and Silicon Valley veteran. His focus is helping private equity portofolio companies use their technology organization to maximize growth, and he joins me today to discuss the contentious topic of Friday deployments and why you definitely should do them and why you definitely should not do them. Confused?
In this episode
- Mores are not moratoriums
- Shaming is inappropriate, on both sides of the issue
- Every outage is unexpected, nobody knows what might go wrong
- Friday deployment should be an informed choice
- Why small batch deployments are important
- Deploying features vs other changes
- You should be able to deploy at any time, but separate that from choosing to deploy at any time
- Why more QA can be worse than less QA
- If deployment hurts, or causes fear, do it more
- Responding to failures when they do occur
- Building an accurate mental model of your system
Article: Deploy on Fridays, or Don't
Book: Continuous Delivery by Jez Humble and Dave Farley
Talk: How Complex Systems Fail by Richard Cook (Velocity 2012)
Book: Project to Product by Mik Kersten
Book: Out of the Crisis by W. Edwards Deming
Voiceover: Ladies and gentlemen, the Tiny DevOps guy.
Jonathan Hall: Hello, everybody. Welcome to another episode of the Tiny DevOps Podcast. I am your host, Jonathan Hall. Today we're going to talk about Friday deployments. I'm excited to have Dave Mangot with me today. Hi, Dave.
Dave Mangot: Hey, Jonathan.
Jonathan: Glad you're here. Let's talk about Friday deployments and whether it's a good idea or a bad idea, or something in the middle maybe, or maybe there's nuance here that we need to unbury. Before we do that, why don't you introduce yourself a little bit, tell us about your experience, who you are, what you do.
Dave: Sure. my name is Dave Mangot. Like you said, I've been in Silicon Valley doing computer nerd stuff for 20-something years at this point. I was an architect in infrastructure engineering at Salesforce, designed a lot of the way that Salesforce runs. Even today, I introduced the concept of infrastructures code there, which was fun. I went on to run the global engineering SRE organization for SolarWinds, all of their cloud companies and now I mostly work with private equity portfolio companies on their engineering practices so they can become super valuable and all kinds of stuff for their investors.
I like to say that I have one of the best jobs in the world in that I get win, win, win. I get to go in and help make the lives of the engineers better so they're happier to come to work every day, which makes their executive leadership team happier because their engineers are really happy to come to work and don't have to do lots of manual pushing stuff from here to there, like deploys, we're going to talk about. When the engineers and leadership are happy, then the investors are happy. It turns into a win, win, win, which I just don't think there's that many jobs out there where you can do that kind of a thing.
Jonathan: Awesome. The topic today is Friday deployments, and I think maybe other deployments too will come up. You wrote an article about this a while back. We'll put a link of course in the show notes. A couple of years ago it came out on HackerNoon, and the title is, Deploy on Fridays or Don't. I guess you just don't care if we deploy on Fridays, is that what the point is?
Dave: What it all boils down to I think is that it should be a choice whether you deploy on Fridays or not. I'm a very firm believer that you should be able to deploy on Fridays and I don't have any problem with anyone deploying on Fridays. What I wrote that article about was a response to a lot of people shaming people or trying to make them feel bad or explaining to them how they're doing it all wrong if they decide that they don't want to deploy on Fridays.
To be clear, I'm not talking about moratoriums, things like that, moratoriums mean you are not permitted to deploy on Fridays. My disposition is not that you shouldn't be permitted to deploy on Fridays, but that you should choose whether or not you're going to deploy on Fridays. There's a lot of us who have done this for a long time who make the choice to not so much not deploy on Fridays, but like, if I'm going to deploy on Fridays, it better be for a darn good reason. Sometimes there's a darn good reason, and oftentimes there's not.
There's just all the things that people talk about.
In the Twitter arguments about giving up 20% of your velocity, you don't trust your tests and you just have to know how to deploy better. There's a whole thing about testing in production which I guess you and I can also jump into as well, that's a very large topic. Testing and production is necessary and there are ways to test in production, I will say, more safely as opposed to safely. But the whole point of the thing about Fridays is there's no guarantee that that thing that you push is not going to break something. That's why it needs to be a choice.
Jonathan: You touched on this in the article, there's a lot of, I guess we could call it stigma about not deploying on Friday. That was part of the reaction. If you're afraid to deploy on a Friday, there's something wrong with you or your team or your tech sucks or you're not a real high performer. Whatever no true Scotsman types of arguments you might want to throw out there.
I think we both agree that that's not healthy. Even if there might be some bad reasons not to blind fight, and I don't mean bad, that's not the right phrase I want to say, but maybe you want to overcome some of those reasons not to deploy on Friday, but that's still not a reason to be ashamed or certainly not to shame other people.
Dave: Most of us I think, who have done this for a long time know that every single outage that we've ever had was not an outage that we planned.
Dave: I'm not talking about plan maintenance or plan downtime. Every outage was something that was unexpected. My take is, you can deploy on Fridays or don't. That's the title of the article, but let's recognize that nobody knows when that next outage is going to happen so if I want to respect my Fridays, maybe I don't deploy. One of the ways I wrote in the article is I said, mores are not moratoriums. Meaning the culture of our company is to not deploy on Fridays, but that's it, that's the culture of the company. That's not a rule. That's not a, you can't do it.
People can deploy on Fridays, that's fine if it makes sense. A lot of places that I've worked, if it's Friday afternoon at two o'clock or three o'clock and everybody's already-- They've got one foot out the door at that point, do you want to make all these people stay in case some untoward event happens? To be clear, when I said earlier make it a choice, if it's 2:30 on a Friday afternoon and you have to get this thing out the door because there's some reason it has to go out the door, then definitely ship on Friday if you have to do it. If it doesn't really matter whether I ship this today or I ship this Monday when I come back, then why would you ship it on Friday?
Dave: There's all kinds of things we can do on a Friday. We can [crosstalk] documentation, we can train people on stuff, we can do some research, we can do all-- There's a ton of things that are available for us to do on Fridays that if I can just say, you know what, I'll just do this Monday, it's no big deal, then let's do it Monday. No big deal.
To your point, I think a lot of the backlash that I see, I think is the thing that's unhealthy, where people are like, "No. You should deploy on Friday." Don't wait till Monday even though you can, because the fact that you are choosing to wait till Monday means that there's something wrong with your deploy procedure. There's something wrong with your test. You don't have enough of this.
I get it. I read Continuous Delivery by Humble and Farley. Part of their thesis is that if you want to do a deploy, then you write tests, and the more tests that you have, the more confidence you should have in your deploy. It's Friday afternoon, you can say you should have more tests because then you'll be more confident in your deploy. I think that's fine but let's recognize there that nobody is saying that there's 100% confidence that nothing bad can happen. It's more confidence, it's not guaranteed 100% confidence. If I can wait till Monday and it doesn't matter, why would I deploy on Friday? It doesn't make any sense to me.
Jonathan: I think you've also struck a nice balance between that and the other extreme, I've maybe hear less often, but I've definitely heard it, which is if you deploy on Fridays you're also immoral or whatever because you don't respect your employees or you're expecting your employees to work late hours or you're stupid because things are going to go wrong every time you deploy on Friday or whatever. Both extremes I think we can agree are extremes and not really the right approach. [laughs]
Dave: Yes. The idea that you don't respect your employees if you deploy on Friday-- That kind of thing that you're talking about, I think is what people who are in the-- You must deploy on Fridays every da-da-da. I think that's what they're reacting to. I think the thing that they're reacting to is this idea that you are going to somehow hamstring the company by not doing that kind of a thing. That's where we get into this whole, "You're reducing your velocity by 20% and it's dangerous and it means that there's something fragile, because otherwise why would you not deploy on Friday unless there was something fragile?"
I'm like, "I don't deploy on Fridays because literally Saturday and Sunday have a special name for them, it's called the Weekend, there is nobody ever calls Tuesday and Wednesday the weekend unless they work in retail or something when they have crazy hours." We have a special name for those two days and people make plans with their friends or their families or whatever to do things on the weekend, nobody's like, "Hey, I've got this great Tuesday and Wednesday booked in October to go fishing." It's not, does that happen? Sure, but most people talk about the weekend. Fortunately, or unfortunately, Friday comes before the weekend, so Friday actually is a--
People go, people say in their argent like, there's nothing special about Friday. There is something special about Friday, it's the last day before the weekend, and we make choices based on the fact that we're going to be away for 48 hours, not sitting in front of our computers. One of the things I like to point out to people that sometimes they either didn't know or they forget or whatever is Netflix when they-- Chaos Monkey, everybody's like, "Oh, the Chaos Monkey."
It was an incredible thing, and the launch of chaos engineering and Nora Jones and Casey Rosenthal and all that other stuff is, it's super cool, but the Chaos Monkey only ran during business hours, and people like, I don't either don't know or forget that and they're like, Oh, you should be resilient to failures at any time, and yes, you absolutely should be resilient to failures at any time but the Chaos Monkey only ran during business hours.
Jonathan: Nobody does a fire drill at three o'clock in the morning. Right. [laughs]
Dave: Yes, well hopefully not.
Jonathan: [laughs] I guess if you're in a hotel, maybe you want to make sure your guests know how to get out too. I don't know.
Dave: During business, for me, it's the same thing. It's like I can do a deploy on Friday. Sure, but I also know that, for the next two days, no one's going to be in the office, no one's going to want to be in the office, no one's going on be up all weekend, figuring out what happened when we dropped something in the database, so let's be cognizant of that. Again, that's being cognizant of it, that doesn't mean you can't do it because if you need to do it, then do it, and if it's Friday morning at 9:00 AM go ahead, who cared? You got the whole day basically. If it's that Friday afternoon time, that's when I suggest people exercise caution.
Jonathan: I have a general rule of thumb that I follow. I'm curious if you have your own take on this, and that is, if I'm deploying code, do I personally have the time to make sure to wait for it to deploy, run in production a reasonable amount of time, whatever that is given the circumstance, and be confident that it's working before my shift ends. Whether that's a Friday afternoon or a Tuesday, or maybe I'm working on a Saturday afternoon or something, and am I going to be around for the next three hours in case something explodes?
Dave: Yes. You hit the nail on the head, there is that thing about, I have to be around for a while after this is deployed because we've hopefully all of us have had outages where you push everything seems fine and then until it's not, and that's not necessarily immediate.
The other thing that you said there was, whether it's a Tuesday, but even on a Tuesday, how many, let's call it experienced people, it's five o'clock on Tuesday, you have to run out the door because you have a jujitsu class at 5:30 and what's the last thing you do before you run out the door? "Oh, I'm just going to push this to production right now because it's five o'clock, I got to be somewhere in 30 minutes, let's push this to production right now, and then I'm--" Who does that?
Dave: Who does that?
Jonathan: I know somebody who does that, but no. [laughs] Yes, that's I think critical is recognizing. Just thinking about what might go wrong and are you prepared to respond if something does go wrong? And if not you then, somebody on your team, is somebody prepared? Will there be cover?
Dave: That's the thing is, what you said that I'll pick a fight with or whatever is what might go wrong, you said, "Think about what might go wrong." I have no idea what might go wrong, I have some ideas about what might go wrong but again, nobody, no outages that I've ever been a part of was something that we planned. We're like, Oh, if I do this, we're going to have a huge outage. Nobody pushes that.
I don't know if you ever saw Dr. Richard Cooks who just passed away two weeks ago, so [unintelligible 00:15:19] went out for Dr. Cook. His velocity talk was something that changed my way of looking at all of this stuff is he says, an incident or an outage or whatever you want to call it, I think the phrase he used was each necessary but only jointly sufficient meaning the things that cause an outage, there's no one single thing that causes the outage.
It's a combination of all these things that come together and each of them is necessary to come together, but only jointly sufficient. Only when the combination of all those things happen is when we have an incident. If any of those things happen in isolation that may not have caused an outage, but it's the combination of those factors that caused the incident. He didn't talk about this in the talk, but it's like, there was that aircraft thing that some plane that was flying from Brazil and then some part on the outside of the plane iced over, and then the pilots didn't have the right readings and da, da, and all these things happen as a plane crash.
It's like there was just one thing, there's no single root cause, right? It's not just the fact that that instrument iced over. It's not like if there's ice on the instrument, plane automatically goes down, that's not how it works. It was like, that happened and then this, and then all, and then the plane went down. It's because each of those things was necessary, but only jointly sufficient.
We can't know when we're pushing something to production, whether that's going to be the trigger, that pushes any something over the edge. It's also, this is why you get into, why we do all the testing or whatever is because I want to have confidence because I did all these things, that all those conditions have to exist in order for me to have the outage.
I give the example in the article, which is, I was building a bunch of servers that were the interface between our front-end tier and the database. Because the front-end tier was PHP, we didn't want to have basically PHP just pounding the database into submission because it is every hit would be opening up a new database connection. We have this tier in the middle that's Java that has persistent database connections. Those things are long-lived, they're part of a connection pool, they have timeouts, they all the things that you want to have in a proper, resilient setup, and I was replacing the tier that we had with the brand new shiny, new software, new configuration, whatever.
These servers that I was building were not in production that was just building them, they were just there to sit there, until we were like, Okay, let's start migrating the traffic from the old servers to the new servers and then we will turn off the old servers. When you have software that creates a connection pool, it opens up a connection to the database. Well, databases can only handle a certain amount of number of connections.
What I had essentially-- Not essentially. What I actually done is I had doubled the number of connections into the database. Which pushed it over the limit of the number of databases the connection could, the connections the databases could handle. It caused an outage because things from the active tier were trying to open up connections because they time out and recycle and they couldn't anymore, which means we couldn't satisfy production database requests and that was an outage.
Did I know that was going to cause an outage? No, and everybody looks back and they go, of course, you should have known that opening up all those connections will cause that. Of course, I should have, but just that language right there should have is a counterfactual. It's I know information now that I didn't know then, and everything is obvious in hindsight because I have information that didn't exist, I didn't have that information at that time so that's a counterfactual, and so this idea that we could sit back and think about something hard enough to know whether or not my production push on a Friday afternoon is going to cause an outage. You can't know that. It's impossible.
That doesn't mean that there's not things that are more risky or less risky. That's a judgment call that you could make. At the same time, I guarantee you out of your thousands of listeners, somebody pushed the change that they thought was not risky at all, and it caused an outage. You can't know. Each is necessary, but only jointly sufficient.
Jonathan: How do you decide when it's appropriate or not appropriate? It sounds like your default is don't deploy on Friday, but there are times when you'll make an exception. How do you decide when to make that exception?
Dave: My heuristic is, "Can this wait, or can't this wait." That's what it boils down to. If it can wait, then it'll wait because it can.
Jonathan: I can imagine scenarios where "Can it wait? Yes, it could, but it would be nice if it didn't." Something like that. Where do you draw that line?
Dave: That's your risk tolerance. Certainly, like I just said, something seems simple and then they're not. Replacing the entire tier that talks to the production database, maybe that's not a good thing for a Friday afternoon. Changing the maximum number of hosts available in an auto-scaling group from five to eight? To me, that seems okay. I don't know.
I can't tell you beyond a shadow of a doubt because if a lot of traffic comes in and now we get to eight hosts that are hitting the database where we only had five hosts hitting the database, and the database was already close to the edge because we weren't monitoring that number, which now I know to monitor that number, but I can't know everything, to me that sounds okay. I already have five things in the auto-scaling group. I go up to eight. That's something I'd probably feel comfortable with? Yes, I probably feel comfortable with that. Again, if I don't have to, does it matter?
Our website gets a lot more traffic on Tuesdays than it does on Saturdays and Sundays. Can increasing the size of our scaling group from five to eight wait till Monday? Probably. What's the chances I'm going to need that capacity on the weekend when I have graphs that show me what our traffic looks like? Does that mean that the New York Times is going to publish a story on us that we didn't know about, and now we're on the front page of the Sunday whatever, and now we got a huge rush of traffic? "Oh, I wish we had increased that auto-scaling group from five to eight." These are all counterfactuals. It's having information that we didn't have at the time.
Jonathan: What kinds of responses do you get to this from the clients you work with? You go into a new company and you're like-- I don't know what their policy is. If it's always deployed on Fridays or never deployed on Fridays, or-- I imagine a lot of places they just don't even think about it. Whoever feels like pushing the button pushes the button when they feel like it. When you come to these companies and educate them on this, what responses do you get?
Dave: I think the fun part of that is people who are deploying any time they want and multiple times a day and all those other kinds of things, they tend not to be my clients. [laughs] I go in and help companies who are having trouble with that, who are-- A lot of the things that I get are-- Our customers won't allow us to push every day. They don't have an appetite for that. It takes two weeks to get stuff out into the staging environment, and then another two weeks after that before it goes into production. There's a lot of that.
This is what I do with people, is I help get them much more into that. We're going to deploy every day and we're going to get fast feedback, and the whole idea of, excuse me, small batches, small batch deploys are the things that make things safer. It's actually more dangerous to deploy 10,000 things at once than it is to deploy one thing at once, because the blast radius and the interactions, and again, each necessary, but only jointly sufficient.
I love that we get to talk about Dr. Cook, but I don't know what's going to be the interaction of all these things that's going to cause a production outage. I want to limit the number of things that go out so that I'm minimizing that. I tend to work with people on that. The other part of that is when people say that whole thing about my clients won't allow me to deploy is Dr. Mik Kersten wrote From Project to Product, and he talks about the four flow items in there. It's like defects, risks, tech debt, and features.
When people tell me they can't deploy on Friday-- Not on Fridays. They can't deploy because their customers won't allow it or don't have a-- they're talking about features. I'm like, "Yes, but there's other things you can deploy that are not features." Are you telling me that you're-- Let's say it's a risk, a security problem. Are you telling me that your customers going to be like, "You have a security hole that will allow some hacker exfiltrate our data and hold it for ransomware and sell it on the black market and all that." You're telling me that your customer would say, "No, no deploy. No, don't patch that. I don't have a risk tolerance for you to--" That's nonsense. Nobody would ever say that. That's crazy.
It's good that you and I are talking about deploys because I think that's important too. People need to understand that some of these things that we would deploy are not deployed on Friday. We're not just talking about features, we're talking about tech debt, we're talking about risks, we're talking about bugs, we're talking about all kinds of stuff. We have to make a determination on that Friday or whatever.
As my old boss used to say, is the juice worth the squeeze? Is this something that should go out or could go out? Let's say there's some security hole in-- I don't know, making something stupid up, like gzip or whatever. I've tested it on my workstation, I've tested it in the integration environment, the staging environment, whatever, and it's a security hole, yes, I'd probably push that on a Friday because I don't want to be exposed all weekend. I don't want to be vulnerable. There's other things that we can deploy on a Friday.
I work with customers or clients or whatever you want to call them. I work with my clients on making it safe to deploy. Again, you can hear it and what you and I are talking about. I was like, "You should be able to deploy on Friday. You should be able to deploy anytime. You should be able to deploy 24/7, 365 without having to go to a vice president and getting approval or anything like that." That doesn't mean that you will just because you can.
Jonathan: I want the ability to go to the hospital 24 hours a day. That doesn't mean I want to go to the hospital 24 hours a day. [laughs] If my hospital closed down every Friday, I think I'd be a little bit upset. [laughs]
Dave: To extend your metaphor, I think that's what a lot of the people who are talking about, "What's this nonsense about not deploying on Fridays?" That's the hospital being shut down on Friday because they didn't want to risk hurting a patient or something like that. The hospital should be open. If I need to go in for treatment, I should be able to. The deployment window on Friday, let's call it, should be open. I should be able to deploy on Friday, but that doesn't mean that I will necessarily choose to do that. Just out of respect for my colleagues.
Jonathan: I take a similar approach when it comes to things like code review in the sense. If a team's doing pull requests, the four I's rule, I think that they're-- I like the culture of four I's, but I don't like the rule. I want my developers to be able to merge. If they're trying to solve something in emergency situation, they shouldn't have to wait till Bob's around to review their code to get their fix into production, if it comes down to that, if you know what I mean.
Dave: That's interesting because there's a lot of compliance stuff around that.
Jonathan: In certain areas, there can be, for sure, for sure. In that case, you just probably need to make sure that two people are on call so that Bob can review your code to get it into production in a timely manner.
Dave: It's interesting that you say that because I don't know if you've read Deming's Out of the Crisis, but he talks about inspection and he says that 200% inspection is worse than 100% inspection. You're like, what? No, I've got more people looking at it. Therefore, my quality will be higher. You said basically, if that someone else is also going to be looking at this thing for quality or whatever you want to call it, then you're going to be like, Well, if I don't catch it, they will.
You don't look at it quite as much as you would otherwise because there's safety in numbers. It turns out if that you're the only person who's standing in between this thing and something awful happening, you're going to look at it a lot more closely because you're the last line of defense.
I've had clients who were like, Oh, well we require two reviewers who are experts or whatever. Then we also require a security person. There's all these code review that happens before a deploy can go out. It turns out at least according to Deming, that's actually more unsafe than it is safe. We're dealing with humans here. We're writing tests like a unit test that's not the same thing. The computer's going to do the same thing every time because that's what computers do.
Jonathan: I've definitely seen that happen. I think it's one of the strongest arguments I know for continuous delivery, continuous integration getting moving those manual QA, regression tests into an automated test suite is every time you throw that thing over the wall, you just like, maybe you'll come back later. If there's a problem, someone will find it and let me know.
It's when I'm coaching teams to adopt continuous delivering deployment and I'm trying to teach them that when you hit merge, that's final, that goes for the customers. I get a lot of pushback on that because it's scary. One of the things I say is, it should be scary. You need that motivation to be sure that the code you write is good. Speaking for myself, I'm always more careful when I know that that merge button means production instead of some review environment or development branch or something like that. At least to that extent, I can completely agree with Deming's quote.
Dave: That's the nice thing about the automated testing. You're not relying on the person on the other side of the wall. That is throwing it over the wall. You're not relying on them to have woken up at the right time in the morning or have had their coffee, or have not had an argument with their partner, or all those other things that distract humans from being able to do the exact same thing in the exact same way every single time. That's what the automated tests come in. That doesn't mean that there's no place for exploratory testing, but exploratory testing is not a blocker for being able to get your code into production.
Jonathan: Exactly. One thing I've said before is if you are afraid to deploy on a Friday, and I want to emphasize the afraid, if you choose not to, but not out of fear, that's a different thing. If you're afraid to deploy on Friday, use that fear as a motivation to improve your process. Would you agree with that or am I overstating the fact that am I overstating the statement?
Dave: I think the sentiment is-- I think Jez Humble said it a lot, lots of people have said it as like, "If it hurts, do it more." If you're afraid, then you should definitely address that. To me that's, it hurts. That's a pain. If I feel pain about that, I should do it more, I should do more pushing on a Friday but to your point, how do I make it less painful? How do I make it less scary? obviously now we have to work on not just writing more unit tests for the sake of writing more unit tests, but more like what can we do to give us a higher degree of confidence about what we're doing and what we're pushing that makes us more comfortable?
That makes it so that we feel like there's less risk because that's the whole point of testing. We said from continuous delivery is to reduce risk. If you feel like there's pain or if you feel scared to deploy on Friday, then you should-- It's weird to say, right, you should deploy more on Friday. Really what you should do is work on eliminating that scariness as much as you possibly can. To your point earlier about pressing the merge button, you should not eliminate it entirely. You shouldn't be like, Oh, no, I'm 100% confident that when I push this button, nothing could possibly ever go wrong. It's impossible.
That's not that. That doesn't exist in the real world. You can certainly say like, well I don't feel real comfortable about this section of the code because I don't really understand it. The person who wrote this code left the company three years ago and no one's had the time to get in to look at it.
Anyone who has, has been like, I don't understand how this even works. You should feel uncomfortable about that. What can you do to reduce that uncomfortableness or what can you do to reduce that scariness? Is it writing some tests around the output of that section of the code? Is it, we're going to have to sit down and do a pair programming on this thing, for two days until at least two of us understand how this code works.
Who knows what the answer is there, but definitely if you have that fear, you're talking about, I think that you should try to address that and there's no way we'll be able to address that for everything, especially on like a large complex code base. We have a pretty-- We have a decent sense of where we feel uncomfortable. Does that mean that that's actually the most dangerous part of the code? Who knows? Maybe it is, maybe it's not, but certainly, in terms of what you're talking about, people feeling uneasiness to push those deploys, I think they can address that.
Jonathan: You've talked to a fair amount about testing. What about the other side of addressing this and that is the observability side and the responding to failures that do happen? One side is preventing the failures from reaching production. The other is once they're there because they do happen, responding quickly. How do you address that element of things?
Dave: That's awesome. I don't know. John [unintelligible 00:37:30] put out a tweet I think on Friday about resilience being the ability to respond to outages, not preventing outages which is, I thought was pretty awesome. I spent a lot of time like with John's crew but that observability and stuff is all a part of your ability to respond to an outage.
I worked for a bunch of metrics and monitoring companies that most of my solar winds things were those kinds of things. One of the things that I teach my clients is, the best monitoring you can have is the ability to find answers to questions that you didn't know that you had. Everybody's got their take on what that means in the observability space and whose tool is the best for that and blah blah blah.
For me, let's say I want to know the size of some queue and we didn't have that instrumented before, how quickly and easy is it to find out how many items are in that queue, and can I push a code change and then all of a sudden I have a graph that shows up with the number of items in that queue and that is the thing that gives you the ability to respond to outages and fix them more quickly.
It's the ability to get that information because all of our incident response is hypothesis-driven. It's rare that we're like, Well this happened, so here's the answer to-- This happened. I think this is why. Let's do something about that and see if that fixes it. You might be right the first time and that's awesome when you are right the first time. It's all hypothesis-driven, but in order to test the hypothesis, you need to have data. Sometimes we don't have that data already. We have to have the ability to get that data. That's like the observability half of it.
The other half of it really is people's mental models of how the system works. It turns out in a group, everybody has a slightly different mental model of how stuff works. Sometimes it's only-- It's the wake Jonathan up at three o'clock in the morning because he's the only one who can fix this, because Jonathan happens to have the most complete mental model of how that system works. Other people are like I have no idea how the database gets that information. I just don't know because their mental models are incomplete.
Richard Cook, I'm just going to come back to him on purpose. There's the whole thing that was in the Stella report about above the line and below the line. I think it's below the line stuff is like all of our monitoring, all of our sending stuff to Promethea-- Whatever, all those, all the nitty gritty like things that we like sort of think of as the system, but then the above the line stuff is like the actual models of how it all actually works.
The system talking to each other is just a bunch of bits and bytes moving around on a bunch of wires. That's not the actual system itself. Those are just bits and bytes moving around. It's the construction of those bits and bytes into a thing where this talks to that and that gets information. That's the actual system itself it's not, the electrons are not the system. Having those mental models gives us the ability to respond to things faster, which is why, Google does their dirt exercises and Amazon does their game days and all the other stuff where they have like these, they induce these failures and see how people respond because people's mental models are incomplete.
If you start practicing those things, then you develop a better mental model of how the actual thing was, Oh, I had no idea that it was that old 486 that's under Jonathan's desk that kept the entire production website running and was responsible for billions of dollars that transact our website every day. Who knew that that was the 486 under Jonathan's desk? That was written so long ago at the beginning of the company.
I think people literally used to tell me that PayPal was run on some old Perl script and everything, or was built around this ancient Perl script or whatever. You have to have a mental model that understands all those different aspects of it so that when you have an outage, then you can fix it more quickly because you actually have an understanding of how the system works. Then you want to do that with your team because the more people that understand how the system works, the better you are able to respond, and the quicker you can get those things back.
It's funny because you are never going to stop outages, like all this stuff about like MTTR which you and I could jump into what, how awful the concept of MTTR is. All this stuff is like, we're going to eliminate failures. You won't. The only systems that don't have any faults are the ones that are turned off and have no electricity and are buried under the ground. That's like just entropy affects all of us. Having the ability to respond is really important.
Jonathan: I think every company I've worked at has had that 486, at least one. It wasn't necessarily literal 486. I do remember recently I was working at an eCommerce company and it was probably a little bit newer than a 486. It's probably a Pentium II or something. One day orders stopped coming through and we're like, what's going on? Someone we've discovered, oh, that server crashed that we forgot even existed. I don't remember what it did, but it was something critical.
Dave: A lot of companies have that. Everyone calls it the legacy system that like if it were ever to go down, nobody knows how to bring it back up. It's like, and you put the yellow caution tape around it, so like nobody would go and touch it. Then what? Inevitably the data center loses power at some point, right? Then to bring that system back up, it turns out this other system, it's a strange loop, right? This other system has to be up in order for that system to come up. The other system requires that system to be up in order for it to come up. They both have a dependency on each other being up before the other one could come up. Then you're like, "Oh, no, we have-- This is bad."
That's not even a joke. That's real, right? when I was first at Salesforce they were trying to fail over the systems that run Salesforce, like literally run the company from one data center to another. It turned out that the data center that was down or whatever you want to call it, needed to be up for the other system to come up. They were like, Oh, no, like, this isn't going to work.
The first time they tried to run that exercise, I think it took like seven hours, and eventually, it was just a failure because they were like, Hey, it turns out this other thing has to be up and it's not because that's the thing we're trying. The first time they tried to do it took seven hours and then the next time they ran through that exercise, I think it took like four hours. There they kept doing it until they could fail over the entire thing and like 20 minutes.
Part of that is understanding the model of how all this stuff actually works and then working around those things in order to make it go faster. They just kept getting better, like Gene Kim's Third Way of DevOps, right? Repetition practice leads to mastery. They were able to do that. That's great.
Jonathan: We've talked quite a bit about deployment and related topics. Do you have any recommended resources for anybody who's interested in learning how to do more reliable Friday deployments or anything else we talked about?
Dave: Well, we mentioned the continuous delivery book a bunch of times. I still think it's classic and I say that I think that's the right word to use. I think it's a classic meaning it's still very valuable. There are definitely things that have changed in the industry since that book was written. I think the ability to deploy on the ability to do testing and production, which is what a lot of people are all up in arms about, even though, let's be honest, for the vast majority, well over 99% of people like the idea, like, we're not going to have any pre-production environments and we're just going to test in production because that's the only place that we could be sure. I think that's nonsense.
I get it when people like Twitter or whatever are saying they do a lot of testing and production because you can't simulate that load or whatever, like, I get that but I'm sure Twitter even does testing before production. They don't just like well, I wrote it on my laptop, there are no tests, YOLO, press the button and it's in, there's literally zero chance that that's what really happens.
In the same regard that people get a little bit crazy about the Friday deploys and like, Oh, you should always be whatever, and you don't have enough faith in your testing and I think that there's a little bit of overlap between that and like testing and production. You should definitely test in production, feature flags, dark launching, all that stuff is really important and great skills to have. That doesn't mean you should throw away all other testing. That doesn't make any sense.
There's a little bit of that religious zealot kind of energy to some of this. It's like, "Well, Twitter tests in production, so therefore that's what our company should do." It's like well you have 400 customers you are running on, you're not Twitter, or as we always say to everybody, you're not Google. Even Twitter and Google don't do any testing because testing production is the only place where tests are valid. That doesn't make any kind of sense. I definitely think the Continuous Delivery book is a great place to start.
Jonathan: How can people get in contact with you if they're interested? I know that you do-- maybe if they're working at a company how did you put it owned by private equity or that's not quite the right phrasing?
Dave: The reason I work with private equity companies is generally that they have a definite appetite for change and making things better. They want to do it on a time scale that's not geologic, is that the word? Whatever. Private equity companies, the least the ones that I work with, firms generally have like a three to five-year investment thesis where they want to improve the company and then sell it off for profit. They don't have forever to figure this stuff out and make things go.
I like working with those, portfolio companies because they want to make a change, they want to get better, and they have a strong driver incentivizing them to go ahead and do that because they're what we like to call them. Their investor overlords or whatever, don't want them or around in the portfolio for 14 years. That's not part of their thesis. I love to work with those companies who are serious about making change and want to get better because like we said earlier about the win, win, win. I love that I get to go in and help make the lives of engineers better so that they're happier to come to work and they're happier to push code and they feel safer about it and all that other fun stuff.
I like working with those kinds of companies because they're willing to make those changes and I don't have to sit back and convince a bunch of EVPs for six months that this DevOps thing actually has legs. It's a real thing and it's not just a bunch of hippies talking about well if you pray to the Agile gods, you will be benefited with more velocity. It's not a bunch of nonsense, DevOps is a real thing. If people are interested in getting better, getting good at delivering software, I like to say, definitely check me out on Mangoteque, M-A-N-G-O-T-E-Q-U-E. Is a play on the fact that my last name is French and I am doing tech stuff or you can certainly find me on LinkedIn or Twitter, places like that.
Jonathan: Great. Well, thanks again, Dave, for coming on. It's been a pleasure, as always. We've had a couple of these conversations before, first one for this show, but appreciate your wisdom and your experience.
Dave: Yes, thank you for having me. It's always a pleasure.
Jonathan: Great. Well, until next time. Cheers, everybody.
[00:51:55] [END OF AUDIO]