Tiny DevOps episode #35 Jonathan Hall — The Butterfly Effect: How a Single Bit Changed My Career

March 8, 2022
This week I share the story of a single bit gone wrong back in 2006, which launched my career on a new trajectory of root-cause analysis, continuous improvement, and DevOps.

Resources
Blog: Joel on Software
Book: Working Effectively with Legacy Code by Michael Feathers
Book: Extreme Programming Explained by Kent Beck
Book: Clean Code by Robert Martin
The Joel Test
Talk: 10+ Deploys Per Day (12:45)
The Jonathan Test
Lean CD Bootcamp
Presentation Slides and notes


Transcript

Jonathan Hall: Ladies and gentlemen, The Tiny DevOps Guy.

[music]

Jonathan: Hi, everyone. Thanks for joining me on another episode of The Tiny DevOps Podcast. I'm your host, Jonathan Hall. On this show, we like to solve big problems with small teams. The guest I had planned for this week fell through in the last minute due to COVID and some travel restrictions. I'll have her back on in a few weeks, but that meant I had an episode to fill this week.

I decided to dig back to my archives and I found an episode or I found a presentation I did at PyCon, Australia last year. I thought that might be an interesting episode. It's about how I got started in my career doing software delivery and eventually DevOps. It really feel some background into how I got involved in this whole industry, in this topic, which eventually, of course, led to this very podcast.

Of course, the original presentation did include a slideshow which you can't hear or you can't see if you're listening to the audio version only. If you're interested in seeing those slides, of course, you can look in the show description, I'll have notes there that will link to the YouTube version. If you want to watch this episode, you're welcome to do that. It really shouldn't detract very much from the topic because it's really just a narrative, it's a story. I hope you find it interesting, I hope to see you next time when I have another guest on.

The title of my talk is The Butterfly Effect: How a Single Bit Changed My Career. My agenda for this talk, of course, is to tell you the story of how I started my career in software delivery. I want to encourage you on your own path in that regard, to whatever extent you are involved in software delivery, which I expect most of you are, as you're probably developers of some sort. Then, maybe give you some pointers on how you can improve in some of these areas if that's interesting for you.

Of course, my name is Jonathan Hall. I am The Tiny DevOps Guy. If you Google for that, you'll find me. I have a podcast by that name. I do DevOps consulting primarily for small companies. I'm a lifelong Developer, Operations person. I now do coaching. I'm a Trekkie, a podcast host. I also dance salsa. I'm a father. Many other things I could say about myself, but you're not here to learn about that. I think most of you are more interested in the technical part of my career.

I got into computers actually at the age of eight, when I started programming on my dad's Commodore 64. For many years I just played around, made games, and various silly little programs on the computer. In 2006, is when I got my first professional programming job. The interview was interesting. The owner of this small company drew three boxes on a whiteboard and said, "These boxes represent the three new servers we just bought. We would like to hire somebody, maybe you, to come in and build these servers out to run our spam filtering software."

Up into that point, their spam filtering software that they had written in-house, they would install it on an appliance, meaning like on your server along with the operating system. They would ship that to the customer's office and the customer would install this physical server next to their probably physical exchange server. They had the idea of switching to a cloud-based model, what we now call SaaS or Software-as-a-Service. It was relatively new at that point.

That's what I was hired to do. I was hired to take this existing spam filtering software and turn it into a SaaS that we could sell as a single installation, rather than all these countless installations we had physically installed around the world. The author of this spam filtering software had just left for another job. Up into that point, he had been the only software developer, technical support person, network engineer, all of that in one person.

When he left, they decided to replace him with two. They hired a developer to take over the day-to-day, and they hired me to help turn it into a SaaS. Jump ahead, just a month after I had started and it was Thanksgiving Day. For those of you who have not experienced Thanksgiving Day in United States, let me just give a little bit of context here.

Of course, Thanksgiving Day is a national holiday in the United States. Everybody, practically everybody was not working that day, even restaurants closed down. It's almost impossible to buy a stick of gum in the United States on Thanksgiving Day, because everything is closed. What you may also know is that the day after Thanksgiving Day is a big deal because that's known as Black Friday.

That's when everybody rushes out to get the best deals they can on electronics, and on vacuum cleaners, and on whatever else stores might be trying to sell for the holiday shopping season. This particular Black Friday, while the entire country was outside at shopping malls fighting over TV sets, we had a half day of work at the office, and our newest developer, not me, the other guy who joined at the same time, was pushing his first change out to customers.

What can possibly go wrong here, right? Nothing. This is what our phone lines maybe normally look like, soon they look more like a Christmas tree. Every single customer, we had 150 plus appliances installed in the wild, every one of them stopped working. Our phone lines were jammed, not just that day, this lasted into the next week. Everyone in the company, not just the people working on the spam filtering software, the sales representatives, the president of the company, the janitor were taking phone calls from customers whose mail was not working.

What happened? Well, our developer had made a small, a single bit mistake. He had failed to set the execute bit on a startup script. This prevented, of course, the mail from flowing, but it also prevented future updates from working. We couldn't even create a patch and put it on the central server for that servers to download because all these appliances wouldn't even start enough to download a new update. We had to actually log in by SSH to every customer machine and do a chmod +x on this particular script for every customer.

Now, you might think, "Couldn't you script some of this?" Well, we could and we did to an extent, but many of these machines were behind firewalls, or the owners had changed the password, or deleted SSH keys for security reasons. In some cases, we actually had to call customers over the phone and walk them through the process. If they weren't able to give us access to their network, then we would walk somebody through going into their server room, putting a monitor into the physical machine, logging in with root on the console and typing these commands to get things working again.

It was quite an ordeal, it took the better part of the following week to get everything fixed. Over the following months, in the aftermath of this incident, we had some other problems. As you might imagine, problems come up all the time on software. One that jumps to mind is, we had some reports that were taking 20 minutes or so to generate from our database, we were using MySQL.

The lead developer, remember my job was mainly getting the SaaS ready, his job was the core development. The core developer and I had a disagreement on how to do this. I suggested partitioning by date so that we could only query the data we cared about and he said that, "Would be too complicated and take too long." Well, our mutual boss came to me, in private and said, "Jonathan, you think you can get that partitioning working?"

I threw together a prototype in a few hours and it proved that it would be worth investing in. That upset the developer. That wasn't the only thing, of course, but he ended up leaving the company. I became the lead developer of the project. I essentially set out on my own. Of course, we did, overtime, hire other people to join the team and I remained the team lead for several years. During this time, I was reading a lot.

Of course, every hero's journey needs a guide and I had no shortage of guides. My guides were people like Joel Spolsky, who you may know as a co-founder of Stack Overflow. He also had a blog, I guess he still does. It's called Joel On Software. He used to blog very frequently, now it's about once a year maybe. I read his blog, front to back, maybe more than once, during this time.

Michael Feathers, he's the author of the famous book, Working Effectively with Legacy Code. I would say more than any other book, this book has changed my career. It is an excellent book, especially if you're working with Legacy Code and who isn't. Kent Beck, he's the author of Extreme Programming. He's also the so-called inventor of Test-Driven Development.

He's a very influential person, one of the co-authors of Agile Manifesto along with Bob Martin, another co-author of the Agile Manifesto. Bob Martin wrote a number of books in the Clean Coding series, Clean Code, Clean Architecture. He has some great YouTube talks. All of these people and many, many others were my guides during the years that I spent on this project to help me take this project to the next level.

One of the lenses I have used to evaluate the project I was working on at the time, became the Joel Test, which is a blog post that Joel Spolsky wrote on his Joel On Software blog. I think the post is written in 2000, so I read it 6, 7 years later. He goes through 12 steps and 12 questions you can ask about your project to decide the maturity or the usefulness of your project.

Many of the items on this list are timeless, they're still valid. Do you fix bugs before writing new code, for example? Or, does your team get distracted with new things before they finish the old things and leave bugs that sit for months or years or forever? Can you make a build in a single step? This was profound back then at least. On the other hand, some of the items on his list are a little bit outdated. Do you use source control? That's still valid. I think practically everybody does. If you don't use source control, I would love to hear why. It's at least assumed that you should be using source control. You make daily builds. Maybe not everybody does this, but it's another one of those things that the bar has been raised.

Nowadays, we tend to talk about, are you doing daily deployments, not daily build? I felt like its time for something new. This is why I have created The Jonathan Test, which are my 12 steps for better soccer delivery. It's a little bit different focus than Joel's, but it also is a little bit more updated, 20 years newer.

As I already mentioned, I feel like everybody should be using source control. That's assumed if you're not good with the game. Now, maybe you're not doing one build per day, but it's not exciting if you are. Let's raise the bar on that one. There's a lot of new technology. It's easier to build things faster, more frequently than it was in 2000.

Also, we have some new science. Not a lot. There's not a whole lot of science that's gone into the software development process, but we do have some. We do have some qualitative and quantitative data that we didn't in 2000 about what works and what doesn't. For example, we have evidence now that code review is one of the sure things to reduce defect count reviewing to this code. We didn't know that in 2000, maybe people guessed, but now we have some hard science to back some of that up.

Here's my test. Number one, do Devs merge multiple times per day, also known as continuous integration? This is another thing that scientists proven is beneficial in the sense that companies that do continuous integration outperform those in terms of delivery and revenue versus those that do not.

Is all code tested thoroughly before merging into mainline? When I took on the development project, this was not assumed. Our main branch was broken frequently and a big part of the software release process was testing the main branch to make sure that it wasn't broken and fixing any bugs that had crept in the intervening months, since the last release. Nowadays, with things like trunk-based development, we can assume the main is always ready to be released.

Continuous deployment. Is your project built and deployed automatically after every merge to mainline? This was almost inconceivable in 2000 when The Joel Test was written. In fact, it was in 2009 when the famous talk was done called 10 Plus Deploys Per Day. That's what birthed the DevOps movement. Back then we were talking about, "Is it even possible to deploy that frequently?" Now, we pretty well confidently that it is possible. It's more a business question now, should we? I believe you should. [laughs]

It's all code reviewed by another developer before merge. Like I said, this is one of the things that the little science we have has demonstrated pretty conclusively that code review is effective. Do you fix bugs before writing new code? This is the one I borrowed straight from The Joel Test. I believe that you should fix all significant bugs before writing code. Maybe if your line's off by two pixels, you don't care enough to ever fix it, if it's something that affects users, you should fix that before doing your work.

Do you have a prioritized backlog that features bugs? Bugs should be at the top of that list in my view and then features should be ranked according to whatever criteria your project or product manager has in mind. Do you have a reasonably complete automated test suite? I say reasonably complete, I don't say 100% or 95%, because I think these percentages are almost worthless.

They're so easy to game that you might as well not use those. What's more important is, do you have the confidence when your test suite runs that you haven't introduced a regression? If you don't have that confidence, you should improve it. You should improve your test suite. Don't focus on percentages, focus on confidence.

Does everybody have quiet working conditions? Another one I borrowed from The Joel Test. I think this is really important, especially for knowledge work like we're doing in software development. You should be able to concentrate. You use the best tools money can buy, another one I borrowed. If you need a better IDE, you should be able to get one. If your boss is not willing to pay for it, you might consider getting a new boss. [chuckles]

Get the best IDE for your job. Get the best monitor, the best keyboard. Nobody wants to fight with a mouse that doesn't work correctly when you're trying to code. Get the best tools you can buy, whether that's hardware, software, the best ticket tracking system, the best meeting software, whatever you need to get your job done effectively. The money spent pales in comparison to the productivity lost. Developers are expensive people. Get the best tools you can buy and save your time.

Do you do hallway usability testing? This is one that Joel talked about a lot in his blog. It's when I don't do well enough. There's a lot you can gain just by having some random person walking down the hallway, test your code. I don't mean like reading your code, but show them a prototype, show them the software.

Look at this screen, what does it say to you? Just simple stuff like that. Just getting a second pair of eyes, somebody who hasn't been looking at that all day to look at your code or to look at your product can go a long way.

No handoffs. This one gets at the heart of the DevOps movement. Do developers control their workflow from start to finish without handoffs? It could be to QA, or to operations, or to any other team. If you have handoffs, you're not doing DevOps. The whole idea of DevOps is to kill those silos.

Can your developers work effectively from an airplane or anywhere else where there's no Wi-Fi connection? This doesn't necessarily mean that you should be able to build an entire working application and run it from your laptop, but you should be able to do a meaningful amount of work without an internet connection.

In my view, it's just common sense. Especially if you have developers who are going to conferences or they have a long commute on a train or something like that, or they like to work from a cafe somewhere, it should be possible to work when you're offline. That's not just because Wi-Fi doesn't exist everywhere else, sometimes GitHub goes down. Or sometimes your production server goes down, or sometimes anything can happen that prevents your entire network from being online and available. You should not cripple your developers in those instances. Let them keep working.

With my 12 step test in mind, let's jump back to when I started with DoubleCheck. What did things look like back then? Do contributors merge their changes into mainline multiple times per day? No, we were not doing this. We did have subversion. We were using subversion control, but there was no rule about continuous integration whatsoever.

Code was not thoroughly tested. In fact, as I explained a minute ago, we just assumed that mainline was broken at all times until we were ready to do a release, then we would test mainline. Our project was not built or deployed at all manually or otherwise until we were ready to do a release. We fail there.

Is code reviewed by other developers? No, we had no code review policy in place. We did not fix bugs before writing new code. We fixed bugs when they became co-priority. Do you have a prioritized backlog of the features? As I recall when I started here, we did have a backlog, but it was a text file in subversion that the developer had been keeping. As a request came in, he would add an item to his text file and when he was done, he would remove it.

That's fine if it works. The problem, of course, is that there's no visibility to the rest of the company. Customer support can't see that. It didn't matter for him because he was a solo operation. He was technical support. Once you have more than two people, you probably want something a little more centralized than a text file. Of course, if it works, it works.

You have a reasonably complete automated test suite. No. I think there were some automated tests, but most of them didn't work correctly. It certainly wasn't being used and it was not a reasonably complete test suite whatsoever. Does everyone have quiet condition working conditions? Yes, we did. We were a small office. It was relatively quiet. That one is a win.

Do you use the best tools money can buy? Honestly, I don't remember the answer on this one. I'm assuming so. I don't remember any complaints about the tools we had at our disposal, so I'm going to give them a pass. Hallway usability testing. No, we didn't really do that. The developers control their work from start to finish without handoffs. In a sense, I suppose so because there was just one developer or two once I started, but it certainly wasn't like a mindset that developers should be able to do all this. It was by accident if anything.

Can your developers work productively from an airplane? I don't remember the details this time. In 2006, probably not. I'm sure it was not intentional. If it was possible to do meaningful work from an airplane, it was by accident. Final score 2.5 out of 12. Not so great. I stayed at DoubleCheck for nine years. I left in 2015, in September. Almost exactly nine years to the day later is when I left. Let's see how we scored by the time I left. All these highlighted in green are things that had improved, almost everything improved. The only things that did not were, of course, the ability to deploy automatically after every merge, we were not doing continuous deployment. We still did not do usability testing. That's one I fail on frequently. I've done it at times, but it's not something that's ever become a habit for me.

Could we work productively from an airplane? To some extent we had gotten better about that, but there were still parts of our software that just didn't work without a network. In fact, I remember writing a pearl module to allow us to do DNS-based tests without a network, so we could cache DNS server. There was effort being done there, but we hadn't reached our goal.

Final score 9.5. We gained seven points over a nine-year period. Guess that's something. Keep in mind I was learning as I went here. This was my first programming job. Less than nine years we had two or three other developers join, come and go over that time period. Not bad.

Let's jump ahead to 2020 last year. In January, I started with a small startup called Lana. They're based in Madrid, Spain. They're FinTech startup. When I started in January, only the CTO was able to do deployment. Developers were at his mercy. When they wanted something deployed, they had to ask the CTO and if he had time and agreed that it was something that should be deployed or didn't have any external dependencies and so on, he would deploy it within a day or two, probably.

The only of these 12 items that Lana had was quiet working conditions. That was mainly because it was mostly a remote company. Most people worked from their homes. Final score, one, when I started in January. Less than 12 months later, December when my contract ended, this are the changes I had been able to implement. Of course I didn't do it alone, but I was a driving force behind most of this.

See, it's a big change. Almost everything. The only things that weren't done were completely automated test suites. We're still doing a fair amount of manual testing. You never completely get rid of manual testing, but we were doing more than we would've wanted. We did not have the best tools money could buy. I had some developers ask for an IDE and they were denied. I disagreed with that but I didn't control the budget, so I didn't get to make that call.

We still didn't do much hallway usability testing and we didn't have a no-handoffs culture yet. We were still only partway through of the DevOps mindset transition. In the sense that we did have a manual QA and some of the developers were handing things off to our manual QA to do testing before deployment.

Well, it's still pretty good progress. I would say final score of 9 after 12 months. Plus 8 over 12 months, that's pretty encouraging in my view. I would like to invite you to join me on this goal to improving your Jonathan Test score. You don't have to use my test, use whatever test you want, but the idea is, "Can you improve on these important areas?" If you want to see the Jonathan test, you can go to this URL, jhall.IO/test, score for yourself.

I'd also like to invite you to my free 10-day email course called the Lean CD Bootcamp. It's really designed for teams that aren't doing continuous deployment yet. If you're not doing continuous deployment, this breaks it down into 10 simple steps that you can apply. Spoiler alert, you don't have to do automated testing before you're doing continuous deployment. That's the point of the Lean here.

Get your continuous deployment in place first and then you can start to add the steps to automation later. If this is interesting to you, I encourage you to go to leancdbootcamp.com. It's completely free, sign up for the 10-day course. If you need any help with that, with any of the things I've talked about, I'm always happy to talk to you. You can go to my website, you can contact me by email, follow me on LinkedIn or Twitter at Tiny DevOps.

[music]

[00:24:41] [END OF AUDIO]

Share this

Related Content

10 deploys per day? BORING!

In 2009, many thought 10+ deploys per day was inconcievable. Today it's boring. I call that progress!

Talk Notes: Scaling CD Down

CD Without CI

Conventional wisdom tells us that an automated test pipeline is the necessary first piece to Continuous Deployment. I challenge that thinking.