My worst deployment war story
Over the 12 months I worked on this project, there were two instances in which deployments were blocked for more than a week.What’s the worst deployment experience you’ve had?
I recently spoke about a deployment disaster very early in my career, which had a profound impact on my career’s trajectory.
But believe it or not, that wasn’t the worst deployment experience I’ve had.
I once worked for a company (which I won’t name, to protect the guilty) where the deployment process worked roughly like this:
- Announce your intention to deploy on the company chat, then wait a few minutes in case there are objections (i.e. someone knows they’ve broken
master
) - Execute some manual shell commands to kick off the build and deploy process
- When the build is complete, and if there were no errors, confirm your intention to deploy to the canary servers
- Once deployed to the canary servers, announce that it’s testing time to the company chat
- Read through several pages of manual test scenarios, executing all the tests as described. If done alone, this would typically take 60-90 minutes. If others were helping, you could cut the time down by doing the tests in parallel.
- Do any other tests you feel might be warranted, in light of whatever change you’re pushing.
- Once the team gives the green light on the test results, instruct the deploy script to deploy to the production servers.
- Do some spot testing in production, to make sure things are still running smoothly.
- When you’re confident, instruct the deployment process to finalize the rollout to all production servers.
That probably looks tedious, but at least it’s fairly straight forward, right? On a good day, it would typically take 2-3 hours to deploy the service into production, if you didn’t cut any corners.
That’s on a good day.
The (mostly ignored) policy of the group was to do a deployment every time you made a change. With dozens of developers working on the project in parallel, this was obvoiusly impossible. With 50 developers, if each developer made only a single change per week, the system would be in perpetual deployment, with no room for failed deployments.
So in practice what happened is that most people would push their changes, then wait a day or two, hoping someone else would feel a greater urgency to go through the pain of deployment on their behalf.
This in turn lead to a lot of half-baked features pushed to master
, which would in turn break the deployment in ways impossible to debug by the person doing the deployment.
I once calculated the number of failed deployments, based on the deployment logs. It was well over 50%. And that only counts the major failures that prevented a deployment, with no accounting for regressions or other bugs that were introduced but not detected during the deployment process.
Over the 12 months I worked on this project, there were two instances in which deployments were blocked for more than a week.
What is your worst deployment story? Let me know.