The two big dangers of rolling back
Broken code is inevitable. Despite our best planning, automated tests, code review, and everything else, sooner or later (probably sooner) you’ll find that you’ve deployed a broken feature into production, and users of the system are complaining.
Our natural instinct in such a case is to quickly jump back to the last working version—a process we colloquially call a “rollback”.
Today I offer two reasons not to do this. Tomorrow I’ll talk about an alternative.
Diverging from the established software release cyle is risky.
On a healthy team, releasing software changes into production should be second nature. It should be as natural, and automatic, as breathing, and it should require just as much thought on a daily basis. I’m describing (and assuming) that this team is using continuous deployment.
On such a team, a rollback to a previous version of software on production servers is a divergence from normal. This isn’t always a problem in practice, of course, but it’s a potential problem area. When you roll back a service version, do you need to roll back any other dependencies, too? Are there any steps that are normally automated, that now need to be performed manually? Do dependent services need to be restarted? Etc.
In theory, of course, a rollback process could be automated as completely as a normal release, to aleviate this concern. And if you find yourself depending on rollbacks, you definitely should do this!
A rollback blocks other progress
The other serious problem when doing a rollback is often overlooked: It blocks all other development work from progressing. Of course, here I’m assuing the use of real continuous integration.
To illustrate, let’s imagine a rollback scenario:
- Service version 1.2.4 is released to production. A seroius bug is discovered.
- The service is rolled back to version 1.2.3 while the bug is investigated and fixed.
- Meanwhile, all updates to the service are put on hold, lest version 1.2.5 be released before the bug is fixed.
- Eventually the bug fix is applied, and 1.2.5 is released. Now any backlogged work can be committed, and 1.2.6, 1.2.7, etc, are likely to be following quickly.
Even on a small project owned by a single team, this can be a problem. On a large project, this can effectively put an entire department’s work on hold for the duration of the bug fix. And then once the bug is fixed, you get a slew of upgrades, effectively negating the reduced risk benefit that CD can offer.