Optimize for recovery time, not defect count

June 5, 2021
Even if we reduce defects to, one per year, if repar takes days, any gain is lost.

Regular readers of mine know that I’m a strong advocate for continuous deployment. But many teams are afraid of implementing CD for fear that more bugs might slip through than with a more manual process.

The problem with this concern is that it’s focused on the wrong thing.

The focus is on trying to reduce the number of defects that get released (formally known as Mean Time Between Failure (MTBF)).

A much more useful goal is to try to reduce the time it takes to fix defects when they do occur (formally known as Mean Time to Repair).

Why?

Even if we can reduce our defect count to, say, one per year, if reparing that one defect takes days or weeks, the impact will be severe.

On the other hand, if we can improve our ability to detect and repair defects to, say minutes, then the impact of even hundreds or thousands of defects may be minimal.

Related Content

Different models of CI/CD

There's almost always more than one way to do something. What workflow does your team use for CI/CD?

How to handle long-running batch jobs during an upgrade

Two approaches to managing hours-long jobs with continuous deployment.

Don't deploy on payday!

Blocking deployments on certain days is an admission that standards are lower every other day.