Don't be so meanThe mean is useless when severe outliers are present, as is virtually always true when calculating MTTR.
Let’s talk about Mean Time to Recovery, or MTTR.
It’s often cited as one of the DORA metrics. And many organizations measure and report this statistic.
Not so fast.
MTTR is actually pretty useless.
Let me make my case.
Let’s say that last year your service had 10 outages. 9 of them lasted one hour, and the tenth lasted 8 hours. That’s 17 hours of outage, over 10 incidents, for an MTTR of 1.7 hours.
Let’s say you then improve things. This year, you have only one outage. And it takes 3 hours to resolve. Woot! Now you have a shiny new MTTR of… 3 hours. Oh crap!
Let’s try another one.
We’ll start with the same 10 outages, with an MTTR of 1.7 hours as in the first scenario.
But this time, in the new year, new management changed things, and demanded we push out a bunch of new features. Some of them broke things. Fortunately, they were easy fixes. So this year we had the same 3-hour outage as in the first scenario, but an additional 50 outages which were each fixed in 5 minutes. So an MTTR of 8.4 minutes. Big in! Ehh…
But these are just made up examples, that play to the extremes, right?
Maybe… so let’s try a more realistic thought experiment. Think of your current MTTR. If you know it from measurement, use that. If not, just guess. It’s not important that it’s accurate.
Now imagine that at this time next year, you learn that the MTTR went down by 10%. What would this tell you about the reliability of your service? You might assume that it means that each recovery is reduced; on average by 10%. And while that is one possible explanation, it’s far from the only one. Maybe you’ve just had 20% more outages, but with easier resolutions.
Or imagine that your MTTR goes up by 10%… what can you know about the reliability of your service from this? Maybe you’ve just had 20% fewer outages, and they were the easier ones to solve.
Generally speaking, the mean of any data set is of limited usefulness if it meets any of the following charactaristics (source):
- if severe outliers are present
- if the distribution is multi-modal
- if the distribution has infinite variance
The first of these is exceedingly common in IT systems. So common, I’d say it’s the norm. The other two are also quite common.
If you have historical data regarding incident recovery times, change from a mean calculation to a 99th percentile calculation.