99th percentile

Blog / Tech Tools

Yesterday I talked about some problems surrounding MTTR.

“So what’s the alternative?” many will say.

There’s no single answer… but I’d like to offer one possible answer. It’s not a perfect answer, but it could be a simple improvement:

Rather than calculating the mean time to recovery, why not simply calulate the 99th percentile?

This is already standard practice when measuring things like latencies, or page load times, because it is so widely recognized that there is a long tail in these sorts of measurements—exactly as there is in time to recover from an incident.

Keep in mind, this is no silver bullet. Even percentiles are susceptible to gaming, and misrepresenting data sets with long tails. But if you already have a bunch of historical data regarding incident recovery times, I’d venture that an immediate improvement could be had by switching from a mean calculation to a 99th percentile calculation.

99th percentile

Related Content

Don't be so mean

Adventures in DevOps 120: DevOps Research and Assessment (DORA) Metrics with Dave Mangot

Monitor what matters

99th percentile

Related Content

Don't be so mean

Adventures in DevOps 120: DevOps Research and Assessment (DORA) Metrics with Dave Mangot

Monitor what matters

Improve your software delivery