An SLA is a business metric, not a technical one

Is a 99.999% SLA unrealistic? Does it actually matter?

Think about the company or project you’re currently working on.

Should you offer a 99.999% (“5 nines”) SLA to your clients or users?

Think about it for a moment…

I'll wait...

What things did you consider to decide if you can offer a 99.999% SLA?

Here are some of my guesses. How many were close?

  • “Our service is complex. Any small component could fail…”
  • “What does ‘available’ even mean? The web site loads? Users can make a purchase? Users can use every feature?”
  • “Do bugs count as down time?”
  • “We can’t even do a zero-downtime service upgrade yet…”
  • “How long was that last outage?”
  • “How much downtime did we have last year?”
  • “5 nines? That means only 5 minutes of downtime per year!

You probably came pretty quickly the conclusion that 99.999% uptime is unrealistic for your application.

Question answered. Right?

No.

I didn’t ask if 99.999% uptime was realistic. I asked if you should offer a 99.999% SLA to your clients.

These questions appear related, but they’re almost completely unrelated.

If you’re like younger me, and practically every other engineer, CTO, or tech-minded person in the world, you hear “99.999% uptime”, and you start thinking of technical feasibility. And there is a place for that. But that’s not what an SLA is actually for.

An SLA is an agreement, usually contractually enforced, between a service provider and a client. It explains the expected level of service and… now this is the important part… the repercussions for non-compliance.

Focusing on the uptime percentage alone is like falling for the age-old trap of “You name the price, I’ll set the terms”.

Consider the whole package.

Now I haven’t seen an extensive number of SLA contracts, but I’ve seen a few, and from both sides of the bargaining table. In some cases, the penalty may just be a prorated credit for services not rendered. If the service is down for 5 hours during the month, the client receives a 0.69% discount (5 hours / 720 hours).

On another example, I saw the requirement that severe outages, a 100% refund is issued for the month.

Now I can imagine you thinking of new questions…

  • “What are the terms of our current SLA with clients?”
  • “Would my customers be happy with a prorated discount for services missed? That’s pennies…”
  • “Actually, a 100% monthly refund is pennies, too.”
  • “We would absolutely love take on this huge client, even if we have to give them a free month of service once or twice a year.”
  • “Wait, if SLAs can have such weak penalties, who cares about them?”

Oh, that last one is a great one. I’m glad you asked!

Lawyers.

Compliance officers.

Those types.

SLAs are mostly a CYA tactic. “Nobody got fired for buying from a vendor with a 99.999% SLA.”

Now you shouldn’t take this as carte blanche to write toothless or unethical SLAs. And, of course, you need to keep your customers happy, too. And if they honestly expect 99.999% uptime, and you’re only delivering 98% uptime, they may take their 2 months of free service and leave.

Just consider the entire SLA. Not just the number. And remember it’s a business metric primarily. Not a technical one.