How serious is a deployment failure?

August 14, 2022
A deployment failure should go through standard alert channels, but should it page whomever is on call?

Continuing on the theme of monitoring deployments, another response I received suggested (paraphrased):

Arguably, a production deployment failure should alert through standard production alerting, and go to whomever is on call.

I think this is a good point. But there’s also some nuance to unpack.

First, I agree that deployment failures should go through the standard alerting process in most cases.

However, I’m not sure that should always mean paging whomever is on call. It might mean that. But it might not. It depends on what a failed deployment means for your business.

If you’re using Kubernetes or Amazon ECS, or any other system that defaults to a rolling deployment strategy, a failed deployment usually just means that the service keeps working, but on the older version of the software. This is the case at the company I mentioned at the link above.

In such a case, an on-call alert may not be appropriate. I’d venture a guess that for most companies, a failed deployment that goes unnoticed at 3am can wait until the start of business the next day to be fixed.

On the other hand, if a failed deployment means your service goes down… that obviously changes things significantly. In such a case, you probably do want to page whomever is on call. (And also consider finding a more robust service upgrade procedure 😉).

Share this

Related Content

Adventures in DevOps 129: The Future of Intelligent Monitoring and Alerting with Ava Naeini

Ava Naeini shares her patent-pending tool that uses ML to determin the health and performance of distributed systems.

When not to monitor your systems

With too many alerts, you can be paralized into inaction.

Do you monitor deployments?

I've seen a broken deployment go unnoticed for several hours before being fixed. No more!