How serious is a deployment failure?
A deployment failure should go through standard alert channels, but should it page whomever is on call?Continuing on the theme of monitoring deployments, another response I received suggested (paraphrased):
Arguably, a production deployment failure should alert through standard production alerting, and go to whomever is on call.
I think this is a good point. But there’s also some nuance to unpack.
First, I agree that deployment failures should go through the standard alerting process in most cases.
However, I’m not sure that should always mean paging whomever is on call. It might mean that. But it might not. It depends on what a failed deployment means for your business.
If you’re using Kubernetes or Amazon ECS, or any other system that defaults to a rolling deployment strategy, a failed deployment usually just means that the service keeps working, but on the older version of the software. This is the case at the company I mentioned at the link above.
In such a case, an on-call alert may not be appropriate. I’d venture a guess that for most companies, a failed deployment that goes unnoticed at 3am can wait until the start of business the next day to be fixed.
On the other hand, if a failed deployment means your service goes down… that obviously changes things significantly. In such a case, you probably do want to page whomever is on call. (And also consider finding a more robust service upgrade procedure 😉).