When not to monitor your systems
August 13, 2022With too many alerts, you can be paralized into inaction.
I recently asked you Do you monitor deployments?
One response, which I’ll leave anonymous to protect the innocent, said:
What if half of them are already breaking and that’s the norm? Haha… I shouldn’t laugh.
What an honest response. And what a common one!
This exact thing happens all the time, in all sorts of scenarios. You want to start monitoring deployments, but half of them fail… so it’s just noise.
Or maybe you start monitoring for 500 errors from your HTTP service, and you get thousands per day.
Or you implement a linter, and suddenly your CI pipeline lights up like a New Years Eve fireworks show.
The problem is that each of these alerts represents a real potential problem. But with so many of them, what can you do?
Honestly, not much. Too many alerts leads to alert fatigue, and people start to ignore them.
If you’re facing one of the situations, it’s often best to find some way to filter your alerts, so that you have a lower volume that you can actually respond to.
If half of your deployments are failing, it may not be possible to ignore some of those failures. But if your HTTP service is producing 10,000 500 errors per day, you probably can ignore many of them, at least for now.
Maybe configure your alerting to monitor only 500 errors on certain, critical endpoints. If you’re doing eCommerce, for example, just monitor for 500 errors during the checkout or payment process. Hopefully that’s a more manegeable and actionable number of alerts. Once you get that under control, cast your nets wider, and start alerting for some other areas of the system.
Bottom line: Don’t let an overwhelming number of problems or alerts paralize you and your team. Don’t try to fix or monitor everything at once. Start small, and expand.
Do you monitor deployments?
I've seen a broken deployment go unnoticed for several hours before being fixed. No more!
Adventures in DevOps 129: The Future of Intelligent Monitoring and Alerting with Ava Naeini
Ava Naeini shares her patent-pending tool that uses ML to determin the health and performance of distributed systems.
How serious is a deployment failure?
A deployment failure should go through standard alert channels, but should it page whomever is on call?