Alert (or alarm) fatigue, also known as overload, is a real problem for developers and operations people who support always-on services.
Maybe, if you are lucky enough, you don’t know what it means. So, let’s start with the definition.
Alert fatigue is what happens to your team when they frequently receive an overwhelming number of alerts. If every week or so, you get a lot of alerts for one incident, that is not alert fatigue. It has to happen frequently.
Alert fatigue is a big concern for services and for those who are responsible for them. One study showed that for every reminder of the same alert, the attention of the alerted person dropped by 30%. When these unnecessary and often redundant alerts and their notifications become usual, it creates three major risks:
- Slow response times;
- Missed or ignored alerts;
Alert fatigue threatens our service reliability and employee health. That is why it is an important topic to discuss. In this article, we will discuss key aspects of minimizing and hopefully avoiding alert fatigue.
You can’t fix what you can’t measure
On-call engineers and team leads are usually the only ones who can see the negative impact of alert fatigue. You may hear them saying ‘we get a lot of alerts’, but we can’t approach a problem with just words and good intentions. The solution to challenging problems like alert fatigue requires both technical and people skills. First, we need to get buy-in from our stakeholders so we spend time and money. That is why we need data to detect the symptoms and have a diagnosis.
A good way to start measuring alerting is by building an on-call program. If done right, this will give you a lot of data to work on both on-call schedules and alerts. Tools help with this challenge, but due to the human aspect of it, there is far more to cover. A 100-page ebook I authored called ‘On Call’ is a good start. The book goes deep into topics like on-call scheduling, escalations, roles and responsibilities, and compensation.
One important action item I’d like to go deeper into before I specifically talk about alerts is having a centralized alerting and on-call management system. This helps teams see who spends time on what types of issues, and for how long. Once patterns are identified, it is much easier to start fixing the problem.
Identify what is important for your services
Talk around SRE (Site Reliability Engineering) has been pretty popular recently. The main reason is that SRE offers practical solutions to real problems. The original SRE Book by Google covered alerting as a major topic by associating them with service level objectives. Service level objectives are useful tools to measure service health. Even though you probably are not Google, the idea is simple and powerful and can be applied in many cases.
Focus on service level and choose actionable reliability indicators like HTTP error rate, end-to-end latency, or availability: metrics that your customers, either internal or external, would care about. Alerts should be tied to these metrics instead of some random node failure that would fire off and fix itself without any impact.
Distribute on-call responsibility among developers
One of the reasons on-call people burn out and eventually stop caring about alerts is because they are on-call too often. Ideally, in a team of 4-8 people, everyone should be on-call for one week and rest for at least the remaining three weeks of the month. This may change depending on the team size, experience, and some other factors like personal preferences, but the idea is to give enough space for people to recover from being on-call.
When there are a lot of alerts, the need to rest becomes obvious. However, even if there is no alert, being on-call creates cognitive overload. A scientific study on extended work availability revealed that even if there are no alerts, there are significant effects of extended work availability on the daily start-of-day mood, and cortisol awakening response. Consider this and give people enough time to recover, so that they can focus on actual work instead of dealing with chores all the time.
The best way to distribute being on-call between more people is to ask developers to take part in on-call schedules. This has become a common practice lately, and teams realize incredible benefits in addition to reduced alert overload such as better code, relationship with ops teams, and uptime.
Reduce dependencies to minimize the blast radius
In complex systems, dependencies are unavoidable. But we can make architectural and organizational decisions to minimize dependencies. Solving alert fatigue starts with focusing on individual services and teams. If we have too many dependencies, change becomes scary.
Most alerts accumulate over time when on-call engineers are scared of removing an alert. Creating boundaries and using the right technical practices help minimize the blast radius: the reach that a problem might cause. They also enable small and frequent changes. For example, even if your apps are all in one repository, you can use tags to indicate different alerts. Another approach would be to track dependencies using distributed tracing. That helps with identifying relationships and showing where the problematic parts are. If one service is causing a cascading failure, you can separate the logic and apply a circuit breaker pattern.
There are different approaches. The idea is to create independent services and teams so they can make changes quickly without worrying too much about breaking things. Dependencies make it harder to remove and fix alerts. Otonomy and simplicity are required to have actionable alerts.
Create actionable alerts
An actionable alert is an alert that is routed to the right team with useful information and action capabilities. If an alert says ‘server is down’, that is basically a panic notification, nothing more. We should aim better.
Monitoring tools should add context instead of just stating the problem. Then, alerts become more useful. If there is information related to the incident in other tools, automation should be used to add that info to the alert in order to reduce manual investigation.
Referring to the first action item (SLOs) in the list, alerts should be triggered on the right symptoms. SLOs are useful tools to manage the right level of urgency.
Another key aspect of actionable alerts is notifying the right people. Always aim to direct alerts to the right team by using alert properties like tags or service names. The on-call person in that team should not have to act like a proxy, instead, they should be the expert.
One last reminder
Alert fatigue is a reliability problem. And reliability is everyone’s job. But if something is everyone’s job, it is no one’s responsibility. Create independent teams and ownership around services – then start learning from your experience and data.
Resilience is about continuous learning because change is inevitable. Following the ideas from DevOps, we should create opportunities for people to learn. We should talk about reliability, and in this specific case, talk about alerting. This may be obvious for some but most teams don’t spend enough time discussing important operational concerns like this regularly.
Avoiding alert fatigue is key for healthy on-call teams and reliability services. Following the learnings shared in this article, teams should be able to focus on business problems instead of drowning in alerts.