Docs

About

Resources

Get a Demo

All Guides

Contents

Test out Aptible AI for your team

Thoughts & Ideas

Last Updated

Mar 17, 2025

0 Min Read

The Ultimate Guide to Alert Management

Ashley Mathew

VP of Architecture @ Aptible

Ashley Mathew

VP of Architecture @ Aptible

Thoughts & Ideas

Last Updated

Mar 17, 2025

0 Min Read

The Ultimate Guide to Alert Management

Ashley Mathew

VP of Architecture @ Aptible

How to cut out alert noise, fight alert fatigue, and automate incident response

I’ve had the pleasure of being on-call more times over more years than I can count. So when given the chance to share some of my opinions about how engineering teams can handle alert management better, I jumped at the opportunity.

In the following guide, my goal is really only to help organizations implement a better alert management system that’s structured, actionable, and (when it makes sense) automated.

Fighting alert fatigue

I don’t think I’m being dramatic when I say that there is an alert fatigue epidemic plaguing on-call engineers right now.

Here’s what’s wrong:

On-call engineers get woken up for unclear, useless alerts with no documentation
Teams are either drowning in alert noise or they struggle with too few alerts and miss critical issues
Bad alert management is leading to burnout and turnover; engineers don’t want to spend their nights/mornings/weekends fixing issues that are seemingly impossible to diagnose or understand

On top of that, we’re seeing teams get smaller, consistently applying band-aid fixes to problems, creating useless alerts, and ultimately making their lives even harder for the following months and years.

Most teams don’t even have an alerting strategy. They have a mess of notifications, false positives, and messy monitoring tools. Let’s try to fix that, shall we?

So here’s what you can expect to learn in this guide:

How to cut alert noise by filtering out non-critical alerts
How to ensure every alert is actionable and comes with documentation
How to improve your team's on-call experience and prevent burnout
When to implement automation to reduce toil

Accounting for organizational differences: how to choose the right alert management strategy

As will most things, the alert management strategy you choose will come with a number of tradeoffs:

Approach	Pros	Cons
Every engineer can create alerts	Fast iteration	Leads to noise & undocumented alerts
Centralized alerting	High-quality alerts	Engineers have no control, slow setup
Hybrid approach (best practice IMO)	Balance of ownership & quality	Requires discipline to enforce

Rather than preach at you, I’ll point you toward three companies that, in my humble opinion, are doing a decent job with their alert management strategies.

GitLab: alerts for on-call engineers at GitLab are based on user impact, NOT infrastructure failures. When an engineer creates an alert, they’re required to attach appropriate documentation and a single dashboard to help the on-call engineer determine exactly what the alert means and how to address it. Read more about their process here.
Canvas Health: engineers at Canvas Health (at least at some point in the not-so-distant past) were required to build a dashboard definition for the metrics they wanted to track after building a feature. While this process was mainly used for analytics, it also helped on-call engineers monitor those particular aspects of a feature.

Here’s an example to make it more tangible: if I’m an engineer that just created a new reporting interface and I want to know that users are able to generate their reports, I might create a dashboard that measures execution times of the queries or the response codes on the request as part of that feature development. Then I’d create and document alerts based off of that; if response times spike really high or there are a lot of errors, then the on-call engineer could refer to the dashboard to troubleshoot the problem.
Aptible: at Aptible, we use a hybrid alert management model where engineers can set up alerts, but they must provide documentation. Unlike many companies, we actually lean toward more alerts than less — but our situation is a unique one because, as a PaaS, we’re doing incident management on behalf of hundreds of customers. Customer uptime is critical to the success of our business, and we can’t afford to miss critical alerts.

We also do regular audits to remove old or misleading alerts, and we commonly use PagerDuty reporting to track which alerts fire most often so that we can automate common fixes. More on that shortly!

How to cut out alert noise, fight alert fatigue, and automate incident response

In the following guide, my goal is really only to help organizations implement a better alert management system that’s structured, actionable, and (when it makes sense) automated.

Fighting alert fatigue

I don’t think I’m being dramatic when I say that there is an alert fatigue epidemic plaguing on-call engineers right now.

Here’s what’s wrong:

On-call engineers get woken up for unclear, useless alerts with no documentation
Teams are either drowning in alert noise or they struggle with too few alerts and miss critical issues
Bad alert management is leading to burnout and turnover; engineers don’t want to spend their nights/mornings/weekends fixing issues that are seemingly impossible to diagnose or understand

Most teams don’t even have an alerting strategy. They have a mess of notifications, false positives, and messy monitoring tools. Let’s try to fix that, shall we?

So here’s what you can expect to learn in this guide:

How to cut alert noise by filtering out non-critical alerts
How to ensure every alert is actionable and comes with documentation
How to improve your team's on-call experience and prevent burnout
When to implement automation to reduce toil

Accounting for organizational differences: how to choose the right alert management strategy

As will most things, the alert management strategy you choose will come with a number of tradeoffs:

Approach	Pros	Cons
Every engineer can create alerts	Fast iteration	Leads to noise & undocumented alerts
Centralized alerting	High-quality alerts	Engineers have no control, slow setup
Hybrid approach (best practice IMO)	Balance of ownership & quality	Requires discipline to enforce

Rather than preach at you, I’ll point you toward three companies that, in my humble opinion, are doing a decent job with their alert management strategies.

GitLab: alerts for on-call engineers at GitLab are based on user impact, NOT infrastructure failures. When an engineer creates an alert, they’re required to attach appropriate documentation and a single dashboard to help the on-call engineer determine exactly what the alert means and how to address it. Read more about their process here.
Canvas Health: engineers at Canvas Health (at least at some point in the not-so-distant past) were required to build a dashboard definition for the metrics they wanted to track after building a feature. While this process was mainly used for analytics, it also helped on-call engineers monitor those particular aspects of a feature.

Here’s an example to make it more tangible: if I’m an engineer that just created a new reporting interface and I want to know that users are able to generate their reports, I might create a dashboard that measures execution times of the queries or the response codes on the request as part of that feature development. Then I’d create and document alerts based off of that; if response times spike really high or there are a lot of errors, then the on-call engineer could refer to the dashboard to troubleshoot the problem.
Aptible: at Aptible, we use a hybrid alert management model where engineers can set up alerts, but they must provide documentation. Unlike many companies, we actually lean toward more alerts than less — but our situation is a unique one because, as a PaaS, we’re doing incident management on behalf of hundreds of customers. Customer uptime is critical to the success of our business, and we can’t afford to miss critical alerts.

We also do regular audits to remove old or misleading alerts, and we commonly use PagerDuty reporting to track which alerts fire most often so that we can automate common fixes. More on that shortly!

How to cut out alert noise, fight alert fatigue, and automate incident response

In the following guide, my goal is really only to help organizations implement a better alert management system that’s structured, actionable, and (when it makes sense) automated.

Fighting alert fatigue

I don’t think I’m being dramatic when I say that there is an alert fatigue epidemic plaguing on-call engineers right now.

Here’s what’s wrong:

On-call engineers get woken up for unclear, useless alerts with no documentation
Teams are either drowning in alert noise or they struggle with too few alerts and miss critical issues
Bad alert management is leading to burnout and turnover; engineers don’t want to spend their nights/mornings/weekends fixing issues that are seemingly impossible to diagnose or understand

Most teams don’t even have an alerting strategy. They have a mess of notifications, false positives, and messy monitoring tools. Let’s try to fix that, shall we?

So here’s what you can expect to learn in this guide:

How to cut alert noise by filtering out non-critical alerts
How to ensure every alert is actionable and comes with documentation
How to improve your team's on-call experience and prevent burnout
When to implement automation to reduce toil

Accounting for organizational differences: how to choose the right alert management strategy

As will most things, the alert management strategy you choose will come with a number of tradeoffs:

Approach	Pros	Cons
Every engineer can create alerts	Fast iteration	Leads to noise & undocumented alerts
Centralized alerting	High-quality alerts	Engineers have no control, slow setup
Hybrid approach (best practice IMO)	Balance of ownership & quality	Requires discipline to enforce

Rather than preach at you, I’ll point you toward three companies that, in my humble opinion, are doing a decent job with their alert management strategies.

GitLab: alerts for on-call engineers at GitLab are based on user impact, NOT infrastructure failures. When an engineer creates an alert, they’re required to attach appropriate documentation and a single dashboard to help the on-call engineer determine exactly what the alert means and how to address it. Read more about their process here.
Canvas Health: engineers at Canvas Health (at least at some point in the not-so-distant past) were required to build a dashboard definition for the metrics they wanted to track after building a feature. While this process was mainly used for analytics, it also helped on-call engineers monitor those particular aspects of a feature.

Here’s an example to make it more tangible: if I’m an engineer that just created a new reporting interface and I want to know that users are able to generate their reports, I might create a dashboard that measures execution times of the queries or the response codes on the request as part of that feature development. Then I’d create and document alerts based off of that; if response times spike really high or there are a lot of errors, then the on-call engineer could refer to the dashboard to troubleshoot the problem.
Aptible: at Aptible, we use a hybrid alert management model where engineers can set up alerts, but they must provide documentation. Unlike many companies, we actually lean toward more alerts than less — but our situation is a unique one because, as a PaaS, we’re doing incident management on behalf of hundreds of customers. Customer uptime is critical to the success of our business, and we can’t afford to miss critical alerts.

We also do regular audits to remove old or misleading alerts, and we commonly use PagerDuty reporting to track which alerts fire most often so that we can automate common fixes. More on that shortly!

Test out Aptible AI for your team

When is it time to automate your alerts?

In an ideal world, you’d be automating anything you’re getting paged for repeatedly. But alas, that is much easier said than done. As with determining how much alert noise is appropriate and who should own the setup of alerts, automating vs. not automating comes with its own tradeoffs.

Many of you are likely already familiar with Google’s SRE book; here’s an overview of the section on automation that I think is particularly helpful for understanding when to automate alerts:

Stage	What it means	Outcome
No automation	Engineers manually recover all failing instances	Consistently time-consuming toil in comparison to automating
Local script	Engineers run scripts on their machines	Not scalable
Runbook automation	Predefined steps trigger an automated fix	Faster response time, though it requires more upfront effort
Full automation	System self-recovers and notifies the customer	Zero engineering intervention, but potential for edge cases and missed outages

Here’s the rub: it’s the smallest engineering teams that need the most automation to save themselves from burnout and turnover… But because automating alerts requires time and investment upfront, the smallest teams can’t afford to do it.

There’s no perfect answer to this problem, and I understand that this section is particularly grim, but FWIW, here’s my two cents: if an alert fires too often, automate the resolution instead of waking up engineers. You’ll thank yourself later.

On-call and turnover: How to prevent alert fatigue and engineering burnout

I’ve mentioned this several times, but the whole “alert fatigue epidemic” is incredibly painful because it’s a never-ending cycle: alerts cause burnout; burnout causes turnover; high turnover causes bad alerting.

As a victim of both the burnout and the stress of managing a burnt out team, I’m all too familiar with this particular problem. Here’s my advice:

Strive for no engineering turnover (I sort of say this in jest, of course, but… seriously — prioritize it).
Rotate on-call schedules fairly.
Prioritize knowledge transfer before employees leave.
Set up and stick to a stringent alert management and alerting hygiene routine (delete or update old alerts as regularly as you can).

Wrapping up

I know we covered a lot of ground, and a large part of this guide is more philosophical than practical, but that’s the whole thing about alert management: there’s no perfect way to do it.

At the end of the day, IT alerting shouldn’t be a nightmare; it should be a tool that helps engineers respond faster with less stress. If you skipped everything else and only read this, my advice is to start with small changes:

Audit and clean up your alerts now if you haven’t recently
Require documentation for all your alerts, regardless of who’s owning them
When possible, automate the alerts that you can — reducing noise even by a small fraction is worth it if it helps to retain your engineering team

Finally, if you want to dive even deeper on how other companies are handling alerting, here are a few alerting resources I’ve found particularly useful:

When is it time to automate your alerts?

Many of you are likely already familiar with Google’s SRE book; here’s an overview of the section on automation that I think is particularly helpful for understanding when to automate alerts:

Stage	What it means	Outcome
No automation	Engineers manually recover all failing instances	Consistently time-consuming toil in comparison to automating
Local script	Engineers run scripts on their machines	Not scalable
Runbook automation	Predefined steps trigger an automated fix	Faster response time, though it requires more upfront effort
Full automation	System self-recovers and notifies the customer	Zero engineering intervention, but potential for edge cases and missed outages

On-call and turnover: How to prevent alert fatigue and engineering burnout

As a victim of both the burnout and the stress of managing a burnt out team, I’m all too familiar with this particular problem. Here’s my advice:

Strive for no engineering turnover (I sort of say this in jest, of course, but… seriously — prioritize it).
Rotate on-call schedules fairly.
Prioritize knowledge transfer before employees leave.
Set up and stick to a stringent alert management and alerting hygiene routine (delete or update old alerts as regularly as you can).

Wrapping up

I know we covered a lot of ground, and a large part of this guide is more philosophical than practical, but that’s the whole thing about alert management: there’s no perfect way to do it.

Audit and clean up your alerts now if you haven’t recently
Require documentation for all your alerts, regardless of who’s owning them
When possible, automate the alerts that you can — reducing noise even by a small fraction is worth it if it helps to retain your engineering team

Finally, if you want to dive even deeper on how other companies are handling alerting, here are a few alerting resources I’ve found particularly useful:

When is it time to automate your alerts?

Many of you are likely already familiar with Google’s SRE book; here’s an overview of the section on automation that I think is particularly helpful for understanding when to automate alerts:

Stage	What it means	Outcome
No automation	Engineers manually recover all failing instances	Consistently time-consuming toil in comparison to automating
Local script	Engineers run scripts on their machines	Not scalable
Runbook automation	Predefined steps trigger an automated fix	Faster response time, though it requires more upfront effort
Full automation	System self-recovers and notifies the customer	Zero engineering intervention, but potential for edge cases and missed outages

On-call and turnover: How to prevent alert fatigue and engineering burnout

As a victim of both the burnout and the stress of managing a burnt out team, I’m all too familiar with this particular problem. Here’s my advice:

Strive for no engineering turnover (I sort of say this in jest, of course, but… seriously — prioritize it).
Rotate on-call schedules fairly.
Prioritize knowledge transfer before employees leave.
Set up and stick to a stringent alert management and alerting hygiene routine (delete or update old alerts as regularly as you can).

Wrapping up

I know we covered a lot of ground, and a large part of this guide is more philosophical than practical, but that’s the whole thing about alert management: there’s no perfect way to do it.

Audit and clean up your alerts now if you haven’t recently
Require documentation for all your alerts, regardless of who’s owning them
When possible, automate the alerts that you can — reducing noise even by a small fraction is worth it if it helps to retain your engineering team

Finally, if you want to dive even deeper on how other companies are handling alerting, here are a few alerting resources I’ve found particularly useful:

Questions?

Don't want to build your own? Try Aptible AI.

Terms

Changelog

Docs

The Ultimate Guide to Alert Management

The Ultimate Guide to Alert Management

How to cut out alert noise, fight alert fatigue, and automate incident response

Fighting alert fatigue

Top 3 problems with alerting (and their hidden costs)

[01] Noisy alerts

[02] Unclear and useless alerts

[03] No defined ownership of alerts

Accounting for organizational differences: how to choose the right alert management strategy

How to cut out alert noise, fight alert fatigue, and automate incident response

Fighting alert fatigue

Top 3 problems with alerting (and their hidden costs)

[01] Noisy alerts

[02] Unclear and useless alerts

[03] No defined ownership of alerts

Accounting for organizational differences: how to choose the right alert management strategy

How to cut out alert noise, fight alert fatigue, and automate incident response

Fighting alert fatigue

Top 3 problems with alerting (and their hidden costs)

[01] Noisy alerts

[02] Unclear and useless alerts

[03] No defined ownership of alerts

Accounting for organizational differences: how to choose the right alert management strategy

When is it time to automate your alerts?

On-call and turnover: How to prevent alert fatigue and engineering burnout

Wrapping up

When is it time to automate your alerts?

On-call and turnover: How to prevent alert fatigue and engineering burnout

Wrapping up

When is it time to automate your alerts?

On-call and turnover: How to prevent alert fatigue and engineering burnout

Wrapping up

Questions?

Don't want to build your own? Try Aptible AI.

Don't want to build your own? Try Aptible AI.

Don't want to build your own? Try Aptible AI.