Alert Fatigue: What It Is, Why It Happens, and How to Fix It

MONITORING

Mar 23, 2026 ∙ 8 min read

Elliot Langston

Table of Contents

What Is Alert Fatigue?
Why It Happens: The Root Causes
Two Philosophies, Two Alert Experiences
The Fix: Composite Alerting - Without Composites - With Composite Alerts - Expression Logic
Additional Practices to Reduce Noise
The Underlying Principle
Stop fighting alert noise.

Great systems are not just built. They are monitored.

MetricFire is the fully managed Graphite and Grafana platform for small teams that don’t want to self-host their monitoring stack. Pre-built dashboards, alerts, and native add-ons for Heroku, AWS, Azure, and GCP. All with dedicated support and no infrastructure to maintain.

When every alert feels like noise, the one that matters gets ignored. Here's how to build a monitoring system your team actually trusts.

Your monitoring is firing alerts. Lots of them. Some fire at 2 AM due to CPU spikes that resolve within seconds. Others fire five times in a row for the same condition. After a few weeks of this, your team stops reading them. That silence, that learned indifference, is alert fatigue, and it's one of the most quietly dangerous failure modes in modern operations.

What Is Alert Fatigue?

Alert fatigue occurs when a monitoring system generates so many notifications that people receiving them start to ignore or dismiss them reflexively. It doesn't require a bad monitoring setup to occur. It can happen in well-intentioned systems where individual alerts are technically correct but collectively overwhelming.

The consequence isn't just annoyance. When engineers stop trusting their alerts, real incidents get missed. The alert that actually mattered, the one that signaled a genuine outage, gets lost in a sea of notifications nobody reads anymore.

Traditional alerts are simple by design: if a metric crosses a threshold, fire an alert. While that simplicity makes alerts easy to configure, it also leads to alert noise, because single metrics rarely tell the full story.

Why It Happens: The Root Causes

Threshold-only alerting
Setting a single metric threshold like "alert if CPU > 80%" looks reasonable in isolation. But in production, metrics spike constantly for transient, non-actionable reasons. Every spike fires an alert, most of which require no human response.
Alerting on symptoms, not causes
Many setups alert on correlated symptoms simultaneously. When a database slows down, it may trigger CPU, memory, connection count, and query latency alerts all at once. Your on-call engineer receives five alerts about the same root cause.
Too much breadth, not enough context
Platforms that instrument everything automatically can surface hundreds of metrics with default alert rules. Without intentional configuration, this becomes noise by default. More data doesn't mean better signal.
No correlation between signals
A memory spike alone might be normal. A memory spike combined with a surge in router errors is a different story entirely. Without the ability to combine signals, each metric is evaluated in isolation, and most fire false positives.
Copy-paste alert configurations
Alert configs copied from templates or onboarding docs are rarely tuned to the actual behavior of your application. A threshold that's appropriate for one service may be completely wrong for another.

Two Philosophies, Two Alert Experiences

The way a monitoring platform is designed has a direct bearing on how much alert fatigue you experience. Comparing New Relic and Hosted Graphite makes this concrete.

New Relic positions itself as a full-stack observability platform, bringing together infrastructure, APM, logs, distributed tracing, real user monitoring, and synthetics under one roof. The breadth is compelling for consolidation, but it creates a challenge. When everything is instrumented automatically, and alerts are tied to a unified telemetry model, the surface area for noise is enormous. Default alert conditions across dozens of auto-discovered integrations can multiply quickly. Teams that don't actively prune alert configurations tend to end up with a noisy dashboard and a growing instinct to mute notifications.

Hosted Graphite takes the opposite approach. It's a focused metrics platform: explicit configuration, direct metric manipulation via Graphite's query language, and alerts that you intentionally define rather than inherit from auto-discovery. The tradeoff is more setup, but the reward is signal over noise. You alert on what you decide matters.

Dimension	New Relic	Hosted Graphite
Alert configuration	Auto-discovered + opinionated defaults More noise risk	Explicit, intentional configuration More control
Signal combination	NRQL-based conditions, multi-signal possible but complex	Composite alerts with AND/OR logical expressions Native support
Query model	Unified NRQL across all telemetry types	Graphite functions for direct metric manipulation Time-series native
Scope	Full stack: metrics, logs, traces, synthetics	Metrics-focused, Grafana dashboards included
Alert fatigue risk	Higher without active curation	Lower by design, you define what fires

Neither approach is wrong. But if reducing alert fatigue is a priority, the explicit-configuration model gives you more leverage from the start.

The Fix: Composite Alerting

The most powerful structural fix for alert fatigue is composite alerting: the ability to combine multiple metric conditions into a single logical expression before firing a notification. Instead of alerting on a symptom in isolation, you alert on a meaningful combination of signals that together indicate a real problem.

Hosted Graphite's composite alerts let you define conditions with AND (&&) and OR (||) operators. Each metric threshold gets assigned a label, and those labels are combined in an expression that determines when the alert fires.

Without Composites

CPU spike fires alert
Memory spike fires alert
Error count fires alert
3 alerts, 1 event
Most are false positives

With Composite Alerts

CPU spike: no alert
Memory spike: no alert
Memory AND errors: fires
1 alert, 1 actionable event
Every alert means something

Expression Logic

a && b — both must be true
(a && b) || c — either case
a || b || c — any threshold
Labels map to metrics
Flexible, precise control

Real-World Example: Heroku Memory Pressure + Router Errors

Consider a Heroku application where you're tracking memory usage and HTTP error rates. Memory spikes frequently during normal traffic surges and don't require intervention on their own. But when memory is under pressure and router errors are climbing simultaneously, that's a genuine signal: your application is likely running out of resources and users are being affected.

Composite Alert — Heroku

# Condition a: memory RSS above 85%
metric:     heroku.<app>.web.1.memory_rss
threshold:  85
label:      a

# Condition b: router errors exceed 50 in the window
metric:     heroku.<app>.router.errors
threshold:  50
label:      b

# Only fire when BOTH are true
expression: a && b

Neither condition alone fires the alert. Only the combination does, which means every notification your team receives is tied to a real, user-impacting event.

PostgreSQL: Connection Pressure AND Disk Read Latency

A database with high connection counts might be under load, or it might just be a traffic spike that self-resolves. But when connection pressure combines with elevated disk read latency, you're looking at a database struggling to keep up. A composite a && b fires only on that meaningful combination, not on either symptom in isolation.

Disk: Capacity OR Inode Exhaustion

Sometimes OR logic is the right tool. A disk can become unusable by hitting capacity limits or by exhausting its inodes: two different failure paths, each catastrophic on its own. An a || b expression catches either condition without requiring separate alert configurations to manage.

How it works in Hosted Graphite

Composite alerts are created and updated via the Alerts API. Each sub-condition targets a metric with its own threshold and window, gets assigned a label (a, b, c...), and those labels are combined in a logical expression. The alert fires only when that expression evaluates to true. A UI editor for composite alerts is currently in development.

Additional Practices to Reduce Noise

Tune thresholds to your application's baseline

Default thresholds from templates are almost never right for your specific workload. Spend time reviewing historical metric data in Grafana and set thresholds that distinguish meaningful deviations from normal variance. This single step eliminates a large fraction of alert noise for most teams.

Use alert windows, not instantaneous triggers

Alerting on a metric that crossed a threshold for one data point fires on transient spikes. Alerting on a metric that has been above a threshold for 5 continuous minutes is far more likely to represent a real problem. All credible alerting systems support evaluation windows — use them.

Route alerts to the right channels

Not every alert needs to wake someone up. Hosted Graphite supports routing notifications to Email, Slack, PagerDuty, Microsoft Teams, OpsGenie, and webhooks. Map your alert severity to your notification channel. Low-severity alerts go to a Slack channel. High-severity alerts page on-call. This preserves the signal value of your most urgent notifications.

Audit and prune regularly

Alert debt compounds. Every alert that hasn't fired a meaningful notification in 30 days is a candidate for review. Either it's set too conservatively to be useful, or the metric it watches isn't relevant anymore. A quarterly alert audit is one of the highest-leverage operations habits a team can develop.

The Underlying Principle

Alert fatigue is not a technology problem at its core: it's a signal design problem. Every alert in your system is a claim that when this condition occurs, a human needs to act. When that claim is wrong most of the time, trust erodes, and the entire monitoring system becomes unreliable.

The combination of intentional alert configuration, composite logic, appropriate thresholds, and deliberate routing turns your alert system from a noise generator into a reliable signal. When your team sees a notification, they know it means something. That trust is the real product of good monitoring.

Get Started Free

Stop fighting alert noise.

MetricFire's Hosted Graphite gives you the tools to build a monitoring system your team actually trusts: composite alerts, Grafana dashboards, flexible thresholds, and dedicated support. No infrastructure to manage.

Start your free 14-day trial

No credit card required · Set up in minutes · Cancel anytime

Start your free trial

Alert Fatigue: What It Is, Why It Happens, and How to Fix It

Great systems are not just built. They are monitored.

What Is Alert Fatigue?

Why It Happens: The Root Causes

Threshold-only alerting

Alerting on symptoms, not causes

Too much breadth, not enough context

No correlation between signals

Copy-paste alert configurations

Two Philosophies, Two Alert Experiences

The Fix: Composite Alerting

Without Composites

With Composite Alerts

Expression Logic

Real-World Example: Heroku Memory Pressure + Router Errors

PostgreSQL: Connection Pressure AND Disk Read Latency

Disk: Capacity OR Inode Exhaustion

Additional Practices to Reduce Noise

Tune thresholds to your application's baseline

Use alert windows, not instantaneous triggers

Route alerts to the right channels

Audit and prune regularly

The Underlying Principle

Stop fighting alert noise.

Infrastructure Monitoring Assessment

Heroku Monitoring Best Practices (2026) | MetricFire

Nonprofit Monitoring Tools: Best of 2026 | MetricFire

Container Monitoring: Top Tools & Features (2026) | MetricFire

We strive for 99.95% uptime

Try MetricFire now!

Add Hosted Graphite to your Heroku app for dashboards, alerts, and insights in minutes.