Table of Contents
Great systems are not just built. They are monitored.
MetricFire is the fully managed Graphite and Grafana platform for small teams that don’t want to self-host their monitoring stack. Pre-built dashboards, alerts, and native add-ons for Heroku, AWS, Azure, and GCP. All with dedicated support and no infrastructure to maintain.
When every alert feels like noise, the one that matters gets ignored. Here's how to build a monitoring system your team actually trusts.
Your monitoring is firing alerts. Lots of them. Some fire at 2 AM due to CPU spikes that resolve within seconds. Others fire five times in a row for the same condition. After a few weeks of this, your team stops reading them. That silence, that learned indifference, is alert fatigue, and it's one of the most quietly dangerous failure modes in modern operations.
What Is Alert Fatigue?
Alert fatigue occurs when a monitoring system generates so many notifications that people receiving them start to ignore or dismiss them reflexively. It doesn't require a bad monitoring setup to occur. It can happen in well-intentioned systems where individual alerts are technically correct but collectively overwhelming.
The consequence isn't just annoyance. When engineers stop trusting their alerts, real incidents get missed. The alert that actually mattered, the one that signaled a genuine outage, gets lost in a sea of notifications nobody reads anymore.
Traditional alerts are simple by design: if a metric crosses a threshold, fire an alert. While that simplicity makes alerts easy to configure, it also leads to alert noise, because single metrics rarely tell the full story.
Why It Happens: The Root Causes
-
Threshold-only alerting
Setting a single metric threshold like "alert if CPU > 80%" looks reasonable in isolation. But in production, metrics spike constantly for transient, non-actionable reasons. Every spike fires an alert, most of which require no human response. -
Alerting on symptoms, not causes
Many setups alert on correlated symptoms simultaneously. When a database slows down, it may trigger CPU, memory, connection count, and query latency alerts all at once. Your on-call engineer receives five alerts about the same root cause. -
Too much breadth, not enough context
Platforms that instrument everything automatically can surface hundreds of metrics with default alert rules. Without intentional configuration, this becomes noise by default. More data doesn't mean better signal. -
No correlation between signals
A memory spike alone might be normal. A memory spike combined with a surge in router errors is a different story entirely. Without the ability to combine signals, each metric is evaluated in isolation, and most fire false positives. -
Copy-paste alert configurations
Alert configs copied from templates or onboarding docs are rarely tuned to the actual behavior of your application. A threshold that's appropriate for one service may be completely wrong for another.
Two Philosophies, Two Alert Experiences
The way a monitoring platform is designed has a direct bearing on how much alert fatigue you experience. Comparing New Relic and Hosted Graphite makes this concrete.
New Relic positions itself as a full-stack observability platform, bringing together infrastructure, APM, logs, distributed tracing, real user monitoring, and synthetics under one roof. The breadth is compelling for consolidation, but it creates a challenge. When everything is instrumented automatically, and alerts are tied to a unified telemetry model, the surface area for noise is enormous. Default alert conditions across dozens of auto-discovered integrations can multiply quickly. Teams that don't actively prune alert configurations tend to end up with a noisy dashboard and a growing instinct to mute notifications.
Hosted Graphite takes the opposite approach. It's a focused metrics platform: explicit configuration, direct metric manipulation via Graphite's query language, and alerts that you intentionally define rather than inherit from auto-discovery. The tradeoff is more setup, but the reward is signal over noise. You alert on what you decide matters.
| Dimension | New Relic | Hosted Graphite |
|---|---|---|
| Alert configuration | Auto-discovered + opinionated defaults More noise risk | Explicit, intentional configuration More control |
| Signal combination | NRQL-based conditions, multi-signal possible but complex | Composite alerts with AND/OR logical expressions Native support |
| Query model | Unified NRQL across all telemetry types | Graphite functions for direct metric manipulation Time-series native |
| Scope | Full stack: metrics, logs, traces, synthetics | Metrics-focused, Grafana dashboards included |
| Alert fatigue risk | Higher without active curation | Lower by design, you define what fires |
Neither approach is wrong. But if reducing alert fatigue is a priority, the explicit-configuration model gives you more leverage from the start.
The Fix: Composite Alerting
The most powerful structural fix for alert fatigue is composite alerting: the ability to combine multiple metric conditions into a single logical expression before firing a notification. Instead of alerting on a symptom in isolation, you alert on a meaningful combination of signals that together indicate a real problem.
Hosted Graphite's composite alerts let you define conditions with AND (&&) and OR (||) operators. Each metric threshold gets assigned a label, and those labels are combined in an expression that determines when the alert fires.
Without Composites
- CPU spike fires alert
- Memory spike fires alert
- Error count fires alert
- 3 alerts, 1 event
- Most are false positives
With Composite Alerts
- CPU spike: no alert
- Memory spike: no alert
- Memory AND errors: fires
- 1 alert, 1 actionable event
- Every alert means something
Expression Logic
a && b— both must be true(a && b) || c— either casea || b || c— any threshold- Labels map to metrics
- Flexible, precise control
Real-World Example: Heroku Memory Pressure + Router Errors
Consider a Heroku application where you're tracking memory usage and HTTP error rates. Memory spikes frequently during normal traffic surges and don't require intervention on their own. But when memory is under pressure and router errors are climbing simultaneously, that's a genuine signal: your application is likely running out of resources and users are being affected.
# Condition a: memory RSS above 85% metric: heroku.<app>.web.1.memory_rss threshold: 85 label: a # Condition b: router errors exceed 50 in the window metric: heroku.<app>.router.errors threshold: 50 label: b # Only fire when BOTH are true expression: a && b
Neither condition alone fires the alert. Only the combination does, which means every notification your team receives is tied to a real, user-impacting event.
PostgreSQL: Connection Pressure AND Disk Read Latency
A database with high connection counts might be under load, or it might just be a traffic spike that self-resolves. But when connection pressure combines with elevated disk read latency, you're looking at a database struggling to keep up. A composite a && b fires only on that meaningful combination, not on either symptom in isolation.
Disk: Capacity OR Inode Exhaustion
Sometimes OR logic is the right tool. A disk can become unusable by hitting capacity limits or by exhausting its inodes: two different failure paths, each catastrophic on its own. An a || b expression catches either condition without requiring separate alert configurations to manage.
Composite alerts are created and updated via the Alerts API. Each sub-condition targets a metric with its own threshold and window, gets assigned a label (a, b, c...), and those labels are combined in a logical expression. The alert fires only when that expression evaluates to true. A UI editor for composite alerts is currently in development.
Additional Practices to Reduce Noise
Tune thresholds to your application's baseline
Default thresholds from templates are almost never right for your specific workload. Spend time reviewing historical metric data in Grafana and set thresholds that distinguish meaningful deviations from normal variance. This single step eliminates a large fraction of alert noise for most teams.
Use alert windows, not instantaneous triggers
Alerting on a metric that crossed a threshold for one data point fires on transient spikes. Alerting on a metric that has been above a threshold for 5 continuous minutes is far more likely to represent a real problem. All credible alerting systems support evaluation windows — use them.
Route alerts to the right channels
Not every alert needs to wake someone up. Hosted Graphite supports routing notifications to Email, Slack, PagerDuty, Microsoft Teams, OpsGenie, and webhooks. Map your alert severity to your notification channel. Low-severity alerts go to a Slack channel. High-severity alerts page on-call. This preserves the signal value of your most urgent notifications.
Audit and prune regularly
Alert debt compounds. Every alert that hasn't fired a meaningful notification in 30 days is a candidate for review. Either it's set too conservatively to be useful, or the metric it watches isn't relevant anymore. A quarterly alert audit is one of the highest-leverage operations habits a team can develop.
The Underlying Principle
Alert fatigue is not a technology problem at its core: it's a signal design problem. Every alert in your system is a claim that when this condition occurs, a human needs to act. When that claim is wrong most of the time, trust erodes, and the entire monitoring system becomes unreliable.
The combination of intentional alert configuration, composite logic, appropriate thresholds, and deliberate routing turns your alert system from a noise generator into a reliable signal. When your team sees a notification, they know it means something. That trust is the real product of good monitoring.
Stop fighting alert noise.
MetricFire's Hosted Graphite gives you the tools to build a monitoring system your team actually trusts: composite alerts, Grafana dashboards, flexible thresholds, and dedicated support. No infrastructure to manage.
No credit card required · Set up in minutes · Cancel anytime