Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards

Table of Contents

Great systems are not just built. They are monitored.

MetricFire runs Graphite and Grafana as a fully managed service for growing engineering teams, taking care of storage, scaling, and version updates so your team doesn't have to. Plans start at $19/month, billed per metric namespace rather than per host, and include engineer-staffed support. Integrations work natively with Heroku, AWS, Azure, and GCP, and data is stored with 3× redundancy in SOC2- and ISO:27001-certified data centres.

Introduction

Most alerting systems start by defining thresholds on individual metrics so that when a value crosses that threshold, an alert is triggered. This approach works fine when systems are small, but it becomes difficult to manage as services grow and expose more signals.

A single service often produces multiple related metrics. Throughput, latency, queue depth, and health could all describe different aspects of the same system. Alerting on each of these components independently leads to duplication, and more importantly, it removes the context that ties them together.

In the previous article, we focused on structuring metrics so that services and their signals are clearly defined within the metric hierarchy.

In this article, we build on that foundation by using Graphite wildcard queries (*) to group related metric signals into meaningful alert definitions. Instead of creating alerts per metric, we define alerts around services and their internal components, and evaluate the specific metrics responsible when those alerts fire.

1: Why Metric-Level Alerts Can Create Noise

Consider the load balancing layer from the previous article. It exposes multiple signals that describe how traffic is being handled.

<host>.lb.forwarder.requests.rate
<host>.lb.worker.requests.rate
<host>.lb.queue.depth.value
<host>.lb.health.status.value
<host>.lb.latency.p95.value

Each of these metrics is valid on its own, but they are all part of the same system. Now imagine a failure scenario where traffic begins to back up. Maybe the queue depth increases, latency rises, and request throughput drops. If alerts are defined per metric, each of these signals can trigger independently. What you end up with is multiple alerts describing a single issue, each lacking the broader context of the service.

This is where alert noise starts to become a problem because the issue is not the signals themselves, but how they are grouped when defining alerts.

2: Using Wildcards to Define Meaningful Alert Groupings

Wildcard queries allow you to group related metrics together based on their position in the metric path. The key is to group metrics in a way that reflects how you want to reason about the system. So instead of trying to alert on every signal within a service at all once, it is often more useful to group by signal type.

For example, rather than: *.lb.*.*.*

(which includes metrics with very different units and behaviors), a more practical grouping could be: *.lb.*.requests.rate

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 1


This query grabs request throughput metrics across all load balancer hosts/components and because these metrics share the same unit and meaning, they can be evaluated together. This allows you to define a single alert that represents request handling across the service. An example of this alert criteria could be: "Trigger an alert if any load balancer request rate ever drops below 1000 req/sec."  When this alert fires, it represents a degradation in the service’s ability to process traffic requests, rather than an issue with a single metric.

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 2


In most cases, these wildcard alerts evaluate each matched time series independently, which means a single degraded signal can trigger the alert and be returned as an offending metric. Now this pattern can be applied across multiple services and signal types:

  • Latency across all hosts/services can be grouped with: *.*.latency.p95.value
  • Memory usage across hosts can be grouped with: telegraf.*.mem.used_percent
  • Health checks across hosts/services can be grouped with: *.*.health.status.value

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 3


Each of these alerting queries defines a meaningful grouping based on how a systems service behaves, rather than individual components running within the service.

Why Metric Structure Still Matters

These groupings work well because the metric structure is consistent. Each segment in the metric path has a defined meaning:

<host>.<service>.<signal>.<metric>.<stat>

Because of this, wildcard queries can target specific dimensions such as service, signal, or measurement type without ambiguity.

It's not always possible to maintain system metrics at a consistent depth, but it is helpful to do this when possible to make alert grouping more reliable and alert definitions easier to maintain. For more details on this concept, please see our related article around Metric Naming Best Practices.

3: Expanding This Pattern Across Services

The load balancing layer is a good starting point because the signals are easy to reason about, but the same pattern becomes more valuable as you apply it across multiple services.

In a typical system, requests move through several layers. In this example, we defined metrics for a load balancer, an API layer, background job processing, and a database. Each service exposes a different set of signals, but they all follow a similar structure and can be grouped using the same approach.

For the API layer, request throughput and latency become the primary signals:

*.api.*.requests.rate
*.*.latency.p95.value

For background processing, queue depth and job throughput describe system behavior:

*.jobs.queue.depth.value
*.jobs.*.processed.total.rate

For the database, query rate and latency indicate load and performance:

*.db.*.queries.total.rate
*.db.*.latency.p95.value

And across all services, application health metrics and system performance metrics (from agents like Telegraf) provide a consistent view of availability and resource usage:

*.*.health.status.value
telegraf.*.mem.used_percent

Each of these queries groups metrics in a way that reflects how engineers think about the system. Instead of focusing on individual time series, you are evaluating behavior across a service or a signal type. When these alerts are configured, they behave consistently across services. A latency alert does not need to be redefined for each layer. A health alert does not need to be duplicated per host. The same wildcard pattern applies, and the alert output identifies which service and which host is affected.

This is where the structure from the previous article becomes important. Because each metric follows the same hierarchy, you can scale this approach across the entire system without introducing additional complexity. At this point, dashboards and alerts start to align. In Grafana, each service can be visualized with panels for throughput, latency, and capacity. In alerting, those same groupings become the basis for service-level alerts. The result is a monitoring model where:

  • dashboards show how services behave over time
  • alerts indicate when that behavior crosses a threshold
  • returned metrics identify the specific signals responsible

To make this more concrete, the following dashboard example shows how these groupings look in practice, across multiple hosts/services:

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 4


This makes it much easier to move from detection to diagnosis without switching context or manually correlating multiple alerts.

4: Applying These Concepts in MetricFire’s Hosted Graphite

In Hosted Graphite, this approach maps directly to how alerts are configured.

Wildcard queries are used to define the scope of the alert, whether that is a specific service or a signal type across multiple services. Alert conditions are applied to those grouped metrics, and when triggered, the alert returns only the metrics that exceed the defined threshold. This reduces the number of alert rules required and makes alerts easier to interpret when they fire. In practice, this means:

  • one alert for LB throughput
  • one alert for latency across all services (api, cache, DB)
  • one alert for memory usage across all hosts
  • one alert for service health

So instead of managing dozens of individual alerts, you define a small number of grouped alerts that reflect how the system actually behaves.

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 5


Once defined, these alerts can be routed through notification channels such as email, Slack, PagerDuty, MSTeams, or webhooks. The delivery mechanism remains the same, but the signal quality improves significantly.

Reducing Alert Noise: Service-Level Alerting with Graphite Wildcards - 6

At MetricFire, we ran into this exact issue within our own alerting infrastructure, specifically in a routing layer responsible for handling incoming traffic. Multiple components within that layer could fail at the same time and each one generated its own alert, resulting in a cluster of alerts that all pointed to the same underlying issue.

After restructuring our metrics to follow a more consistent service and signal hierarchy, we were able to group those components under a single "service-level" alert. That made it possible to evaluate the health of the service as a whole and then identify which specific signals were responsible when something went wrong.

When working in Hosted Graphite, this structure translates directly into how you build dashboards and alerts. Queries become simpler because they rely on predictable paths, and alert rules can operate across a service rather than being tied to individual metrics. In the next articles, we will detail how to create efficient alerts with targeted wildcard groupings, and use Composite Alerts to add conditional AND/OR logic.

Conclusion

Metric-level alerting becomes noisy because it treats each signal independently, without considering how those signals relate to a service.

By grouping Graphite metrics using wildcard queries, alerts can be defined around meaningful dimensions such as throughput, latency, and health. Returning only the offending metrics adds the context needed to quickly understand what changed. The result is fewer alerts, clearer signals, and faster response.

In the next article, we build on this by introducing composite alerting, where multiple signals are evaluated together using conditional logic (AND/OR) to detect more complex system behavior.

You might also like other posts...
metricfire Apr 07, 2026 · 3 min read

Telegrafを使ってPrometheusメトリクスをGraphiteに変換する方法

PrometheusのメトリクスをGraphiteのメトリクスとして監視することで、メトリクスの整理が簡素化され、管理や理解が容易な、明快で階層的な命名構造が実現されます。Graphiteのシンプルなクエリ構文と長期的なデータ保持への重点は、過去の傾向分析が重要な環境において理想的であり、Prometheusのラベルベースのシステムにしばしば伴う複雑さを軽減します。メトリクスをGraphiteに統合することで、操作や保守が容易な、よりユーザーフレンドリーで一貫性のある監視環境を構築できます。 Continue Reading

metricfire Apr 02, 2026 · 3 min read

TelegrafとMetricFireでIoTデバイスのメトリクス管理を行う方法

IoTデバイスを監視することで、その使用状況、環境条件、およびユーザーの行動に関する洞察を得ることができます。この記事では、メトリクスをMQTTブローカーに送信する方法、およびTelegrafエージェントを設定してこれらのメトリクスを受信し、データソースに転送する方法について詳しく説明します。 Continue Reading

metricfire Mar 30, 2026 · 12 min read

Reducing Alert Noise: Metric Naming Best Practices in Graphite

Learn how to structure Graphite metrics using "services" and "signals" to create efficient, service-level... Continue Reading

header image

We strive for 99.95% uptime

Because our system is your system.

14-day trial 14-day trial
No Credit Card Required No Credit Card Required