How Graphite achieves this
Graphite's percentile functions (nPercentile, percentileOfSeries) compute p50/p95/p99 inference latency across all replicas in one query, no recording rules needed
vLLM and TGI metrics ship directly via the Graphite plaintext or StatsD protocol, no exporter chain, no extra infrastructure
Graphite alert expressions fire composite conditions: KV-cache above 90% AND queue depth rising, precise scaling signals with near-zero false positives
Token throughput correlated with GPU $/hr using Graphite's divideSeries function, live cost-per-token dashboard
MetricFire includes pre-built Grafana dashboards for LLM inference. Token throughput, KV-cache saturation, queue depth, and p99 latency panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.utilization_pct
gpu.{id}.fb_used_mib
inference.token_throughput
inference.kv_cache_usage_pct
inference.queue_depth
inference.p99_latency_ms
inference.requests_per_sec
Self-hosted pain solved
✕Multi-replica federation gaps cause blind spots across inference nodes → Graphite single endpoint receives all replicas simultaneously
✕Complex composite alert conditions are hard to maintain in legacy stacks → Graphite alert expressions are simple path-based functions
✕No native percentile aggregation without recording rules → Graphite computes percentiles on-the-fly at query time