Inference

LLM Inference at Scale

token throughput KV-cache queue depth p99 latency

How Graphite achieves this
Graphite's percentile functions (nPercentile, percentileOfSeries) compute p50/p95/p99 inference latency across all replicas in one query, no recording rules needed
vLLM and TGI metrics ship directly via the Graphite plaintext or StatsD protocol, no exporter chain, no extra infrastructure
Graphite alert expressions fire composite conditions: KV-cache above 90% AND queue depth rising, precise scaling signals with near-zero false positives
Token throughput correlated with GPU $/hr using Graphite's divideSeries function, live cost-per-token dashboard
MetricFire includes pre-built Grafana dashboards for LLM inference. Token throughput, KV-cache saturation, queue depth, and p99 latency panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.utilization_pct gpu.{id}.fb_used_mib inference.token_throughput inference.kv_cache_usage_pct inference.queue_depth inference.p99_latency_ms inference.requests_per_sec
Self-hosted pain solved
Multi-replica federation gaps cause blind spots across inference nodes → Graphite single endpoint receives all replicas simultaneously
Complex composite alert conditions are hard to maintain in legacy stacks → Graphite alert expressions are simple path-based functions
No native percentile aggregation without recording rules → Graphite computes percentiles on-the-fly at query time
Graphite value: Inference teams get SLA-correlated Graphite dashboards in MetricFire's hosted Grafana in under an hour. Composite alerting reduces false-positive scale events, saving meaningful GPU-hours and keeping latency SLOs visible to the whole business through MetricFire's hosted Grafana.

GPU Monitoring Use Cases
Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.