LLM Inference at Scale

token throughput KV-cache queue depth p99 latency

How Graphite achieves this

Graphite's percentile functions (nPercentile, percentileOfSeries) compute p50/p95/p99 inference latency across all replicas in one query, no recording rules needed

vLLM and TGI metrics ship directly via the Graphite plaintext or StatsD protocol, no exporter chain, no extra infrastructure

Graphite alert expressions fire composite conditions: KV-cache above 90% AND queue depth rising, precise scaling signals with near-zero false positives

Token throughput correlated with GPU $/hr using Graphite's divideSeries function, live cost-per-token dashboard

MetricFire includes pre-built Grafana dashboards for LLM inference. Token throughput, KV-cache saturation, queue depth, and p99 latency panels, ready on day one with no dashboard configuration needed

Graphite metrics collected

gpu.{id}.utilization_pct gpu.{id}.fb_used_mib inference.token_throughput inference.kv_cache_usage_pct inference.queue_depth inference.p99_latency_ms inference.requests_per_sec

Self-hosted pain solved

✕Multi-replica federation gaps cause blind spots across inference nodes → Graphite single endpoint receives all replicas simultaneously

✕Complex composite alert conditions are hard to maintain in legacy stacks → Graphite alert expressions are simple path-based functions

✕No native percentile aggregation without recording rules → Graphite computes percentiles on-the-fly at query time

📅 Book an Inference Demo 🚀 Start Free Trial 📖 Read Blog Post

GPU Monitoring Use Cases

Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.

🤖 Large Model Training Runs ML Training 🔬 HPC & Scientific Computing HPC / Research 🎮 Cloud Gaming & Video Streaming Cloud Gaming 💰 GPU Cost Attribution & Optimization FinOps 🛡️ GPU Fleet Health & SRE Platform Ops / SRE 🎨 Image & Video Generation APIs Generative AI 🚗 Edge AI & Embedded GPU Fleets Edge AI