Large Model Training Runs

utilization memory NVLink thermal

How Graphite achieves this

Graphite's high-resolution whisper storage captures every GPU counter at 10-second resolution, no downsampling during the run

Graphite functions (averageSeries, maxSeries) let you aggregate per-GPU utilisation across a full multi-node job in a single query

Tag-based metric paths (gpu.{jobid}.{nodeid}.utilization) give instant per-job, per-node drill-down without complex query syntax

Custom training loss and token/s metrics ship via StatsD or the Graphite plaintext protocol, land in the same MetricFire-hosted Grafana dashboard as hardware counters

MetricFire includes pre-built Grafana dashboards for ML training runs. Job-level GPU utilisation, memory, and NVLink bandwidth panels, ready on day one with no dashboard configuration needed

Graphite metrics collected

gpu.{id}.utilization_pct gpu.{id}.fb_used_mib gpu.{id}.nvlink_bandwidth gpu.{id}.sm_clock_mhz gpu.{id}.temp_c gpu.{id}.power_watts training.tokens_per_sec training.loss

Self-hosted pain solved

✕Per-GPU-per-job label cardinality causes storage explosions in legacy TSDBs → Graphite paths scale without cardinality limits

✕Short retention means losing the trace after 15 days → Graphite stores every training run indefinitely for efficiency comparisons

✕No way to correlate hardware metrics with training metrics in one place → Graphite unifies both in a single data store

📅 Book a GPU Training Demo 🚀 Start Free Trial 📖 Read Blog Post

GPU Monitoring Use Cases

Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.

⚡ LLM Inference at Scale Inference 🔬 HPC & Scientific Computing HPC / Research 🎮 Cloud Gaming & Video Streaming Cloud Gaming 💰 GPU Cost Attribution & Optimization FinOps 🛡️ GPU Fleet Health & SRE Platform Ops / SRE 🎨 Image & Video Generation APIs Generative AI 🚗 Edge AI & Embedded GPU Fleets Edge AI