🤖
ML Training

Large Model Training Runs

utilization memory NVLink thermal

How Graphite achieves this
Graphite's high-resolution whisper storage captures every GPU counter at 10-second resolution, no downsampling during the run
Graphite functions (averageSeries, maxSeries) let you aggregate per-GPU utilisation across a full multi-node job in a single query
Tag-based metric paths (gpu.{jobid}.{nodeid}.utilization) give instant per-job, per-node drill-down without complex query syntax
Custom training loss and token/s metrics ship via StatsD or the Graphite plaintext protocol, land in the same MetricFire-hosted Grafana dashboard as hardware counters
MetricFire includes pre-built Grafana dashboards for ML training runs. Job-level GPU utilisation, memory, and NVLink bandwidth panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.utilization_pct gpu.{id}.fb_used_mib gpu.{id}.nvlink_bandwidth gpu.{id}.sm_clock_mhz gpu.{id}.temp_c gpu.{id}.power_watts training.tokens_per_sec training.loss
Self-hosted pain solved
Per-GPU-per-job label cardinality causes storage explosions in legacy TSDBs → Graphite paths scale without cardinality limits
Short retention means losing the trace after 15 days → Graphite stores every training run indefinitely for efficiency comparisons
No way to correlate hardware metrics with training metrics in one place → Graphite unifies both in a single data store
Graphite value: ML teams using MetricFire typically recover 15–25% of wasted GPU-hours in the first 30 days by catching idle nodes, thermal throttling, and memory misconfiguration that cardinality-limited stacks routinely miss. All are surfaced in MetricFire's pre-built Grafana dashboards from day one.

GPU Monitoring Use Cases
Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.