How Graphite achieves this
Graphite's high-resolution whisper storage captures every GPU counter at 10-second resolution, no downsampling during the run
Graphite functions (averageSeries, maxSeries) let you aggregate per-GPU utilisation across a full multi-node job in a single query
Tag-based metric paths (
gpu.{jobid}.{nodeid}.utilization) give instant per-job, per-node drill-down without complex query syntax
Custom training loss and token/s metrics ship via StatsD or the Graphite plaintext protocol, land in the same MetricFire-hosted Grafana dashboard as hardware counters
MetricFire includes pre-built Grafana dashboards for ML training runs. Job-level GPU utilisation, memory, and NVLink bandwidth panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.utilization_pct
gpu.{id}.fb_used_mib
gpu.{id}.nvlink_bandwidth
gpu.{id}.sm_clock_mhz
gpu.{id}.temp_c
gpu.{id}.power_watts
training.tokens_per_sec
training.loss
Self-hosted pain solved
✕Per-GPU-per-job label cardinality causes storage explosions in legacy TSDBs → Graphite paths scale without cardinality limits
✕Short retention means losing the trace after 15 days → Graphite stores every training run indefinitely for efficiency comparisons
✕No way to correlate hardware metrics with training metrics in one place → Graphite unifies both in a single data store