How Graphite achieves this
Graphite's tagged metric paths carry SLURM job ID, queue, and user, every GPU counter is automatically attributed to its experiment, forever
Graphite's long-term whisper storage retains multi-year job history at full resolution, compare simulation efficiency across code versions months apart
ECC error rate graphed with movingAverage() and holtWintersForecast() for predictive hardware failure alerts before jobs corrupt
NVLink and PCIe throughput ratio functions identify memory bus bottlenecks starving CUDA kernels without custom dashboards
MetricFire includes pre-built Grafana dashboards for HPC workloads. ECC error tracking, SM occupancy, memory bandwidth, and job attribution panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.ecc_sbe_total
gpu.{id}.ecc_dbe_total
gpu.{id}.nvlink_bw_total
gpu.{id}.mem_copy_util
gpu.{id}.sm_occupancy
gpu.{id}.pcie_tx_bytes
job.{id}.wall_time_sec
Self-hosted pain solved
✕15–30 day rolling retention windows wipe experiment traces → Graphite retains every job indefinitely
✕No job-level attribution without heavy label taxonomy → Graphite metric paths encode job context natively
✕ECC errors often missed without custom exporters → MetricFire collects all DCGM ECC counters by default