🔬
HPC / Research

HPC & Scientific Computing

memory bandwidth SM occupancy ECC errors NVLink

How Graphite achieves this
Graphite's tagged metric paths carry SLURM job ID, queue, and user, every GPU counter is automatically attributed to its experiment, forever
Graphite's long-term whisper storage retains multi-year job history at full resolution, compare simulation efficiency across code versions months apart
ECC error rate graphed with movingAverage() and holtWintersForecast() for predictive hardware failure alerts before jobs corrupt
NVLink and PCIe throughput ratio functions identify memory bus bottlenecks starving CUDA kernels without custom dashboards
MetricFire includes pre-built Grafana dashboards for HPC workloads. ECC error tracking, SM occupancy, memory bandwidth, and job attribution panels, ready on day one with no dashboard configuration needed
Graphite metrics collected
gpu.{id}.ecc_sbe_total gpu.{id}.ecc_dbe_total gpu.{id}.nvlink_bw_total gpu.{id}.mem_copy_util gpu.{id}.sm_occupancy gpu.{id}.pcie_tx_bytes job.{id}.wall_time_sec
Self-hosted pain solved
15–30 day rolling retention windows wipe experiment traces → Graphite retains every job indefinitely
No job-level attribution without heavy label taxonomy → Graphite metric paths encode job context natively
ECC errors often missed without custom exporters → MetricFire collects all DCGM ECC counters by default
Graphite value: HPC teams get experiment-level GPU attribution with years of retention, something short-retention stacks simply cannot sustain. Every job is fully queryable and visible in MetricFire's hosted Grafana with no retention limits. No more "we lost the telemetry for that run."

GPU Monitoring Use Cases
Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.