HPC & Scientific Computing

memory bandwidth SM occupancy ECC errors NVLink

How Graphite achieves this

Graphite's tagged metric paths carry SLURM job ID, queue, and user, every GPU counter is automatically attributed to its experiment, forever

Graphite's long-term whisper storage retains multi-year job history at full resolution, compare simulation efficiency across code versions months apart

ECC error rate graphed with movingAverage() and holtWintersForecast() for predictive hardware failure alerts before jobs corrupt

NVLink and PCIe throughput ratio functions identify memory bus bottlenecks starving CUDA kernels without custom dashboards

MetricFire includes pre-built Grafana dashboards for HPC workloads. ECC error tracking, SM occupancy, memory bandwidth, and job attribution panels, ready on day one with no dashboard configuration needed

Graphite metrics collected

gpu.{id}.ecc_sbe_total gpu.{id}.ecc_dbe_total gpu.{id}.nvlink_bw_total gpu.{id}.mem_copy_util gpu.{id}.sm_occupancy gpu.{id}.pcie_tx_bytes job.{id}.wall_time_sec

Self-hosted pain solved

✕15–30 day rolling retention windows wipe experiment traces → Graphite retains every job indefinitely

✕No job-level attribution without heavy label taxonomy → Graphite metric paths encode job context natively

✕ECC errors often missed without custom exporters → MetricFire collects all DCGM ECC counters by default

📅 Book an HPC Demo 🚀 Start Free Trial 📖 Read Blog Post

GPU Monitoring Use Cases

Explore other use cases

MetricFire's Hosted Graphite covers every GPU workload. See how it fits your team's specific challenge.

🤖 Large Model Training Runs ML Training ⚡ LLM Inference at Scale Inference 🎮 Cloud Gaming & Video Streaming Cloud Gaming 💰 GPU Cost Attribution & Optimization FinOps 🛡️ GPU Fleet Health & SRE Platform Ops / SRE 🎨 Image & Video Generation APIs Generative AI 🚗 Edge AI & Embedded GPU Fleets Edge AI