Why GPU Monitoring Matters: Tracking Utilization, Power, and Errors with DCGM

Why GPU Monitoring Matters: Tracking Utilization, Power, and Errors with DCGM

Table of Contents

Introduction

GPUs aren’t just for graphics anymore, they’ve become the workhorses of modern computing whenever massive process parallelism is needed. Unlike CPUs, which handle a few threads really well, GPUs are built to blast through thousands of operations at once, making them perfect for training AI models, running scientific simulations, or processing huge datasets. The trade-off is that all this horsepower comes with higher power draw, more heat, and memory systems that can bottleneck if you’re not paying attention. That’s why monitoring matters because without it, you can end up wasting resources by hitting thermal throttling or even pushing your hardware into failure.

In this guide, we’ll walk through how to set up NVIDIA’s DCGM Exporter to collect GPU metrics, and use Telegraf to scrape/forward them to MetricFire for storage and visualization.

Why GPU Monitoring Matters: Tracking Utilization, Power, and Errors with DCGM - 1


Use MetricFire's Hosted Graphite platform to analyze your system's performance and troubleshoot errors. Book a demo with our team for more detailed information about MetricFire and how to integrate it with your systemsign up for a MetricFire free trial to get started with seeing your GPU’s vital signs. 

Step 1: Getting Started With the SMI Command Line

Most GPU servers ship with NVIDIA's System Management Interface (SMI), a built-in command-line tool for checking performance and health stats. Running the command nvidia-smi shows you real-time usage details like GPU utilization, memory consumption, temperature, and power draw and it's the quickest way to confirm your GPU is alive and working.

For our testing, we spun up a Hetzner GEX44 GPU server running Ubuntu 22.04, and used it as a sandbox environment to explore how to capture and visualize GPU metrics in a way that actually helps you keep an eye on long-term performance.

Why GPU Monitoring Matters: Tracking Utilization, Power, and Errors with DCGM - 2


While SMI is great for spot checks, it doesn't give you a continuous stream of metrics for dashboards and alerts. That's why we'll take things a step further with the DCGM collector, which exposes the same low-level telemetry over an HTTP endpoint that can be scraped by Telegraf or OpenTelemetry and sent into a storage backend like Hosted Graphite for long-term monitoring.

Step 2: Installing DCGM and the Exporter

NVIDIA’s Data Center GPU Manager (DCGM) is a low-level toolkit for monitoring and managing GPUs in server environments. It exposes detailed telemetry (utilization, memory, temperature, power, reliability counters) that you wouldn’t get from a simple spot-check tool like nvidia-smi. On its own, DCGM provides the runtime and APIs to access GPU stats, and when paired with the DCGM Exporter, those metrics are made available over HTTP in Prometheus format so they can be scraped by collectors like Telegraf, OpenTelemetry, or even Prometheus itself.


Add the NVIDIA CUDA/DCGM repo which installs NVIDIA's keyring and adds their official repository, so apt can fetch packages on Ubuntu 22:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

Install DCGM runtime to provide the low-level host engine and libraries that expose GPU stats. It's required by dcgm-exporter to talk to the NVIDIA driver and collect metrics:

sudo apt-get install -y datacenter-gpu-manager

Install and start DCGM-Exporter (via Snap). The exporter runs a small HTTP server on port 9400, exposing GPU stats in the Prometheus format so collectors (like Telegraf) can scrape them:

sudo snap install dcgm
sudo snap start dcgm.dcgm-exporter

Verify the metrics endpoint to confirm that the exporter is running by querying the metrics endpoint. You will see GPU temperature, power usage, clock speeds, and more:

curl localhost:9400/metrics | head -20

Step 3: Use Telegraf to Forward GPU Performance Metrics

Setup the Telegraf Collector

If you don't already have an instance of Telegraf running in your server, install our handy HG-CLI tool to quickly install/configure Telegraf:

curl -s "https://www.hostedgraphite.com/scripts/hg-cli/installer/" | sudo sh

NOTE: You will need to input your Hosted Graphite API key, and follow the prompts to select which metric sets you want. The CLI tool automatically configures the output to your Hosted Graphite account!


Once it's installed, open the Telegraf configuration file at: /etc/telegraf/telegraf.conf and add the following section so Telegraf scrape the metrics endpoint:

[[inputs.prometheus]]
  urls = ["http://localhost:9400/metrics"]

Ship GPU Metrics to Hosted Graphite

Simply save your updated conf file, and restart the Telegraf service to forward the GPU performance metrics to your HG account. Or run it manually to inspect the output for potential syntax/permission errors:

telegraf --config /etc/telegraf/telegraf.conf

Once these metrics hit your Hosted Graphite account, you can use them to create custom dashboards and alerts!

Why GPU Monitoring Matters: Tracking Utilization, Power, and Errors with DCGM - 3


If you don't already have a Hosted Graphite account, sign up for a free trial here to obtain a Hosted Graphite API key.

Step 4: Interpreting GPU Metrics

Once the DCGM exporter is up and running, you’ll see a wide range of GPU metrics automatically collected and forwarded to the Hosted Graphite backend. These cover utilization, memory, thermal, power, bandwidth, and even hardware health indicators. Below is an overview of the most useful default metrics that will be collected, along with what each represents and its unit of measurement.

  • Correctable Remapped Rows (DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS)
    Count of GPU memory rows that were remapped due to correctable ECC errors. Unit: count (counter)
  • Decoder Utilization (DCGM_FI_DEV_DEC_UTIL)
    Hardware video decoder load (NVDEC). Unit: % utilization (0–100)
  • Encoder Utilization (DCGM_FI_DEV_ENC_UTIL)
    Hardware video encoder load (NVENC). Unit: % utilization (0–100)
  • Framebuffer Free (DCGM_FI_DEV_FB_FREE)
    Amount of unused GPU VRAM. Unit: MiB
  • Framebuffer Used (DCGM_FI_DEV_FB_USED)
    Amount of GPU VRAM currently in use. Unit: MiB
  • GPU Temperature (DCGM_FI_DEV_GPU_TEMP)
    Core GPU die temperature. Unit: °C
  • GPU Utilization (DCGM_FI_DEV_GPU_UTIL)
    Core GPU utilization (compute engine load). Unit: % utilization (0–100)
  • Memory Temperature (DCGM_FI_DEV_MEMORY_TEMP)
    VRAM module temperature. Unit: °C
  • Memory Clock (DCGM_FI_DEV_MEM_CLOCK)
    GPU memory clock frequency. Unit: MHz
  • Memory Copy Utilization (DCGM_FI_DEV_MEM_COPY_UTIL)
    Memory copy engine utilization (data transfer). Unit: % utilization (0–100)
  • NVLink Bandwidth Total (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL)
    Total bandwidth transferred over NVLink (if present). Unit: bytes (counter)
  • PCIe Replay Counter (DCGM_FI_DEV_PCIE_REPLAY_COUNTER)
    Number of PCIe retries caused by transmission errors. Unit: count (counter)
  • Power Usage (DCGM_FI_DEV_POWER_USAGE)
    Current GPU power draw. Unit: watts (W)
  • Row Remap Failures (DCGM_FI_DEV_ROW_REMAP_FAILURE)
    Failed GPU memory row remaps. Unit: count
  • SM Clock (DCGM_FI_DEV_SM_CLOCK)
    Streaming multiprocessor (core) clock frequency. Unit: MHz
  • Total Energy Consumption (DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION)
    Total energy consumed since boot. Unit: millijoules (mJ)
  • Uncorrectable Remapped Rows (DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS)
    Rows remapped due to uncorrectable memory errors. Unit: count (counter)
  • vGPU License Status (DCGM_FI_DEV_VGPU_LICENSE_STATUS)
    vGPU licensing status (0 = unlicensed, 1 = licensed). Unit: binary
  • PCIe RX Bytes (DCGM_FI_PROF_PCIE_RX_BYTES)
    Total PCIe traffic received by the GPU. Unit: bytes (counter)
  • PCIe TX Bytes (DCGM_FI_PROF_PCIE_TX_BYTES)
    Total PCIe traffic transmitted by the GPU. Unit: bytes (counter)
  • Xid Errors (DCGM_FI_DEV_XID_ERRORS)
    Driver-reported fatal error events. Unit: count (gauge)

Reliability Metrics (Remaps, Retries, Errors)

Metrics like remapped rows, PCIe replay counters, and Xid errors are designed as health indicators rather than performance metrics. On a healthy GPU, these values should stay at 0. A nonzero value means the hardware had to correct or retry something it shouldn't have to (like memory rows were remapped due to ECC faults, or PCIe packets were retried due to transmission errors). If you see these counters incrementing, it's a sign of underlying instability like failing VRAM, an unreliable PCIe bus, or a deeper hardware/driver fault. In production, these are "red flag" metrics so any nonzero value warrants investigation.

GPU Stress Testing

Using gpu-burn

gpu-burn is a CUDA-based stress test that pushes the GPU to 100% utilization. It’s commonly used for burn-in testing and validating stability under heavy load. While running, you’ll see spikes in GPU utilization, power draw, and temperature in your DCGM metrics.


First, install build tools and CUDA toolkit that is required for compilation:

sudo apt-get update
sudo apt-get install -y git make gcc nvidia-cuda-toolkit

Then, clone the gpu-burn repository and compile the binary:

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make

Finally, run gpu-burn for 2 min (adjust duration as needed):

./gpu_burn 120

Why GPU Servers Consume So Much Power

GPUs are among the most power-hungry components in modern compute infrastructure. Unlike CPUs, which idle down aggressively, GPUs maintain a higher baseline draw just to keep cores and memory ready. Under load, they can spike toward their thermal design power (TDP) hundreds of watts to sustain thousands of parallel compute cores and high-bandwidth VRAM simultaneously. Even at idle, memory refresh cycles and driver management consume measurable energy. Over time, that translates into large cumulative energy use, which is exactly what your DCGM energy counter is reporting.

Conclusion

Modern GPU servers are the backbone of AI and high-performance computing, enabling workloads that would overwhelm even the fastest CPUs. But their strengths also make them complex pieces of hardware, where utilization, memory health, temperature, and power consumption all directly impact reliability and cost. Monitoring them isn't just about chasing performance charts, it's about ensuring workloads run efficiently, keeping infrastructure stable, and avoiding expensive downtime. With tools like the DCGM Exporter and Telegraf integrated into your observability stack, you can treat GPU servers as transparent and measurable systems. In short, if you depend on GPU acceleration, monitoring is not optional. It is the safeguard that ensures those powerful processors keep delivering when you need them most.


Reach out to MetricFire today and learn how their Hosted Graphite product can satisfy your monitoring requirements and give you full visibility into any environment!

Book a demo with MetricFire experts or sign up for the free trial today to learn more about our features.

You might also like other posts...
metricfire Sep 10, 2025 · 8 min read

Visualize Logs Alongside Metrics: Complete Observability Elasticsearch Performance

Elasticsearch is a powerful system, but it’s only as reliable as your visibility into... Continue Reading

metricfire Sep 05, 2025 · 8 min read

How to Improve MariaDB Performance: Track Slow Queries with Logs and Metrics

Slow query log gives you the exact SQL behind slow downs, while metrics are... Continue Reading

metricfire Sep 03, 2025 · 7 min read

How to Improve MongoDB Performance: Track Slow Queries with Unified Logs and Metrics

Monitoring slow queries gives you an early warning system for index gaps and query-plan... Continue Reading

header image

We strive for 99.95% uptime

Because our system is your system.

14-day trial 14-day trial
No Credit Card Required No Credit Card Required