+

Monitoring GPU with MetricFire

Gain Visibility into your GPU Environment

To integrate GPU with MetricFire, please sign up for a free 14 day trial. We want to fully understand your requirements and monitoring goals, so we can advise you on how to obtain better visibility into your infrastructure. Please book a demo with us so we can show you how quick and easy it is to get meaningful data into your MetricFire account, and use that data to build custom dashboards and alerts.



Collect and Forward GPU Metrics Using Telegraf and DCGM

GPUs aren't just for graphics anymore; they've become the workhorses of modern computing whenever massive process parallelism is needed. Unlike CPUs, which handle a few threads really well, GPUs are built to blast through thousands of operations at once, making them perfect for training AI models, running scientific simulations, or processing huge datasets. The trade-off is that all this horsepower comes with higher power draw, more heat, and memory systems that can bottleneck if you're not paying attention.

That's why GPU monitoring matters: without it, you can end up wasting resources by hitting thermal throttling or even pushing your hardware into failure.

In this guide, we'll walk through how to set up NVIDIA's DCGM Exporter to collect GPU metrics, and use Telegraf to scrape and forward them to MetricFire for storage and visualization.

Telegraf is a plugin-driven server agent built on InfluxDB, and can be used for collecting and sending statistics from servers, databases, processes, devices, and a range of 3rd party technology platforms. It is compatible with most operating systems and has many useful output plugins and input plugins for collecting and forwarding a wide variety of performance metrics.

Especially popular for GPU servers that ship with NVIDIA's System Management Interface (SMI), a built-in command-line tool for checking performance and health stats. Popular GPUs to use include the Titan Series GPUs (Titan RTX, Titan V, Titan Xp) and the RTX 30XX Series GPUs (RTX 3060, RTX 3070, RTX 3080, RTX 3090).

Follow these steps to get started:

  1. See our Telegraf docs for detailed instructions on installing the agent and configuring the Agent on your GPU server
  2. Install the NVIDIA DCGM Exporter and configure Telegraf's Prometheus input plugin. See our handy blog article with detailed instructions HERE
  3. Restart Telegraf to see new metrics appear in your MetricFire account's Metrics Search UI
  4. You can use these metrics to create Dashboard panels and Graphite alerts

dcgm-gpu-example.webp

Use MetricFire's Hosted Graphite platform to analyze your system's performance and troubleshoot errors. Book a demo with our team for more detailed information about MetricFire and how to integrate it with your system; sign up for a MetricFire free trial to get started with seeing your GPU's vital signs.

Step 1: What Is NVIDIA DCGM? (and why it beats spot checks)

Most GPU servers ship with NVIDIA's System Management Interface (SMI), a built-in command-line tool for checking performance and health stats. Popular GPUs that include this include the Titan Series (Titan RTX, Titan V, Titan Xp) and the RTX 30XX Series (RTX 3060, RTX 3070, RTX 3080, RTX 3090).

Running nvidia-smi shows you real-time usage details like GPU utilization, memory consumption, temperature, and power draw—it's the quickest way to confirm your GPU is alive and working.

For testing, we spun up a Hetzner GEX44 GPU server (running Ubuntu 22.04) and used it to explore how to capture and visualize GPU metrics for long-term performance tracking.

NVIDIA DCGM Monitoring: Setup, Metrics & Alerts | MetricFire - 2

While SMI is great for spot checks, it doesn't provide a continuous stream of metrics for dashboards and alerts. That's where NVIDIA DCGM (Data Center GPU Manager) comes in—it exposes the same low-level telemetry over an HTTP endpoint that can be scraped by Telegraf, OpenTelemetry, or Prometheus, and stored in MetricFire's Hosted Graphite backend for long-term monitoring.

Step 2: NVIDIA DCGM Monitoring: Quick Start

Install DCGM + dcgm-exporter

NVIDIA's Data Center GPU Manager (DCGM) is a low-level toolkit for monitoring and managing GPUs in server environments. It exposes detailed telemetry—utilization, memory, temperature, power, reliability counters—that you wouldn't get from a simple spot-check tool like nvidia-smi.

On its own, DCGM provides the runtime and APIs to access GPU stats. When paired with the DCGM Exporter, those metrics are made available over HTTP in Prometheus format so they can be scraped by collectors like Telegraf, OpenTelemetry, or Prometheus itself.

Setup commands (Ubuntu 22.04):

Add the NVIDIA CUDA/DCGM repo which installs NVIDIA's keyring and adds their official repository, so apt can fetch packages on Ubuntu 22:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

Install DCGM runtime to provide the low-level host engine and libraries that expose GPU stats. It's required by dcgm-exporter to talk to the NVIDIA driver and collect metrics:

sudo apt-get install -y datacenter-gpu-manager

Install and start DCGM-Exporter (via Snap). The exporter runs a small HTTP server on port 9400, exposing GPU stats in the Prometheus format so collectors (like Telegraf) can scrape them:

sudo snap install dcgm
sudo snap start dcgm.dcgm-exporter

Verify your metrics endpoint:

Verify the metrics endpoint to confirm that the exporter is running by querying the metrics endpoint. You will see GPU temperature, power usage, clock speeds, and more:

curl localhost:9400/metrics | head -20

Step 3: Scrape with Telegraf (or Prometheus)

If you don't already have an instance of Telegraf running in your server, install our handy HG-CLI tool to quickly install/configure Telegraf:

curl -s "https://www.hostedgraphite.com/scripts/hg-cli/installer/" | sudo sh

NOTE: You will need to input your Hosted Graphite API key, and follow the prompts to select which metric sets you want. The CLI tool automatically configures the output to your Hosted Graphite account!

Once it's installed, open the Telegraf configuration file at: /etc/telegraf/telegraf.conf and add the following section so Telegraf scrape the metrics endpoint:

[[inputs.prometheus]]
  urls = ["http://localhost:9400/metrics"]

Visualize in Grafana (MetricFire dashboards)

Simply save your updated conf file and restart the Telegraf service to forward the GPU performance metrics to your HG account. Or run it manually to inspect the output for potential syntax/permission errors:

telegraf --config /etc/telegraf/telegraf.conf

Once these metrics hit your Hosted Graphite account, you can use them to create custom dashboards and alerts!

NVIDIA DCGM Monitoring: Setup, Metrics & Alerts | MetricFire - 3

Once your metrics are in Hosted Graphite, log in to Grafana and start building GPU performance dashboards.

Explore MetricFire Grafana dashboards

Learn more about Telegraf integration

If you don't already have a Hosted Graphite account, sign up for a free trial here to obtain a Hosted Graphite API key.

Step 4: Key DCGM Metrics to Track (cheat sheet)

Once the DCGM exporter is up and running, you'll see a wide range of GPU metrics automatically collected and forwarded to the Hosted Graphite backend. These cover utilization, memory, thermal, power, bandwidth, and even hardware health indicators. Below is an overview of the most useful default metrics that will be collected, along with what each represents and its unit of measurement.

Metric Description Unit
DCGM_FI_DEV_GPU_UTIL Core GPU compute engine load %
DCGM_FI_DEV_POWER_USAGE Current GPU power draw W
DCGM_FI_DEV_GPU_TEMP GPU core temperature °C
DCGM_FI_DEV_MEMORY_TEMP VRAM module temperature °C
DCGM_FI_DEV_FB_USED VRAM currently in use MiB
DCGM_FI_DEV_XID_ERRORS Fatal driver-reported error count count
DCGM_FI_DEV_*_REMAPPED_ROWS Memory health indicators count

Reliability Metrics (Remaps, Retries, Errors)

Metrics like remapped rows, PCIe replay counters, and Xid errors are designed as health indicators rather than performance metrics. On a healthy GPU, these values should stay at 0. A nonzero value means the hardware had to correct or retry something it shouldn't have to (like memory rows were remapped due to ECC faults, or PCIe packets were retried due to transmission errors). If you see these counters incrementing, it's a sign of underlying instability like failing VRAM, an unreliable PCIe bus, or a deeper hardware/driver fault. In production, these are "red flag" metrics so any nonzero value warrants investigation.

Step 5: Alerts That Catch Real Issues

Set alerts to detect early signs of degradation:

  • GPU Utilization <10% for >5 min → underuse or stalled jobs
  • Xid Errors >0 → potential driver/hardware instability
  • Power Usage >90% of TDP → possible thermal throttling
  • ECC Remaps increasing → memory degradation

Step 6: Troubleshooting DCGM Exporter

  • Empty /metrics? Ensure dcgm-exporter is running, and the DCGM runtime matches your driver version.
  • Missing ECC fields? Consumer GPUs don't expose them—expected behavior.
  • Container issues? Use the NVIDIA Container Toolkit when running DCGM in Docker/Kubernetes.

GPU Stress Testing with gpu-burn

gpu-burn is a CUDA-based stress test that pushes the GPU to 100% utilization. It's commonly used for burn-in testing and validating stability under heavy load.

First, install build tools and CUDA toolkit that is required for compilation:

sudo apt-get update
sudo apt-get install -y git make gcc nvidia-cuda-toolkit

Then, clone the gpu-burn repository and compile the binary:

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make

Finally, run gpu-burn for 2 min (adjust duration as needed):

./gpu_burn 120

While running, you'll see spikes in GPU utilization, power draw, and temperature in your DCGM metrics.

Why GPU Servers Consume So Much Power

GPUs are among the most power-hungry components in modern compute infrastructure. Unlike CPUs, which idle down aggressively, GPUs maintain a higher baseline draw just to keep cores and memory ready. Under load, they can spike toward their thermal design power (TDP) hundreds of watts to sustain thousands of parallel compute cores and high-bandwidth VRAM simultaneously. Even at idle, memory refresh cycles and driver management consume measurable energy. Over time, that translates into large cumulative energy use, which is exactly what your DCGM energy counter is reporting.

What is MetricFire?

MetricFire is a full-scale platform that provides infrastructure, system, and application monitoring using a suite of open-source tools. We will aggregate and store your data as time series metrics, which can be used to build custom dashboards and alerts. MetricFire takes away the burden of self-hosting your own monitoring solution, allowing you more time and freedom to work on your most important tasks.

 

     hg-main-page

         

 

MetricFire offers a complete ecosystem of end-to-end infrastructure monitoring, comprised of open-source Graphite and Grafana. MetricFire handles the aggregation, storage, and backups of your data, and offers alerting, team features, and API's for easy management of your monitoring environment. You can send server metrics using one of our agents, custom metrics from within your application code, and integration metrics from a variety of popular 3rd party services that we integrate with like Heroku, AWS, Azure, GCP, and many more!

         

Our Hosted Graphite product has improved upon standard Graphite to add data dimensionality, optimized storage, and offers additional tools and features that provide customers with a robust and well-rounded monitoring solution.

       

Benefits of Using MetricFire:

  • Simple, low-cost pricing
    A structured pricing model based on total unique time series metrics allows you to work within your budget. The predictability and transparent pricing allows you to keep your costs in check and plan for the future. 1 metric is 1 metric regardless of the source, or requests sent to and from the metric namespace.
  • Easy-to-use dashboards
    Our Hosted Dashboards provide endless options for customizing your visualizations, and allow you to share dashboards with clients.
  • Responsive alerting
    Receive alert notifications quickly through Email, PagerDuty, Slack, Microsoft Teams, OpsGenie, and custom webhooks.
  • Freedom of customization
    Send custom application metrics through your code. Our Language Guide can help you update your codebase to send event metrics via socket connections.
  • Fantastic customer support
    Fast and friendly support is provided by engineers, for engineers, to get you set up quickly. Start a conversation with us through the chat bubble below!
  • Enterprise-ready
    Dedicated clusters for users that need their own environment with custom parameters. Please reach out to us for more information.
      
The author for this testimonial, Jim Davies, Head of DevOps, MoneySuperMarket.com.
“As MetricFire scales effortlessly, we can push and store more metrics than we really need today but might need tomorrow. This increases our depth of understanding of the systems that we run and heads off any future problems.”
Jim Davies, Head of DevOps, MoneySuperMarket.com.

Don't just take our word for it

Why thousands of engineers choose us:

fingerprint icon Own your data
dollar icon Major savings
eye icon Nothing hidden
face icon Engineering support

Vendor lock-in's not our thing. We’re believers in you still owning your data, so you can request a full export at any time. That means you get all the benefits of an open-source tool, but with the security and stability of a SaaS tool.

“We now have over ten times the amount of metrics we started with, and on different accounts. One of the great things about MetricFire is that scaling to support this increase has been hassle-free, requiring no additional work on our side.”

Maxime Audet, Cloud-Ops Team Lead, Coveo

Read customer story Customer photo of Maxime

Save valuable engineering time and stick within your budget. We're known for our predictable and transparent pricing so it's easy to keeps costs under control.

“Building and managing an on-premise installation at this scale would require a lot of engineer time, especially in the first year...we use this engineering time to work on initiatives closer to our core business”

Jim Davies, Head of DevOps, MoneySuperMarket.com

Read customer story Customer photo of Jim

Transparency is at the root of how we operate: from pricing to postmortems we're open about how we do things. We even publish our internal system metrics to our public status page.

“There’s complete transparency with everything MetricFire do which means we can accurately predict what we’ll be spending and comfortably keep within our budget.”

Itai Yaffe, Big Data Developer, Nielsen

Read customer story Customer photo of Itai

Our first line, technical support comes from engineers, for engineers. So you get highly detailed, expert support when you need it.

"Every time I have a question, I get an answer from support after just a couple of hours. Their technical knowledge is excellent.”

Shahar Kobrinsky, VP of Architecture and Scale, Eyeview

Read customer story Customer photo of Shahar

Questions?

Don’t see the integration you’re looking for? Need help setting up your monitoring? Get in touch and one of our engineers will help you out.

Contact us
You might also like other posts...
metricfire Dec 18, 2025 · 4 min read

How To Connect Your Prometheus Server to a Grafana Datasource

Prometheus and Node Exporter give you deep visibility into your Linux servers with almost... Continue Reading

metricfire Dec 17, 2025 · 4 min read

How to Connect Your MySQL Instance to a Grafana Datasource

Visualizing key MySQL metrics in Grafana gives teams real-time insight into query performance, resource... Continue Reading