Best Practices for Monitoring Kubernetes with Grafana

Best Practices for Monitoring Kubernetes with Grafana

Table of Contents

There are tons of tools to choose from when it comes to visualizing data, but Grafana has become one of the best ways for organizations to visualize information and get notified about events happening within their infrastructure or data.

According to Kubernetes:

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

In this article, we will take a look at the best practices for monitoring Kubernetes using Grafana.

To get started, sign up for a MetricFire free trial, where you can quickly and easily send Kubernetes metrics and make Grafana dashboards right on our platform.


Key Takeaways

  1. Grafana is vital for Kubernetes monitoring: Grafana is a top tool for visualizing data and tracking events in Kubernetes infrastructure.
  2. Key Kubernetes metrics to monitor: Monitor pod/container, node, and cluster metrics for better performance management.
  3. Grafana excels in troubleshooting: It helps identify cluster issues, bottlenecks, application problems, and I/O waits using RED metrics.
  4. Easy data sources and dashboard creation: Grafana supports various sources, enabling simple dashboard building with effective visualizations.
  5. Hosted Graphite and Grafana for scalability: MetricFire's hosted services simplify scaling and monitoring large Kubernetes setups.


Why Choose Grafana?

The issue is that monitoring strategies for regular containerized applications don't work when applied to Kubernetes. Specifically, in today's environments containers are hard to monitor because they are constantly dying and being rebuilt. When combined with container orchestration, it is a challenge to not only manage the application's fundamental infrastructure and take care of the normal operational areas at scale. This is why it is imperative to have a strategy that incorporates monitoring dashboards along with centralized metrics for Kubernetes applications.

For real-time metrics in a centralized place, Grafana is the answer. It monitors both your infrastructure and your applications which is critical to having concise insight into your metrics. Kubernetes leverages these crucial metrics in Grafana so that you get total transparency into the condition of your Kubernetes cluster. This allows you to ensure that everything is running and working exactly how it is supposed to be.

Here are just a few of the metrics you can monitor with Grafana dashboards:

  • The availability and health of your pod
  • Kubernetes cluster resource utilization (CPU/memory on a cluster, node, pod, and container level)
  • Usage of resources including actual usage as well as requested usage so you have a clear picture of both
  • The Kubernetes cluster node's actual CPU and memory usage
  • Kubernetes nodes' available resources
  • Individual Kubernetes node available resources



Which Kubernetes Metrics Should Your Organization Monitor?

Kubernetes has two kinds of metrics you can use: application-level metrics and system-level metrics. For instance, the application-level metrics are gathered from third-party monitoring integrations or solutions such as Prometheus. Check out our information about Kubernetes monitoring with Prometheus to learn more. The system-level metrics are the core Kubernetes sources that come right from the start. These are things like Kubernetes API, Metrics Server, and cAdvisor.

Here are three lists that showcase the Kubernetes metrics that are important to monitor.


Pod/Container Metrics

  1. Pods resource allocation
  2. Under-provisioned or over-provisioned pods
  3. Amount of running pods in the cluster
  4. Healthy vs. unhealthy pods in the cluster
  5. Throttled containers percentages
  6. Occurrences of container restarts
  7. Number of persistent volumes in a failed or pending state
  8. Container CPU and memory utilization (you can configure this in the memory-defaults-pod.yaml file for each pod or container)

Node Metrics

  1. Health check for master nodes—API server, scheduler, controller, etc.
  2. Degradation of master nodes
  3. Number of nodes available for serving pods
  4. Node CPU utilization
  5. Node memory usage
  6. Node disk space available for placing pods
  7. Node disk I/O usage
  8. Node network traffic (in and out)—receive and transmit 
  9. Node network traffic errors
  10. Node network traffic drop

Cluster Metrics

  1. Cluster-level overview of workloads deployed
  2. Cluster CPU usage: used vs. total
  3. Cluster memory usage: used vs. total (you can configure this in the memory-defaults.yaml file under the default-mem-example namespace)
  4. Cluster file system usage: used vs. total
  5. Cluster network I/O pressure
  6. Cluster health (pod status, pod restarts, pod throttling)
  7. Overview of nodes, pods, and containers


Kubernetes and Grafana Troubleshooting Guide

When gaining insight into your data visualization, Grafana is an ideal tool. These insights help you pinpoint metrics collected from a variety of source materials. There are many troubleshooting scenarios that may occur and here are a few where Grafana shines:

  • Cluster Performance Issues and Instability. This is where you can look at requests versus limits and find the aforementioned performance issues and instability in resource planning.
  • Identify and Monitor. Identifying Kubernetes nodes and monitoring bottleneck issues.
  • Application Issues. This is where you can locate issues with the application by visualizing container restarts.
  • Determining a Source of I/O Waits. This is done by correlating I/O wait for spikes with network spikes or disk by using network stats and I/O wait.
  • Correlating unhealthy pod states and throttled pods. This is done by memory spikes on nodes or I/O wait times. You can also do this using CPU usage.

Using what is known as RED metrics consists of the following: request rate, error rate, and duration. This is used to instrument the services running in Kubernetes and from an application perspective, it is crucial for pinpointing performance issues. This makes it easy to alert your team when business portals show breach events and Kubernetes allows leverage with its built-in alerting capabilities.


Adding Data Sources in Grafana

Grafana works by fetching data from its sources and then displaying it in graphs on the dashboard. These are your time-series data from storage backends. It has support for a plethora of data sources built-in and straight out of the box including:

  • Azure Monitor
  • Prometheus Alert Monitoring
  • CloudWatch
  • InfluxDB
  • AWS
  • Elasticsearch
  • MySQL
  • Graphite
  • Microsoft SQL Server
  • OpenTSDB
  • PostgreSQL
  • Proxy
  • StackDriver
  • TestData DB

As you create your dashboard, multiple data sources can be combined into one dashboard. Keep in mind that each of the panels is tethered to a specific data source. You have the ability to write queries against your data stores in order to provide visualizations of the metrics with the query editor. There are many visualization options that can be applied to each of your panels to create what is easier for you to utilize.


Building Your Grafana Dashboard

Building your dashboard in Grafana isn't difficult. The Grafana dashboard comes with panels and the default ones include Table, Heatmap, Graph, and Singlestat. Of course, you can add panel plugins as well. These allow new data visualization for non-time series data as well as time-series data.

Once the panels are added, they can be placed into rows and this is done by simply dragging and dropping to organize them. Along with that, customization is available in a wide range of options. You can do whatever works best for you so that you can have your visualizations in an ideal format.


A Few Tips To Keep in Mind

KISS: Seriously, just keep it simple! If you add too much information to your dashboard it simply makes it harder to visualize and read. Keep your panels limited. Perhaps a single metric is sufficient that allows you to display disk space or CPU memory. When all is said and done, you want to be able to easily understand what you are looking at.

Keep It Consistent: Make all of your dashboards the same design to ensure that your metrics are simple to read and you don't have to learn the layout just because you're on a different dashboard. This can be confusing and is not necessary. Instead, keep it all uniform for easier access and visibility.

Add Tags: Tagging your dashboards ensures that you can easily organize and group them without added time to look for what you are trying to find. This is crucial for when your team starts creating them.

Keep Your Audience in Mind: The development team will need a detailed dashboard with less aggregation and increased diagnostics for troubleshooting purposes. Management might be interested in an aggregated dashboard that shows a high-level picture of all the services and their SLA/SLI/SLO. Make sure your dashboards are configured to help your staff with their decision-making processes.

Setting up Kubernetes Monitoring using MetricFire

To handle a production-level Kubernetes infrastructure can require a few hundred nodes and upwards of a few Mbps network traffic. Meaning you would need to scale out both Graphite and Grafana to handle the increasing load. 

That’s where Hosted Graphite and Hosted Grafana come into the picture. It allows you to scale for long-term storage, as well as provides redundant storage of data without you having to go through the arduous process of setting up Graphite and Grafana. 

Hosted Graphite and Visualizations through MetricFire allow for the continuous active deployment of new features, as MetricFire’s products all have their foundations in the ever-growing open-source projects. Configuring a Snap Daemon to send Kubernetes metrics to your MetricFire account is simple and just requires configuring your account's API key to be used as the prefix for each metric and the URL Endpoint to be used as the server destination. Check out our article Monitoring Kubernetes with Hosted Graphite to learn how to set up monitoring your Kubernetes infrastructure quickly and easily using our Hosted service.

Sign up for the MetricFire free trial here, and start building Kubernetes dashboards within a few minutes.


In this post, we learned more about the best practices for monitoring Kubernetes using Grafana. If you want, take a look at our favorite Grafana Dashboards, our article about Grafana plugins, and our Grafana Dashboard tutorial

If you want to know how MetricFire can help with your monitoring needs, book a demo and talk to us directly. And don’t forget you can use our 14-day free trial, and make your own Grafana Dashboards within a few minutes of signing up.

You might also like other posts...
kubernetes Oct 12, 2023 · 19 min read

Logging for Kubernetes: Fluentd and ElasticSearch

Use fluentd and ElasticSearch (ES) to log for Kubernetes (k8s). This article contains useful... Continue Reading

kubernetes Oct 11, 2023 · 9 min read

Python API with Kubernetes and Docker - Part I

Use Docker to containerize an application, then run it on development environments using Docker... Continue Reading

kubernetes Oct 06, 2023 · 13 min read

Tips for Monitoring Kubernetes Applications

Monitoring is the most important aspect of infrastructure operations. This article will give you... Continue Reading

header image

We strive for
99.999% uptime

Because our system is your system.

14-day trial 14-day trial
No Credit Card Required No Credit Card Required