Best Practices for Monitoring Kubernetes with Grafana

Best Practices for Monitoring Kubernetes with Grafana

Table of Contents

  1. Why Choose Grafana?
  2. Which Kubernetes Metrics Should Your Organization Monitor?
    1. Pod/Container Metrics
    2. Node Metrics
    3. Cluster Metrics
  3. Kubernetes and Grafana Troubleshooting Guide
  4. Adding Data Sources in Grafana
  5. Building Your Grafana Dashboard
  6. A Few Tips To Keep in Mind
  7. Summary

There are tons of tools to choose from when it comes to visualizing data, but Grafana has become one of the best ways for organizations to visualize information and get notified about events happening within their infrastructure or data.

According to Kubernetes:

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

In this article, we will take a look at the best practices for monitoring Kubernetes using Grafana.

To get started, log in to the MetricFire free trial, where you can send kubernetes metrics and make Grafana dashboards right in our platform.

         

Why Choose Grafana?

The issue is that monitoring strategies for regular containerized applications don't work when applied to Kubernetes. Specifically, in today's environments containers are hard to monitor because they are constantly dying and being rebuilt. When combined with container orchestration, it is a challenge to not only manage the application's fundamental infrastructure and take care of the normal operational areas at scale. This is why it is imperative to have a strategy that incorporates monitoring dashboards along with centralized metrics for Kubernetes applications.

For real-time metrics in a centralized place, Grafana is the answer. It monitors both your infrastructure and your applications which is critical to having concise insight into your metrics. Kubernetes leverages these crucial metrics in Grafana so that you get total transparency into the condition of your Kubernetes cluster. This allows you to ensure that everything is running and working exactly how it is supposed to be.

Here are just a few of the metrics you can monitor with Grafana dashboards:

  • The availability and health of your pod
  • Kubernetes cluster resource utilization (CPU/memory on a cluster, node, pod, and container level)
  • Usage of resources which includes actual usage as well as requested usage so you have a clear picture of both
  • The Kubernetes cluster node's actual CPU and memory usage
  • Kubernetes nodes' available resources
  • Individual Kubernetes node available resources

   

       

Which Kubernetes Metrics Should Your Organization Monitor?

The Kubernetes platform has two kinds of metrics you can use: application-level metrics and system-level metrics. For instance, the application-level metrics are fetched from third-party monitoring integrations or solutions such as Prometheus. To learn more about Kubernetes monitoring with Prometheus, check out this information we provided. The system-level metrics are those that the core Kubernetes sources that come right from the start. These are things like Kubernetes API, Metrics Server, and cAdvisor.

Here are three lists that showcase the Kubernetes metrics that are important to monitor.

     

Pod/Container Metrics

  1. Pods resource allocation
  2. Under-provisioned or over-provisioned pods
  3. Amount of running pods in the cluster
  4. Healthy vs.unhealthy pods in the cluster
  5. Throttled containers percentages
  6. Occurrences of container restarts
  7. Number of persistent volumes in a failed or pending state
  8. Container CPU and memory utilization (you can configure this in the memory-defaults-pod.yaml file for each pod or container)

Node Metrics

  1. Health check for master nodes—API server, scheduler, controller, etc.
  2. Degradation of master nodes
  3. Number of nodes available for serving pods
  4. Node CPU utilization
  5. Node memory usage
  6. Node disk space available for placing pods
  7. Node disk I/O usage
  8. Node network traffic (in and out)—receive and transmit 
  9. Node network traffic errors
  10. Node network traffic drop

Cluster Metrics

  1. Cluster level overview of workloads deployed
  2. Cluster CPU usage: used vs. total
  3. Cluster memory usage: used vs. total (you can configure this in the memory-defaults.yaml file under the default-mem-example namespace)
  4. Cluster file system usage: used vs. total
  5. Cluster network I/O pressure
  6. Cluster health (pod status, pod restarts, pod throttling)
  7. Overview of nodes, pods, and containers

      

Kubernetes and Grafana Troubleshooting Guide

When gaining insight into your data visualization, Grafana is an ideal tool. These insights help you pinpoint metrics collected from a variety of source material. There are many troubleshooting scenarios that may occur and here are a few where Grafana shines:

  • Cluster Performance Issues and Instability. This is where you can look at requests versus limits and find the aforementioned performance issues and instability in resource planning.
  • Identify and Monitor. Identifying Kubernetes nodes and monitoring bottleneck issues.
  • Application Issues. This is where you can locate issues with the application by visualizing container restarts.
  • Determining a Source of I/O Waits. This is done by correlating I/O wait for spikes with network spikes or disk by using network stats and I/O wait.
  • Correlating unhealthy pod states and throttled pods. This is done by memory spikes on nodes or I/O wait times. You can also do this using CPU usage.

Using what is known as RED metrics consists of the following: request rate, error rate, and duration. This is used to instrument the services running in Kubernetes and from an application perspective, it is crucial for pinpointing performance issues. This makes it easy to alert your team when business portals show breach events and Kubernetes allows leverage with its built-in alerting capabilities.

     

Adding Data Sources in Grafana

Grafana works by fetching data from its sources and then displaying it in graphs on the dashboard. These are your time-series data from storage backends. It has support for a plethora of data sources built-in and straight out of the box including:

  • Azure Monitor
  • Prometheus Alert Monitoring
  • CloudWatch
  • InfluxDB
  • AWS
  • Elasticsearch
  • MySQL
  • Graphite
  • Loxi
  • Microsoft SQL Server
  • OpenTSDB
  • PostgreSQL
  • Proxy
  • StackDriver
  • TestData DB

As you create your dashboard, multiple data sources can be combined on one dashboard. Keep in mind that each of the panels is tethered to a specific data source. You have the ability to write queries against your data stores in order to provide visualizations of the metrics with the query editor. There are many visualization options that can be applied to each of your panels to create what is easier for you to utilize.

     

Building Your Grafana Dashboard

Building your dashboard in Grafana isn't difficult. The Grafana dashboard comes with panels and the default ones include Table, Heatmap, Graph, and Singlestat. Of course, you can add panel plugins as well. These allow new data visualization for non-time series data as well as time-series data.

Once the panels are added, they can be placed into rows and this is done by simply dragging and dropping to organize them. Along with that, customization is available in a wide range of options. You can do whatever works best for you so that you can have your visualizations in an ideal format.

    

A Few Tips To Keep in Mind

KISS. Seriously, just keep it simple! If you add too much information to your dashboard it simply makes it harder to visualize and read. Keep your panels limited. Perhaps a single metric is sufficient that allows you to display disk space or CPU memory. When all is said and done, you want to be able to easily understand what you are looking at.

Keep It Consistent. Make all of your dashboards the same design to ensure that your metrics are simple to read and you don't have to learn the layout just because you're on a different dashboard. This can be confusing and is not necessary. Instead, keep it all uniform for easier access and visibility.

Add Tags. Tagging your dashboards ensures that you can easily organize and group them without added time looking for what you are trying to find. This is crucial for when your team starts creating them.

Keep Your Audience in Mind. The development team will need a detailed dashboard with less aggregation and increased diagnostics for troubleshooting purposes. Management might be interested in an aggregated dashboard that shows a high-level picture of all the services and their SLA/SLI/SLO. Make sure your dashboards are configured to help your staff with their decision making processes.

     

Summary

In this post, we learned more about the best practices for monitoring Kubernetes using Grafana. If you want, take a look at our favorite Grafana Dashboards, our article about Grafana plugins, and our Grafana Dashboard tutorial

If you want to know how MetricFire can help with your monitoring needs, book a demo and talk to us directly. And don’t forget you can use our 14-day free trial, and make your own Grafana Dashboards within a few minutes.

Hungry for more knowledge?

Related posts