Monitoring Kubernetes with Prometheus

March 12, 2020
  1. Introduction
  2. What’s Broken, and Why?
  3. Monitoring Distributed Systems: The Four Golden Signals
  4. Latency
  5. Traffic
  6. Errors
  7. Saturation
  8. Prometheus and the Four Golden Signals
  9. Conclusion

Introduction

In part I of this blog series, we understood that monitoring a Kubernetes cluster is a challenge that we can overcome if we use the right tools. We also understood that the default Kubernetes dashboard allows us to monitor the different resources running inside our cluster, but it is very basic. We suggested some tools and platforms like cAdvisor, Kube-state-metrics, Prometheus, Grafana, Kubewatch, Jaeger, and MetricFire.

In this blog post, we are going to look at the Four Golden Signals of building an observable system, and then see how Prometheus can help us in applying these rules.

To get started, sign up for the MetricFire free trial, where you can try out our Hosted Prometheus with almost no set up.


What’s Broken, and Why?

A big part of a DevOps team's job is to empower development teams to take part in the operational responsibility. DevOps is based on cooperation between the various IT players around good practices in order to design, develop, and deploy applications more quickly, less expensively, and with higher quality. It regulates the development and operation teams around the famous principle given to us by Werner Vogels, CTO of Amazon: "you build it, you run it". The people that make up the team is, therefore, one of DevOps' main assets.

Taking the responsibility of running an application also requires the DevOps team to get involved in other subtasks like monitoring the application. This is when choosing the right metrics to watch in production is a critical task. What you monitor, and the data you see, will impact your DevOps approach.

Also, involving the team in traditional monitoring tasks is not enough. With monitoring, you can discern what is happening in your production infrastructure. You can determine if there is a high activity volume on a server or a pool of servers. However, with observability (or white box monitoring), you can detect the problem before it becomes an issue.

"Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise." ~ Google SRE Book.

There are no ready-to-use methodologies when it comes to choosing the right metrics; everything depends on your team's technical and business needs. However, the following approaches may inspire you:

We will try to understand some of the essential and most commons metrics to watch in a kubernetes-based production system based on Google's Four Golden Signals.

Monitoring Distributed Systems: The Four Golden Signals

In chapter 6 of "Monitoring Distributed Systems" of the famous Google SRE book, Google defines the four main signals to be constantly observed. These four signals are called the four golden signals: latency, traffic, errors, and saturation.

These signals are extremely important, as they are essential to ensure high application availability. Let's briefly take a look at what each one means.

Latency

Latency is the time required to send a request and receive a response. It is usually measured on the server-side, but it can also be measured on the client-side to account for the differences in network speed. The operations team has the most control over server-side latency, but client-side latency is more relevant for end clients.

The target threshold you choose may vary depending on the application type. You also need to trace latency of successful and failed requests distinctly, because failed requests often fail quickly without further processing.

Traffic

Traffic is a measure of request numbers passing through the network. These can be HTTP requests sent to your web or API server, or messages sent to a processing queue. Peak traffic periods can stress your infrastructure and drive it to its limits, which can have downstream consequences. That is why traffic is a key signal. It helps you differentiate between two different root causes that have the same results: capacity issues and inappropriate system configurations, as the system configuration issues can produce problems even in times of low traffic.

For distributed systems, particularly Kubernetes, this will help you plan capacity in advance to meet future demand.

Errors

Errors can tell you about a bug in your code, an unresolved dependency, or configuration errors in your infrastructure. Take the example of a database failure that generates a spike in the error rate, and compare it with the case of a network error that usually induces the same spike in the results. You can't understand what the issue is from looking at only the error rate.

Following a change in your Kubernetes deployment, the errors may indicate bugs in the code that were not detected during testing or only appeared in your production system.

Therefore, the error message provides a more accurate report about the problem. Errors can also affect other metrics, such as artificially reducing latency, and errors can cause repeated attempts that end up drowning your Kubernetes clusters.


Saturation

Saturation is defined as the load on the resources of your server like network and CPU. Each resource has a limit beyond which performance degrades or becomes totally unavailable.

Saturation applies to resources such as disk capacity (read/write operations per second), CPU usage, memory usage, as well as other resources. We need to recognize that the design of the Kubernetes cluster needs to accommodate which parts of the service might become saturated first.

Often, the used metrics are leading indicators, so you can adjust capacity before performance degenerates. For example, network saturation can cause packets to drop off. Also, when the CPU is full it can cause delayed responses and full disks can cause disk write failures and data loss.


Prometheus and the Four Golden Signals

Prometheus is an open source tool for monitoring and alerting. It was developed by SoundCloud and afterwards donated to the CNCF. This tool integrates natively or indirectly with other applications using metrics exporters. Using Prometheus Operator, installing and managing Prometheus on top of Kubernetes becomes easier than you would expect. Prometheus Operator is an easy way to run the Prometheus Alertmanager and Grafana inside of a Kubernetes cluster. So, what are the Prometheus metrics to watch in order to implement the Four Golden Signals?

There are plenty of metrics collected and stored by Prometheus, but we are going to see some of them just as a demonstration; therefore, the following list is not exhaustive.

First, "http_requests_total", counts the number of HTTP responses issued and classifies them by code and methods. It can be used to observe the traffic.

Example:

sum(rate(http_requests_total[1m]))

Other metrics can be used to watch the traffic like "node_network_transmit_bytes" or "node_network_receive_bytes". Choosing the right metric depends on what you need to measure and the use case. Do you need to monitor HTTP requests? TCP requests? Transmitted and received bytes?

Latency, which is another golden signal, can be also observed using metrics like "http_request_duration_seconds". Using PromQL, we can for instance, get the percentage of requests that complete within 400ms:

sum(rate(http_request_duration_seconds_bucket{le="0.4"}[1m])) / sum(rate(http_request_duration_seconds_count[1m]))

Errors percentage can be measured in almost the same way, using metrics like "http_status_500_total" and "http_responses_total":

rate(http_status_500_total [1m]) / rate(http_requests_total [1m])

or

sum(rate(http_responses_total{code="500"}[1m])) / sum(rate(http_responses_total[1m]))

To measure saturation, you should usually refer to system metrics, like memory, disk or CPU. These types of metrics are collected directly from Kubernetes nodes and don't rely on instrumentation. For instance, to monitor the CPU saturation, you need to use a metric like "cpu" from the node exporter combined with the average over time function:

avg_over_time(cpu[1m])

If you need to apply the same to other resources, like disk, you may use something like:

avg_over_time (node_disk_io_time_seconds_total[1m])

Conclusion

Choosing the right metric comes easily when you understand the design of your Kubernetes cluster and the nature of the services it runs. The four golden signals are helpful for designing an observable system. However, to use the full power of Prometheus, you need to extend it from using basic metrics to other advanced features like the AlertManager, and then integrate data and alerts into a Grafana dashboard. To check out more about how to use AlertManager, check out our article on the top 5 AlertManager gotchas. Also, sign up for the MetricFire free trial and experiment with querying and alerting on your Prometheus metrics today.

Related Posts

GET FREE MONITORING FOR 14 DAYS