Table of Contents
- Introduction
- Introduction to Kubernetes
- Why monitor Kubernetes with Graphite?
- Introduction to Graphite
- Setting Up Kubernetes for Graphite Monitoring
- Viewing Metrics with Hosted Graphite
- Plotting the Metrics on Hosted Grafana
- Benefits of Monitoring Kubernetes Cluster with Hosted Graphite and Grafana
- What metrics are collected by the Snap Daemon?
- Conclusion
Introduction
In this article, we will be looking into Kubernetes monitoring with Graphite and Grafana. Specifically, we will look at how your whole Kubernetes set-up can be centrally monitored through Hosted Graphite and Hosted Grafana dashboards. This will allow Kubernetes Administrators to centrally manage all of their Kubernetes clusters without setting up any additional infrastructure for monitoring.
To follow the steps in this blog, sign up for the MetricFire free trial, where you can use Graphite and Grafana directly on our platform. MetricFire is a full-scale Hosted Graphite and Grafana service, where we do the set-up and management of these open-source tools so you don’t have to.
Introduction to Kubernetes
Kubernetes is the most popular platform for managing containerized applications. Many big and small organizations today use Kubernetes to deploy their production services. However, even though it simplifies the production services deployment greatly, it also brings in its own set of complexities and challenges. Thus, it becomes very important to monitor the components which are running inside Kubernetes pods, services and daemons to make sure they are behaving as expected.
For more information on what Kubernetes is, check out our beginner's guide here. We also have some great tutorials on Deploying Grafana to Kubernetes, HA Kubernetes monitoring, and more on the MetricFire blog.
Why monitor Kubernetes with Graphite?
Prometheus is the de facto tool for monitoring Kubernetes, with convenient abilities like service discovery and hundreds of pre-built dashboards. However, Prometheus is intended for day-to-day monitoring, and it's a terrible tool when it comes to the long-term storage of data. If you need to keep your data for a long time, so that you can view trends and analysis over a year, then you will have to write data out from your local Prometheus instance to remote storage. This can be challenging to maintain and use, not to mention expensive.
That's where Hosted Graphite by MetricFire comes into play. If you want to monitor your Kubernetes installation completely hassle-free, without installing a local Prometheus instance, then we recommend using Hosted Graphite with a Snap daemon. This snap_k8s daemon acts like an agent and it sends the data to Hosted Graphite from inside your Kubernetes cluster. This solution is very easy - it requires almost no set-up on your side, and we can take care of everything from the MetricFire platform. We will outline this solution in this article.
Now, let's dive into how we can monitor Kubernetes with Hosted Graphite by MetricFire!
Introduction to Graphite
Graphite is one of the most popular and open-source enterprise monitoring tools, and it's used by many enterprises across the world. Graphite provides us with the entire stack consisting of a database for metrics storage called whisper, a daemon called carbon for listening to incoming events and finally a graphite web application which allows us to see the metrics on the browser.
In this solution, we will not be focussing on open source graphite which can be troublesome to set up initially, but instead, we'll use Hosted Graphite which comes with a lot of advantages.
There are some differences between open source Graphite and Hosted Graphite which will not be covered in this article in detail. In short, Hosted Graphite is an enhanced version of Graphite that provides more granular metrics and more transparent scalability over the open source Graphite.
Setting Up Kubernetes for Graphite Monitoring
In order to transmit Kubernetes metrics to Graphite, we will be using an open-source Snap daemon called snap_k8s which will run as a DaemonSet on our Kubernetes cluster. DaemonSet is a background process in Kubernetes which can be scheduled to run at a specified interval. In this case, we will be configuring snap_k8s to run every 10 seconds.
Before we set up snap_k8s in our Kubernetes cluster, it needs to be supplied with two key pieces of information.
- The Graphite server to connect to
- The prefix to be sent for each metric sent to Graphite
Both of these pieces of information can be retrieved from Hosted Graphite. Login to https://https://www.hostedgraphite.com/app/ and go to the Overview section.
Once logged in, click on the “How do I send metrics?” button to reveal the server information.
As shown below, the API key shown below will be used as the prefix for each metric and the URL Endpoint will be used as the server destination.
Having both this information at hand, let’s update these values at lines #98 and #99 respectively in the configuration manifest below and deploy the daemon on the Kubernetes cluster using kubectl create -f snap_ds.yml .
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: snap
spec:
selector:
matchLabels:
name: snap
template:
metadata:
name: snap
labels:
name: snap
spec:
hostPID: true
hostNetwork: true
containers:
- name: snap
image: raintank/snap_k8s:latest
volumeMounts:
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /var/run/docker.sock
name: docker-sock
- mountPath: /var/lib/docker
name: fs-stats
- mountPath: /usr/local/bin/docker
name: docker
- mountPath: /proc_host
name: proc
- mountPath: /opt/snap/tasks
name: snap-tasks
ports:
- containerPort: 8181
hostPort: 8181
name: snap-api
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: PROCFS_MOUNT
value: /proc_host
volumes:
- name: dev
hostPath:
path: /dev
- name: cgroup
hostPath:
path: /sys/fs/cgroup
- name: docker-sock
hostPath:
path: /var/run/docker.sock
- name: fs-stats
hostPath:
path: /var/lib/docker
- name: docker
hostPath:
path: /usr/bin/docker
- name: proc
hostPath:
path: /proc
- name: snap-tasks
configMap:
name: snap-tasks
---
apiVersion: v1
kind: ConfigMap
metadata:
name: snap-tasks
data:
core.json: |-
{
"version": 1,
"schedule": {
"type": "simple",
"interval": "10s"
},
"workflow": {
"collect": {
"metrics": {
"/intel/docker/*":{},
"/intel/procfs/cpu/*": {},
"/intel/procfs/meminfo/*": {},
"/intel/procfs/iface/*": {},
"/intel/linux/iostat/*": {},
"/intel/procfs/load/*": {}
},
"config": {
"/intel/procfs": {
"proc_path": "/proc_host"
}
},
"process": null,
"publish": [
{
"plugin_name": "graphite",
"config": {
"prefix": "USER_TOKEN.snap.dev.<%NODE%>",
"server": "USER_ID.carbon.hostedgraphite.com",
"port": 2003
}
}
]
}
}
}
Now, if we run “kubectl get pod”, we would see our snap_k8s daemon running on our Kubernetes cluster.
NAME READY STATUS RESTARTS AGE
snap-67sft 1/1 Running 0 1m
snap-fl7qv 1/1 Running 0 1m
snap-xs47f 1/1 Running 0 1m
In this setup, we are using a Kubernetes cluster running on Amazon Web Services (AWS). However, it doesn’t really matter which cloud provider you use. The same steps would be applicable for a cluster running on Google Cloud / Google Kubernetes Engine (GKE) or Azure Cloud / Azure Kubernetes Service(AKS).
Viewing Metrics with Hosted Graphite
At this point, the snap_k8s daemon running inside our Kubernetes cluster should be transmitting metrics to Hosted Graphite. So let’s head over to Hosted Graphite and browse to Metrics -> Account Diagnostics Page which allows us to see the data about incoming metrics. In our case, this is how it looks:
Live Metric Names And Limiting, Datapoint Rates
Total Account Metrics, Metrics Created And Deleted
The following table shows activity on a Hosted Graphite account by protocol. As we can see, the snap_k8s daemon uses TCP and UDP protocols to send its metrics over to Hosted Graphite.
Activity by Protocol
If you are unable to view these metrics, double-check the configuration provided in the snap_k8s daemon and make sure that the outbound port 2003 is not blocked by any firewall configuration. If this still doesn’t help, reach out to Hosted Graphite support through the chat bubble in the Hosted Graphite window.
Plotting the Metrics on Hosted Grafana
Now for the most exciting part, we will plot these raw metrics data into visualizations which the users can understand. We will use Grafana, which is an open-source web application for interactive visualization of data using charts, graphs and various other data visualization tools. It also provides capabilities to connect to multiple data sources such as Graphite, Elasticsearch, and MySQL, along with alerting capabilities.
First, let’s login to Hosted Graphite at https://www.hostedgraphite.com/app and browse to Dashboards -> Grafana to open the Hosted Grafana application page.
Since Hosted Grafana is part of MetricFire’s offering itself, the data source connection between Grafana and Graphite is automatically done for us.
Grafana comes with a lot of pre-built dashboards as part of its Grafana dashboards library, and MetricFire's support team is happy to install more wherever needed.
We can also download the ready-made dashboards built for Kubernetes instead of creating one ourselves from scratch. And that’s what we are going to do: download a pre-built dashboard from the Grafana dashboards library for our use case.
On the Kubernetes Container Stats page, click the link “Download JSON” and import it into the Hosted Grafana portal. Make sure to choose the Graphite data source appropriately.
Once imported, we are able to visualize the metrics in the Grafana dashboard below as shown below.
We can also choose to have data visualized by each container/host in the Kubernetes cluster.
Here are some more images of the above panel types, zoomed in:
Hosted Grafana is a very powerful visualization tool built on top of the open-source Grafana. It allows you to create your own visualizations, as well as set up alerts whenever a metric value crosses a certain threshold.
Check out our article Grafana Dashboards from Basic to Advanced to learn how to set up Grafana alerts, and build custom dashboards.
You can also create other types of visualizations based on the metrics exposed by Kubernetes. Have a look at the article Our Favorite Grafana Dashboards to create some of the more advanced dashboards.
Benefits of Monitoring Kubernetes Cluster with Hosted Graphite and Grafana
As we saw above, it takes only a few minutes of setting up a Snap daemon inside our Kubernetes cluster to get the Kubernetes monitoring up and running with Hosted Graphite and Hosted Grafana.
There are various advantages of using this solution over a custom monitoring solution:
- The data is securely transmitted and stored in Hosted Graphite. Each and every metric is securely stamped with a private API key.
- Hosted Grafana has the integration with Hosted Graphite under the hood. Thus, as soon as the Kubernetes cluster starts sending its metrics to Hosted Graphite, not much additional work needs to be done as those metrics can be instantaneously visualized in Hosted Graphite due to the integration which is already set up.
- Unless you know Graphite in and out, there are a lot of hiccups which are required to overcome in order to set up and maintain a scalable graphite infrastructure. Check out Klaviyo's struggle here. Whenever possible it's best to leave this job to the experts.
What metrics are collected by the Snap Daemon?
To see all of the information about this daemon, check out the source repository here. You can also check out the list here:
Load
Namespace |
Description (optional) |
/intel/procfs/load/min1 |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 minute |
/intel/procfs/load/min5 |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 minutes |
/intel/procfs/load/min15 |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 minutes |
/intel/procfs/load/min1_rel |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 minute per core |
/intel/procfs/load/min5_rel |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 minutes per core |
/intel/procfs/load/min15_rel |
number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 minutes per core |
/intel/procfs/load/runnable_scheduling |
The number of currently runnable kernel scheduling entities (processes, threads). |
/intel/procfs/load/existing_scheduling |
The number of kernel scheduling entities that currently exist on the system |
Interface
Namespace |
Description (optional) |
/intel/procfs/iface/<interface_name>/bytes_recv |
The total number of bytes of data received by the interface |
/intel/procfs/iface/<interface_name>/bytes_sent |
The total number of bytes of data transmitted by the interface |
/intel/procfs/iface/<interface_name>/compressed_recv |
The number of compressed packets received by the device driver |
/intel/procfs/iface/<interface_name>/compressed_sent |
The number of compressed packets transmitted by the device driver |
/intel/procfs/iface/<interface_name>/drop_recv |
The total number of packets dropped by the device driver while receiving |
/intel/procfs/iface/<interface_name>/drop_sent |
The total number of packets dropped by the device driver while transmitting |
/intel/procfs/iface/<interface_name>/errs_recv |
The total number of receive errors detected by the device driver |
/intel/procfs/iface/<interface_name>/errs_sent |
The total number of transmit errors detected by the device driver |
/intel/procfs/iface/<interface_name>/fifo_recv |
The number of FIFO buffer errors while receiving |
/intel/procfs/iface/<interface_name>/fifo_sent |
The number of FIFO buffer errors while transmitting |
/intel/procfs/iface/<interface_name>/frame_recv |
The number of packet framing errors while receiving |
/intel/procfs/iface/<interface_name>/frame_sent |
The number of packet framing errors while transmitting |
/intel/procfs/iface/<interface_name>/multicast_recv |
The number of multicast frames received by the device driver |
/intel/procfs/iface/<interface_name>/multicast_sent |
The number of multicast frames transmitted by the device driver |
/intel/procfs/iface/<interface_name>/packets_recv |
The total number of packets of data received by the interface |
/intel/procfs/iface/<interface_name>/packets_sent |
The total number of packets of data transmitted by the interface |
Memory
Namespace |
Description |
/intel/procfs/meminfo/active |
The total amount of buffer or page cache memory, in bytes, that is in active use; this memory has been used more recently and usually not reclaimed unless absolutely necessary. |
/intel/procfs/meminfo/active_anon |
The amount of anonymous memory, in bytes, that has been used more recently and usually not swapped out |
/intel/procfs/meminfo/active_file |
The amount of pagecache memory, in bytes, that has been used more recently and usually not reclaimed until needed |
/intel/procfs/meminfo/anon_huge_pages |
The size of non-file backed huge pages mapped into user-space page tables, in bytes |
/intel/procfs/meminfo/anon_pages |
The size of non-file backed pages mapped into user-space page tables, in bytes |
/intel/procfs/meminfo/bounce |
The amount of memory used for block device "bounce buffers", in bytes |
/intel/procfs/meminfo/buffers |
The amount of physical RAM, in bytes, used for file buffers |
/intel/procfs/meminfo/cached |
The amount of physical RAM, in bytes, used as cache memory |
/intel/procfs/meminfo/cma_free |
The size of Contiguous Memory Allocator pages, in bytes, which are not used |
/intel/procfs/meminfo/cma_total |
The total size of Contiguous Memory Allocator pages, in bytes |
/intel/procfs/meminfo/commit_limit |
The amount of memory, in bytes, currently available to be allocated on the system based on the overcommit ratio |
/intel/procfs/meminfo/committed_as |
The amount of memory, in bytes, estimated to complete the workload; this value represents the worst case scenario value, and also includes swap memory |
/intel/procfs/meminfo/direct_map1g |
The amount of memory, in bytes, being mapped to 1 G pages |
/intel/procfs/meminfo/direct_map2m |
The amount of memory, in bytes, being mapped to 2 MB pages |
/intel/procfs/meminfo/direct_map4k |
The amount of memory, in bytes, being mapped to standard 4k pages |
/intel/procfs/meminfo/dirty |
The total amount of memory, in bytes, waiting to be written back to the disk. |
/intel/procfs/meminfo/hardware_corrupted |
The amount of failed memory in bytes (can only be detected when using ECC RAM). |
/intel/procfs/meminfo/high_free |
The amount of memory, in bytes, that is not directly mapped into kernel space. |
/intel/procfs/meminfo/high_total |
The total amount of memory, in bytes, that is not directly mapped into kernel space. High memory is for pagecache and userspace. |
/intel/procfs/meminfo/huge_pages_free |
The total number of hugepages available for the system |
/intel/procfs/meminfo/huge_pages_rsvd |
The number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. |
/intel/procfs/meminfo/huge_pages_surp |
The number of huge pages in the pool above the value in /proc/sys/vm/nr_hugepages |
/intel/procfs/meminfo/huge_pages_total |
The total number of hugepages for the system |
/intel/procfs/meminfo/hugepagesize |
The size for each hugepages unit, in bytes |
/intel/procfs/meminfo/inactive |
The total amount of buffer or page cache memory, in bytes, that are free and available; this memory has not been recently used and can be reclaimed for other purposes |
/intel/procfs/meminfo/inactive_anon |
The amount of anonymous memory, in bytes, that has not been used recently and can be swapped out |
/intel/procfs/meminfo/inactive_file |
The amount of pagecache memory, in bytes, that can be reclaimed without huge performance impact |
/intel/procfs/meminfo/kernel_stack |
The amount of memory allocated to kernel stacks in bytes |
/intel/procfs/meminfo/low_free |
The amount of memory, in bytes, that is directly mapped into kernel space. |
/intel/procfs/meminfo/low_total |
The total amount of memory, in bytes, that is directly mapped into kernel space. It might vary based on the type of kernel used. |
/intel/procfs/meminfo/mapped |
The total amount of memory, in bytes, which have been used to map devices, files, or libraries using the mmap command |
/intel/procfs/meminfo/mem_available |
The estimated amount of memory, in bytes, which is available for starting new applications without swapping |
/intel/procfs/meminfo/mem_free |
The amount of physical RAM, in bytes, left unused by the system (the sum of low_free+high_free) |
/intel/procfs/meminfo/mem_total |
Total amount of physical RAM, in bytes |
/intel/procfs/meminfo/mem_used |
The amount of physical Ram, in bytes which is used; it equals: mem_total-(mem_free+buffers+cached+slab) |
/intel/procfs/meminfo/mlocked |
The total amount of memory, in bytes, which is locked from userspace. |
/intel/procfs/meminfo/mmap_copy |
The amount of memory, in bytes, which has been used in copying mmap(). Notice that MMU is required to see this metric. |
/intel/procfs/meminfo/nfs_unstable |
The size of NFS pages, in bytes, which are sent to the server, but not yet committed to stable storage |
/intel/procfs/meminfo/page_tables |
The total amount of memory, in bytes, dedicated to the lowest page table level. |
/intel/procfs/meminfo/quicklists |
The amount of memory, in bytes, consumed by quicklists. |
/intel/procfs/meminfo/sreclaimable |
The part of Slab, in bytes, that might be reclaimed, such as caches |
/intel/procfs/meminfo/sunreclaim |
The part of Slab, in bytes, that cannot be reclaimed on memory pressure |
/intel/procfs/meminfo/shmem |
The total amount of memory, in bytes, which is shared |
/intel/procfs/meminfo/slab |
The total amount of memory, in bytes, used by the kernel to cache data structures for its own use. |
/intel/procfs/meminfo/swap_cached |
The amount of swap, in bytes, used as cache memory |
/intel/procfs/meminfo/swap_free |
The total amount of swap free, in bytes |
/intel/procfs/meminfo/swap_total |
The total amount of swap available, in bytes |
/intel/procfs/meminfo/unevictable |
The amount of memory, in bytes, that cannot be reclaimed (for example, because it is Mlocked or used as a RAM disk). |
/intel/procfs/meminfo/vmalloc_chunk |
The largest contiguous block of vmalloc area, in bytes, which is free |
/intel/procfs/meminfo/vmalloc_total |
The total size of vmalloc memory area in bytes |
/intel/procfs/meminfo/vmalloc_used |
The amount of vmalloc area, in bytes, which is used |
/intel/procfs/meminfo/writeback |
The total amount of memory, in bytes, actively being written back to the disk |
/intel/procfs/meminfo/writeback_tmp |
The amount of memory, in bytes, used by FUSE for temporary writeback buffers |
CPU
Namespace |
Data Type |
Description |
/intel/procfs/cpu/*/user_jiffies |
float64 |
The amount of time spent in user mode by CPU with given identifier |
/intel/procfs/cpu/*/nice_jiffies |
float64 |
The amount of time spent in user mode with low priority by CPU with given identifier |
/intel/procfs/cpu/*/system_jiffies |
float64 |
The amount of time spent in system mode by CPU with given identifier |
/intel/procfs/cpu/*/idle_jiffies |
float64 |
The amount of time spent in the idle task by CPU with given identifier |
/intel/procfs/cpu/*/iowait_jiffies |
float64 |
The amount of time spent waiting for I/O to complete by CPU with given identifier |
/intel/procfs/cpu/*/irq_jiffies |
float64 |
The amount of time servicing interrupts by CPU with given identifier |
/intel/procfs/cpu/*/softirq_jiffies |
float64 |
The amount of time servicing softirqs by CPU with given identifier |
/intel/procfs/cpu/*/steal_jiffies |
float64 |
The amount of stolen time, which is the time spent in other operating systems when running in a virtualized environment by CPU with given identifier |
/intel/procfs/cpu/*/guest_jiffies |
float64 |
The amount of time spent running a virtual CPU for guest operating systems under the control of the Linux kernel by CPU with given identifier |
/intel/procfs/cpu/*/guest_nice_jiffies |
float64 |
The amount of time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel) by CPU with given identifier |
/intel/procfs/cpu/*/active_jiffies |
float64 |
The amount of time spend in non idle state by CPU with given identifier |
/intel/procfs/cpu/*/utilization_jiffies |
float64 |
The amount of time spend in non idle and non iowait states by CPU with given identifier |
/intel/procfs/cpu/*/user_percentage |
float64 |
The percent of time spent in user mode by CPU with given identifier |
/intel/procfs/cpu/*/nice_percentage |
float64 |
The percent of time spent in user mode with low priority by CPU with given identifier |
/intel/procfs/cpu/*/system_percentage |
float64 |
The percent of time spent in system mode by CPU with given identifier |
/intel/procfs/cpu/*/idle_percentage |
float64 |
The percent of time spent in the idle task by all CPUs |
/intel/procfs/cpu/*/iowait_percentage |
float64 |
The percent of time spent waiting for I/O to complete by CPU with given identifier |
/intel/procfs/cpu/*/irq_percentage |
float64 |
The percent of time servicing interrupts by CPU with given identifier |
/intel/procfs/cpu/*/softirq_percentage |
float64 |
The percent of time servicing softirqs by CPU with given identifier |
/intel/procfs/cpu/*/steal_percentage |
float64 |
The percent of stolen time, which is the time spent in other operating systems when running in a virtualized environment by CPU with given identifier |
/intel/procfs/cpu/*/guest_percentage |
float64 |
The percent of time spent running a virtual CPU for guest operating systems under the control of the Linux kernel by all CPUs |
/intel/procfs/cpu/*/guest_nice_percentage |
float64 |
The percent of time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel) by CPU with given identifier |
/intel/procfs/cpu/*/active_percentage |
float64 |
The percent of time spend in non idle state by CPU with given identifier |
/intel/procfs/cpu/*/utilization_percentage |
float64 |
The percent of time spend in non idle and non iowait states by CPU with given identifier |
Conclusion
In this article, we looked at how Hosted Graphite and Hosted Grafana can help us monitor Kubernetes clusters without much setup.
Sign up here for a free trial of our Hosted Graphite and Grafana offering. Also, if you have any questions about our products, or about how MetricFire can help your company, talk to us directly by booking a demo.