Monitoring Kubernetes with Hosted Graphite by MetricFire

Monitoring Kubernetes with Hosted Graphite by MetricFire

Table of Contents

  1. Introduction
  2. Introduction to Kubernetes
  3. Why monitor Kubernetes with Graphite?
  4. Introduction to Graphite
  5. Setting Up Kubernetes for Graphite Monitoring
  6. Viewing Metrics with Hosted Graphite
  7. Plotting the Metrics on Hosted Grafana
  8. Benefits of Monitoring Kubernetes Cluster with Hosted Graphite and Grafana
  9. What metrics are collected by the Snap Daemon?
    1. Load
    2. Interface
    3. Memory
    4. CPU
  10. Conclusion

Introduction

In this article we will be looking into Kubernetes monitoring with Graphite and Grafana. Specifically, we will look at how your whole Kubernetes set up can be centrally monitored through Hosted Graphite and Hosted Grafana. This will allow Kubernetes Administrators to centrally manage all of their Kubernetes clusters without setting up any additional infrastructure for monitoring.

To follow the steps in this blog, sign up for the MetricFire free trial, where you can use Graphite and Grafana directly in our platform. MetricFire is a hosted Graphite, Grafana and Prometheus service, where we do the set up and management of these open-source tools so you don’t have to. 

     

Introduction to Kubernetes

Kubernetes is the most popular platform for managing containerized applications. Many big and small organizations today use Kubernetes to deploy their production services. However, even though it simplifies the production services deployment greatly, it also brings in its own set of complexities and challenges. Thus, it becomes very important to monitor the components which are running inside Kubernetes pods, services and daemons to make sure they are behaving as expected.

For more information on what Kubernetes is, check out our beginner's guide here. We also have some great tutorials on Deploying Grafana to Kubernetes, HA Kubernetes monitoring, and more on the MetricFire blog

   

Why monitor Kubernetes with Graphite?

Prometheus is the de facto tool for monitoring Kubernetes, with convenient abilities like service discovery and hundreds of pre-built dashboards. However, Prometheus is intended for day-to-day monitoring, and it's a terrible tool when it comes to long-term storage of data. If you need to keep your data for a long time, so that you can do trending and analysis over a year, then you will have to write data out from your local Prometheus instance to a remote storage. This can be challenging to maintain and use, not to mention expensive.

That's where Hosted Graphite and Hosted Prometheus by MetricFire come into play. If you want to have a local Prometheus installation in your Kubernetes cluster that writes out to a hosted Prometheus long-term storage, you can use Hosted Prometheus. MetricFire's Hosted Prometheus tool is essentially a hosted long-term storage for Prometheus data with built in dashboards, alerting, maintenance, and around-the-clock support. 

Our Hosted Prometheus solution does not control the Prometheus instance within your Kubernetes cluster. The reason for this is security. If we try to access your Prometheus instance from MetricFire it could put your entire cluster at risk. It's easier and safer for you to install a Prometheus instance in your Kubernetes cluster, and have us manage the remote storage, dashboards, security, alerting and updates.

If you want to monitor your Kubernetes installation completely hassle free, without installing a local Prometheus instance, then we recommend using Hosted Graphite with a Snap daemon. This Snap daemon acts like an agent and it sends the data to Hosted Graphite from inside your Kubernetes cluster. This solution is very easy - it requires almost no set up on your side, and we can take care of everything from the MetricFire platform. We will outline this solution in this article.

Now, let's dive into how we can monitor Kubernetes with Hosted Graphite by MetricFire!

           

Introduction to Graphite

Graphite is one of the most popular and an open source enterprise monitoring tools, and it's used by many enterprises across the world. Graphite provides us with the entire stack consisting of a database for metrics storage called whisper, a daemon called carbon for listening to incoming events and finally a graphite web application which allows us to see the metrics on the browser.  

In this solution, we will not be focussing on open source graphite which can be troublesome to set up initially, but instead we'll use Hosted Graphite which comes with a lot of advantages.

There are some differences between open source Graphite and Hosted Graphite which we will not be covering in this article in detail. In short, Hosted Graphite is a tuned version of Graphite that provides more granular metrics and more transparent scalability over the open source Graphite.

            

Setting Up Kubernetes for Graphite Monitoring

In order to transmit Kubernetes metrics to Graphite, we will be using an open source Snap daemon which will run as a DaemonSet on our Kubernetes cluster. DaemonSet is a background process in Kubernetes which can be scheduled to run at a specified interval. In this case, we will be configuring Snap to run every 10 seconds.

Before we set up Snap in our Kubernetes cluster, it needs to be supplied with two key pieces of information.

  • The Graphite server to connect to
  • The prefix to be sent for each metric sent to Graphite

          

Both of these pieces of information can be retrieved from HostedGraphite. Login in to https://https://www.hostedgraphite.com/app/ and go to the Overview section.

Once logged in, click on “How do I send metrics?” button to reveal the server information.

    

undefined

     

As shown below, the API key shown below will be used as the prefix for each metric and the URL Endpoint will be used as the server destination.

       

undefined

         

Having both these information at hand, let’s update these values at line #98 and #99 respectively in the configuration manifest below and deploy the daemon on the Kubernetes cluster using kubectl create -f snap_ds.yml .

        

apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: snap
spec:
 selector:
   matchLabels:
     name: snap
 template:
   metadata:
     name: snap
     labels:
       name: snap
   spec:
     hostPID: true
     hostNetwork: true
     containers:
     - name: snap
       image: raintank/snap_k8s:latest
       volumeMounts:
         - mountPath: /sys/fs/cgroup
           name: cgroup
         - mountPath: /var/run/docker.sock
           name: docker-sock
         - mountPath: /var/lib/docker
           name: fs-stats
         - mountPath: /usr/local/bin/docker
           name: docker
         - mountPath: /proc_host
           name: proc
         - mountPath: /opt/snap/tasks
           name: snap-tasks
       ports:
       - containerPort: 8181
         hostPort: 8181
         name: snap-api
       imagePullPolicy: IfNotPresent
       securityContext:
         privileged: true
       env:
         - name: PROCFS_MOUNT
           value: /proc_host
     volumes:
       - name: dev
         hostPath:
           path: /dev
       - name: cgroup
         hostPath:
           path: /sys/fs/cgroup
       - name: docker-sock
         hostPath:
           path: /var/run/docker.sock
       - name: fs-stats
         hostPath:
           path: /var/lib/docker
       - name: docker
         hostPath:
           path: /usr/bin/docker
       - name: proc
         hostPath:
           path: /proc
       - name: snap-tasks
         configMap:
           name: snap-tasks
---
apiVersion: v1
kind: ConfigMap
metadata:
 name: snap-tasks
data:
 core.json: |-
   {
       "version": 1,
       "schedule": {
           "type": "simple",
           "interval": "10s"
       },
       "workflow": {
           "collect": {
               "metrics": {
                   "/intel/docker/*":{},
                   "/intel/procfs/cpu/*": {},
                   "/intel/procfs/meminfo/*": {},
                   "/intel/procfs/iface/*": {},
                   "/intel/linux/iostat/*": {},
                   "/intel/procfs/load/*": {}
               },
               "config": {
                   "/intel/procfs": {
                       "proc_path": "/proc_host"
                   }
               },
               "process": null,
               "publish": [
                   {
                       "plugin_name": "graphite",                   
                       "config": {
                           "prefix": "your-api-key-added-here",
                           "server": "your-server.carbon.hostedgraphite.com",
                           "port": 2003
                       }
                   }
               ]
           }
       }
   }

    

Now, if we run “kubectl get pod”, we would see our Snap daemon running on our Kubernetes cluster.

NAME         	READY   	STATUS    	RESTARTS   AGE
snap-67sft   	1/1     	Running   	0          	1m
snap-fl7qv   	1/1     	Running   	0          	1m
snap-xs47f   	1/1     	Running   	0          	1m

     

In this setup, we are using a Kubernetes cluster running on Amazon Web Services (AWS). However, it doesn’t really matter which cloud provider you use.  The same steps would be applicable for a cluster running on Google Cloud / Google Kubernetes Engine (GKE) or Azure Cloud / Azure Kubernetes Service(AKS).

        

       

Viewing Metrics with Hosted Graphite

At this point, the snap daemon running inside our Kubernetes cluster should be transmitting metrics to Hosted Graphite. So let’s head over to Hosted Graphite and browse to Metrics -> Account Diagnostics Page which allows us to see the data about incoming metrics. In our case, this is how it looks:

    

undefined
     
Live Metric Names And Limiting, Datapoint Rates

       

undefined
  
Total Account Metrics, Metrics Created And Deleted

      

The following table shows activity on a Hosted Graphite account by protocol. As we can see, Snap daemon uses TCP and UDP protocol to send its metrics over to Hosted Graphite.

        

undefined
    
Activity by Protocol

     

If you are unable to view these metrics, double check the configuration provided in the snap daemon and make sure that the outbound port 2003 is not blocked by any firewall configuration. If this still doesn’t help, reach out to Hosted Graphite support through the chat bubble in the Hosted Graphite window. 

       

Plotting the Metrics on Hosted Grafana

Now for the most exciting part, we will plot these raw metrics data into visualizations which the users can understand. We will use Grafana, which is an open source web application for interactive visualization of data using charts, graphs and various other data visualization tools. It also provides capabilities to connect to multiple data sources such as Graphite, Elasticsearch, and MySQL, along with alerting capabilities.

First, let’s login to Hosted Graphite at https://www.hostedgraphite.com/app and browse to Dashboards -> Grafana to open the Hosted Grafana application page.

Since Hosted Grafana is part of MetricFire’s offering itself, the datasource connection between Grafana and Graphite is automatically done for us.

Grafana comes with a lot of pre-built dashboards as part of its Grafana dashboards library, and MetricFire's support team is happy to install more wherever needed.

We can also download the ready made dashboards built for Kubernetes instead of creating one ourselves from scratch. And that’s what we are going to do: download a pre-built dashboard from the Grafana dashboards library for our use case.

On the Kubernetes Container Stats page, click the link “Download JSON” and import it into the Hosted Grafana portal. Make sure to choose the Graphite data source appropriately.

             

undefined

        

Once imported, we are able to visualize the metrics in the Grafana dashboard below as shown below.

    

undefined
       

We can also choose to have data visualized by each container / host in the Kubernetes cluster.

     

undefined    

Here's some more images of the above panel types, zoomed in:

   

    undefined

   

undefined

       

undefined

      

undefined

       

Hosted Grafana is a very powerful visualization tool built on top of the open-source Grafana. It allows you to create your own visualizations, as well as setup alerts whenever a metric value crosses a certain threshold.

Check out our article Grafana Dashboards from Basic to Advanced to learn how to set up Grafana alerts, and build custom dashboards.

You can also create other types of visualizations based on the metrics exposed by Kubernetes. Have a look at the article Our Favorite Grafana Dashboards to create some of the more advanced dashboards.

           

Benefits of Monitoring Kubernetes Cluster with Hosted Graphite and Grafana

As we saw above, it takes only a few minutes of setting up a Snap daemon inside our Kubernetes cluster to get the Kubernetes monitoring up and running with Hosted Graphite and Hosted Grafana.

There are various advantages of using this solution over a custom monitoring solution:

      

What metrics are collected by the Snap Daemon?

To see all of the information about this daemon, check out the source repository here. You can also check out the list here:

     

Load

   

Namespace

Description (optional)

/intel/procfs/load/min1

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 minute

/intel/procfs/load/min5

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 minutes

/intel/procfs/load/min15

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 minutes

/intel/procfs/load/min1_rel

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 minute per core

/intel/procfs/load/min5_rel

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 minutes per core

/intel/procfs/load/min15_rel

number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 minutes per core

/intel/procfs/load/runnable_scheduling

The number of currently runnable kernel scheduling entities (processes, threads).

/intel/procfs/load/existing_scheduling

The number of kernel scheduling entities that currently exist on the system

      

Interface

   

Namespace

Description (optional)

/intel/procfs/iface/<interface_name>/bytes_recv

The total number of bytes of data received by the interface

/intel/procfs/iface/<interface_name>/bytes_sent

The total number of bytes of data transmitted by the interface

/intel/procfs/iface/<interface_name>/compressed_recv

The number of compressed packets received by the device driver

/intel/procfs/iface/<interface_name>/compressed_sent

The number of compressed packets transmitted by the device driver

/intel/procfs/iface/<interface_name>/drop_recv

The total number of packets dropped by the device driver while receiving

/intel/procfs/iface/<interface_name>/drop_sent

The total number of packets dropped by the device driver while transmitting

/intel/procfs/iface/<interface_name>/errs_recv

The total number of receive errors detected by the device driver

/intel/procfs/iface/<interface_name>/errs_sent

The total number of transmit errors detected by the device driver

/intel/procfs/iface/<interface_name>/fifo_recv

The number of FIFO buffer errors while receiving

/intel/procfs/iface/<interface_name>/fifo_sent

The number of FIFO buffer errors while transmitting

/intel/procfs/iface/<interface_name>/frame_recv

The number of packet framing errors while receiving

/intel/procfs/iface/<interface_name>/frame_sent

The number of packet framing errors while transmitting

/intel/procfs/iface/<interface_name>/multicast_recv

The number of multicast frames received by the device driver

/intel/procfs/iface/<interface_name>/multicast_sent

The number of multicast frames transmitted by the device driver

/intel/procfs/iface/<interface_name>/packets_recv

The total number of packets of data received by the interface

/intel/procfs/iface/<interface_name>/packets_sent

The total number of packets of data transmitted by the interface

     

Memory

   

Namespace

Description

/intel/procfs/meminfo/active

The total amount of buffer or page cache memory, in bytes, that is in active use; this memory has been used more recently and usually not reclaimed unless absolutely necessary.

/intel/procfs/meminfo/active_anon

The amount of anonymous memory, in bytes, that has been used more recently and usually not swapped out

/intel/procfs/meminfo/active_file

The amount of pagecache memory, in bytes, that has been used more recently and usually not reclaimed until needed

/intel/procfs/meminfo/anon_huge_pages

The size of non-file backed huge pages mapped into user-space page tables, in bytes

/intel/procfs/meminfo/anon_pages

The size of non-file backed pages mapped into user-space page tables, in bytes

/intel/procfs/meminfo/bounce

The amount of memory used for block device "bounce buffers", in bytes

/intel/procfs/meminfo/buffers

The amount of physical RAM, in bytes, used for file buffers

/intel/procfs/meminfo/cached

The amount of physical RAM, in bytes, used as cache memory

/intel/procfs/meminfo/cma_free

The size of Contiguous Memory Allocator pages, in bytes, which are not used

/intel/procfs/meminfo/cma_total

The total size of Contiguous Memory Allocator pages, in bytes

/intel/procfs/meminfo/commit_limit

The amount of memory, in bytes, currently available to be allocated on the system based on the overcommit ratio

/intel/procfs/meminfo/committed_as

The amount of memory, in bytes, estimated to complete the workload; this value represents the worst case scenario value, and also includes swap memory

/intel/procfs/meminfo/direct_map1g

The amount of memory, in bytes, being mapped to 1 G pages

/intel/procfs/meminfo/direct_map2m

The amount of memory, in bytes, being mapped to 2 MB pages

/intel/procfs/meminfo/direct_map4k

The amount of memory, in bytes, being mapped to standard 4k pages

/intel/procfs/meminfo/dirty

The total amount of memory, in bytes, waiting to be written back to the disk.

/intel/procfs/meminfo/hardware_corrupted

The amount of failed memory in bytes (can only be detected when using ECC RAM).

/intel/procfs/meminfo/high_free

The amount of memory, in bytes, that is not directly mapped into kernel space.

/intel/procfs/meminfo/high_total

The total amount of memory, in bytes, that is not directly mapped into kernel space. High memory is for pagecache and userspace.

/intel/procfs/meminfo/huge_pages_free

The total number of hugepages available for the system

/intel/procfs/meminfo/huge_pages_rsvd

The number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made.

/intel/procfs/meminfo/huge_pages_surp

The number of huge pages in the pool above the value in /proc/sys/vm/nr_hugepages

/intel/procfs/meminfo/huge_pages_total

The total number of hugepages for the system

/intel/procfs/meminfo/hugepagesize

The size for each hugepages unit, in bytes

/intel/procfs/meminfo/inactive

The total amount of buffer or page cache memory, in bytes, that are free and available; this memory has not been recently used and can be reclaimed for other purposes

/intel/procfs/meminfo/inactive_anon

The amount of anonymous memory, in bytes, that has not been used recently and can be swapped out

/intel/procfs/meminfo/inactive_file

The amount of pagecache memory, in bytes, that can be reclaimed without huge performance impact

/intel/procfs/meminfo/kernel_stack

The amount of memory allocated to kernel stacks in bytes

/intel/procfs/meminfo/low_free

The amount of memory, in bytes, that is directly mapped into kernel space.

/intel/procfs/meminfo/low_total

The total amount of memory, in bytes, that is directly mapped into kernel space. It might vary based on the type of kernel used.

/intel/procfs/meminfo/mapped

The total amount of memory, in bytes, which have been used to map devices, files, or libraries using the mmap command

/intel/procfs/meminfo/mem_available

The estimated amount of memory, in bytes, which is available for starting new applications without swapping

/intel/procfs/meminfo/mem_free

The amount of physical RAM, in bytes, left unused by the system (the sum of low_free+high_free)

/intel/procfs/meminfo/mem_total

Total amount of physical RAM, in bytes

/intel/procfs/meminfo/mem_used

The amount of physical Ram, in bytes which is used; it equals: mem_total-(mem_free+buffers+cached+slab)

/intel/procfs/meminfo/mlocked

The total amount of memory, in bytes, which is locked from userspace.

/intel/procfs/meminfo/mmap_copy

The amount of memory, in bytes, which has been used in copying mmap(). Notice that MMU is required to see this metric.

/intel/procfs/meminfo/nfs_unstable

The size of NFS pages, in bytes, which are sent to the server, but not yet committed to stable storage

/intel/procfs/meminfo/page_tables

The total amount of memory, in bytes, dedicated to the lowest page table level.

/intel/procfs/meminfo/quicklists

The amount of memory, in bytes, consumed by quicklists.

/intel/procfs/meminfo/sreclaimable

The part of Slab, in bytes, that might be reclaimed, such as caches

/intel/procfs/meminfo/sunreclaim

The part of Slab, in bytes, that cannot be reclaimed on memory pressure

/intel/procfs/meminfo/shmem

The total amount of memory, in bytes, which is shared

/intel/procfs/meminfo/slab

The total amount of memory, in bytes, used by the kernel to cache data structures for its own use.

/intel/procfs/meminfo/swap_cached

The amount of swap, in bytes, used as cache memory

/intel/procfs/meminfo/swap_free

The total amount of swap free, in bytes

/intel/procfs/meminfo/swap_total

The total amount of swap available, in bytes

/intel/procfs/meminfo/unevictable

The amount of memory, in bytes, that cannot be reclaimed (for example, because it is Mlocked or used as a RAM disk).

/intel/procfs/meminfo/vmalloc_chunk

The largest contiguous block of vmalloc area, in bytes, which is free

/intel/procfs/meminfo/vmalloc_total

The total size of vmalloc memory area in bytes

/intel/procfs/meminfo/vmalloc_used

The amount of vmalloc area, in bytes, which is used

/intel/procfs/meminfo/writeback

The total amount of memory, in bytes, actively being written back to the disk

/intel/procfs/meminfo/writeback_tmp

The amount of memory, in bytes, used by FUSE for temporary writeback buffers

         

CPU

   

Namespace

Data Type

Description

/intel/procfs/cpu/*/user_jiffies

float64

The amount of time spent in user mode by CPU with given identifier

/intel/procfs/cpu/*/nice_jiffies

float64

The amount of time spent in user mode with low priority by CPU with given identifier

/intel/procfs/cpu/*/system_jiffies

float64

The amount of time spent in system mode by CPU with given identifier

/intel/procfs/cpu/*/idle_jiffies

float64

The amount of time spent in the idle task by CPU with given identifier

/intel/procfs/cpu/*/iowait_jiffies

float64

The amount of time spent waiting for I/O to complete by CPU with given identifier

/intel/procfs/cpu/*/irq_jiffies

float64

The amount of time servicing interrupts by CPU with given identifier

/intel/procfs/cpu/*/softirq_jiffies

float64

The amount of time servicing softirqs by CPU with given identifier

/intel/procfs/cpu/*/steal_jiffies

float64

The amount of stolen time, which is the time spent in other operating systems when running in a virtualized environment by CPU with given identifier

/intel/procfs/cpu/*/guest_jiffies

float64

The amount of time spent running a virtual CPU for guest operating systems under the control of the Linux kernel by CPU with given identifier

/intel/procfs/cpu/*/guest_nice_jiffies

float64

The amount of time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel) by CPU with given identifier

/intel/procfs/cpu/*/active_jiffies

float64

The amount of time spend in non idle state by CPU with given identifier

/intel/procfs/cpu/*/utilization_jiffies

float64

The amount of time spend in non idle and non iowait states by CPU with given identifier

/intel/procfs/cpu/*/user_percentage

float64

The percent of time spent in user mode by CPU with given identifier

/intel/procfs/cpu/*/nice_percentage

float64

The percent of time spent in user mode with low priority by CPU with given identifier

/intel/procfs/cpu/*/system_percentage

float64

The percent of time spent in system mode by CPU with given identifier

/intel/procfs/cpu/*/idle_percentage

float64

The percent of time spent in the idle task by all CPUs

/intel/procfs/cpu/*/iowait_percentage

float64

The percent of time spent waiting for I/O to complete by CPU with given identifier

/intel/procfs/cpu/*/irq_percentage

float64

The percent of time servicing interrupts by CPU with given identifier

/intel/procfs/cpu/*/softirq_percentage

float64

The percent of time servicing softirqs by CPU with given identifier

/intel/procfs/cpu/*/steal_percentage

float64

The percent of stolen time, which is the time spent in other operating systems when running in a virtualized environment by CPU with given identifier

/intel/procfs/cpu/*/guest_percentage

float64

The percent of time spent running a virtual CPU for guest operating systems under the control of the Linux kernel by all CPUs

/intel/procfs/cpu/*/guest_nice_percentage

float64

The percent of time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel) by CPU with given identifier

/intel/procfs/cpu/*/active_percentage

float64

The percent of time spend in non idle state by CPU with given identifier

/intel/procfs/cpu/*/utilization_percentage

float64

The percent of time spend in non idle and non iowait states by CPU with given identifier

      

Conclusion

In this article, we looked at how Hosted Graphite and Hosted Grafana can help us monitor Kubernetes clusters without much setup.

Sign up here for a free trial of our Hosted Graphite and Grafana offering. Also, if you have any questions about our products, or about how MetricFire can help your company, talk to us directly by booking a demo

Hungry for more knowledge?

Related posts