Kafka performance monitoring metrics

Kafka performance monitoring metrics

Table of Contents

Introduction

In this article, we will analyze what are the metrics for monitoring Kafka performance and why it is important to constantly monitor them. We will also look at the process of monitoring metrics for Kafka using Hosted Graphite by MetricFire.

            

To learn more about MetricFire, book a demo with the MetricFire team or sign up for the free trial.

                

Key Takeaways

  1. Kafka is an open-source distributed event streaming platform used for storing, processing, and analyzing streaming data.
  2. Kafka offers high throughput, low latency, fault tolerance, durability, scalability, and real-time data processing.
  3. Monitoring Kafka is essential for ensuring the stable operation of applications.
  4. Grafana is an open-source system for visualizing metrics with customizable dashboards.
  5. Graphite is a monitoring tool that stores and processes data, and Grafana can connect to it as a data source.

 

What is Kafka?

Kafka is an open-source distributed event streaming platform used by thousands of users to store, process, and analyze streaming data.

              

How does Kafka work?

Kafka consists of servers and clients that communicate over the high-performance TCP network protocol. Kafka operates as a cluster of one or more servers. Some of these servers store data. Other servers import and export data as streams of events to integrate Kafka with your existing systems. The Kafka cluster is highly scalable and resilient. If one of the servers fails, the other ones do their work to ensure continuous operation without data loss. Clients enable you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner.

              

Kafka allows you to create themes and then connect apps and write records to those themes. Records are byte arrays in which you can store any information. A record has four attributes: key, value, timestamp, and titles. Only the first two attributes are required.

             

Kafka consists of four main systems:

  1. The broker handles all requests from clients and stores data. A cluster can have one or more brokers.
  2. Zookeeper maintains the state of the cluster.
  3. Producer sends records to the broker.
  4. The consumer receives batches of records from the broker.

                    

                           

Benefits of using Kafka

Let’s take a closer look at the benefits of using Kafka.

  1. High throughput. Kafka supports the throughput of thousands of messages per second and can handle high speed and large amounts of data.
  2. Low latency. Kafka can process messages with latency in the range of milliseconds.
  3. Fault-tolerant. It is one of the main advantages of Kafka. Kafka works flawlessly even when a node/machine in the cluster fails.
  4. Durability. Kafka offers a message replication feature, which is one of the reasons for its durability, so messages are never lost.
  5. Scalability. Kafka can be scaled up on the fly by adding additional nodes.
  6. Distributed architecture. Kafka’s distributed architecture makes it scalable by leveraging capabilities such as replication and partitioning.
  7. Convenience for consumers. Kafka can work in different ways depending on the consumer with whom it integrates. It can integrate with many consumers written in different programming languages.
  8. Real-time processing. Kafka can process the data pipeline in real-time.

               

Kafka metrics

To ensure the stable operation of applications that depend on Kafka, you need to constantly monitor its status and efficiency.  To do this, you need to monitor the key metrics of each component that the cluster includes:

  • Broker metrics.
  • Producer metrics.
  • Consumer metrics.
  • ZooKeeper metrics.

Those are some Kafka performance metrics.

Broker metrics

Each message goes through the broker before being used. Therefore, brokers play a key role in Kafka. It is very important to track their performance characteristics, which can be divided into three main categories:

  • Kafka system metrics.
  • JVM garbage collector metrics.
  • Host metrics.

           

Kafka system metrics

                 

Name Description
UnderReplicatedPartitions The number of under-replicated partitions across all topics on the broker. Under-replicated partition metrics are a leading indicator of one or more brokers being unavailable.
IsrShrinksPerSec/IsrExpandsPerSec If a broker goes down, in-sync replica ISRs for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up.
ActiveControllerCount Indicates whether the broker is active and should always be equal to 1 since there is only one broker at the same time that acts as a controller.
OfflinePartitionsCount The number of partitions that don’t have an active leader and are hence not writable or readable. A non-zero value indicates that brokers are not available.
LeaderElectionRateAndTimeMs A partition leader election happens when ZooKeeper is not able to connect with the leader. This metric may indicate a broker is unavailable.
UncleanLeaderElectionsPerSec A leader may be chosen from out-of-sync replicas if the broker which is the leader of the partition is unavailable and a new leader needs to be elected. This metric can indicate potential message loss.
TotalTimeMs The time is taken to process the message.
PurgatorySize The size of purgatory requests. Can help identify the main causes of the delay.
BytesInPerSec/BytesOutPerSec The number of data brokers received from producers and the number that consumers read from brokers. This is an indicator of the overall throughput or workload in the Kafka cluster.
RequestsPerSecond Frequency of requests from producers, consumers, and subscribers.

                             

JVM garbage collector metrics

                     

Name Description
CollectionCount The total number of young or old garbage collection processes performed by the JVM.
CollectionTime The total amount of time in milliseconds that the JVM spent executing young or old garbage collection processes.

                     

Host metrics

          

Name Description
Page cache reads ratio The ratio of the number of reads from the cache pages and the number of reads from the disk.
Disk usage The amount of used and available disk space.
CPU usage The CPU is rarely the source of performance issues. However, if you see spikes in CPU usage, this metric should be investigated.
Network bytes sent/received The amount of incoming and outgoing network traffic.

                                          

Producer metrics

Producers are processes that send messages to consumers. If producers stop working, consumers will not receive new messages. Let’s take a look at the key producer metrics.

                     

Name Description
compression-rate-avg Average compression rate of sent batches.
response-rate An average number of responses received per producer.
request-rate An average number of responses sent per producer.
request-latency-avg Average request latency in milliseconds.
outgoing-byte-rate An average number of outgoing bytes per second.
io-wait-time-ns-avg  The average length of time the I/O thread spent waiting for a socket (in ns).
batch-size-avg The average number of bytes sent per partition per request.

                 

Consumer metrics

Monitoring consumer metrics can show how efficiently data is being retrieved by consumers, which can help identify system performance problems. Let’s take a look at the consumer metrics below.

                       

Name Description
records-lag The number of messages consumer is behind the producer on this partition.
records-lag-max Maximum record lag. Increasing value means that the consumer is not keeping up with the producers.
bytes-consumed-rate Average bytes consumed per second for each consumer for a specific topic or across all topics. 
records-consumed-rate An average number of records consumed per second for a specific topic or across all topics.
fetch-rate The number of fetch requests per second from the consumer.

                 

ZooKeeper metrics

ZooKeeper is an essential component of Kafka deployment and disabling ZooKeeper will stop Kafka. ZooKeeper stores information about brokers and Kafka themes, applies quotas to control the speed of traffic passing through the cluster, and stores information about replicas. Below are the ZooKeeper metrics.

                          

Name Description
outstanding-requests The number of requests that are in the queue.
avg-latency The response time to a client request is in milliseconds.
num-alive-connections The number of clients connected to ZooKeeper.
followers The number of active followers.
pending-syncs The number of pending consumers syncs.
open-file-descriptor-count The number of used file descriptors.

                      

Collecting Kafka metrics

There are several tools for collecting Kafka metrics:

  1. JConsole is the GUI that comes with the JDK. It provides an interface for collecting all Kafka metrics.
  2. JMX. A lot of monitoring tools can collect JMX metrics from Kafka through JMX plugins, through metric reporter plugins, or through connectors that write JMX metrics to Graphite or other systems.
  3. Burrow is a tool that allows you to get detailed metrics of the efficiency of all consumers.

                 

What is Graphite and Grafana?

Grafana is an open-source system that provides tools for the graphical visualization of metrics. Grafana has a lot of different customizable dashboards that let you create beautiful graphs and charts. The data source for Grafana can be any place where you store your data.

                

Graphite is a monitoring tool that allows you to store and process data. Grafana can connect to Graphite as a data source and can be used with it to monitor your system’s metrics.

            

Using hosted Grafana and Graphite for monitoring Kafka metrics

To monitor Kafka metrics use Grafana dashboards. First, you need to choose the type of dashboard that suits you and create it. Then choose a data source. Today the best source of data for Grafana is Graphite. All Kafka metrics that you have collected using special tools need first be saved in Graphite. Next, create and configure all the necessary charts. The finished dashboard can be exported to a JSON file. You can also create an external link to the dashboard or a screenshot of it.

                

How to integrate Kafka and Grafana via Graphite?

Create beautiful, customizable dashboards for monitoring Kafka metrics using a lot of tools Grafana provides. Save your metrics with Graphite. Connect Graphite to Grafana and monitor your metrics easily and conveniently.

            

For more information on how to integrate Kafka with Grafana via Graphite, book a demo with the MetricFire team or sign up in MetricFire for free.

             

Benefits of using MetricFire

MetricFire offers hosted Graphite and Grafana which will help make the process of monitoring Kafka metrics easier and more convenient. Using MetricFire, you can only care about Kafka performance metrics, and we take care of setting up the monitoring system.

             

Conclusion

In this article, we explored how the Kafka event streaming platform works and the benefits of using it. We also took a closer look at Kafka performance metrics and such tools for monitoring them as hosted Graphite and Grafana offered by MetricFire.

            

To learn more about MetricFire, book a demo with our experts or sign up in MetricFire for the free trial today.

You might also like other posts...
metricfire Sep 25, 2024 · 6 min read

Guide to Adding K8 Inventory Stats to Your Telegraf Daemonset

By integrating inventory stats with your other K8 performance metrics, you can better correlate... Continue Reading

metricfire Sep 04, 2024 · 9 min read

Guide to Monitoring Nagios Plugins Using Telegraf

Nagios is an open-source monitoring system used to track the performance and health of... Continue Reading

metricfire Aug 23, 2024 · 9 min read

Step By Step Guide to Monitoring RavenDB Using Telegraf

Monitoring the performance of RavenDB is crucial to ensure optimal system operation, quickly identify... Continue Reading

header image

We strive for
99.999% uptime

Because our system is your system.

14-day trial 14-day trial
No Credit Card Required No Credit Card Required