How hard is it to run your own Graphite?

METRICFIRE

Mar 02, 2021 ∙ 11 min read

MetricFire Blogger

Table of Contents

Introduction
The first step is always the hardest
In search of a better solution
Let’s get started with BigGraphite
Experiences with Whisper and Go
Final thoughts

Introduction

You've armed yourself with all the knowledge and technical skills to install and run your own Graphite.

Then you suddenly find yourself stuck and you cannot seem to get past a particular hurdle.

What do you do?

Call in a tech guy?

So far, you must have already wasted an hour.

Think how productive you could have been if you had just eliminated that wasted time.

Here is a first-hand experience of how someone tried to run their own Graphite. I assure you that their journey of installing and running Graphite on their own will serve as a good lesson in realizing that some things are best left to the experts!

Your best bet for running Graphite is to sign up for MetricFire's free trial or book a demo and speak to one of our engineers!

The first step is always the hardest

Let's first look at how one of our MetricFire community members winded-up using Graphite in the first place!

The reason they initially had to use Graphite was to generate all business-related graphs, charts, tables, and indicators displaying all types of metrics from hundreds of internal and external sources. As their company is operating in many countries, they have tens of thousands of metrics from AWS instances to deal with, like many other international businesses.

They used Datadog as their backend, CloudWatch as their AWS infrastructure, and they also had experience using Grafana to configure their alerts. After trying to figure out how to display their metrics, they faced the following problems:

It became impossible for them to import all metrics through a single physical server to Grafana.
They discovered that their company does not have enough hardware to work efficiently with a highly increased flow of metrics.

Knowing that their company did not have sufficient hardware capacity was a challenge for them, and I must admit that it was hard to figure out the additional budget to look for more options.

In search of a better solution

Since they became unable to display their metrics by only using Grafana, they needed to find another solution. Then, they found it… BigGraphite, of which the back-end is supported by Cassandra.

For the sake of not making this article run too long, I will not explain what Cassandra is. But I hope you can find plenty of sources that can help you to understand its role. Based on their assessment of other options other than BigGraphite, I would also suggest using Prometheus.

Prometheus is great if you are not looking for a longer retention period with downsampling, but instead a continuous active development and backend storage in a distributed, fault-tolerant, eventually-consistent database. But, be aware that all of these assessment processes took many hours and made them a bit frustrated at times.

You should talk to one of our MetricFire engineers by signing up for a free 14 day trial and using MetricFire’s Hosted Graphite. Book a demo with us! We look forward to helping you!

Let's get started with BigGraphite

First of all, they need to clarify how many clusters to store indexes and data points. For instance, an index workload that relies on SASI indexes can get slower as nodes are added. This means if they add more nodes to handle their increasing data points, you will encounter performance degradation. In this case, you will have to decide how many Cassandra nodes you are going to use.

You may be surprised to hear that you might need up to 8 Cassandra nodes (or more) for your data when you handle more than two hundred thousand data points. Whatever methods or configuration tools, e.g., Chef, you may try to use, I bet you will still need numerous computing resources.

One of the important lessons this team learned from this experience is that BigGraphite is complex with lots of processes. You may find that even though it is written in Python, just any kind of Python interpreter may not solve your problem.

Experiences with Whisper and Go

The reasons they switched storage is because they aimed to find something that could offer scalability, high performance, and high availability at the same time. It became so obvious for them to look for such qualities as businesses grow.

When trying Whisper, they faced their biggest constraint which was IOPS. They needed hundreds of thousands of IOPS power and their chosen Amazon Elastic Block Store(EBS) did not work. This could not give them their expected high performance and availability.

Then they moved their eyes to Golang. If you look at the go-graphite stack, there is carbon-relay-ng (a fast relay), go-carbon (for receiving and writing metrics), carbonapi (responder to Grafana read requests), and carbonzipper (reading from backends and merging results).

If you see your testing setup, you will see that writes go through AWS NLB that dispatches them to relays, and reads go through AWS ALB that dispatches them to API servers. Until reaching your goal, you can modify carbon-relay-ng as many times as you want.

You should know that, except for the region that stores data points, other regions have local relays and forward writes to the main relay. By making sure all points go to at least one relay, you can modify carbon-relay-ng again and again until the go-graphite setup is successfully scaled.

Depending on what types of clusters you use, you can handle a huge amount of data points per second. For example, 400k data points can be handled per second when you use a cluster of 14c5d.4xlarge instances.

But, be aware that whatever replication factor you use, IOPS will still be your main enemy. Replication itself is not enough since it can go down later which leads them to lose their history. Therefore, they still need to add and replace nodes. Replication is a process similar to the copy-and-paste command they use on documents. Data is replicated to the next node and a node has a copy of the previous node data.

This process is handled with carbonate which helps us to filter data from the nodes. However, if they use carbonate to balance metrics while scaling the cluster, there is a high chance of multiple nodes colliding with each other. In this case, they need to internally patch it since there is a danger that carbonate may miss or delete active nodes. This is a huge problem they still need to solve.

One more limitation they noticed during the experiment, is that Graphite is not capable of distinguishing data. When you give the command, it will look into every directory, and read and return all results. In short, every time you use Graphite, it filters out all your data regardless of its size and date.

Final thoughts

In two words, tremendously difficult!

After experimenting with all those Graphite implementations for many hours/days, I thought that I should have just hired an expert or a company that has many years of experience in Graphite. I thought, what is the point of experimenting with all these options, when there are many experts already there to help you with Graphite, Grafana, and other similar services?

Did I need to Google every single problem on my own? If I have already asked an expert, how many hours should I have saved?

Especially when you are doing business, do you need to experiment with every single technology and software?

Time is money.

Before finishing this article, you should try a free trial of the Hosted Graphite service offered by MetricFire. It has all of the best features of open-source Graphite. You can even sign up for a free trial, book a demo, and talk to one of their experts!