Table of Contents
Graphite is a leading open source time-series metrics monitoring system. First released in 2008, Graphite modernized the way organizations monitored time-series data by introducing an original network-based approach to ingest and handle metrics data coming from external systems, without requiring painful protocols such as SNMP. MetricFire runs a Hosted Graphite service - you can try our free trial today and skip the set up and management of your Graphite installation. This article is part 1 in a two-part series. Take a look at this series to see what exactly Hosted Graphite does behind the scenes. Check out the second article about Graphite installation and setup here.
This section describes how Graphite works and highlights important concepts.
The first architecture described below consists of a single server instance aiming to cover needs for small and medium environments. The second one -- consisting of several Graphite instances -- addresses the needs for large and/or distributed environments.
Single-instance Architecture and Constraints
The basic architecture of a Graphite monitoring environment looks like the following diagram. There is a central server providing a handler (namely carbon-cache) where clients feed data through the network. Graphite uses a push approach of data collection, meaning that each client decides when to push data to the server. This can be seen in contrary to a pull approach (used by Prometheus for example) where it’s the server that decides when to collect data from clients. This push approach has its benefits and its drawbacks as already discussed here. We’ll not dive in again in full detail, but we should consider the following aspects regarding Graphite:
- Each client should be aware of and be able to handle cases where the server can’t receive and ingest the data sent to it.
- A Graphite client can be any piece of software, no matter the language selected, as long as it’s able to feed the server via the network using a protocol and data format supported by Graphite. These protocol and data formats will be discussed later in this section.
- Having each client deciding when to push data to the server implies especially that the server is able to ingest high frequency data coming from many clients at a time (especially in large environments). Otherwise the monitoring would not scale. Considering this aspect, Graphite leverages various features and deployment strategies allowing it to cover, small, medium and large monitoring environments. We’ll discuss some of those capabilities next. When possible, we’ll provide additional links to help readers go further on concepts that will not be discussed.
A basic Graphite architecture with a server and three clients. The server enables a carbon-cache handler to collect and ingest metrics data pushed to it by those clients.
Graphite provides efficient ingesting capabilities such as using in-memory caching of metrics before storage. However, in an environment with a huge number of clients and/or an environment that requires a high frequency of data ingestion, a single server instance may not be enough to support the load. Graphite has advanced deployment capabilities to cope with these situations.
A classical distributed deployment scenario is sketched in the next diagram. In such a scenario, there is a special Graphite instance acting as a load-balancing front end for a set of other Graphite instances. This front end instance enables a special handler called carbon-relay, whose role is to collect data from clients and dispatch them to other instances without any additional processing. It hence acts as a load-balancer that forwards the incoming data to other instances according to rules pre-defined in its configuration. Each of the backend instances enables a carbon-cache handler to collect, ingest and store the metrics it receives. Since the carbon-relay handler does not do complex processing compared to a carbon-cache, this architecture allows us to easily scale a Graphite environment to support a huge number of clients, even if with a high frequency of data feeding.
A Graphite architecture with load-balancing -- There is a front end Graphite server enabling a carbon-relay handler, which collects metrics and dispatches them to other Graphite servers with carbon-cache daemons enabled to collect, ingest and store those metrics.
Graphite handles metrics where the most detailed resolution for time sampling is a second. This means that if several samples of metrics are collected with the same timestamp (in seconds), those samples cannot be stored separately in Graphite. To cope with those kinds of situations, Graphite provides a special handler (namely Carbon Aggregator) which can be configured to collect all samples received in a configured period of time. It aggregates them using functions such as sum and average, to generate single metrics, that can be stored. It's also common in those cases to have StatsD in front of Graphite to aggregate any data with less than a second resolution.
The capabilities of Graphite's aggregators also cover various other use cases, such as:
- When the accuracy needed for metrics is less than the frequency with which samples are received. For example, imagine a scenario where metrics are received every couple of seconds, but we want analytics that just provide average metrics every 5 minutes.
- The aggregation may then be enabled in order to minimize the storage capacity.
- When users would just want to get the sum of samples received in a given interval. As an example, consider a metrics collector that pushes the number of requests received by a certain service every second, but we just want an analytics that shows the amount of metrics received every five minutes.
TCP/UDP Data Feeding
Graphite network handlers can be enabled in TCP or UDP modes. When enabled in TCP mode, data exchanges between each client and the server leverage the reliability of the TCP protocol and guarantee no loss of data. But this implies that, each data exchange induces latency overheads due to the synchronization that the TCP protocol requires to be reliable. Such a reliability can be required for certain use cases, but in some use cases a little data loss may be tolerated. Imagine for example a metrics monitoring environment where a lot of data are generated every couple of seconds and where analytics require less resolution. In those cases, Graphite enables the ability to set up network listeners in UDP mode. This makes the data feeding faster, and would hence be really useful in large environments.
Each metrics sample pushed to a Graphite server shall have the following entries:
- A metric name, should be a free ASCII string identifying the metric.
- A value, shall be an integer or a float point number corresponding to the value of the metrics sample.
- A timestamp, expressed as a number of seconds since epoch.
Metrics names may contain one or more dot (e.g. server1.application1.request_count), and they’re commonly referred to as a metrics path. This dot-based metric name is internally used by Graphite to organize data storage in a way that optimizes the access and retrieval of metrics. Further, as illustrated on the below figure, it can be also used as a hint for a tree-based data exploration and visualization of metrics.
Illustration of a Graphite metrics tree -- Here metrics paths include at the first level, host groups, at the second level, hostnames, and at the third and last level, actual metrics names.
Pushing Metrics to the Server
There are two ways to feed metrics in Graphite, either as a single plaintext metric or as a binary set of metrics using pickle protocol. This latter approach unlocks the benefits of pushing data in bulk, while limiting the size of data transiting across the network, using a compact binary data serialization. For each of these approaches the Graphite server needs to enable appropriate listeners, either on carbon-cache or on carbon-relay, to handle those data. For example, a default installation of Graphite enables carbon-cache listening on port 2003 for single metrics, and on port 2004 for pickle data.
Metrics received through each listener are processed and stored as described in the following section.
Internally Graphite stores metrics in a file-based database (Whisper by default). This database has some essential characteristics:
- There is a base directory from where metrics are stored, each in a separate file. If the metric name is expressed as a path as discussed above, i.e. including one or more dots, those dots are handled as subdirectories from the base location. E.g. A metrics named server1.application1.request_count would be stored in a file located at server1/application1/request count.wsp from the base metrics directory.
- A metrics file has a fixed size, meaning that it can not grow indefinitely. To guarantee that, metrics are defined with two important properties: a retention period and data resolution over time. As an example, we can set it so that metrics will be stored one year (retention), with a resolution of 1 second for the last two weeks, 5 minutes for the last 2 months and 1-hour resolution the rest of the time.
- The retention period and the resolution allow us to determine the size of each file.
Once metrics have been ingested, there are various means to visualize them. The first one is to use Graphite-Web, the native visualization tool provided by the Graphite project. However, it may not be flexible enough for operations visualization. That's why users often opt to use Grafana, which provides better visualization features combined with the fact that it can handle many Graphite instances simultaneously. Finally in some rarer cases, users may also opt for custom-made visualization systems that retrieves metrics in Graphite through Graphite Web or Graphite API.
To keep reading about Graphite, go to the second article in this series about installing and setting up Graphite. If you're interested in using Graphite, but the setup and management of the Graphite installation would be too much, try our Hosted Graphite free trial. Also, feel free to book a demo and talk to us in person about monitoring your infrastructure.