1. Basic Concepts
1.1.1. Single-instance Architecture and Constraints
1.1.2. Multi-instances Architecture
1.2. Metrics Aggregation
1.3. TCP/UDP Data Feeding
2. Metrics Management
2.1. Metrics Format
2.2. Metrics Path
2.3. Pushing Metrics to the Server
2.4. Data Storage
Graphite is a leading open source time-series metrics monitoring system. First released in 2008, Graphite modernized the way organizations monitored time-series data by introducing an original network-based approach to ingest and handle metrics data coming from external systems, without requiring painful protocols such as SNMP. This article is part 1 in a two-part series. Check out the article about Graphite installation and setup here.
This section describes how Graphite works and highlights important concepts.
The first architecture described below consists of a single server instance aiming to cover needs for small and medium environments. The second one -- consisting of several Graphite instances -- addresses the needs for large and/or distributed environments.
The basic architecture of a Graphite monitoring environment looks like the following diagram. There is a central server providing a handler (namely carbon-cache) where clients feed data through the network. Graphite uses a push approach of data collection, meaning that each client decides when to push data to the server. This can be seen in contrary to a pull approach (used by Prometheus for example) where it’s the server that decides when to collect data from clients. This push approach has its benefits and its drawbacks as already discussed here. We’ll not dive in again in full detail, but we should consider the following aspects regarding Graphite:
A basic Graphite architecture with a server and three clients. The server enables a carbon-cache handler to collect and ingest metrics data pushed to it by those clients.
Graphite provides efficient ingesting capabilities such as using in-memory caching of metrics before storage. However, in an environment with a huge number of clients and/or an environment that requires a high frequency of data ingestion, a single server instance may not be enough to support the load. Graphite has advanced deployment capabilities to cope with these situations.
A classical distributed deployment scenario is sketched in the next diagram. In such a scenario, there is a special Graphite instance acting as a load-balancing front end for a set of other Graphite instances. This front end instance enables a special handler called carbon-relay, whose role is to collect data from clients and dispatch them to other instances without any additional processing. It hence acts as a load-balancer that forwards the incoming data to other instances according to rules pre-defined in its configuration. Each of the backend instances enables a carbon-cache handler to collect, ingest and store the metrics it receives. Since the carbon-relay handler does not do complex processing compared to a carbon-cache, this architecture allows us to easily scale a Graphite environment to support a huge number of clients, even if with a high frequency of data feeding.
A Graphite architecture with load-balancing -- There is a front end Graphite server enabling a carbon-relay handler, which collects metrics and dispatches them to other Graphite servers with carbon-cache daemons enabled to collect, ingest and store those metrics.
Graphite handles metrics where the most detailed resolution for time sampling is a second. This means that if several samples of metrics are collected with the same timestamp (in seconds), those samples cannot be stored separately in Graphite. To cope with those kinds of situations, Graphite provides a special handler (namely Carbon Aggregator) which can be configured to collect all samples received in a configured period of time. It aggregates them using functions such as sum and average, to generate single metrics, that can be stored. It's also common in those cases to have StatsD in front of Graphite to aggregate any data with less than a second resolution.
The capabilities of Graphite's aggregators also cover various other use cases, such as:
Graphite network handlers can be enabled in TCP or UDP modes. When enabled in TCP mode, data exchanges between each client and the server leverage the reliability of the TCP protocol and guarantee no loss of data. But this implies that, each data exchange induces latency overheads due to the synchronization that the TCP protocol requires to be reliable. Such a reliability can be required for certain use cases, but in some use cases a little data loss may be tolerated. Imagine for example a metrics monitoring environment where a lot of data are generated every couple of seconds and where analytics require less resolution. In those cases, Graphite enables the ability to set up network listeners in UDP mode. This makes the data feeding faster, and would hence be really useful in large environments.
Each metrics sample pushed to a Graphite server shall have the following entries:
Metrics names may contain one or more dot (e.g. server1.application1.request_count), and they’re commonly referred to as a metrics path. This dot-based metric name is internally used by Graphite to organize data storage in a way that optimizes the access and retrieval of metrics. Further, as illustrated on the below figure, it can be also used as a hint for a tree-based data exploration and visualization of metrics.
Illustration of a Graphite metrics tree -- Here metrics paths include at the first level, host groups, at the second level, hostnames, and at the third and last level, actual metrics names.
There are two ways to feed metrics in Graphite, either as a single plaintext metrics or as a binary set of metrics using pickle protocol. This latter approach unlocks the benefits of pushing data in bulk, while limiting the size of data transiting across the network, using a compact binary data serialization. For each of these approaches the Graphite server needs to enable appropriate listeners, either on carbon-cache or on carbon-relay, to handle those data. For example, a default installation of Graphite enables carbon-cache listening on port 2003 for single metrics, and on port 2004 for pickle data.
Metrics received through each listener are processed and stored as described in the following section.
Internally Graphite stores metrics in a file-based database (Whisper by default). This database has some essential characteristics:
Once metrics have been ingested, there are various means to visualize them. The first one is to use Graphite-Web, the native visualization tool provided by the Graphite project. However, it may not be flexible enough for operations visualization. That's why users often opt to use Grafana, which provides better visualization features combined with the fact that it can handle many Graphite instances simultaneously. Finally in some rarer cases, users may also opt for custom-made visualization systems that retrieves metrics in Graphite through Graphite Web or Graphite API.
To keep reading about Graphite, go to the second article in this series about installing and setting up Graphite.