2. Why Integrate Prometheus with Thanos?
3. Thanos Overview
3.1 Thanos architecture
3.2 Thanos Sidecar
3.3 Thanos Store
3.4 Thanos Query
3.5 Thanos Compact
3.6 Thanos Ruler
4. Thanos Configuration
6. Grafana Dashboards
In this article, we will deploy a clustered Prometheus setup that integrates Thanos. It is resilient against node failures and ensures appropriate data archiving. The setup is also scalable. It can span multiple Kubernetes clusters under the same monitoring umbrella. Finally, we will visualize and monitor all our data in accessible and beautiful Grafana dashboards.
Prometheus is scaled using a federated set-up, and its deployments use a persistent volume for the pod. However, not all data can be aggregated using federated mechanisms. Often, you need a different tool to manage Prometheus configurations. To address these issues, we will use Thanos. Thanos allows you to create multiple instances of Prometheus, deduplicate data, and archive data in long-term storage like GCS or S3.
The components of Thanos are sidecar, store, query, compact, and ruler. Let's take a look at what each one does.
A note on run-time duplication of HA groups: Prometheus is stateful and does not allow for replication of its database. Therefore, it is not easy to increase high availability by running multiple Prometheus replicas.
Simple load balancing will also not work -- say your app crashes. The replica might be up, but querying it will result in a small time gap for the period during which it was down. This isn’t fixed by having a second replica because it could be down at any moment, for example, during a rolling restart. These instances show how load balancing can fail.
Thanos Query pulls the data from both replicas and deduplicates those signals, filling the gaps, if any, to the Querier consumer.
Thanos Ruler basically does the same thing as the querier but for Prometheus’ rules. The only difference is that it can communicate with Thanos components.
Prerequisites: In order to completely understand this tutorial, the following are needed:
1. Working knowledge of Kubernetes and kubectl
2. A running Kubernetes cluster with at least 3 nodes (We will use a GKE)
3. Implementing Ingress Controller and Ingress objects (We will use Nginx Ingress Controller); although this is not mandatory, it is highly recommended in order to reduce external endpoints.
4. Creating credentials to be used by Thanos components to access object store (in this case, GCS bucket)
a. Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler
b. Create a service account with the role as Storage Object Admin
c. Download the key file as json credentials and name it thanos-gcs-credentials.json
d. Create a Kubernetes secret using the credentials, as you can see in the following snippet:
Deploying Prometheus Services Accounts, Clusterrole and Clusterrolebinding: The following manifest creates the monitoring namespace, service accounts, clusterrole and clusterrolebindings needed by Prometheus.
Deploying Prometheus Configuration configmap: The following config map creates the Prometheus configuration file template that will be read by the Thanos sidecar component. The template will also generate the actual configuration file. The file will be consumed by the Prometheus container running in the same pod. It is extremely important to add the external_labels section in the config file so that the querier can deduplicate data based on it.
Deploying Prometheus Rules configmap: this will create alert rules that will be relayed to Alertmanager for delivery.
Deploying Prometheus Stateful Set
It is important to understand the following about the above manifest:
Deploying Prometheus Services
We create different services for each Prometheus pod in the stateful set. These are not strictly necessary, but are created only for debugging purposes. The purpose of thanos-store-gateway headless service has been explained above. Next, we will expose the Prometheus services using an ingress object.
Deploying Thanos Query: this is one of the main components of Thanos deployment. Note the following
Deploying Thanos Store Gateway: this will create the store component which serves metrics from the object storage to the querier.
Deploying Thanos Compact
Deploying Thanos Ruler
If you go to the interactive shell in the same namespace as our workloads to check which pods thanos-store-gateway resolves, you will see something like this:
The IPs returned above correspond to our Prometheus pods, thanos-store and thanos-ruler. This can be verified as:
Deploying Alertmanager: This will create our alertmanager deployment. It will deliver all the alerts generated as per Prometheus Rules.
Deploying Kubestate Metrics: Kubestate metrics deployment is needed to relay some important container metrics. These metrics are not natively exposed by the kubelet and are not directly available to Prometheus.
Deploying Node-exporter Daemonset: Node-exporter daemonset runs a node-exporter pod on each node. It exposes very important node metrics that can be pulled by Prometheus instances.
Deploying Grafana This will create our Grafana deployment and Service which will be exposed using our ingress object. We should add thanos-querier as the datasource for our Grafana deployment. In order to do so:
Deploying the Ingress Object: This is the final piece in the puzzle. This will help expose all our services outside the Kubernetes cluster and help us access them.
Make sure you replace <yourdomain> with your own domain name. You can point the ingress-controller’s service to.
You should now be able to access Thanos Querier at http://thanos-querier.<yourdomain>.com . It will look something like this:
Make sure deduplication is selected.
If you click on Stores, you will be able to see all the active endpoints discovered by thanos-store-gateway.
Finally, you add Thanos Querier as the datasource in Grafana and start creating dashboards.
Kubernetes Cluster Monitoring Dashboard:
Kubernetes Node Monitoring Dashboard:
Integrating Thanos with Prometheus allows you to scale Prometheus horizontally. Since Thanos Querier can pull metrics from other querier instances, you can pull metrics across clusters and visualize them in Grafana dashboards. Thanos lets us archive metric data in an object store that provides infinite storage for our monitoring system. It also serves metrics from the object storage itself. A major operating cost for this setup can be attributed to the object storage (S3 or GCS). This can be reduced if we apply appropriate retention policies to them.
Today’s setup requires quite a bit of configuration on your part. The manifests provided above have been tested in a production environment and should make the process easy for you. Feel free to reach out should you have any questions around them. If you decide that you don’t want to do the configuration yourself, we have a hosted Prometheus offering where you can offload it to us and we will happily manage it for you.
This article was written by our guest blogger Vaibhav Thakur. If you liked this article, check out his LinkedIn for more.