Prometheus is an increasingly popular tool in the world of SREs and operational monitoring. Based on ideas from Google’s internal monitoring service (Borgmon), and with native support from services like Docker and Kubernetes, Prometheus is designed for a cloud-based, containerised world. As a result, it’s quite different from existing services like Graphite.
Starting out, it can be tricky to know where to begin with the official Prometheus docs and the wave of recent Prom content. This article acts as a high level overview of how Prometheus works, its positives and challenges for monitoring, and where Metricfire can help.
Prometheus is an application (written in Go) which can be run directly on a server, in a docker container, or as part of e.g. a Kubernetes cluster. You tell it where to find metrics by configuring a list of “scrape jobs”. Each job specifies endpoints to scrape, or configures service discovery to obtain endpoints automatically. For example, a job to scrape Kubernetes would contain the Kubernetes API server endpoint. The Kubernetes API then returns the endpoints to scrape for current nodes or pods.
Applications can provide these metrics endpoints to Prometheus using client libraries available for various languages. You can also use separate exporters which gather metrics from specific applications and make them available to Prometheus. Each application or exporter endpoint serves up metrics plus tags and appropriate metadata whenever Prometheus requests them.
Official and unofficial exporters exist for dozens of services. A popular one is node_exporter, which collects system metrics for Linux and other Unix servers.
Metrics are stored locally on disk, and by default they’re only retained for 15 days, providing a sliding window of data instead of a long term storage solution. Prometheus doesn’t have the capability to store the metrics in more than one location. However, since the metrics aren’t consumed when requested, it’s possible to run more than one Prometheus for the same services in order to have redundancy. Federation also allows one Prometheus server to scrape another for data, consolidating related or aggregated data into one location.
Remote storage is another option: Prometheus can be configured with remote_write and remote_read endpoints. Prometheus will regularly forward its data to the remote_write endpoint. When queried, it will request data via the remote_read endpoint and add it to the local data. This can produce graphs that display a much longer timeframe of metrics. Metricfire provides these remote storage endpoints for your Prometheus installations.
Prometheus also serves a frontend UI to let you search the stored metrics, apply functions and preview graphs. Alongside this, an HTTP API can be used by e.g. Grafana as a datasource.
Prometheus supports configuring 2 kinds of rules – recording rules and alerting rules. Recording rules allow you to specify a PromQL-style rule to create new metrics from incoming data by applying transformations and functions to the data. This can be great if, for example, you have a large number of metrics to view at once, and they’re taking a long time to retrieve. Instead you can create a sum () metric on the fly, and you’ll only need to retrieve one metric in the future.
Alerting rules instruct Prometheus to look at one or more metrics being collected and go into an alerting state if specified criteria are breached. The state of alerts is checked just by going to the alerts page in the Prometheus UI; Prom doesn’t have the capacity to send notifications. AlertManager is a service that adds that ability, and monitors alerts separately in case the platform a Prometheus server is running on has errors.
At Metricfire we provide a Hosted version of Prometheus. This includes long term, scalable, storage for Prometheus, in the form of remote_read and remote_write destinations for your existing Prometheus installations. That means off-disk storage, with redundancy built in, and extended retention of up to 2 years.
It comes with a hosted Grafana service, which lets you configure your Prometheus installations as datasources. Alternatively, you can use the Metricfire datasource to view stored data from all your Prometheus servers together, in one place. Each Prometheus server may generate metrics with the same names and will often consider it’s own hostname to be localhost:9090. You can use the ‘external_labels’ option in the global configuration to ensure that similar metrics from different Prometheus servers can be differentiated.
Added to that, we’re working on a separate alerting service so that you can have one central location to manage all your Prometheus alerts.
Questions, comments, or anything we’ve missed? Get in touch: firstname.lastname@example.org