Table of Contents
Prometheus is becoming a popular tool for monitoring Python applications despite the fact that it was originally designed for single process multi-threaded applications, rather than multi process.
Prometheus was developed in the Soundcloud environment, and was inspired by Google’s Borgmon. In its original environment, Borgmon relies on straightforward methods of service discovery - where Borg can easily find all jobs running on a cluster.
Prometheus inherits these assumptions, so Prometheus assumes that one target is a single multi-threaded process. Prometheus’s client libraries also assume that metrics come from various libraries and subsystems, in multiple threads of execution, running in a shared address space.
To get started, sign up for the MetricFire free trial, where you can start using Prometheus in our platform and try out what you learn from this article.
Problems with integrating Prometheus into Python WSGI applications
We start to see the break down when we run a Python app under a WSGI application server. With WSGI applications, requests are allocated across many different workers, rather than to a single process. Each of these workers is deployed using multiple processes. This results in a multi-process application.
When this kind of application exports to Prometheus, Prometheus gets multiple different workers responding to its scrape request. The workers each respond with the value that it knows. This means that Prometheus could scrape a counter metric and have it returned as 100, then immediately after it gets returned as 200. Each worker is exporting its own value, so the counter metric measures random pieces of information rather than the whole job.
To handle these issues, we have four solutions listed below.
Sum all of the worker nodes
If you give a unique label to each metric, then you can query all of them at once, and effectively query the whole job. For example, if you give each worker a label such as worker_name, you can write a query such as:
sum by (instance, http_status) (sum without (worker_name) (rate(request_count[5m])))
This results in aggregating all of the worker nodes for one job at once. The problem with this is getting an explosion in the number of metrics you have.
This method is our favorite here at MetricFire. We actually use this method to monitor our own application with Prometheus. This method entails using the Prometheus Python Client, which handles multi-process apps on gunicorn application server.
You can check our full tutorial on how MetricFire uses the Python Client to monitor our own service. In that tutorial we walk through each step of monitoring a Python web app with Prometheus.
The Django Prometheus Client
This method designates each worker as a completely separate target. The Django Prometheus client sets it up so that each worker is listening for Prometheus’s scrape requests through its own port.
This method rejects the concept that Prometheus must scrape our application directly. Instead, export metrics from your app to a locally running StatsD instance, and set up Prometheus to scrape the StatsD instance instead of the application. This gives you more control over what’s counted by each counter.
Although multi-process applications cannot be natively monitored with Prometheus, these four solutions are great work-arounds. This allows us to use Prometheus as the main monitoring tool throughout the corporation, for both IT resources as well as APM.
For more information about how Prometheus can be used to monitor Python apps, check out our articles on Python Based Exporters, and our series on Developing and Deploying a Python API with Kubernetes.
To try out Prometheus, and apply what you've learned from this article, check out our free trial. You can use Prometheus directly in our platform, and monitor metrics without any set up. Also, talk to us directly by booking a demo - we’re always happy to talk with you about your company’s monitoring needs.