Nassim Taleb, Averages and Systems

MONITORING

Nov 23, 2022 ∙ 4 min read

MetricFire Blogger

Table of Contents

Melting old ladies
Bell curves, averages, and systems
Percentiles

Great systems are not just built. They are monitored.

MetricFire runs Graphite and Grafana as a fully managed service for growing engineering teams, taking care of storage, scaling, and version updates so your team doesn't have to. Plans start at $19/month, billed per metric namespace rather than per host, and include engineer-staffed support. Integrations work natively with Heroku, AWS, Azure, and GCP, and data is stored with 3× redundancy in SOC2- and ISO:27001-certified data centres.

This article was originally published on March 24, 2015, by Charlie von Metzradt, co-founder of Hosted Graphite, for the Hosted Graphite blog. Since then, Hosted Graphite has become MetricFire but our goal has stayed the same: Monitoring should be accessible. For more information and for updates on new features, book a time with our team!

Nassim Taleb is famous for a few things - as the author of the 2007 blockbuster “The Black Swan” he examined the impact of rare high-impact events and how they tend to be explained away in retrospect with pithy or simplistic narratives. He’s also renowned for his reluctance to give interviews, so it’s a pleasure to hear him expound on his theories at length in this recording with James Altucher.

https://nassimtaleb.org/2014/09/podcast-nassim-taleb-james-altucher-show/#.Ws3gcNPwai4

Melting old ladies

Of possible interest to monitoring is his simple explanations of the effect of averages on data, namely in relation to humans. Take the example of an old woman who likes the temperature to be a comfortable 70 degrees Fahrenheit. If you take an average of 70 degrees does it sound reasonable? Let’s examine a possible set of data - if it’s 0 degrees half the time, and 140 degrees the other half we’ve set a successful average of 70 degrees but our little old lady has unfortunately perished by either freezing or being burned to a crisp. Whoops!

‍

A Bell Curve displaying representing normally distributed data

‍

The bell curve - or a gaussian curve representing normally distributed data. If your data fits into a normal distribution you can do useful things like anomaly detection* as the normal distribution gives you an idea of how frequently you should see certain values. If we imagine that this somehow represents temperature variation in a different version of our old lady example, we would expect the 0 degrees or 140 degrees events to fall on either side in the “very rare” event category and our old lady would probably be pretty happy within a standard deviation of the mean; the fat bit in the middle of the bell curve which would roughly cover a range of ~50 to 100 degrees. (She might get a little uncomfortable, but is unlikely to melt like the Nazi in Raiders of the Lost Ark).

‍

Bell curves, averages, and systems

So, for a lot of systems applications this turns out to be a poor model to follow - first of all, the data may not be normally distributed but follow something closer to a long-tailed distribution. The average of this data set may not represent anything particularly useful. If we take something like the time it takes to process a request on your web server, your average may fall within your SLA criteria but there may be large numbers of outliers pulling data one way or another.

Percentiles

In situations like this, percentile data is useful. Let’s take our web request timing example again: Viewing data at the 50th percentile gives you a view of the median user experience - If the 50th percentile (median) of a response time is 750ms that means that 50% of my transactions are either as fast or faster than 750ms - sounds alright depending on what your site is doing. The 90th percentile view of the same data may be around 1250ms, which means that 90% of your requests fall on or under that speed with 10% of all requests completing slower. You may have a 98th+ percentile request that is 1750ms or slower, and this might represent a lesser-used use case such as a reporting function that takes a lot of time.

Using a percentile view of your data, you can see what the typical experience is for your users. A degradation in the 50th percentile from 750ms to 1000ms means that 50% of your requests just had a 25% slower experience and you probably need to start looking into it. This is especially true if you're managing complex systems where aggregating data from multiple sources and databases, something Integrate.io helps teams do through data pipelines, becomes critical to understanding your full system performance.

* = Ok, there are methods for anomaly detection on non-Gaussian data too, but normal distribution makes it a lot easier.

Total Servers to monitor ~150 metrics per host (configurable for fewer metrics if needed) Cloud Services to monitor (in AWS, Azure, GCP)

~25 metrics per instance / service (typical baseline monitoring) Application / Custom metric event footprint Custom metrics are defined and emitted from your app code Heroku Applications ~75 metrics (varies by app-size / add-ons)