Table of Contents
The reliability and stability of your services directly depend on how well you understand the state of your infrastructure and services. A great monitoring system will allow you to understand your infrastructure better.
Monitoring the performance of your system allows you to not only fix current problems but also make changes to your infrastructure to avoid them in the future. An effective monitoring system collects data, aggregates it, stores it, visualizes metrics, and also alerts you about any problems in your systems.
Metrics are the basic values used to understand historical trends, compare various factors, identify patterns and anomalies, and find errors and problems.
MetricFire is a SaaS product that collects, stores, visualizes and allows you to analyze and use your data. If you’ve got a lot of data at your company, sign up for the MetricFire demo and we’ll help you out. You can also try the free trial, and see what the product can do!
In this article, we will look at what metrics are and what types of metrics exist. We’ll also dive deep into what alerts are and what monitoring is as a whole.
What is monitoring?
Monitoring is the process that transforms streams of raw data into information that is useful to us. Thanks to monitoring metrics, we can respond to certain events, evaluate the operation of our systems, and identify patterns and anomalies. Monitoring is the process of collecting, aggregating, and analyzing data to raise awareness of the characteristics and behavior of your components.
The monitoring system has many functions including:
Data collection. A good monitoring system can collect thousands of metrics from different sources. It must do this efficiently to avoid data loss and to be able to scale.
Data storing. While metric values representing the current moment in time are useful, it is almost always more useful to analyze these numbers in relation to past values to understand the context around changes and trends. This means that the monitoring system must be able to store and manage data over certain periods of time, allowing for sampling or aggregation of older data.
Data aggregation. Data aggregation is the process of gathering data and presenting it in a summarized format. The data may be gathered from multiple data sources with the intent of combining these data sources into a summary for data analysis. Also, raw data can be aggregated over a certain timeframe to provide statistics such as average, minimum, maximum, sum, and count.
Data visualization. Metrics can be monitored by building tables and analyzing their values. But it is more efficient to recognize trends and understand how different components fit together with different graphs and visualizations. Good monitoring tools provide a wide range of visualization capabilities. This allows you to understand the interaction of a large number of variables or changes in the system by taking a look at the display.
Alerting. The monitoring system can notify users about the occurrence of certain events or the achievement of monitored metrics of certain values. This allows you not to miss an important event even if you are absent from your workplace.
One of the monitoring tools that provides all of the above functions and even more is MetricFire. You can use our product with minimal configuration to gain in-depth insight into your environments. If you would like to learn more about it, please book a demo with us, or sign up for the free trial today.
Metrics: definition and types
Metrics are raw data that can be collected from various sources. These sources can be hardware, sensors, applications, websites, etc. The data that these sources produce can be, for example, resource usage, performance, or behavior of users. This can be data provided by the operating system, or it can be higher-level data types tied to a specific functionality or component operation, for example, the number of active users on a site, or page load time.
In general, metrics are collected on a periodical basis, for example, once a second, once a minute, or any other period, depending on the characteristics of the indicators and the goals of metrics monitoring.
Many sources produce metrics, and it is easy to collect them. This can be done without any additional work and you can still get significant benefits by building a simple monitoring system.
Metrics are produced by operating systems - operating systems can give you tons of data at just a click. It will be easy for you to get data, for example, about CPU usage, available disk space, or used memory. The challenge is, of course, how do you digest all of that data.
Many web servers, database servers, and other software programs produce their own metrics that can be collected as well. Your own applications can be configured in a way that ensures they produce the metrics you need, and that your monitoring system is collecting them.
Depending on the place that the source of metrics occupies in the hierarchy of your infrastructure, your metrics can fall under one of several metric subclasses. These subclasses include:
Host-based indicators can include anything related to assessing the health or performance of an individual computer, excluding the services that it serves. These metrics mainly measure the usage or performance of the operating system or hardware. Monitoring host metrics can give you an idea of what factors can affect the ability of one computer to remain stable or perform assigned tasks. Examples of host metrics are:
- CPU metrics
- Disk metrics
- Memory metrics
The next class of metrics are applications metrics. They indicate whether the application is working correctly and efficiently. Application metrics are measures of the health, performance, and load of applications. These include the following metrics:
- Average response time
- Error rates
- Request rate
- Service failures and restarts
Network performance metrics
Network performance metrics are indicators that show how infrastructure and services are performing as part of short- and long-term assessments of network performance. Analyzing these metrics in real-time allows you to identify potential network problems, prioritize resources, and respond based on impact. Over time, network performance metrics provide long-term insight into the needs of end-users and help build a network that meets future business needs.
Examples of network performance metrics are:
- Packet loss
Server pool metrics
Server pool metrics measure the ability of a collection of servers to work and respond appropriately to requests. Monitoring the metrics of not every individual machine, but a server pool, will allow you to scale and manage resources more efficiently. In addition to the aggregated server metrics that are included in the pool, you can track:
- Number of running instances
- Total number of instances
- Scaling-adjustment metrics
External dependencies metrics
Very often, your services use external services in order to function properly, for example, using various APIs. Monitoring metrics of external systems can help you identify issues with your vendors that may affect your operations. These include the following metrics:
- Service status and availability
- Error rates
- Service response speed
Events are a separate class of metrics. What sets them apart from the rest of the metrics is that they do not have a clearly defined periodicity. If the usual metrics are collected once in a certain period of time, for example, once a second or once a minute, then events are received by the monitoring system when they happen. Another distinguishing feature of events is that they usually carry detailed information about what happened, while conventional metrics are often just data points in a time series. Events usually record what happened, where it happened, and when it happened. They are often the trigger for alerts.
How to collect the right data
In addition to understanding what types of metrics you can collect and analyze, it is also worth talking about what criteria the metrics must meet to gain reliable and useful information from them.
First of all, the metrics that you collect should be clear to you. When receiving a metric as an input, it should be clear what produces it, and what the behavior pertains to. The correct interpretation of metrics is an important aspect of effective monitoring. Don't complicate the process of defining and collecting metrics. The simpler and more standard the data, the easier it is to interpret.
Use suitable collection frequency and aggregation. You must determine the appropriate data slices for each source depending on the result you want to get from monitoring it. For example, critical peak server loads can be hidden in averaged data over a fairly long period of time. Conversely, excessive granularity may not be necessary for some tasks since it could create an unnecessary load on performance and data storage.
Use different slices of your metrics. Having a lot of homogeneous data sources at your disposal, for example, hosts, it makes sense to combine them into groups according to certain characteristics. Then you can get a more informative picture of what is happening in your environment.
Determine the optimal storage time for your data. For some metrics, it makes sense to store them long enough to understand the nature of their behavior in the long term and more accurately identify patterns in their behavior, such as seasonality, or finding anomalies. To conserve disk space, your monitoring system may aggregate certain data over time. You must understand this and take this into account when building your monitoring strategy.
What is alerting?
Alerting is a very useful monitoring element that takes action based on changes in metric values. The main task of notifications is to keep users updated on any changes and important events even when they are not physically present to monitor metrics on their dashboard.
Alert rules have two components: a condition or threshold based on metrics, and an action to take when the values are out of range. These actions can be notifying responsible individuals or taking automatic actions as a response to certain events.
The most common alerting action is the sending of a notification to humans. It can be either one person or a group. Depending on the amount of exceeding the values, notifications can be sent to different recipients. Good monitoring systems allow you to send detailed information about the problem with the notification so that the person in charge can quickly understand what happened and what action needs to be taken.
For example, a screenshot of the metric graph in which the boundary value was reached. The person responding to the alert can then use the monitoring system and associated tools such as log files to investigate the cause of the problem and implement a mitigation strategy.
Some software responses (actions) can be triggered based on threshold violations. They can be useful in situations where automated actions are safe to perform without human intervention. An example of such an action would be the automatic restart of a service in which there was a problem or the automatic scaling of an application that currently requires a higher level of CPU utilization.
Implementing an effective monitoring system is something that the majority of businesses have to do today. Collecting and monitoring metrics gives you the ability to keep an eye on what is happening in your systems, what resources require attention, and what is causing slowdowns or shutdowns. While designing and implementing your monitoring system can be challenging, it is an investment that can help your team prioritize their work, delegate oversight to an automated system, and understand the impact of your infrastructure and software on your stability and performance.
At MetricFire we provide a Hosted version of Graphite which includes storing your data for two years, a complete tool Grafana for data visualization, and much more. You can use our product with minimal configuration to gain in-depth insight into your environments. If you would like to learn more about it, please book a demo with us, or sign up for the free trial today.