Out of more than 100 services that Amazon Web Services (AWS) provides, Amazon CloudWatch was one of the earliest services provided by AWS. CloudWatch was announced on May 17th, 2009, and it was the 7th service released after S3, SQS, SimpleDB, EBS, EC2, and EMR.
AWS CloudWatch is a suite of tools that encompasses a wide range of cloud resources, including collecting logs and metrics; monitoring; visualization and alerting; and automated action in response to operational health changes. CloudWatch is an excellent tool because it allows you to go beyond monitoring into observability.
For some time now, observability has had an essential place in the cloud computing and modern software engineering ecosystem. The word is no longer just a simple buzzword, and Amazon has adapted to this by adding the tools and means to do proactive monitoring.
Indeed, you can have a lot of monitoring in place, but you may not have an observable system.
If you are new to observability, think of it as a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In simpler words, monitoring is about the symptoms of a problem, and observability is about the (possible) root causes of it.
You can also think of observability as "the white box monitoring". In this type of monitoring, logs, metrics, and traces are the pillars of observability.
In this blog post, we will discover the basics of CloudWatch, see some of its use cases and dive into the core concepts. To do this, we will focus on 4 main features that this service offers:
You can configure alarms to initiate an action when a condition is satisfied, like reaching a pre-configured threshold. To better understand this, let's create an Elastic Computing Cloud (EC2) machine. We don't need a production instance, since either a nano or micro instance can be used.
When creating this instance, make sure to enable CloudWatch detailed monitoring, which will make data available in one minute periods for an additional cost. The standard monitoring is free, but it takes 5 minutes to deliver data to CloudWatch.
After creating the EC2 instance, you can use the EC2 machine to set up an alarm. First, click ‘Edit’, and then ‘Add alarm’.
At this step, there is an important concept to understand: alarms are managed by AWS CloudWatch, but in most use cases, you will be notified by something like email when your alarm is active. This feature is managed by AWS Simple Notification Service, or SNS. SNS is a low-cost messaging and notification service that lets you decouple publishers from subscribers. In our case, SNS will be used to listen to CloudWatch alarms and send an email when an alarm is active.
We can create an SNS topic from the EC2 alarm configuration window, or use the SNS administration dashboard. Say we want CloudWatch to send us an email when the average CPU utilization of our EC2 instance reaches 50% or more for at least a period of one minute. This can be easily configured from the EC2 console. You can also set up an action that will be triggered once the same conditions are fulfilled. Don't forget to confirm your email subscription to the topic.
To test this out, we are going to stress our EC2 CPU, and see the resulting alarm and notification from CloudWatch. The tool used here is called stress. To spawn 500 workers spinning on sqrt() with a timeout of 600 seconds, we can use this command:
stress --cpu 500 --timeout 600
After some minutes, you will be able to see that the alarm is active, and if you confirmed your subscription to the topic we created above, you will also receive an email.
We have seen how to push an email notification when a condition is fulfilled, but you can also set up other actions to trigger methods like auto-scaling.
You are receiving this email because your Amazon CloudWatch Alarm "awsec2-i-073cf4770bed5d313-CPU-Utilization" in the EU (Paris) region has entered the ALARM state, because "Threshold Crossed: 1 datapoint [71.6666666666667 (14/10/19 14:00:00)] was greater than or equal to the threshold (50.0)." at "Monday 14 October, 2019 14:01:56 UTC".
If you need a near real-time stream that describes changes in your AWS resources, Events is what you are looking for. Events make CloudWatch aware of your operational changes as they happen, and they respond by taking action.
You can create an alarm on any of the AWS resources you use, and you will receive a notification once a threshold is reached. Events are continuously recorded over time. This continuity is the main difference between events and alarms.
CloudWatch Events are a stream of system events, and it provides you with a complete picture of your systems. Meanwhile, alarms are generally used when you know the metrics you are measuring.
Suppose you are running a streaming service like Netflix with millions of viewers worldwide. You will never get a complete view on your system loads and operational changes as they occur if you only use alarms.
There are the three concepts you should be familiar with when setting up a CloudWatch Event stream:
These are some of the target services offered by AWS CloudWatch:
You can even configure the default event bus of another AWS account as a target.
In the first part of this blog post, we created an EC2 machine. We are now going to use it to demonstrate how events can be streamed continuously while containing data about the instance state. We will also show how events can call a target as soon as a change occurs.
Go to AWS CloudWatch console, click on ‘Events’, and create a new rule:
We can configure an event pattern to match any state change of an EC2 service and specify a single instance (using its ID). On the target, we can set up the same SNS topic to which we already subscribed by email. This guarantees that when an instance state is "stopped", "terminated", "stopping", or "shutting-down", we will receive an email.
Let's trigger the event by stopping the instance:
aws ec2 stop-instances --instance-ids <instance_id>
When the instance stops, you should receive two emails. One with the state "stopping" and another email when the instance is completely stopped:
You can implement several use cases using the different configurations that CloudWatch offers. For instance, you can add AWS Lambda Function to process, transform, and analyze the data sent when a change occurs. This allows you to specify and trigger custom actions. You can also connect your SNS to Slack team chat and publish alarms to the same SNS, etc.
Just like metrics, logs are critical if you want to have more control and observability of your systems. You can use CloudWatch to monitor, store, access, query, analyze, and visualize your logs. CloudWatch centralizes the logs from all of the resources and AWS services you use in a scalable service. You can, for example, store your web application access logs and adjust their retention period to 10 years. You can also store your system logs, which is ideal when you don't want to retain logs on the host machine, or when your infrastructure is immutable.
"Treat logs as event streams." This is one of the principles of the twelve-factor app developed by Heroku:
Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes and backing services. Logs in their raw form are typically a text format with one event per line (though backtraces from exceptions may span multiple lines). Logs have no fixed beginning or end, but flow continuously as long as the app is operating.
If we look at the philosophy of AWS CloudWatch Logs, it is helpful to implement this principle and treat logs as event streams.
We want to create a stream of our system logs (syslogs) using CloudWatch Logs, so we need to install and configure an agent on our EC2 machine.
Once the installation is completed, an interactive setup will start:
Make sure to configure the IAM credentials that have the ability to execute at least these actions:
Alternatively, attach this policy to the role you will use:
A few seconds after finishing the above configurations, you will be able to see your syslogs:
Using the Insights Explorer, you have the ability to query your log stream. These are some useful examples:
1. 25 most recently added log events:
fields @timestamp, @message
sort @timestamp desc
2. Number of exceptions logged every 5 minutes:
filter @message like /Exception/
stats count(*) as exceptionCount by bin(5m)
sort exceptionCount desc
3. List of log events that are not exceptions:
filter @message not like /Exception/
4. View Lambda latency statistics for 5-minute intervals:
filter @type = "REPORT"
stats avg(@duration), max(@duration), min(@duration) by bin(5m)
5. VPC top 10-byte transfers by source and destination IP addresses:
stats sum(bytes) as bytesTransferred by srcAddr, dstAddr
sort bytesTransferred desc
This feature allows you to subscribe to another service like AWS Lambda. A good use case is when you need to ETL (extract, transform, and load) your logs data from AWS CloudWatch to another datastore. You may also need to use a full-text search engine - this is when you can use subscriptions to send logs to Amazon Elasticsearch Service (AES).
CloudWatch Metric Filter
Using the CloudWatch console, you can also create a filter to extract custom text from your logs, as shown in the next section.
A metric is a time-ordered set of data points that are published to CloudWatch. We will walk you through creating custom metrics and discuss AWS observable metrics.
Adding on to the last example, say you want to group all of the log lines containing the word "kernel", or you want to grep the word "memory". Let's do this using the "Create Metric Filter" from before.
Note: this example is using a simple pattern, but AWS lets you use more advanced patterns for complicated use cases. Regardless of whether your pattern is complex or simple, you will be able to assign it, then visualize it once chosen.
In this practical example, we filter the word "memory" from the syslog stream we set up before. To verify that everything is working well, we are going to load test the memory.
Since we are using a nano machine, it's obvious that the memory will not support this kind of stress tests, but this is a good exercise to check and visualize the logs. First of all, we can see the memory fault here:
At the same time, with the right configurations, we can visualize the "memory" word count in the log stream and thus monitor the health of the EC2 instance memory.
AWS also publishes metrics configured by default. If you visit the Metrics dashboard, you will be able to see the available metrics that are part of the AWS namespace. You can create different namespaces for your custom metrics, and it’s recommended if you want to isolate your metrics in a separate container. If you manage many applications at the same time, you don't want to see metrics from different apps aggregated in the same feeds.
Given the fact that most AWS services publish metrics, you have a multitude of possibilities to use CloudWatch effectively. These are some common examples:
Alarms, events, logs, and metrics--coupled with other AWS services--give you the flexibility you need to build an efficient monitoring and observability system that works for you. You may have different sources of information and data that should be collected and analyzed to have a complete view of your systems. AWS CloudWatch can help you. Using the built-in features in CloudWatch, you can collect and aggregate a max of data, and organize and visualize it using the different tools CloudWatch offers.