Last updated on April 28, 2023
Monitoring is the best friend of every system administrator. It gives visibility into what is happening on your application so that when an issue occurs, you or someone in your team could be alerted and do the necessary measures to mitigate it. However, this can only be as effective as the metrics you’re keeping track of.
There are plenty of factors that can affect application performance. And while it’s difficult to stay on top of them all, there are certain metrics that are non-negotiable. Examples of these are CPU utilization, memory usage, disk space, disk usage, network throughput, and so on. If you’re using a cloud provider like AWS, these metrics are probably handed to you out of the box.
In this article, we’ll take a look at a specific error that is just as important to look out for as the metrics mentioned above: HTTP 5xx errors. These errors are usually recorded by the web server you’re using, whether that be Nginx, Apache, or something else. So if you ever want to run Nginx in EC2, these errors won’t automatically show up in your CloudWatch dashboard. Later, I’ll show you a way to capture log events with 5xx errors. We will also send an alert message to a Slack channel along with helpful information such as the requester’s IP address, the requested resource, and its user agent.
What are 5xx errors?
5xx errors are HTTP response codes returned by a web server to a client whenever it encounters problems processing a request. Although identifying the cause of these errors isn’t always as straightforward and oftentimes tricky, common reasons could be a bug in your code waiting to be fixed or a hardware limitation, such as high CPU usage or memory running out.
Why should you care?
-
It could mean that your website is down. While this is not always the case, even some occasional 5xx errors are worth digging into.
-
It leads to a poor user experience.
-
Numerous 5xx errors are bad for your website’s Google SEO ranking. Web pages with lots of 5xx responses are dropped by Google bots from the index. Since SEO determines the level of visibility your website gets, you wanna keep it as high as you can.
Overview of the Solution
Prerequisites:
-
A CloudWatch agent must be configured on your server to send logs to CloudWatch Logs.
-
An incoming webhook for the Slack channel of your choice. You can make one by following this guide. An incoming webhook is basically an endpoint to a Slack channel. In order to send alert messages to a channel, we simply have to send a POST request to its webhook.
-
The CloudWatch agent installed on the EC2 instance sends log events to CloudWatch Logs.
-
We’ll set up a subscription filter to filter out 5xx events. A subscription filter is a CloudWatch Logs feature that allows sending of real-time feeds of log events to other AWS services such as AWS Lambda, Amazon OpenSearch Service, Amazon Kinesis Data Stream, and Amazon Kinesis Data Firehose. In our case, we’ll use AWS Lambda.
-
Filtered events are sent to a Lambda function. The Lambda function parses the event data, formats it to a Slack message, and sends it to a channel via a webhook.
Steps:
- Create a Lambda function. Give it a descriptive name. For runtime, choose
Python 3.9
and leave the default execution role as is. The default execution role contains permissions for sending logs to CloudWatch Logs, which is helpful for debugging. Finally, clickCreate function
. For the time being, leave the function’s code blank.
2. Go to the CloudWatch Console and click on Log groups. Search and click the log group that you wish to monitor.
3. On Subscription Filters
, Click Create Lambda subscription filter
.
4. For destination, select the Lambda function that you created in Step 1.
5. The pattern’s syntax depends on the log structure. For example, the syntax for a JSON-formatted log is not the same for a space-delimited log. In this demo, we work on the latter. The filter pattern for a space-delimited log must be in a list format: [var1, var2, var3, ...]
, where each variable in the list corresponds to terms in the log separated by spaces.
The maximum number of variables that you should specify must not exceed the total number of space-delimited terms, otherwise, CloudWatch won’t be able to match the pattern.
Let’s test out the [var1, var2, var3]
filter pattern against the following sample log events. Keep in mind that the log format you’re working on might be different from what is shown here:
[25/Apr/2022:12:06:56+0000] 12.13.15.16 GET 200 - GET /courses/aws-certified-cloud-practitioner-practice-exams/ HTTP/1.1 portal.tutorialsdojo.com
https://portal.tutorialsdojo.com/courses/aws-certified-solutions-architect-professional-practice-exams/lessons/practice-exams-timed-mode-5/topic/aws-certified-solutions-architect-professional-practice-exam-timed-mode-instructions/ [Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0]
[25/Apr/2022:12:05:32+0000] 49.56.66.5 GET 503 - GET /courses/aws-certified-cloud-practitioner-practice-exams/ HTTP/1.1 portal.tutorialsdojo.com https://portal.tutorialsdojo.com/courses/aws-certified-solutions-architect-professional-practice-exams/lessons/practice-exams-timed-mode-1/topic/aws-certified-solutions-architect-professional-practice-exam-timed-mode-instructions/ [Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0]
[25/Apr/2022:12:01:49+0000] 12.32.32.32 GET 200 - GET /courses/aws-certified-cloud-practitioner-practice-exams/ HTTP/1.1 portal.tutorialsdojo.com https://portal.tutorialsdojo.com/courses/aws-certified-solutions-architect-professional-practice-exams/lessons/practice-exams-timed-mode-5/topic/aws-certified-solutions-architect-professional-practice-exam-timed-mode-instructions/ [Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0]
Results:
Great! as you can see, the pattern matches all events in the sample log. Keep in mind though that we haven’t done any filtering yet; we’ve just captured the logged messages from each event and distributed them to var1
, var2
, and var3
. This time, let’s try filtering requests from a certain IP address. Say I’m curious about what the IP address 12.13.15.16
is up to and want to track every request it makes. We can simply do that by using the [var1, var2=12.13.15.16, var3]
filter pattern.
As shown in the results, only the event containing the 12.13.15.16
IP address was filtered. Now, the problem with our current pattern is that it would be difficult to filter 5xx events because the status codes are mixed in with other data in var3
. The goal is to separate the status code from other messages. To achieve this, let’s use the pattern below:
[time_stamp ,ip_addr,request_method,status=5*, separator, method, resource, request,http_host, http_referer, http_user_agent] |
ⓘ Tip: When naming variables, think of names that best describe the information that they represent. |
6. Look at the value of status
. Instead of targeting an exact string like what we did with the IP address, we use a simple regex expression this time. 5*
means all text starting with 5 (i.e., 500, 501, 502,…) will be matched. If a 5xx error is detected, CloudWatch sends the filtered event to the Lambda function that we created.
7. Apply the filter pattern. Then, test it against your own log events. Provide a descriptive filter name then click Start streaming
.
8. Go to the Lambda Console and select the function that we created in Step 1. Copy the code below and paste it into the Lambda function editor. Click Deploy
.
The event data that CloudWatch sends is base64 encoded. Therefore, it must be decoded first and converted into a Python object that we can work with. After converting, we parse the data that we need and format it as a Slack message. Finally, we wrap the message as a JSON payload in a POST request and send it to the Slack webhook using the urllib3
module.
Verification
To check if the solution is working, do a simulation of 5xx HTTP status codes in your web server. Your Slack channel should receive a message similar to the following screenshot:
Conclusion
Having an awareness of server-related errors on any business-critical website is important. What I’ve shared with you is a very simple yet effective tool that you can consider using for catching these errors in a real-time fashion. I think AWS Lambda really shines in this kind of use case, simply because of its ease of setup, no server management required, and is priced relatively cheap. While this demo is intended for teams that regularly communicate over Slack, you can try modifying the code to suit your needs. For example, instead of a webhook, you could send the message to emails that are subscribed to an SNS topic.