The Danger in the Details — Scalable Cloudwatch Metrics for AWS Lambda

Symphonia
John Chapin
Jan 24, 2017 · 7 min read

This is the second article in a series on AWS Lambda monitoring from Symphonia, published on The Symphonium. Symphonia is a Serverless and Cloud Technology consultancy based in NYC.

The first article in the series was “A Love Letter to Lambda Logging".

Metrics are vital to understanding the performance and operational characteristics of your applications. They track everything from server-level performance characteristics like CPU and memory utilization to high-level business indicators like “number of products purchased”. However, extracting business metrics from AWS Lambda functions is unexpectedly complex, and the consequences of doing it incorrectly range from frustrating to dangerous.

In this article, I describe the dangers inherent in naively publishing metrics from Lambda functions directly to the Cloudwatch Metrics service. I then introduce Symphonia‘s new open-source library, lambda-metrics, and tool, lambda-metrics-maven-plugin, that allow Lambda developers to easily and safely collect and publish business metrics to Cloudwatch from their Java-based Lambda functions.

Background

The AWS Lambda platform removes the need to track server-level metrics — the code we control is running within a fairly constrained environment in a fully-managed, ephemeral container. The platform (and other AWS services in general) also provides a set of service-level metrics that track Lambda operational characteristics like duration, throttling, and errors.

However, business metrics (like “number of products purchased”) must still be handled explicitly within an application. A common solution for capturing business metrics within persistent or long-running Java applications is a library like Codahale Metrics. That library provides several common metric types, like Counters, Gauges, Meters, Histograms, and Timers. It also provides some capability for publishing that information via JMX, or to external services like Cloudwatch. Publishing metrics to an external service generally involves using a client library (in the case of Cloudwatch, the AWS Cloudwatch Java SDK) which will efficiently batch messages, manage communication channels (like HTTP), and ensure that publishing metrics isn't an arduous or resource-intensive operation.

When using AWS Lambda though, we aren't able to realize that same efficiency in publishing metrics. Lambdas are ephemeral and relatively short-lived. They can't share in-memory data structures or efficiently reuse communications channels, even between separate invocations of the same Lambda.

“The call is coming from inside the house!”

Despite those limitations, a common approach to publishing metrics from Lambdas is to just proceed as if the Lambda is a normal Java application. For infrequently (for example, hourly) invoked Lambdas, using the Cloudwatch API directly (or through the SDK library) is usually sufficient.

However, for frequently invoked, highly concurrent Lambda functions (for example, a function that's attached to a Kinesis stream with hundreds or thousands of shards), publishing metrics directly from the Lambda function is a recipe for disaster.

Like most AWS services, Cloudwatch Metrics has account-level limits. One of those limits is the rate at which PutMetricData API calls can be made — by default that limit is 150 calls per second. Because that's an account-level limit, if it's exceeded due to a flood of calls from one system or application, any other application trying to publish metrics to Cloudwatch will be throttled. With an instantaneously scalable service like Lambda, it becomes trivially easy to launch an unintended Denial of Service attack on your own account, and blind yourself in the process.

Raising the account level PutMetricData limit is one way to deal with the problem of throttling. However, that brings to light another issue inherent in publishing metrics directly from Lambdas — cost. The raw cost of making PutMetricData requests is $0.01 per 1000 requests. So, if your Lambda function is executed 10 million times (e.g., a month of processing events from a Kinesis stream with a few shards), and makes a single PutMetricData API call per execution, you will pay $100 to simply record metrics. Keep that in mind as we discuss a different approach to extracting metrics from Lambdas.

The “Right” Way

Rather than directly publish Cloudwatch Metrics from Lambda functions, AWS recommends the use of Cloudwatch Logs Metric Filters.

Metric Filters are used to scan incoming Cloudwatch Logs data for patterns of interest, and to produce and publish Cloudwatch metrics from matching log entries. For example, given a series of log entries that looks like this:

[2017–01–20 14:24:27.857] INFO Processing message
[2017–01–20 14:24:27.857] METRIC inputBytes 3
[2017–01–20 14:24:27.857] INFO Finished processing

We can construct a Metric Filter to identify log entries that have the word “METRIC” as the second term, and publish a Cloudwatch metric named “inputBytes” with the value “3”.

Here's an example of what that looks like in the Cloudwatch Metric Filters console:

Cloudwatch Logs Metric Filters console

For more specifics, I recommend reading the extensive example in the AWS Cloudwatch documentation.

Metric Filters are the only way to scalably and efficiently extract Cloudwatch Metrics from AWS Lambdas. By using logging output (which is handled transparently by the AWS Lambda runtime), they eliminate the overhead and complexity of directly calling the Cloudwatch Metrics API from your Lambda code. You don't have to worry about Cloudwatch Metrics configuration, throttling or error-handling within your Lambda.

In addition to the scaling and runtime benefits, Metric Filters are a substantially less expensive solution. In our earlier example, a pure Cloudwatch Metrics API solution processing 10 million metric data points per month would cost $100. That same volume of metrics, processed using a Metric Filter based solution, would cost less than $10 per month, due to the comparatively small cost of Cloudwatch Logs data ingestion and the behind-the-scenes batching of PutMetricData calls by our Metric Filters.

Metric Filter Limits

However, using Metric Filters effectively isn't straightforward. In addition to collecting and updating metrics internally your Lambda must log those metric values out to System.out, and it must do so in a format that is either space delimited, or JSON. There are some limitations on the length of the metric patterns which are used to match log events and extract values (1 KB, which isn't documented), and the number of Metric Filters per log group (100) which must be taken into account.

The resulting Cloudwatch Metrics that are produced from Metric Filters should be named (and name spaced) appropriately. It's also worth noting that metrics generated from Metric Filters are considered “custom metrics”, and as such can't have dimensions or units.

None of these individual caveats are fundamental barriers to using Metric Filters, but taken together can be burdensome and error-prone to deal with.

Introducing lambda-metrics (and friends)

I'm happy to announce that Symphonia has open-sourced and released two new projects to make it straightforward and safe to generate and collect business metrics from Java-based Lambda functions. Together, these two projects allow Java developers to easily use Codahale Metrics within their Lambda functions, and safely and efficiently publish those metrics using Cloudwatch Logs Metric Filters — the approach recommended by AWS. They also solve the pattern limit problems we just described.

lambda-metrics is a runtime library that allows Java developers to use Codahale Metrics within their Lambda functions, and to easily output those metrics to Cloudwatch Logs in a format consumable by Cloudwatch Metric Filters. It relies on and extends the lambda-logging library.

Here's an example of lambda-metrics in action:

As you can see, using lambda-metrics is straightforward — simply create your own metrics-gathering class that extends the LambdaMetricSet base class, and add annotated metric fields to that class. Interact with those fields as you normally would with Codahale metrics (for example, incrementing counters), and at the end of your Lambda handler, call the report method. Your metrics will be logged out in the correct format to be picked up by Cloudwatch Metric Filters. No need to worry about logging formats, or anything else.

However, the real power of Cloudwatch Metric Filters is unlocked by our second project. lambda-metrics-maven-plugin is a tool (delivered as a Maven plugin) that automates the setup of Cloudwatch Metric Filters by inspecting your compiled Lambda function code and using the AWS Java Cloudwatch SDK to create the appropriate Metric Filters based on the metric fields you annotated.

Here's an example of the Metric Filters that were automatically created by lambda-metrics-maven-plugin (based on the previous code sample):

image

And, most exciting of all, we can easily find and use those custom business metrics as we would any other Cloudwatch Metrics:

image

The lambda-metrics-maven-plugin‘s __ goals are easily incorporated into your existing Maven workflow. See the README for more information.

In Conclusion

Through a combination of Codahale Metrics and Java annotations, the new lambda-metrics library makes it trivially easy for Lambda developers to build business metrics into their Lambdas, and extract those metrics into Cloudwatch even in the face of massive concurrency and scale. The additional lambda-metrics-maven-plugin is a companion tool for the easy configuration of these metrics.