Learning Lambda — Part 7

Error Handling

Mike Roberts
Nov 1, 2017

This is Part 7 of Learning Lambda, a tutorial series about engineering using AWS Lambda. To see the other articles in this series please visit the series home page. To be alerted about future installments subscribe to our newsletter, follow Symphonia on twitter @symphoniacloud, and our blog at The Symphonium.

Welcome to Part 7 of Learning Lambda! If you didn’t read Part 6 you’ll probably want to do that before continuing here.

So far in this series we’ve only talked about when things go perfectly well. But of course this is unrealistic in the real world and any useful production application and architecture needs to handle the times when errors occur, whether those be errors in our code or in the systems we rely on.

Since AWS Lambda is a Platform it has certain constraints and behavior when it comes to errors, and in this part we’ll dig into what kind of errors can happen, for which contexts, and how we can handle them. As a language note I’ll be using the words ’error’ and ’exception’ interchangeably, without the nuance that comes between the two terms in the Java world.

1 — Classes of error

When using AWS Lambda there are several different classes of error that can occur. The primary ones are as follows, in order roughly of the time in which they can occur through the processing of an event:

Error initializing the Lambda function (a problem locating the handler, or with the function signature, or perhaps loading code)
Error parsing input into specified function parameters
Error communicating with an external downstream service (database, etc.)
Error generated within the Lambda function (either within its code, or within the immediate environment, like an out-of-memory problem)
Error caused by function timeout

Another way of considering classes of errors are as follows:

Handled Errors
Unhandled errors

For example let’s consider the case where we communicate with a downstream microservice over HTTP, and it throws an error. In this case we may choose to catch the error within the Lambda function and handle it there (a handled error), or we may let the error propagate out to the environment (an unhandled error.)

Alternatively, say we specified an incorrect function name in our Lambda configuration. In this case we are unable to catch the error in the Lambda function code, so this is always an unhandled error.

If we handle the error ourselves, within code, then Lambda really has nothing to do with our particular error handling strategy. We can log to Standard Error if like, but as we saw earlier Standard Error is treated identically to Standard Output as far as Lambda as concerned.

Therefore the nuances that come with handling errors in Lambda are all about unhandled errors — those that bubble out of our code to the Lambda Runtime, or that happen externally to our code. What happens to them? Interestingly this depends significantly on the type of event source that triggers our Lambda function in the first place, as we will now examine.

2 — Lambda Runtime error processing by event source

Lambda divides what it does with errors according to the event source that triggers invocation. Every event source is placed in exactly 1 category:

Non-stream-based, synchronous. This includes direct Lambda invocations using the RequestResponse invocation type(e.g. from the CLI), or from API Gateway.
Non-stream-based, asynchronous. This group includes most types of Lambda event sources, including S3, SNS, and CloudWatch Events (used for scheduled tasks.), plus direct Lambda invocations using the Event invocation type
Stream-based. This covers all the pull-model sources I mentioned in Part 6. At time of writing these are Kinesis Streams and DynamoDB Streams.

As a reminder you can view all of Lambda’s event sources in the official documentation here.

Each of these categories has a different error handling model within Lambda. We’ll look at these next.

Non-stream-based, synchronous invocation

This is the simplest model. For Lambdas invoked in this way the error is propagated back up to the caller, and no automatic retry is performed. How the error is exposed to the client depends on the precise nature of how the Lambda function was called, and may be confusing, so you should likely try forcing errors within your code to see how such problems are exposed.

Non-stream-based, asynchronous invocation

Since this model of invocation is asynchronous, or event, oriented, there is no upstream caller that can do anything useful with an error. Because of this Lambda has a more sophisticated error handling model for this type of invocation.

First of all if an error is detected in this model of invocation then Lambda will retry processing the event up to twice further (for a total of three attempts), with a delay between such retries (the precise delay is not documented, but we’ll see an example a little later.)

If the Lambda function fails for all three attempts then the event will be posted to the function’s Dead Letter Queue if one is configured (more on this later), otherwise the event is discarded and lost.

Stream-based invocation

If an error bubbles up to the Lambda runtime when processing an event from a stream-based / pull-model source then Lambda will keep retrying the event until either (a) the failing event expires or (b) the problem is resolved. This means that the processing of the stream is effectively blocked until the error is resolved.

The official documentation page for error handling in Lambda is here, and you may want to look at that for more details.

3 — Looking more at error processing for asynchronous, non-stream-based, invocation

This invocation type is the most common, in terms of number of types of event sources involved, and has the most detail, so let’s take a deeper look.

As an example I’m going to extend the S3 PUT processing Lambda function I created in Part 6, so if you want to follow along with this example in your own environment you should recreate that Lambda.

First of all let’s change our code to force it throw an exception:

This will cause our Lambda function to throw an error. We’ll also keep track of how many times this instance of the Lambda of is invoked.

If we deploy this updated version of the code, and then PUT a file to S3, then Lambda will call our function 3 times (which will error every time). We can see this in two places.

First, if we look at the Monitoring tab for the function in the Lambda web console, we see that the Invocation errors count has risen to 3:

Second, we can look on the log output for the function in Cloudwatch Logs (click the View logs in CloudWatch link on the web console). We’ll see the exception generated (three times) in our logs from the code:

We can tell by looking at this that there was a delay of a little over a minute between the first two attempts (18:40:02 to 18:41:09) , and a little over two minutes between the second two attempts.

If you look closely here you’ll see that the “container has processed…” count increases for each event. I’m not going to get into too much detail why that is here, but suffice to say this is the same instance of our Lambda being executed for each attempt executed by the Lambda runtime. We’ll get more into container reuse in later in the series.

Without doing any further work this is all that will happen for the failed event — it will be logged, but then will be discarded. In certain situations this may be sufficient, but what if we want to have alternative processing for this failing event later?

This is where Dead Letter Queues, or DLQs, come into effect. For asynchronous, non-stream-based, event sources you can configure your Lambda to use either of SNS or SQS (two of Amazon’s messaging products) as a target to store events that failed 3 times. Once the event is in SNS or SQS you can do whatever you want with it either immediately, or manually later. For example you may register a separate Lambda function as an SNS topic listener that posts a copy of the failing event to an operations Slack channel for manual processing.

DLQs are a configuration setting of the Lambda function, and so can be specified in either the Web Console, or through programmatic configuration. For sake of an example let’s create an SQS queue named S3Lambda-DLQ (through the SQS Web Console), and then use the Lambda Web Console to configure it as the target DLQ for our S3 Put Lambda function:

Note that the first time you do this you’ll likely need to increase the permissions on the lambda_basic_execution role to give it privs to write to SQS.

Now if we PUT a file to our S3 bucket, and wait for our Lambda to fail 3 times, the event will subsequently be posted to our SQS queue. Using the SQS Web Console we can actually view the messages of the queue, and navigate to the erroring event:

The SQS message body of the DLQ message is precisely the same as the event that was sent to the Lambda function in the first place, with the Message Attributes giving some amount of detail about the error:

Probably the most useful part of the Message Attributes is the RequestID which you can use to search within the Lambda Function’s logs.

More detail on Lambda function DLQs is available here.

4 — Monitoring for Errors

So far we’ve examined the kind of errors that can occur, and what the Lambda runtime will do if such errors bubble up outside of our function code. But if such errors do happen, how will we know about them? There are a few options here.

The primary one, common across all types of Lambda invocation, is that the Lambda Runtime will register an event named Error within Cloudwatch Metrics whenever an unhandled error occurs. This metric is used by the Invocation errors graph we looked at earlier in the Lambda web console. Since this data is in Cloudwatch Metrics there are a whole array of options you can use to process it, including using Cloudwatch Dashboards to manually observe errors, Cloudwatch Alarms to trigger notifications on certain metric conditions, custom Lambda function processing (because it’s Lambda Function Turtles all the way down), and more.

Apart from tracking the Error metric, your options will depend on the invocation type. For synchronous, non-stream, sources you will likely check for problems at the event caller, or within API Gateway if you’re using it.

For asynchronous, non-stream, sources you can use the DLQ method as we described in the previous section (and you can also monitor DLQ errors, in case that also fails! See Dead Letter Error at this link.)

For stream-based sources you can also use the IteratorAge metric to see if you are falling behind in processing events. See this link for more details.

Of course if you handle errors within your lambda function code then you can do whatever you’d like to register problems. For instance you may print an ‘Error’ message to standard out, and then scan for such messages in your logging system.

5 — Error strategies

In the first section we discussed 2 ways of categorizing Lambda errors — handled errors and unhandled errors.

For unhandled errors we should setup monitoring, as described in the previous section, and when errors occur we will likely need some kind of manual intervention. The urgency of this will depend on the context, and also the type of the event source — remember in the case of stream-based sources all processing is blocked until the error is cleared.

For handled errors though we have an interesting choice. Should we process the error, and re-throw, or should we capture the error and exit the function nominally? Again, this will depend on the context and invocation type, but here are some thoughts.

For synchronous, non-stream sources, you will likely want to return some kind of error to the original caller. You can either do this directly within the Lambda function (e.g. you can generate an HTTP response with the appropriate status code), or in the case of API Gateway fronted functions you can let an exception bubble out of the code and then map this to an error within the API Gateway configuration.

For asynchronous, non-stream sources, what you do will largely depend on whether you want to use a DLQ. If you do then there’s often no harm in either letting an error bubble out, or throw a custom error, and then handle the error in whatever is processing messages from the DLQ. If you don’t use a DLQ then you may want to at least log the failing input event if the error occurs within your code.

For stream-based sources you’ll typically want to handle errors within your code, since otherwise further processing is blocked. A good way to do this is to put a top-level try-catch block in your handler function. Within here you can may setup your own retry strategy, or log the failing event and exit the function nominally. In certain situations you really will want to block further event processing until the problem causing the error is resolved, in which case you can throw a new error from the top-level try-catch block and use Lambda’s automatic retrying.

6 — Other topics

There are other aspects to error handling when using Lambda that I’m not going to describe in depth, but that you should be aware of. A grab-bag of them are:

Errors that occur during deployment
Permission errors (e.g. your Lambda’s role not having sufficient permissions.) Sometimes these are picked up during deployment, sometimes during execution. Make sure your Lambda execution role always has at least permissions to post to CloudWatch Logs, otherwise you’ll never know what the problem is!
Resource policy errors — when an upstream service is unable to call Lambda. These are tricky things, and the problem will be exposed in the upstream event source, not Lambda logging.
Throttling errors / Concurrent execution limits.
What happens when Lambda, the service fails? For this you probably want to have more business metric oriented monitoring, outside of individual functions, and be prepared to run your Lambda functions in an alternative region.

Next time

That brings us to the end of Part 7 of Learning Lambda. In this article we’ve explored the various ways that errors can occur in Lambda functions, how to process them, a deep dive into processing errors thrown during asynchronous invocation from non stream sources, how to monitor for errors, and some of the strategies you should consider when thinking about error handling.

Error handling is one of the ’less fun’ parts of development, at least for many engineers, and yet one that needs to be performed. Another ‘gotcha’ area for Lambda development is that of handling Cold Starts — the times that AWS instantiate a function container, and the associated latency impact it comes with. I describe Cold Starts in depth in the next part of this series.

To see the other articles in this series please visit the series home page. To be alerted about future installments subscribe to our newsletter, follow Symphonia on twitter @symphoniacloud, and our blog at The Symphonium.

Need help with Lambda, or other Serverless technologies? We’re the experts! Contact us at Symphonia for expert advice, architectural review, training and on-team development.

Updates

2018–04–10 — Clarified that error handling for direct invocation depends on the invocation type that was specified by the caller. Thanks Erik Erikson, via the Public Serverless Slack Forum.

« Learning Lambda Serverless 'Glue' apps in AWS, and sending Slack notifications for Code Pipeline Events »