AWS Lambda SnapStart - How

Symphonia
Mike Roberts
Jan 24, 2023

The second of two articles about AWS Lambda SnapStart, this looks at how developers can update their existing Java Lambda applications to use SnapStart, and what performance improvements they might expect to see. Part 1 described what SnapStart is, what its caveats are, and why you may want to use it.

Hello fearless JVM developer! So you want to know how to use SnapStart? Well, you’ve come to the right place. In this article you’ll see how to take an existing Lambda Function and get it using SnapStart.

The example application, which comes from my book with John Chapin, is a fairly typical design of implementing a Lambda-backed HTTP API, using API Gateway and DynamoDB.

Architecture of the example application

This article assumes that you are already familiar with using Java, Lambda, and the SAM deployment tool. If you aren’t, well, did I mention my book? :) If you’re using an alternative JVM language, or are using CDK, Terraform, or something else for deployment, then this article should provide you with the knowledge of Lambda SnapStart which you can then adapt for your environments.

Similarly, this example does not use any frameworks - it’s “vanilla” Java. If you need some specific examples of using SnapStart with Spring Boot, Micronaut, Quarkus, etc., then there are already a few other articles that a google-able. I recommend you read this article first to understand in-depth what Lambda is giving you, and then use those articles for specific help on each framework.

Three Steps to SnapStart Heaven

This article follows three distinct steps, and I recommend you perform the same steps when you modify your own Lambda applications to use SnapStart.

  1. Prepare for SnapStart
  2. Enable SnapStart
  3. Optimize using hooks

The example code is available in GitHub here, with the changes in each step available at different commits.

Prepare

To prepare for SnapStart, I make two updates to the Globals section of the SAM template (GitHub diff here). These change two configuration properties of all the Lambda functions in the application. The new / updated configuration is as follows:

Globals:
  Function:
    Runtime: java11
    AutoPublishAlias: live
    # ...

I updated Runtime to Java 11 instead of Java 8, since only the AWS Java 11 runtime supports SnapStart at time of writing. Note that I didn’t actually change the Maven configuration to use Java 11 - it’s still using Java 8 - but on a “real” project I’d likely update that too. For this example just changing the runtime version is sufficient. Yay for Java backwards compatibility!

The second change relates to one of the caveats I mentioned in Part 1 - we need to update the example to use Lambda Function Versions. When using Function Versions usually you also want to use a Function Version Alias - the API Gateway configuration will be configured to integrate with the (constant) Alias, but the Version that the Alias points to will change over time.

Fortunately SAM makes this change extremely simple. Setting AutoPublishAlias does everything we need on both the Lambda and API Gateway resources.

We can now build and deploy the example as follows. Note that you’ll need to replace YOUR_CLOUDFORMATION_BUCKET with the name of the S3 bucket you use for SAM / CloudFormation deployment

$ mvn package && \
  sam deploy --s3-bucket YOUR_CLOUDFORMATION_BUCKET \
    --stack-name snapstart-example --capabilities CAPABILITY_IAM

In this article I’m going to focus on the GET endpoint to /locations, but to start I’m going to save a couple of test events to the database (using the the curl command line http client). IMPORTANT - for your own tests you’ll need to change the URL here and in all subsequent commands to the URL of the API that is deployed to your own AWS environment.

$ curl -d '{"locationName":"Brooklyn, NY", "temperature":91, "timestamp":1564428897, "latitude": 40.70, "longitude": -73.99}' \
        -H "Content-Type: application/json" \
        -X POST https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/events
$ curl -d '{"locationName":"Oxford, UK", "temperature":64, "timestamp":1564428898, "latitude": 51.75, "longitude": -1.25}' \
        -H "Content-Type: application/json" \
        -X POST https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/events      

With those events saved I can now query - again using curl. I surround the curl command with time so we can see how long the command takes:

$ time curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations

[{"locationName":"Oxford, UK","temperature":64.0,"timestamp":1564428898,"longitude":-1.25,"latitude":51.75},{"locationName":"Brooklyn, NY","temperature":91.0,"timestamp":1564428897,"longitude":-73.99,"latitude":40.7}]

curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations  0.01s user 0.01s system 0% cpu 4.234 total

The request took 4.2 seconds. Let’s run exactly the same thing again:

$ time curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations

[{"locationName":"Oxford, UK","temperature":64.0,"timestamp":1564428898,"longitude":-1.25,"latitude":51.75},{"locationName":"Brooklyn, NY","temperature":91.0,"timestamp":1564428897,"longitude":-73.99,"latitude":40.7}]

curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations  0.01s user 0.01s system 20% cpu 0.107 total

0.1 seconds this time. Why is it a lot quicker? Because the first command corresponded to a cold start, and the second command corresponded to a warm invocation.

The round-trip latency from my computer in New York; to API Gateway, Lambda, DynamoDB in the AWS us-east-1 region; and back is 0.1 seconds - not too shabby! We can also see that the pre-SnapStart cold start is roughly 4 seconds.

Before we move on I want to time how long deployment takes so that we can compare with what happens when SnapStart is enabled. Prefixing the sam deploy command with time I get the following:

$ mvn package && \
  time sam deploy --s3-bucket YOUR_CLOUDFORMATION_BUCKET \
    --stack-name snapstart-java11-example --capabilities CAPABILITY_IAM

...

sam deploy --s3-bucket  --stack-name snapstart-java11-example --capabilities   2.02s user 0.25s system 5% cpu 41.299 total

Witht this example, it takes about 41 seconds to deploy a new version of the Lambda function without SnapStart enabled.

Turn on SnapStart

Preparation wasn’t too hard, now it’s time to turn on SnapStart.

In the Globals section of the SAM template I add the SnapStart property (GitHub diff here):

Globals:
  Function:
    # ...
    SnapStart:
      ApplyOn: PublishedVersions
    # ...

… and that’s it! Now we can rebuild and redeploy the function.

Let’s call the /locations API again. Here’s the cold call:

$ time curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations

[{"locationName":"Oxford, UK","temperature":64.0,"timestamp":1564428898,"longitude":-1.25,"latitude":51.75},{"locationName":"Brooklyn, NY","temperature":91.0,"timestamp":1564428897,"longitude":-73.99,"latitude":40.7}]

curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations  0.02s user 0.01s system 1% cpu 2.154 total

And the warm call:

$ time curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations

[{"locationName":"Oxford, UK","temperature":64.0,"timestamp":1564428898,"longitude":-1.25,"latitude":51.75},{"locationName":"Brooklyn, NY","temperature":91.0,"timestamp":1564428897,"longitude":-73.99,"latitude":40.7}]

curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations  0.02s user 0.01s system 21% cpu 0.133 total

The first call has come down from 4.2 seconds to 2.1 seconds, and the warm invocation is unchanged at 0.1 seconds. Therefore the cold start has been reduced to about 2 seconds - a 50% improvement.

Finally, before we move on to optimization, let’s look at deploy time. If I make a small code change and redeploy I see the following timings:

sam deploy --s3-bucket  --stack-name snapstart-java11-example --capabilities   6.73s user 0.53s system 4% cpu 2:41.12 total

From this we can see it’s taking about 2 minutes longer to deploy when SnapStart is enabled than before we turned on SnapStart. As I mentioned in Part 1 - this is enough extra time for me to want to only use SnapStart in production and test environments, but not in development environments.

Optimize using hooks

So far all we’ve done is made a tiny configuration change. This is already an improvement, but we’re not into the 1 second cold start territory that AWS promised. What’s going on?

Back in Part 1 I had this (very basic!) diagram of what happens during cold start:

Usual Cold Start Process

So far we’ve only optimized the first two steps - “Process startup” and “Handler ininitialization”. Out-of-the-box SnapStart will perform these tasks during snapshotting so that we don’t need to do them at cold start time. However just turning on SnapStart doesn’t do anything about the third step: “First invocation”. In other words we’re seeing a delay caused by running the code in the handler method for the first time. How long is that delay? Well let’s take a look.

I’ve added some extremely simple logging to the code to get an idea of how long each step of handler is taking (GitHub diff here). I don’t recommend this style of observability for a “real” project, but it’s simple enough to work well for an example!

When we run the query the first time after deploying, therefore triggering a cold start, we see the following in the logs:

2023-01-23T16:56:37.248-05:00	START RequestId: a738f540-09f2-41e4-b147-25beab680b5f Version: 12
2023-01-23T16:56:37.285-05:00	** 1 **
2023-01-23T16:56:37.285-05:00	** 2 **
2023-01-23T16:56:37.285-05:00	** 3 **
2023-01-23T16:56:37.294-05:00	** 4 **
2023-01-23T16:56:38.517-05:00	** 5 **
2023-01-23T16:56:38.531-05:00	** 6 **
2023-01-23T16:56:38.543-05:00	** 7 **
2023-01-23T16:56:38.559-05:00	END RequestId: a738f540-09f2-41e4-b147-25beab680b5f
2023-01-23T16:56:38.559-05:00   REPORT RequestId: a738f540-09f2-41e4-b147-25beab680b5f	Duration: 1315.48 ms Billed Duration: 1436 ms Memory Size: 1769 MB	Max Memory Used: 140 MB	Restore Duration: 252.75 ms	Billed Restore Duration: 120 ms

What is this telling us? First of all it’s taking 1.3 seconds to run through the code. My end-to-end external latency on that call was 2.5 seconds, and so a rough calculation of where time is going is as follows:

  • Usual latency (warm invocation): 0.1 seconds
  • Handler duration: 1.3 seconds
  • Snapshot restore: 1.1 seconds (2.5 - 0.1 - 1.3)

Snapshot restore is the “SnapStart ‘cold start’” time. It’s the theoretical lowest latency we should expect to see with this Lambda Function, even if the actual handler didn’t do anything. About 1 second is typical in what I’ve seen so far across my experiments, but I hear from others that sub-second cold starts are possible.

But what about those 1.3 seconds that are happening in the code - what’s going on there? And what can we do about it?

That 1.3 seconds, in this case, is mostly down to two things - connecting to DynamoDB, and Java JIT (Just-In-Time) compilation. If we actually look at the timings in the logs carefully we can see that almost all the time - just over 1.2 seconds - is happening between 4 and 5. Here’s the code between those two log statements:

final ScanResult scanResult = dynamoDB.scan(scanRequest);

While this looks simple there’s a lot going on here. While the code really does query DynamoDB we know, from the warm invocations, that that takes less than 0.1 seconds. So there must be over a second here just for setting up the DynamoDB connection and performing JIT compilation.

So what can we do about it? This is the point of SnapStart hooks. Hooks allow us to run any code we want during the snapshot phase. The following describes how to add hooks, and you can see the code diff here.

Hooks Step 1 - Add the CRaC dependency

In order to use hooks we need to update the code, and specifically we need to implement a particular interface. First though we need to add a dependency on the CRaC library. In Maven we add the following dependency:

<dependency>
    <groupId>io.github.crac</groupId>
    <artifactId>org-crac</artifactId>
    <version>0.1.3</version>
</dependency>

Hooks Step 2 - Implement the CRaC Resource interface

We are going to write some code that will happen at snapshot time, i.e. OUTSIDE of the usual invocation / handler flow. The code that we write needs to satisfy the org.crac.Resource interface, which has two methods - beforeCheckpoint and afterRestore .

We can do this by updating the Lambda Handler class code as follows:

public class WeatherQueryLambda implements Resource {
    // ...

    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        System.out.println("In beforeCheckpoint");
        new ArrayList<>(dynamoDB.scan(new ScanRequest().withTableName(tableName).withLimit(1)).getItems());
    }
    
    @Override
    public void afterRestore(Context<? extends Resource> context) throws Exception {
        System.out.println("In afterRestore");
    }
}

beforeCheckpoint is run during snapshot, which happens at deployment time. The code in this implementation of beforeCheckpoint will perform a small scan on the DynamoDB table. Behind the scenes this will set up the connection to DynamoDB and perform JIT compilation on the relevant code.

We’re not doing anything interesting here in afterRestore - I’ll come back to that.

Hooks Step 3 - Register the Resource

Just making the Lambda Handler class implement the Resource interface is not sufficient for Lambda to run the hooks though - we also need to register the CRaC resource. We can do this by adding the following code as a constructor of the Lambda Handler class:

    public WeatherQueryLambda() {
        Core.getGlobalContext().register(this);
    }

Core here is a class from the same CRaC library we added earlier.

IMPORTANT - this is the main part of your code that will likely change depending on the application framework your code uses - Spring, Micronaut, etc.

Result

After building and deploying this change we can perform a cold query again and to see the difference in latency:

$ time curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations

[{"locationName":"Oxford, UK","temperature":64.0,"timestamp":1564428898,"longitude":-1.25,"latitude":51.75},{"locationName":"Brooklyn, NY","temperature":91.0,"timestamp":1564428897,"longitude":-73.99,"latitude":40.7}]

curl https://3zjesials5.execute-api.us-east-1.amazonaws.com/Prod/locations  0.02s user 0.02s system 2% cpu 1.306 total

1.3 seconds! That’s immediately better. Now lets look at the logs:

2023-01-23T17:34:54.413-05:00	START RequestId: 58c8430b-96f0-42cd-b5e9-5451d3c508cc Version: 13
2023-01-23T17:34:54.441-05:00	** 1 **
2023-01-23T17:34:54.441-05:00	** 2 **
2023-01-23T17:34:54.441-05:00	** 3 **
2023-01-23T17:34:54.441-05:00	** 4 **
2023-01-23T17:34:54.582-05:00	** 5 **
2023-01-23T17:34:54.584-05:00	** 6 **
2023-01-23T17:34:54.619-05:00	** 7 **
2023-01-23T17:34:54.632-05:00	END RequestId: 58c8430b-96f0-42cd-b5e9-5451d3c508cc
2023-01-23T17:34:54.632-05:00   REPORT RequestId: 58c8430b-96f0-42cd-b5e9-5451d3c508cc	Duration: 222.00 ms	Billed Duration: 379 ms	Memory Size: 1769 MB	Max Memory Used: 138 MB	Restore Duration: 259.78 ms	Billed Restore Duration: 157 ms

Sure enough it’s now only taking about 0.2 seconds to run the code during the first invocation. Most of it is still happening on the same line of code, but we’ve at least shaved another second off of the cold start by adding a beforeCheckpoint hook.

For a real application I would go through this process in more detail - looking to shave time off of the “first invocation” call, and get closer to that theoretical minimum latency of about one second.

afterRestore

So far I’ve only talked about the beforeCheckpoint hook - this is used to optimize SnapStart cold time. There is also another hook - afterRestore. This second hook is more about efficiency. Because a snapshot may be restored days or weeks after it was created you may need to reset or refresh various state that is now stale. You could perform such work in the regular handler function, but afterRestore gives you a place you can put such code that is only going to be called once for the lifetime of a particular instance of your Lambda Function.

Nuances of hooks

Using hooks - especially beforeCheckpoint - is tricky. First of all you need to decide what you can actually do, safely, in a production environment. For example here we’ve only made a hook for the read / query Lambda function, and are performing an actual read of the DynamoDB table. But what would we do for the POST function - would it be safe to write to the DynamoDB table? If not the beforeCheckpoint code would likely need to be more complicated, and further we likely won’t be able to “pre JIT” all of the code.

Next up is the question of where do you write the hook code? In this example we’ve implemented it on the handler class itself - which definitely makes some sense for pre-JITting. But you don’t have to implement it there, and if your application consists of multiple Lambda functions / handlers you may want to centralize that code. Be aware if you do though of a couple of things:

  • While the documentation and AWS examples have the registration command (Core.getGlobalContext().register(...)) within the constructor of the class that contains the Lambda handler function it’s not currently documented that this is where it needs to be. My best guess is that the SnapStart process always instantiates the class containing the lambda handler during snapshotting, and before running any registered CRaC resources. Therefore performing registration in the constructor (or code called by the constructor) is the right thing to do. But be aware, especially if using application frameworks.
  • If the registration command registers anything other than the Lambda handler class itself then be careful with the “weak references” point in the SnapStart documentation.
  • Snapshotting will occur many times per deployment. When I look over all the whole Log Group for the Lambda function I see this:
2023-01-23T17:18:29.936-05:00	In beforeCheckpoint	2023/01/23/[13]790072611be84fcaad62ea299330eeea
2023-01-23T17:18:29.958-05:00	In beforeCheckpoint	2023/01/23/[13]bee4abc83429423aaa29c21a487ff331
2023-01-23T17:18:29.962-05:00	In beforeCheckpoint	2023/01/23/[13]17027f77947a45f6bb8d66e55f1288e2
...

In fact this happens many more times - typically about 20 times per deployment. For high throughput Lambda Functions this extra load will be negligible, but for lower throughput Functions this might be an unexpected bump on downstream resources.

Summary

In this article I’ve shown you that using SnapStart in you Java Lambda Functions consists of 3 steps:

  1. Prepare - use Java 11, Function Versions / Aliases, and check for any other requirements of SnapStart
  2. Enable - update your infrastructure-as-code tool to turn on SnapStart
  3. Optimize - use the beforeCheckpoint hook to reduce the “first handler invocation” latency, and afterRestore hook for refreshing state created at snapshot time.

Using a simple, but realistic, example we saw the following reductions in Cold Start time:

  • No SnapStart - 4 seconds cold start
  • SnapStart without hook-based optimization - 2 seconds cold start
  • SnapStart with first pass of hook-based optimization - 1.2 seconds cold start

Using the beforeCheckpoint hook is key to getting towards, or below, a one-second cold start for many applications, but requires careful thought. Precise implementation will depend on your JVM language and application framework.

SnapStart adds about 2 minutes to deployment time at time of writing, so you may not want to use it in development environments.

If you’d like to chat with me about this then feel free to email me at mike@symphonia.io, or you can find me on Mastodon.