AWS Lambda SnapStart - What, and Why

Symphonia
Mike Roberts
Jan 11, 2023

At reInvent 2022 AWS announced “SnapStart” - a new Lambda capability to drastically improve cold starts when using Lambda with Java, or other JVM languages. Most teams using Lambda and Java don't need SnapStart, but it should provide the necessary performance improvement for teams that do. This article will help you decide whether SnapStart is useful for your applications, and in fact whether it enables you to use Lambda at all when you haven't been able to previously.

Nearly eight years ago AWS announced that Lambda - their serverless Functions-as-a-Service platform - supported code written for the Java Virtual Machine. Since then many organizations have had success using Lambda and Java, and I even wrote a book on the subject.

The perennial question about using Lambda and Java has always been “but what about cold starts?" Cold starts are the delays that occur when the Lambda platform starts up a new instance of an application. Such delays can be a problem because (a) Lambda starts and stops processes much more frequently than a traditional platform and (b) it does so using a “just in time” strategy.

My take on cold starts is that they usually aren't a problem under production workloads. However sometimes they are a problem. And even the possibility of cold starts being a problem is often enough to put teams off of trying Lambda.

At reInvent 2022 AWS announced a major new feature for Lambda - SnapStart. SnapStart's goal is to remove the cold start latency problem for most Lambda applications using Java / the JVM.

In this two-part series I dive into SnapStart. In this first article I explain the “What, and Why”. SnapStart isn't entirely a free lunch, and it's worth understanding the pros and cons before turning it on. This first article is useful for people making decisions about where to run code in AWS. I have also written a second article that digs into the “How” - which is useful for developers responsible for using SnapStart with their code.

What is Lambda SnapStart?

To understand SnapStart, and whether you need it, you first need an understanding of how Lambda works in the typical case, so I'll start there.

What happens during a cold start?

Lambda's execution model is that one application process will handle only one event, or request, at any one time. As I mentioned above Lambda will create such processes in a “just in time” manner, which means that Lambda will start a new process only when it's needed. This startup activity is a cold start. Cold starts occur when a new application version is deployed; when Lambda needs to scale up the number of available function processes; after several minutes of inactivity; and on a few other occasions.

In a constantly active production application most events are handled as “warm invocations” by Lambda Function instances that have already been initialized and have finished processing their most recent request or event. Cold starts typically only occur for a small subset of production activity.

Lambda's event-separation model being process-based means a cold start includes the following series of activities:

  1. Starting the Lambda Runtime process, with your code
  2. Instantiating your Lambda “handler” code
  3. Running the handler for the first time

Usual Cold Start Process

The first two activities in this list are known as the “Init” phase of running a Lambda function. They add some of the extra latency of a cold start vs a warm invocation because those activities only run during a cold start.

The third activity - actually running the handler code - will also run for warm invocations, however there is usually some delay running code for the first time in a process. In the case of Java / JVM code this is due to JIT compilation, and in all languages there is often some amount of lazy-loading / lazy initialization of data or network connections.

What does SnapStart change?

Simply put, SnapStart creates a pre-initialized snapshot of a Lambda function, and uses that snapshot at cold start instead of the standard cold start sequence described above.

More concretely, when you deploy a Lambda function that has SnapStart enabled the Lambda platform will run the “Init” phase - steps 1 and 2 above - as part of the deployment, i.e. before actually using the new function code for requests. Once the Init phase during SnapStart is complete Lambda will create a snapshot image of the entire function - including memory and local disk - and use that cached image during a cold start rather than go through the usual startup process.

SnapStart Process

SnapStart actually does a little more than that though. I mentioned above that part of the reason for cold starts being slow was also because of step 3 above - running your handler for the first time. SnapStart enables you to have an optional “hook”, named beforeCheckpoint, that allows you to simulate a handler call during the snapshotting process. You can use this to load external data into memory (which will be part of the snapshot image) and/or execute code paths to “pre JIT” your Java code. This means that your first invocation after a snapshot should be faster than during a standard cold start.

SnapStart also provides a second hook function, named afterRestore, that allows you to run code that you want to happen before the first invocation after a restore, but not during later warm invocations.

SnapStart will re-create snapshots over the course of days or weeks that the function code is deployed. This means that any underlying changes to the Lambda platform (e.g. security updates) will be reflected in the cached image.

SnapStart caveats

SnapStart definitely improves cold start performance for JVM Lambda functions. However, there are caveats to using it, which you should be aware of before deciding whether to use it with your own applications. Please note that these are all the case at time of writing - January 2023 - and when you read this some of them may be mitigated, so if in doubt refer to the AWS documentation.

First of all, SnapStart is only for JVM Lambda Functions, and specifically only for JVM Lambda functions using the standard AWS Java 11 Runtime (not Java 8, or custom runtimes.) Further, SnapStart can only be used with x86 architecture, and it does not support use of provisioned concurrency, EFS, X-Ray, or ephemeral storage greater than 512 MB. This is quite a large list!

So those are the blocking concerns, on to some fuzzier issues.

Unsurprisingly, SnapStart increases deployment time. In my experience deployment time was increased by about 2 minutes, even with a small example. Because of this I wouldn't typically want to use SnapStart at development time - this is just too much of a slow down for me. However, that introduces a difference between your production and development environments, which isn't great. As such, if using SnapStart in production but not in development you probably want to also have at least one test environment that uses SnapStart.

SnapStart requires using Function versions. I don't use versions by default, but adding them isn't too onerous, especially when using an “automatic latest” alias (I explain more on how to do this in my follow-up article).

While SnapStart automatically solves a lot of the “Init phase” problems that occur during cold start (steps 1 and 2 above), it doesn't solve the initial handler call latency (step 3 above) without extra development work. To improve the “step 3” latency your code needs to implement the beforeCheckpoint hook, which is not called during regular Lambda invocations. This is a caveat because (a) it will result in more code that needs to be maintained and extended as the rest of your Lambda code changes and (b) it requires some careful thought when wanting the most startup-time speedup possible. For example if you want to exercise code paths that make database calls you're going to need to decide what “fake” calls to make during the snapshot process, and if this is not appropriate for your application then you'll still be subject to cold start slowdown.

One thing you can do during the snapshot process in the beforeCheckpoint hook is to load data and initialize network connections that would otherwise occur during the first invocation. While this is definitely useful, such data and connections may be stale by the time the function image is actually uncached. This is already a concern for Lambda functions, where a function instance may be up to a few hours old when it is run, however it's exacerbated by SnapStart, where the state may be days or weeks old. You can always refresh the state, but that's going to add latency back.

And finally there is the concern of random data seeding, which is especially problematic for encryption. Often an application will use a random number generator for certain data, and some random number generators are based on the state of the process when it first started. Since this state is defined at snapshot-time for SnapStart enabled functions, such functions may have duplicate “random” values across multiple instances. Most modern applications use “secure random” libraries that don't have this problem, but older applications may need updating. For more, see the AWS documentation.

In summary there are plenty of things to think about if you are choosing whether to use SnapStart.

Only Java?

At time of writing SnapStart is only available for Java Lambda functions. I think it surprised a lot of people when SnapStart - a major new Lambda feature - was announced for a language that many thought was an “also ran” of the serverless world. I don't know specifically why SnapStart was launched with only Java support initially, but here are my best guesses:

  • A lot of companies are running Java code on AWS, even if they don't talk about it that much, and would like to use Lambda more. AWS has heard these customers, and SnapStart is the solution for them.
  • There's no doubt that cold starts impact Java-based applications much more than most other runtimes. SnapStart is therefore going to have the most positive impact for Java languages.
  • There is already momentum in the larger Java community, that AWS were able to build on, to use snapshot-based approaches to improve startup time. For example SnapStart uses the community CRaC project for the pre-/post- snapshot hook interface.

Why use SnapStart?

The simple answer to “why use SnapStart” is that you have a Lambda + Java application that is suffering from cold start slowdowns which cause production latency problems. What are such problems? For most people it's latency spikes that are more than a second long. Certainly for heavyweight Java functions, using something like Spring Framework, 5 - 10 second cold starts are common.

Initial experimentation, and reports from other people, suggests that SnapStart puts Java squarely in the “less than, or about a second” cold start club, and can, in theory, enable cold starts around half a second. This already compares well with other runtimes, but when you add the programmatic beforeCheckpoint hook you might be able to start faster than any other runtime if you perform a lot of loading of data at startup.

There are a host of underlying reasons why long cold starts can be a problem, so let's look at those, along with previous mitigations which may no longer be necessary.

Lambda-backed APIs, or other synchronous usage

Snapstart is going to be mostly useful where Lambda functions are invoked synchronously, and the most common example of this is when a Lambda function is implementing an HTTP API. That's because with an HTTP API you typically want sub-second response time, and a cold start can add multiple seconds to such responses.

There are many ways that this problem has been solved in the past. The escape hatch has been “don't use Lambda” - for example instead use a container-packaged application running in ECS or Kubernetes. SnapStart may be enough to revisit that decision.

As a comparison SnapStart is not useful for most asynchronous usage. E.g. if you have a Lambda Function processing messages from a message queue you are typically going to be less concerned about the kind of latency that cold starts have introduced. As such I would usually recommend against turning on SnapStart for asynchronously invoked functions.

Spring, and alternative JVM languages

JVM applications in-general aren't typically known for their fast startups. There are various reasons for this:

  • The JVM is a comparatively heavyweight runtime, and takes some time to start
  • The JVM uses JIT (just-in-time) compilation of bytecode, which results in extra work at startup
  • Many teams use frameworks that make extensive use of reflection at startup. This is great for programming, but slows down initialization. By far the most popular of these frameworks are the Spring family of frameworks.
  • While Java is the most popular language for the JVM there are many alternatives - like Scala, Kotlin, and Clojure. Each of these languages add further startup latency.

Before SnapStart there wasn't much to be done about the first of these two concerns, apart from to configure Lambda functions with enough memory and CPU to speed up startup as much as possible.

However there was a solution to the latter two points, which in summary, is “don't use Spring or an alternative language when latency is a concern”. I've known teams use Scala + Lambda successfully in the past, but it was in a context where latency wasn't a significant problem. But I've also known teams try to use Spring with Lambda, get frustrated with cold starts, and give up on Lambda.

I think with SnapStart available as a backstop there's now a legitimate case for saying “build your JVM Lambda apps with whatever language and framework you want”. I still think overall it's best to make your Lambda functions as lightweight as possible, but at least teams that are very used to Spring, or use something like Scala or Kotlin, now have a relatively simple option with building low-latency Lambda applications.

When other cold start mitigations aren't enough, or add their own problems

I've written extensively before about working around cold start problems (see here, for example), and I still think some of those methods are better than SnapStart, due to the caveats I listed earlier.

However, other mitigations (like “don't use Spring”) aren't going to work out for some teams. As another example one mitigation is to use something like GraalVM instead of a standard JVM. There are still going to be occasions when this is a better option than SnapStart, but the GraalVM solution is a lot more work than SnapStart, and has a host of its own caveats.

Conclusion

SnapStart is a good new addition to Lambda's capabilities. Most teams, even Java teams, aren't going to need it. But just the fact that SnapStart exists now provides an “anxiety-reducing” aspect to Lambda that will make Java teams more willing to invest their time in using Lambda as their compute platform.

If you choose to use SnapStart, and would like an overview on how to use it, then skip on over to “Part 2” for my SnapStart “how-to guide”.

If you'd like to chat with me about this then feel free to email me at mike@symphonia.io, or you can find me on Mastodon.

Further Reading

AWS provided 3 blog articles when SnapStart was launched - here, here, and here. For further information see the AWS documentation.