Learning Lambda — Part 9

Scaling and State

Symphonia
Mike Roberts
Nov 16, 2017 · 17 min read

This is Part 9 of Learning Lambda, a tutorial series about engineering using AWS Lambda. To see the other articles in this series please visit the series home page. To be alerted about future installments subscribe to our newsletter, follow Symphonia on twitter @symphoniacloud, and our blog at The Symphonium.

image

So far in this series we've only been talking about processing a small number of events with Lambda, one after the other. This is a very common pattern of use for many Lambda applications, especially those that are used to help Technical Operations in some regard. However one of the joys of Lambda, and Serverless in general, is the automatic and vast scaling capabilities it gives us.

In this installment I describe and show how Lambda scales, and discuss what happens when we hit scaling limits.

Lambda's scaling model has a large impact on how we consider state, so I also talk about Lambda's instantiation and threading model, followed by a discussion on Application State & Data Storage, and Caching. Note that much of this part is applicable to non-scaled Lambda functions too, but scaling exacerbates the issues here.

Lambda's scaling cannot be considered in a vacuum, and so we'll also see what impacts it has on other services. Finally I mention Lambda's limited, but not trivial, vertical scaling capability.

Scaling

In Part 8 I said one of the times that Cold Starts occur are “when Lambda needs to scale out because all current containers for the required function are already in the middle of processing events."

This small comment opens up a world of wonder! Lambda will horizontally scale precisely when we need it to a massive extent. If we never need more than one instance of our function to run at any one time, Lambda will happily use just one instance of our code at a time. However if hundreds of concurrent versions of our code are necessary to handle load then Lambda will scale up automatically to support that load. No configuration or management — it just happens.

Even better for deployment-architecture-oriented folks — Lambda performs this scaling without any resource planning, allocation or provisioning on our part. It's a completely flexible computation engine, effectively acting as a ‘Massive mainframe in the cloud’.

Lambda will scale down just as easily as it scales up. Unused containers will gradually be retired once they haven't been used for a few minutes. Not that we care, of course, since with Lambda we're only charged for the time our Lambda functions are actually processing an event. In fact our use of Lambda costs exactly the same whether 500 events are processed in parallel by 500 containers, or serially by one container (ignoring any extra time required for initialization.)

So if scaling is so simple, why do we need an entire article to talk about it? It turns out that (spoilers!) Lambda scaling isn't, in fact, infinite. Furthermore the way Lambda scales gives us a set of architectural points to consider when we're building components this way.

But first let's see an example of Lambda's magical auto-scaling.

Observing Lambda Scaling

In Part 8 I used an example of a Lambda function that tracked an identifier across function invocations:

When we invoked this a few times in succession (without changing the function's code or configuration) we saw that the function returned the same container ID every time because the same instance of our Lambda function was getting used. In other words the Lambda platform didn't need to perform any scaling in order to handle the event load.

Let's change this code a little, by adding a sleep statement, as follows:

Adding a sleep() to delay returning to the caller

When you build and deploy this code make sure the function's Timeout configuration setting is at least 6 seconds, otherwise you'll see a good example of a Timeout error.

Now invoke the function several times in parallel. I do this by running the same command from multiple terminal tabs. Depending on how quick on the draw you are for navigating terminal sessions you'll now see that different container IDs are returned for different invocations.

This behavior is visible because when Lambda receives the second request to invoke your function, the previous container that was used for the first request is still processing that request, and so Lambda creates a new instance, automatically scaling out, to handle the second request. This creation of a new instance happens for the third and fourth request too, if you're fast enough.

This example is happening in the scenario of invoking the Lambda function directly, but this is the same scaling behavior we see when Lambda is invoked by most event sources, including API Gateway, S3, or SNS, whenever one instance of a Lambda function is not sufficient to keep up with the event load. Magical auto-scaling, without any effort!

Scaling limits

AWS is not an infinite computer (nor an infinite improbability drive, fortunately) and there are limits to Lambda's scaling. Amazon limit the number of concurrent executions across all functions per AWS account, per region. By default this limit is 1000, but that's just at time of writing (it was 100 not too long ago), and you can make a support request to have this increased. Partly this limit exists because of the physical constraints of living in a material universe, and partly so that your AWS bill doesn't explode to astronomical proportions.

If you reach this limit you'll start to experience throttling, and you'll know this because the Throttles CloudWatch metric for your Lambda functions will suddenly have an amount greater than zero. This makes it a great metric to set a Cloudwatch Alarm for.

When your function is throttled the behavior exhibited by AWS is very similar to the behavior that occurs when your function throws an error (the behavior I described in Part 7) — in other words it depends on the type of event source. In summary:

  • For non-stream based synchronous sources (e.g. API Gateway) throttling is treated as an error and passed back up to the caller
  • For non-stream based asynchronous sources (e.g. S3) Lambda will retry calling your Lambda function for up to 6 hours
  • For stream-based sources (e.g. Kinesis) Lambda will block and retry until successful or the data expires

Stream-based sources, at the current time, also have extra scaling restrictions, based on the number of shards of your stream.

Since the concurrency limit is account-wide, one particularly important aspect to be aware of is that one Lambda function that has scaled particularly wide can impact the performance of every other Lambda function in the same AWS account + region pair. Because of this it is strongly recommended that, at the very least, you use separate AWS accounts for production and testing — deliberately DoS'ing (Denial-of-Servicing) your production application because of a load test against a staging environment is a particularly embarrassing situation to explain!

But beyond the production vs test account separation we also recommend using different accounts for different ‘services’ within your ecosystem, to further isolate yourself from the problems of account-wide limits. There's some effort required to manage all these different accounts, but the combination of AWS Organizations and some scripting will make this significantly easier.

For more detail on concurrent execution limits and throttling I recommend the official AWS documentation on the subject, here.

Threading and instantiation model

Because of the behavior we saw in our experiment a couple of sections ago we're able to infer some interesting aspects of the Lambda runtime instantiation and threading model:

  1. The Lambda runtime guarantees at most one event will be processed per container instance per Lambda function at any one time. In other words you never need to be concerned about multiple events being processed at the same time within a function's runtime, let alone within a function object instance. Further, unless you create any of your own threads, Lambda programming is entirely thread safe.
  2. The Lambda runtime will keep the same classloader/static environment per container/function instance. Our example code used a static member variable, and we saw that same value was consistently used across multiple invocations of the same container instance.
  3. We didn't demonstrate this, but the Lambda runtime will also keep the same function handler object instance per container instance. In other words the same handler object will be used for multiple invocations of a function, only being instantiated / constructed once per container instance. This has some important impacts on caching, which we'll talk about later in this article.

Application State & Data Storage

The way that Lambda instantiates runtime containers, especially in the way that it scales, has massive implications on architecture. For example, we have absolutely no guarantee that sequential requests, for the same upstream client, will be handled by the same container / function instance. There is no ‘client affinity’ for Lambda functions.

This means that we cannot assume that any state that was available locally (in-memory, or on local disk) in a Lambda function for one request will be available for a subsequent request. This is true whether we scale or not — scaling just underlines the point.

Therefore all state that we want to keep across Lambda function invocations must be externalized. In simple terms this means that any state we want to keep beyond an invocation has to be either stored downstream of our Lambda function — in a database, external file storage, or other downstream service — or it must be returned to the caller in the case of a synchronously called function.

This might sound like a massive restriction, but in fact this way of building server-side software is not new. Many of us have been espousing the virtues of 12-factor architecture for years, and this aspect of externalizing state is expressed within the 6th factor of that paradigm.

That being said, this definitely is a constraint, and may require you to significantly re-architect existing applications that you want to move to Lambda. It also may mean that some applications that require particularly low latency (for example gaming servers) are not good candidate applications for Lambda, and nor are those that require a large data set in memory in order to perform adequately.

A typical way that people build Lambda applications that need state across invocations is to use DynamoDB as their application store. DynamoDB is fast, fairly easy to operate and configure, and has very similar scaling properties to Lambda. I address why this last point is useful later in this article.

Caching

Sometimes effective Lambda development is described as stateless, because of the nature described above. Use of this word isn't entirely accurate. While it's true that we have no guarantee that one Lambda function instance will be called multiple times, we do know that it probably will be, depending on invocation frequency. Because of this, cache state is a candidate for local storage.

A Lambda function instance can be configured to have anywhere from 128MB to 1.5GB RAM, and it always has 500MB of /tmp local disk storage available. We can use either/both of these as locations for cached data. For example, say that we always need a set of fairly up-to-date reference data from a downstream service in order to process an event, but ‘fairly up-to-date’ is in the order of ‘valid within the last day’. In this case we can load the reference data once, for the first invocation, and then store that data locally in a static or instance member variable. Remember — our handler function instance object will only be instantiated one time per container.

As another example say that we want to call an external program or library as part of our execution — Lambda gives us a full Linux environment with which to do this. That program / library may be too big to fit in a Lambda code artifact (which is restricted to at most 50MB compressed in size), but instead we can copy the external code from S3 to /tmp the first time we need it for a function instance, and then for subsequent requests for that instance the code will be available locally already.

As a smaller but much more common example, we can define connections to remote services outside of our function handler method, and then those remote service objects will be ready for us next time. Let's look at an example of this:

In this (contrived) example our handler function writes a new item to DynamoDB. We could have created the DynamoDB connection object (dynamoDB) every time in our handler() method. But instead we've created it in the constructor, effectively caching this object instance across invocations.

For good measure we also stored the tableName we want to use in a cross-invocation member variable also. tableName comes from an Environment Variable, but from what we learned about Cold Starts and function instances in Part 8, we know that Environment Variables will be constant across the lifetime of a function instance.

This code is fairly obviously not written in a TDD style! For unit testing we could create a second constructor with which we can specify a mock / stub for our DynamoDB object.

It's of course worth noting that in any of these cases the event that causes the cache to be populated will take longer to process that subsequent events, and so you should be aware of that in your performance analysis.

Impact on Downstream Services

Another impact of the way Lambda scales is the effect it can have on downstream services.

As an example let's consider a situation where we have a more traditionally developed & deployed server side application that interacts with a relational database. In this scenario we would typically create a database connection pool to save, and limit, the number of connections we make to the database — a resource to be considered carefully with relational databases. We may set that connection pool size to 10, and we may allow our application to scale out 5 processes wide. With this example we know that we will be making at most 50 connections to the database from our application. This creates a throughput constraint on how much load we put on the database — if necessary under high enough upstream load this backpressure will be passed up through our architecture.

Now let's consider a Lambda version of the same application. Because Lambda will scale at least to 1000 concurrent executions, we may have at least 1000 connections to our downstream database. If we've asked AWS to increase the concurrent execution limit on our account the possible number of database connections will also rise, to that limit.

There is nothing we can do, from a lambda scaling point of view, to stop this. In other words there is no natural architectural backpressure with Lambda apart from the account wide concurrency limit.

It's easy to imagine, therefore, that because of Lambda's scaling model it is very easy to overwhelm downstream services that have not been created with Lambda in mind.

So what do we do? There are several approaches but an important point to consider is that this is an area where Serverless shows its infancy — there are no ‘best practice’ patterns for this right now, but there are some techniques that people have found useful.

Downstream impact remediation strategies

First of all, where possible it's encouraged to pair Lambda with downstream services that exhibit a similar scaling behavior. If you're calling an HTTP service from Lambda, consider also implementing that downstream service with API Gateway + Lambda. If your Lambda function needs a database or message service then consider using DynamoDB or SNS, etc., which themselves can autoscale up with load.

Our next approach can be to limit the traffic we send to downstream services, throttling downstream traffic. One way to do this is to use a message bus between the Lambda source and the downstream service. An example is Kinesis, combined with another Lambda as an event listener, which will have a significantly smaller concurrent execution limit (because of shard based scaling that I briefly mentioned earlier.) Obviously this breaks a synchronous call, so re-architecture would be necessary. As another example we may introduce a proxy component that can act as a throttle to the sensitive downstream resource — we may even use API Gateway's Usage Plans for this functionality.

A third strategy is to limit the amount of upstream traffic reaching our Lambda function in the first place. We could do this using either of the two throttling techniques I just described for downstream services. This is a great use case for those API Gateway Usage Plans.

One strategy that we don't have available to us is to simply limit the number of concurrent executions one Lambda function will scale out to — AWS do not give us this ability. Microsoft Azure Functions does have something like this idea, so we can imagine AWS may provide it one day, but for now architecture is our tool, rather than configuration.

Cold Start impact on downstream services

So far in this section I've mostly been referring to the load on downstream systems that comes from continual upstream events. However there's another scenario to consider, and that's the impact that comes from Cold Starts.

In Part 8 I said:

It's important to note that if your function loads data from a downstream resource at startup it will be doing that every time a Cold Start occurs. You may want to consider this when you're thinking about the impact your Lambda functions have on downstream resources.

Consider the following scenario. Say that your Lambda function queries a downstream service for reference data on its first container invocation. During fairly high, but level, traffic flow we would expect the downstream service to be called in a somewhat regular fashion as AWS continually destroys and creates containers.

However say now that you get a sudden traffic spike, or say that you deploy a new version of your code and switch all traffic to that new version simultaneously. This will result in a near-spontaneous creation of a fleet of new Lambda containers, and this in turn will result in a spike of requests to your downstream reference data service.

Because of this you may need to consider the same remediation strategies we described above for initialization-oriented services, and not just event-handling-oriented services.

Vertical Scaling

The way we've described Lambda scaling so far is through horizontal scaling — the Lambda Platform creates new instances to handle concurrent requests. This is in contrast to vertical scaling — the ability to handle more load by increasing the computational capability of an individual node.

Lambda also has a rudimentary vertical scaling option, however, in its memory configuration. As I mentioned earlier Lambda functions can be configured (manually) to have from 128MB to 1.5GB of RAM, and the CPU performance capability of a Lambda function scales ‘roughly proportionally’ with the memory setting. In other words you may see a 12X CPU performance improvement going from 128MB to 1.5GB RAM configuration. Network I/O ability also scales with RAM.

However this is the only vertical scaling option you have, and even it is limited. While CPU performance is improved with configuration, this is implemented under the covers through time-slicing, rather than number of available CPU cores. And furthermore, even a ‘maxed out’ 1.5GB Lambda is not a speed-machine. In other words your strategy in scaling with Lambda is very typically going to be to embrace its horizontal scaling capabilities.

Summary, and next time

That brings us to the end of Part 9 of Learning Lambda. This installment was all about Lambda's wonderful scaling capabilities, but also about how we need to rethink state management when using Lambda. We also talked about the impact Lambda's scaling has on how we interact with other services from a Lambda function.

We're still not done with this series! We have a few other areas to cover including a case study with API Gateway, discussion of other gotchas such as Lambda's under-discussed ‘at least once’ execution behavior, and a few other things besides.

To see the other articles in this series please visit the series home page. To be alerted about future installments subscribe to our newsletter, follow Symphonia on twitter @symphoniacloud, and our blog at The Symphonium.

Need help with Lambda, or other Serverless technologies? We're the experts! Contact us at Symphonia for expert advice, architectural review, training and on-team development.