Defining Serverless — Part 5

How Serverless approaches handling of component failures

Mike Roberts
Jun 28, 2017 · 5 min read

This is the final part of a series defining Serverless Services in terms of five common traits, which John Chapin and I introduced in our free ebook from O'Reilly — What is Serverless?. If this is the first part you've come to I advise taking a look at Part 1 first, for context. In this part we discuss the fifth trait common to Serverless services.

Trait #5 — Implicit High Availability

High Availability (HA) is a term that we often use about software systems to describe the ability of the system to continue to operate even when one instance of a component fails. HA is typically often implemented using some kind of redundancy technique. When we build and operate applications using traditional architectural techniques, and long-lived server components, it is frequently our responsibility to implement HA. This might involve setting up a database cluster as opposed to a single database instance, or setting up a farm of web servers.

As we described in trait #1, with Serverless we are no longer concerned with long-lived server components. Instead we use a fully-managed remote service, or are implementing pieces of code that are ephemeral, and event-triggered. In such a world it is not possible to make our own application highly available without the underlying Serverless services themselves being highly available.

Because of this we assume that Serverless services are highly available by default, or in other words that they offer implicit high availability. For a Serverless database we assume clustering is managed on our behalf; for Serverless Functions-as-a-Service we assume that if a computation node fails then the vendor will instantiate a new one, with our code, and re-route events to the new node.

As another example let's consider Amazon S3. I've mentioned S3 a few times in this series, and I consider it Amazon's oldest Serverless service, even though we didn't necessarily think of it in such terms for most of its lifetime. S3 is, in one view, a large network file system (it's actually an ‘object store’, but that's not important for this brief discussion). When we use S3 we can upload files, and then download them again later. This makes it feel like a network file server. But with S3 we're never concerned about individual file storage servers, or nodes — we assume that if an individual component in S3 fails then our data will still be safe, and that we'll continue to be able to access our data. This is implicit HA.

Failovers of a Serverless service may still be felt by clients to some extent, even if we're not responsible for resolving them, or in fact able to resolve them. For example if a Lambda compute node fails we may experience instability using that Lambda function until the problem is resolved. The laws of distributed development still apply, and we still need to include error handling in our code.

An important caveat to this trait is that implicit HA has its limits. When I use the term I'm thinking about the ability of a service to handle the failure of an individual node, and that maps roughly to an individual server (host or process). What I'm not talking about here is the ability of a vendor to offer automatic failover for a failure of an entire service system. In other words Serverless does not offer implicit Disaster Recovery.

Let's consider S3 again. While we don't feel the effects of an individual S3 node failing, we do feel the effects of a complete failure of S3 within a given region (a closely located set of data centers). Such an event occurred on February 28 2017 when S3 failed in the US-East-1 region. In the timeline of S3 failures, this would be considered a disaster in my mind, and S3 does not promise to handle such a scenario (it is only advertised to offer 99.99% availability, or roughly 1 hour of downtime per year.)

Because of this when designing Serverless applications we still need to consider such disaster scenarios in our own architectural planning. The good news is that many vendors, like AWS, offer multiple, highly isolated, regions and so one first approach for Serverless DR is to consider a multi-region deployment strategy. This comes with work, but it is perfectly possible to mitigate many of the effects of a cross-region outage in this way.

In summary, Serverless services offer implicit High Availability (HA) since we require them to do so in order for our usage of them to itself be HA. A service that does not offer HA is not, in my mind, Serverless. Serverless services do not, however, offer implicit Disaster Recovery and it is still our responsibility to handle the failure of an entire service system.

This brings us to the end of this series. Serverless applications are ones that are implemented using Serverless services. A Serverless service is one that entirely, or very nearly entirely, exhibits the five traits that I've described in this series, namely:

  1. Requires no management of Server hosts or Server processes
  2. Self auto-scales and auto-provisions, based on load
  3. Offers costs based on precise usage
  4. Has performance capabilities defined in terms other than host size / count
  5. Has implicit High Availability (explained above)

It is, of course, perfectly reasonable for a system to be implemented using both Serverless and non Serverless services — the badge ‘Serverless’ does not make something inherently good or bad, it merely means it exhibits these five attributes. Such mixed architectures are ones we call ‘hybrid architectures’, and we talk about these on page 44 of our ebook What is Serverless?.

I hope you've enjoyed this series, and have found it educational. If you have questions or thoughts please add them in the comments, email me at , or tweet me @mikebroberts .

Need help with Lambda, or other Serverless technologies? We're the experts! Contact us at Symphonia for advice, architectural review, training and on-team development.