Enterprise Serverless 🚀 Architecture

Published in

The Startup

8 min readJun 28, 2020

This section in the series on Enterprise Serverless specifically covers some of the high level architecture and design aspects when it comes to large serverless applications. You can read the ideas behind the series by clicking on the ‘Series Introduction’ link below.

The series is split over the following sections:

Architecture

Thinking serverless

At the start of any new serverless project it is key that the full team ‘thinks serverless’ in my opinion — which can be a very different World to which they have been living in. Some of the key aspects are listed below:

DevOps built into team — builds fail (a lot)
With any large scale serverless projects with lots of resources and with multiple scrum teams it is key to have DevOps embedded into the team, as even the most well designed pipelines will throw up errors daily, and builds fail quite frequently for various spurious reasons on AWS! In my experience having dedicated resource to unblock the developments teams is a must, whilst also monitoring how the services are running day to day and managing changes. Not securing this resource at the start of the project can have a major knock on effect to the overall project in my experience.

Capacity planning 💭
Before even putting pen to paper on designing the architecture and services which make up the solution, it is key to gain an understanding on anticipated load and estimated future growth, especially in a serverless event driven architecture.

Serverless in my opinion is not a silver bullet for negating the need to think about capacity and load at all, as you will still need to plan for reserved concurrency at a minimum on your lambdas; and especially around areas such as service to service communication, messaging and batch processing, which can fall fowl of poor design and load if not using patterns such as fan out, pub/sub and limiting throughput to downstream services which can’t handle high volume at scale.

Developers building infrastructure
One of the biggest shifts in mindset and day to day work for developers is building infrastructure through code using frameworks such as serverless which was previously done by an Ops team with monolithic solutions. In the new World of serverless the developer themselves alongside the architects or lead developers typically have the biggest grasp of what exactly they want this small piece of infrastructure to do; so it makes sense for them to manage it through IaC within a PR alongside the code which logically supports it.

Test need to think differently
There is also a mindset change for test teams when working in an event driven serverless project, especially when it comes to serverless security, load testing and acceptable response times (when taking into account concurrency and cold starts for example).

Team Checklist

Through experience of working on large scale cloud projects I built up a checklist of key areas of focus when working on features which is fairly self explanatory and aids as a checklist for team members from POs and BAs, to developers and testers when thinking about non functional requirements, and to ensure that they are thought about at feature inception and not forgotten about:

✅ Auditing — what key actions required auditing?
✅ Instrumentation/Key Metrics — what key functions/services require instrumentation and KPIs?
✅ Logging — which key actions require logging and what should be included?
✅ Alerting — what key actions require alerting for DevOps through CloudWatch/Pager Duty or custom dashboards?
✅ Chaos Tests/DR — what testing do we need to put in place?
✅ Load Tests — what load considerations do we have to consider and how do we test it?
✅ Authorisation — who should have access to this resource?
✅ Documentation — what do we need to document and where?
✅ Caching/TTL — could we benefit from caching in this area to benefit the end user and/or cost?

VPC or no VPC

When working with serverless architectures one of the key considerations at the start of the project is whether or not you will need a VPC, and if so, what are the additional hoops you will need to jump through, and what are the limitations.

In my experience this is largely dictated by the use of an accompanying AWS service in your solution which needs to reside in a VPC, and more often than not it is a database such as AWS DocumentDB or RDS.

The first question is whether or not there is an alternative service that can be used, for the example above for AWS DocumentDB it would be AWS DynamoDB as a first choice. If not then one of the biggest challenges you will face in a VPC is the communication between internal and external services which will now need VPC NAT Gateways or VPC Endpoints etc. You may also see issues with longer cold start times in a VPC, however this has been massively reduced since around September 2019.

Although this is not insurmountable it is another thing to manage, increases the integration code and work to implement, specific IAM roles, and can be tricky to get right first time. If at all possible in my opinion it is beneficial to stay outside the realms of a VPC with a serverless solution, and it doesn’t make your solution any less secure either. If you do require a VPC its imperative to factor the additional work into planned estimates.

This article articulates the topic very well: https://lumigo.io/blog/to-vpc-or-not-to-vpc-in-aws-lambda/

No monolithic functions

There are various arguments ongoing in the serverless World around individual lambdas per endpoint vs monolithic lambdas which essentially use a framework such as serverless express to host many endpoints, or sometimes the full application API.

The reason this is a poor idea in my opinion is there is no way to scale out or change the reserved or provisioned concurrency of a specific endpoint in the future and they are too tightly coupled. From a development approach monolithic lambdas mean that there is more risk of cross-contamination of bugs, deployment problems, and security issues; as opposed to having one isolated piece of functionality per endpoint which does one thing well.

Observability

One of the key aspects of a good serverless solution in my opinion is being able to observe how your overall application is performing from the front end clients through to the backend data stores — and the communication between them. This is even more important with event driven architectures where there are more moving parts and services to observe. When architecting solutions some of the key services (and not limited too) I build into the solutions are:

AWS X-ray
AWS X-ray is an AWS service which allows you to monitor and observe how a distributed system using many services is performing.

Google Analytics
Building in Google Analytics or equivalent service from the start of a project is key in my opinion to understand how, where and when your users are using the system.

💡For React applications I typically use react-ga which is easy to plug in and get going.

New Relic Browser
Services such as New Relic Browser allow you to proactively monitor any JavaScript errors which your end users may be getting, which is key when you may have millions of customers on your service. This can be broken down into how many page views are seeing this particular error, as well as on which browsers its manifesting.

Sumo Logic
AWS CloudWatch is great for day to day logging, but not great when creating dashboards or searching through multiple log groups via correlation IDs in one go. For this reason I have historically used services such as Sumo Logic which has a far greater user experience, with the caveat that you need to stream your logs using CloudWatch events and lambda, or directly in your lambda code through your logging framework such as Winston with a Sumo transport.

💡 When logging to Sumo Logic directly in the lambda code we have measured that this can add an additional 100-300ms on average to your overall lambda invocation duration. This is one to watch out for when speed is key to your consumers.

CloudFlare for Ephemeral Environments

A nice approach to ephemeral environments which I have used in the past is the use of CloudFlare workers alongside using stages in the serverless framework to create short lived developer specific environments, accessed via there own subdomains. An example could be:

https://pr-123-uk.something.com

This approach works well full-stack for both the APIs and clients, as well as allowing splitting out of routing if you have a requirement of both REST and Graph on the same domain:

https://pr-123-uk.something.com/api/v1/
https://pr-123-uk.something.com/graphql

The benefit of using CloudFlare workers in this scenario over an alternative such as AWS CloudFront is the speed in which the changes are deployed; for example in the past changes to CloudFront could take up to 15 minutes to be deployed in my experience, whereas CloudFlare is instant. With ephemeral environments it is key that they are deployed as quickly as possible.

Future proofing

When you expect to have 150–200+ lambdas in your solution over time it is essential to split out your code, project and files correctly to allow you to quickly adapt to future changes where required.

One way I have done this is the past is splitting out the various logical layers into separate files, for example specific lambda handlers for both API Gateway (REST) and GraphQL lambda resolvers (as they will have different event objects), which ultimately call through to a separate reusable ‘manager’ file for the main functionality. This means you can share the main business logic between multiple types of service as requirements change (ECS/APIG/AppSync etc).

This simple approach means you can create an index file in the root of that particular ‘entity’ (for example customer or order) at some point in the future, and export as a NodeJS express app and use with AWS Fargate instead of lambda; essentially meaning those individual CRUD(L) calls are easily exported as one microservice rather than individual lambdas.

This has been key on a previous project where the same business logic needed to be exposed via AppSync for our own public facing web clients, as well as through APIG for cloud connected products (desktop applications).

Next section: Databases 🚀
Previous section: Tooling 🚀

Wrapping up

Lets connect on any of the following:

https://www.linkedin.com/in/lee-james-gilmore/
https://twitter.com/LeeJamesGilmore

If you found the articles inspiring or useful please feel free to support me with a virtual coffee https://www.buymeacoffee.com/leegilmore and either way lets connect and chat! ☕️

If you enjoyed the posts please follow my profile Lee James Gilmore for further posts/series, and don’t forget to connect and say Hi 👋

About me

“Hi, I’m Lee, an AWS certified technical architect and polyglot software engineer based in the UK, working as a Technical Cloud Architect and Serverless Lead, having worked primarily in full-stack JavaScript on AWS for the past 5 years.

I consider myself a serverless evangelist with a love of all things AWS, innovation, software architecture and technology.”

** The information provided are my own personal views and I accept no responsibility on the use of the information.