"The Stack": Everything you'll need

This will be a series of blog posts where we will build up the perfect infrastructure setup for the majority of usecase, aka “The Stack”. We’ll be building everything on top of AWS.

Before diving in, let’s first establish some goals:

  • Performant: Latency and performance is important as this will serve end-users.
  • Low cost: We want our base cost to be low, and our costs to scale well with high traffic. Ideally, it should cost nothing if no users are using it.
  • Low operational overhead: It’s >=2023, nobody wants to nurse servers or services anymore, things should scale up and down without intervention or oversight.
  • High flexibility: Everything should be built with the foresight of future scalability, both organizationally, code-wise, and in the way things fit together.
  • Modular: Pieces of the infrastructure should be opt-out, e.g. if you don’t need Pub/Sub, it can be skipped.

Obviously, this is my personal opinion on it, but I’ll be sharing the thinking behind each of the choices as we go along.

Some technology choices upfront:

  • Everything will be infrastructure as code using AWS CDK.
  • We’ll be using Rust throughout for each service, as it allows us to squeeze out the most performance while still giving us a nice developer experience.
  • Federated GraphQL will be our way of facilitating microservices.

What will we be covering?

You can see a sneak peak of the final setup below. All of this will be covered in parts:

  • Part 0: The introduction and goals (this post)
    • The goals of “The Stack” and architecture overview
  • Part 1: Setting up your AWS Account Structure
    • Setting up Control Tower and all of our AWS Accounts
  • Part 2: Automating Deployments via CI
    • Bootstrapping CDK and deploying to all accounts via CI
  • Part 3: Creating our Frontend
    • Creating an SPA and deploying it to S3 + CloudFront
  • Part 4: A Federated GraphQL API
    • Federated GraphQL in Lambda with three subgraphs
  • Part 5: Adding Databases (DynamoDB)
    • Make our subgraphs use DynamoDB for storage
  • Part 6: Using our API from the Frontend
    • Connecting our Frontend to the API and setting up our GraphQL clients
  • Part 7: Preview Environments in CI
    • Spin up environments in Pull Requests using GitHub Actions
  • Part 8: Asynchronous work and processing
    • Queuing up work with SQS and decoupling services via Pub/Sub using EventBridge
  • Part 9: Notifications and emails
    • Sending emails and Push Notifications
  • Part 10: Monitoring, traces, and debugging
    • XRay traces and CloudWatch Dashboards
  • Part 11 (Bonus): Websocket support for GraphQL
    • Support GraphQL subscriptions via API Gateway’s websocket support
  • Part 12 (Bonus): Video transcoding and image resizing
    • Transcode video files with MediaConvert and resize images on-the-fly
  • Part 13 (Bonus): Mobile App
    • Package the Frontend as a Mobile App
  • Part 14 (Bonus): Stripe integration
    • Integrating Stripe for payments and billing
  • Part 15 (Bonus): Billing breakdown
    • Forming an overview of service costs via Billing Tags and the Cost Explorer

Account Structure and Governance

Reorganizing your AWS Account structure is a pain, so let’s see if we can get this right from the beginning. There are a few things that direct our choices here:

  • Isolation of environments to avoid sharing soft and hard limits between environments (e.g. Lambda concurrency limits, etc)
    • We will be putting each environment in their own AWS Account
  • Security, audit, and control to ease GDPR requirements
    • We’ll use AWS Control Tower to give us full control of sub-accounts in our Organization

Let’s first sketch out the Account Governance structure, before diving into the view of each individual AWS account:

  • Control Tower: This is your central place to control access and policies for all accounts in your organization
  • Production Multi-tenant: Your primary production account for multi-tenant setup, and most likely were the majority of users will be
  • Production Single-tenant: While desirable to avoid the operation overhead for single-tenant setups, its good to think in this from the get-go
  • Integration Test: This will be the account that IaC deployments get tested on to ensure rollout works
  • Preview: This will be used to spin up Preview Environments later on
  • Individual Developer: Individual developer accounts to allow easy testing of IaC testing and exploration
  • Monitoring: Centralize monitoring and observability into one account, allowing access to insights without access to sensitive logs or infrastructure from the other accounts
  • Logs: Centralized storage of logs, which may require different access considerations than metrics and traces
graph TD
  subgraph ControlTower[AWS: Control Tower]
    AuditLog[Audit Log]
    GuardRails[Guard Rails]
  end
  ControlTower-->AWSProdMultiTenantAccount
  ControlTower-->AWSProdSingleTenantAccount
  ControlTower-->AWSIntegrationTestAccount
  ControlTower-->AWSPreviewAccount
  ControlTower-->AWSIndividualDeveloperAccount
  ControlTower-->AWSMonitoringAccount
  ControlTower-->AWSLogsAccount

  subgraph AWSProdMultiTenantAccount[AWS: Production Multi-tenant]
    AccountFillerProdMultiTenant[...]
  end

  subgraph AWSProdSingleTenantAccount[AWS: Production Single-tenant]
    AccountFillerProdSingleTenant[...]
  end

  subgraph AWSIntegrationTestAccount[AWS: Integration Test]
    AccountFillerIntegrationTest[...]
  end

  subgraph AWSPreviewAccount[AWS: Preview]
    AccountFillerPreview[...]
  end

  subgraph AWSIndividualDeveloperAccount[AWS: Individual Developer]
    AccountFillerIndividualDeveloper[...]
  end

  subgraph AWSMonitoringAccount[AWS: Monitoring]
    direction LR
    CloudWatchDashboards[CloudWatch Dashboards]
    CloudWatchMetrics[CloudWatch Metrics/Alarms]
    XRay[XRay Analytics]
  end

  subgraph AWSLogsAccount[AWS: Logs]
    CloudWatchLogs[CloudWatch Logs]
  end

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class ControlTower,AWSProdMultiTenantAccount,AWSProdSingleTenantAccount,AWSIntegrationTestAccount,AWSPreviewAccount,AWSIndividualDeveloperAccount,AWSMonitoringAccount,AWSLogsAccount container;

Service Infrastructure

Each of the infrastructure accounts (Production, Integration, Developer) all hold the same services and follow the same setup. The infrastructure we will make might seem complex at first, and it is, but as we go through each piece everything will start to make sense.

The diagram gets quite large, so we will split it up into three parts:

  • Client to Frontend
  • Client to API
  • Asynchronous Work and Media

Client to Frontend

Let’s focus first on the Client to Frontend paths:

  • Public Frontend: A simple hosting of static files in S3 with CloudFront as the CDN in front of it.
  • Internal Frontend: Hosting of internal applications, secured behind Cognito, and otherwise similar to the Public Frontend.
graph TD
  Client
  Route53

  Client-->Route53
  Route53-->FrontendCloudFront
  Route53-->InternalCloudFront
  Route53-->CertificateACM

  Frontend-->API
  Internal-->API

  subgraph Certificate[ACM: Certificate]
    CertificateACM
  end

  subgraph Frontend[Frontend: Public]
    FrontendCloudFront[CloudFront]
    FrontendS3App[S3: Static UI Files]

    FrontendCloudFront-->FrontendS3App
  end

  subgraph Internal[Frontend: Internal]
    InternalCloudFront[CloudFront]
    InternalCognito[Cognito]
    InternalS3App[S3: Static UI Files]

    InternalCloudFront-->InternalCognito
    InternalCloudFront-->InternalS3App
  end

  subgraph API
    APIFiller[...]
  end

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Client to API

As we can see, the two Frontends need something to talk to, let’s check out the APIs:

  • API: A Federated GraphQL setup, served using Lambda with API Gateway exposing it to the internet. WAF is added for security and protection, and CloudFront for possibility of cheaper egress pricing via commitment to a certain volume.
  • Database: DynamoDB tables for storing data.
  • Monitoring: XRay traces and CloudWatch metrics/alarms.
graph TD
  Client
  Route53

  Client-->Route53
  Route53-->APICloudFront

  subgraph API
    APICloudFront[CloudFront]
    APIWAF[WAF]
    APIAPIGateway[API Gateway]
    APILambdaAuthentication[Lambda: Custom Authorizer]
    APILambdaRouter[Lambda: GraphQL Supergraph
Apollo Router] APILambdaServiceReviews[Lambda: GraphQL Subgraph
Reviews Service] APILambdaServiceUsers[Lambda: GraphQL Subgraph
Users Service] APILambdaServiceProducts[Lambda: GraphQL Subgraph
Products Service] APICloudFront-->APIWAF-->APIAPIGateway APIAPIGateway--Cached-->APILambdaAuthentication APIAPIGateway-->APILambdaRouter APILambdaRouter-->APILambdaServiceReviews APILambdaRouter-->APILambdaServiceUsers APILambdaRouter-->APILambdaServiceProducts end subgraph Database DatabaseDynamoDB[DynamoDB] end %% APILambdaAuthentication-->Database APILambdaServiceReviews-->Database APILambdaServiceUsers-->Database APILambdaServiceProducts-->Database subgraph Monitoring[Monitoring] MonitoringXray[Xray] MonitoringCloudWatch[CloudWatch Metrics/Alarms] end API-->Monitoring classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Asynchronous Work and Media

And finally we can see the Media and Async work (the Database and some APIs reappear here as well):

  • Media: S3 buckets for images and videos as well as MediaConvert for transcording video for wider device support.
  • Async: SQS for asynchronous work via a queue and EventBridge for a Pub/Sub style architecture. An initial Analytics Lambda service is set up as the consumer of the Pub/Sub events.
  • Notification: SES for emails and SNS for mobile push notifications.Z
graph TD
  Client
  Route53

  Client-->Route53
  Route53-->APICloudFront

  subgraph API
    APILambdaServiceReviews[Lambda: GraphQL Subgraph
Reviews Service] APILambdaServiceProducts[Lambda: GraphQL Subgraph
Products Service] end subgraph Media MediaConvert[MediaConvert] MediaS3[S3: Media Files
Image Bucket + Video Bucket] end %% FrontendCloudFront-->MediaS3 APILambdaServiceProducts--Create Signed URL-->MediaS3 Client--Upload via Signed URL-->MediaS3 APILambdaServiceProducts--Create Job-->MediaConvert subgraph Database DatabaseDynamoDB[DynamoDB] end subgraph Notification[Notification] NotificationSES[SES: Emails] NotificationSNS[SNS: Mobile Notification] end subgraph Async[Async Work] AsyncSQS[SQS] AsyncEventBridge[Event Bridge: Pub/Sub] AsyncLambdaAnalytics[Lambda: Analytics] AsyncLambdaNotification[Lambda: Notification] AsyncEventBridge-->AsyncLambdaAnalytics AsyncSQS-->AsyncLambdaNotification end DatabaseDynamoDB--Streams-->AsyncEventBridge AsyncLambdaAnalytics-->Database APILambdaServiceReviews-->AsyncSQS AsyncLambdaNotification-->Notification classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

One diagram to rule them all

If we combine all the individual diagrams, we get:

graph TD
  Client
  Route53

  Client-->Route53
  Route53-->FrontendCloudFront
  Route53-->InternalCloudFront
  Route53-->APICloudFront
  Route53-->CertificateACM

  subgraph Certificate[ACM: Certificate]
    CertificateACM
  end

  subgraph Frontend[Frontend: Public]
    FrontendCloudFront[CloudFront]
    FrontendS3App[S3: Static UI Files]

    FrontendCloudFront-->FrontendS3App
  end

  subgraph Internal[Frontend: Internal]
    InternalCloudFront[CloudFront]
    InternalCognito[Cognito]
    InternalS3App[S3: Static UI Files]

    InternalCloudFront-->InternalCognito
    InternalCloudFront-->InternalS3App
  end

  subgraph API
    APICloudFront[CloudFront]
    APIWAF[WAF]
    APIAPIGateway[API Gateway]
    APILambdaAuthentication[Lambda: Custom Authorizer]
    APILambdaRouter[Lambda: GraphQL Supergraph
Apollo Router] APILambdaServiceTodo[Lambda: GraphQL Subgraph
Todo Service] APILambdaServiceMedia[Lambda: GraphQL Subgraph
Media Service] APICloudFront-->APIWAF-->APIAPIGateway APIAPIGateway--Cached-->APILambdaAuthentication APIAPIGateway-->APILambdaRouter APILambdaRouter-->APILambdaServiceTodo APILambdaRouter-->APILambdaServiceMedia end subgraph Media MediaConvert[MediaConvert] MediaS3[S3: Media Files
Image Bucket + Video Bucket] end %% FrontendCloudFront-->MediaS3 APILambdaServiceMedia--Create Signed URL-->MediaS3 Client--Upload via Signed URL-->MediaS3 APILambdaServiceMedia--Create Job-->MediaConvert subgraph Database DatabaseDynamoDB[DynamoDB] end %% APILambdaAuthentication-->Database APILambdaServiceTodo-->Database APILambdaServiceMedia-->Database subgraph Notification[Notification] NotificationSES[SES: Emails] NotificationSNS[SNS: Mobile Notification] end subgraph Async[Async Work] AsyncSQS[SQS] AsyncEventBridge[Event Bridge: Pub/Sub] AsyncLambdaAnalytics[Lambda: Analytics] AsyncLambdaNotification[Lambda: Notification] AsyncEventBridge-->AsyncLambdaAnalytics AsyncSQS-->AsyncLambdaNotification end DatabaseDynamoDB--Streams-->AsyncEventBridge AsyncLambdaAnalytics-->Database APILambdaServiceTodo-->AsyncSQS AsyncLambdaNotification-->Notification subgraph Monitoring[Monitoring] MonitoringXray[Xray] MonitoringCloudWatch[CloudWatch Metrics/Alarms] end API-->Monitoring classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Next Steps

Next up is to start building! Follow along in Part 1 of the series here.