Improve Your Cloud Infrastructure with entr, Terraform, and tflint

Working with a cloud of microservices is a fact of life for web and mobile developers. Though some of us are lucky enough to have a team of ops engineers who keep the back end humming, many of us have to do the hard work of envisioning, implementing, and deploying that cloud ourselves.

Tools like AWS CloudFormation and Terraform make managing a large cloud way easier, but even with these power tools, making changes to a cloud can be painfully slow. My team has been working to speed up that process, and today, I’d like to introduce you to a simple technique that can make orchestrating your cloud 10-15% faster and way more pleasant.

Validating

Find syntax errors in centiseconds.

terraform validate checks the syntax of your Terraform scripts for obvious errors like unclosed strings or missing commas quickly and efficiently. It won’t fix the issues for you, but if you run it frequently, it can easily save you a ton of debugging time. If, for instance, you briefly trip into your editor’s command mode and delete a closing quotation mark by mistake, validate has your back. It’s the fastest checker in our lineup, so we run it all the time.

It may take the complex parser in your brain and mine a handful of seconds to find the error in the following block of code, but it takes the validator on my current project somewhere on the order of 0.05 seconds.

resource "aws_s3_bucket" "static-assets" {
  bucket        = "joe-tf-${terraform.workspace}-static-assets"
  acl           = "private"
  force_destroy = true

  website {
    index_document = "index.html"
    error_document = "index.html
  }

  logging {
    target_bucket = aws_s3_bucket.logs.id
    target_prefix = "log/static-assets/
  }
}

Linting

Find invalid values in deciseconds.

tflint further reduces the cognitive overhead of working with a large cloud service library by checking for provider-specific invalid states in your code. terraform validate won’t save you if you try to instantiate a "t3.medium" EC2 Instance instead of a "t2.medium," but tflint will warn you about it almost immediately. It’s a bit slower than simple validation and much faster than any of the remaining steps, so it comes second in our line of automated assistants. Like any good assistive technology, the linter won’t write code for you, but it can help you outperform an unaided human developer by handling most of the drudge work for you.

On my current project, linting takes about 0.75 seconds.

Planning

Find unintended consequences in decaseconds.

Cloud infrastructure is complicated (and complex), and most of the tools that a developer might use to visualize it produce weird, incoherent results. This isn’t all that surprising, when you think about all the complexity that goes into these services. The everything-as-a-service revolution is in an awkward adolescent phase. Like my friends and I during high school, server-less strategies consume an absurd amount of resources, and they outgrow their documentation as fast as it can be written.

This is what makes orchestration platforms such a boon. CloudFormation and Terraform take the crazy landscape of Microsoft, Google, and Amazon services and turn them into code that you can commit, review, and diff. Planning is where that strategy becomes especially powerful.

When I run terraform plan for my current (AWS-based) project, my computer prepares a dependency graph of every resource that I’ve declared in a .tf file. It then reaches out to AWS and asks about the current state of each of those resources. Much like React, it compares the resulting dependency trees and generates a minimal diff, telling me what needs to change in my AWS account to make my infrastructure dreams real.

This planning step takes a while because it has to reach out to remote services and do some reasonably complicated graph building locally, but having Terraform handle it for me is way faster than checking all the resources manually. This has the added benefit of pointing out any cycles that I may have created in my dependency graph. Neither me nor my pair noticed the cycle in the following code, but plan picked it up in a few seconds.

resource "aws_lambda_function" "graphql-endpoint" {
  function_name = "joe-tf-${terraform.workspace}-graphql-endpoint"
  runtime       = "nodejs10.x"
  handler       = "lambda.handler"
  s3_bucket     = aws_s3_bucket_object.graphql-endpoint-code-bundle.bucket
  s3_key        = aws_s3_bucket_object.graphql-endpoint-code-bundle.key
  memory_size   = 1024
  timeout       = 120
  role          = aws_iam_role.lambda-execution-role.arn
  environment {
    variables = {
      NODE_ENV = "production"
      API_URL  = aws_api_gateway_deployment.graphql-deployment.invoke_url
    }
  }
}

resource "aws_api_gateway_rest_api" "graphql-api" {
  name                     = "joe-tf-${terraform.workspace}-graphql-api"
  description              = "Proxy to handle requests to our GraphQL API"
  minimum_compression_size = 1000
  binary_media_types       = ["application/octet-stream"]
}

resource "aws_api_gateway_resource" "graphql-api-endpoint" {
  rest_api_id = aws_api_gateway_rest_api.graphql-api.id
  parent_id   = aws_api_gateway_rest_api.graphql-api.root_resource_id
  path_part   = "graphql"
}

resource "aws_api_gateway_method" "graphql-http-method" {
  rest_api_id      = aws_api_gateway_rest_api.graphql-api.id
  resource_id      = aws_api_gateway_resource.graphql-api-endpoint.id
  http_method      = "POST"
  api_key_required = "true"
  request_models = {
    "application/json" = aws_api_gateway_model.graphql-request-schema.name
  }
}

resource "aws_api_gateway_method_response" "graphql-proxy-200" {
  rest_api_id = aws_api_gateway_rest_api.graphql-api.id
  resource_id = aws_api_gateway_resource.graphql-api-endpoint.id
  http_method = aws_api_gateway_method.graphql-http-method.http_method
  status_code = "200"
  response_parameters = {
    "method.response.header.Access-Control-Allow-Origin" = true
  }
  response_models = {
    "application/json" = aws_api_gateway_model.graphql-response-schema.name
  }
}

resource "aws_api_gateway_integration" "graphql-proxy" {
  rest_api_id             = aws_api_gateway_rest_api.graphql-api.id
  resource_id             = aws_api_gateway_method.graphql-http-method.resource_id
  http_method             = aws_api_gateway_method.graphql-http-method.http_method
  integration_http_method = "POST"
  type                    = "AWS_PROXY"
  uri                     = aws_lambda_function.graphql-endpoint.invoke_arn
}

resource "aws_api_gateway_deployment" "graphql-deployment" {
  rest_api_id = aws_api_gateway_rest_api.graphql-api.id
  depends_on  = [aws_api_gateway_integration.graphql-proxy]
}

resource "aws_api_gateway_stage" "graphql-stage" {
  stage_name           = terraform.workspace
  rest_api_id          = aws_api_gateway_rest_api.graphql-api.id
  deployment_id        = aws_api_gateway_deployment.graphql-deployment.id
  xray_tracing_enabled = true
}

resource "aws_api_gateway_usage_plan" "api-usage-plan" {
  name = "joe-tf-${terraform.workspace}-usage-plan"
  api_stages {
    api_id = aws_api_gateway_stage.graphql-stage.rest_api_id
    stage  = aws_api_gateway_stage.graphql-stage.stage_name
  }
  quota_settings {
    limit  = 10000
    period = "WEEK"
  }
  throttle_settings {
    burst_limit = 200
    rate_limit  = 10
  }
}
The cycle is: graphql-endpoint → graphql-deployment → graphql-proxy → graphql-endpoint

On my current project, plan takes about 30 seconds.

Applying

Roll out production hardware in hectoseconds.

Even if plan thinks that everything is hunky-dory, I won’t 100% know that I’m done until I’ve actually deployed the new infrastructure. Sometimes, existing resources in AWS conflict with the plan in ways that can’t be predicted.

To avoid that headache, I always run terraform apply in a testing workspace prior to rollout. This gives Terraform one last chance to help me out. Because it spins up actual services and virtual machine instances, apply is the most variable and slowest of all the steps in my pipeline. I try to avoid running it until all of the other steps pass consistently. You could theoretically add it to the end of the automatic process that I set up in the next section, but due to the highly variable timing of deploying AWS services, I recommend against doing so.

Tying Them All Together

Running each of these steps is easy enough, but remembering to run them all is boring and repetitive, so I make the computer do it for me! 😀 Terraform doesn’t come with a built-in watcher, but you can easily build your own using a few off-the-shelf unix tools: fd, entr, and Bash.

fd is really good at finding files quickly, so I use it to find all the Terraform files in my project. fd '\.(tf|tfvars)$' usually works for me. You can accomplish the same thing using find, if you’d rather not install an additional tool.

entr is great at reacting to file changes, so I use it to run each of the Terraform checking steps listed above any time I make changes to any Terraform file. Entr takes two arguments:

  1. A list of files to observe (passed via stdin)
  2. A script that should be run when any of them change (passed as positional arguments)

We already have a handy tool for generating Argument 1. For Argument 2, we just need to factor our Terraform checks as a Bash script. I usually do something like this:
fd '\.(tf|tfvars)$' | entr terraform-check to automatically run my entire pipeline whenever I hit ⌘S.

Bash is the standard shell on just about every Unix system (for now), so I use it to tie my other tools together. The syntax can be a bit fiddly, so I use the excellent shellcheck extension for Visual Studio Code (backed by a standalone tool of the same name) to keep my scripts tidy. For my Terraform automation project, I maintain a small terraform-check script that looks like this:

#!/usr/bin/env bash
set -e  # Stop the script if any step fails

terraform validate
tflint
terraform plan

terraform-check

I also captured the steps to set up this pipeline in another script, terraform-watch, which looks something like this:

#!/usr/bin/env bash

fd '\.(tf|tfvars)' | entr terraform-check

terraform-watch

Whenever I want to make infrastructure changes, I just open a new terminal and run terraform-watch (I use direnv to add scripts like this to my $PATH automatically). As I edit infrastructure files, I have a friendly machine dutifully checking all of my work in real time, so I can find errors faster and focus on the interesting parts of building cloud infrastructure for my clients.