EngineeringDevOps & Tooling

Infrastructure as Code

Managing infrastructure through version-controlled, reproducible configuration.

Overview

Infrastructure as Code (IaC) means that every server, database, network rule, and cloud resource is defined in a file that can be read, reviewed, versioned, and reproduced. A configuration file that creates a database instance is worth more than tribal knowledge about how someone clicked through the AWS console six months ago.

This page covers how we manage infrastructure through code, which tools we use, and the practices that prevent the most common failure modes: configuration drift, undocumented changes, and secrets in the wrong places.


Why It Matters

Manual infrastructure changes are undocumented by default. A change made through a cloud console leaves no record in the code. Nobody knows what changed, when, or why. Nobody can reproduce it in a new environment. Nobody can audit it. IaC is the only way to make infrastructure changes as reviewable as application changes.

IaC enables consistent environments. Staging should mirror production. With IaC, "staging mirrors production" is not an aspiration — it is the direct consequence of both environments being created by the same module with different variable values. Without IaC, drift between environments is inevitable.

Infrastructure changes go through code review. A PR for a Terraform change can be reviewed, questioned, discussed, and reverted like any other code change. A console change bypasses all of that. Code review catches mistakes before they affect production.

Disaster recovery becomes mechanical. If a region goes down or an environment is accidentally deleted, IaC means recovery is a terraform apply away — not a multi-day reconstruction from memory, screenshots, and tribal knowledge.


Standards & Best Practices

All infrastructure lives in version control — no console exceptions

Every resource managed by the team must be defined in IaC. Manual console changes are forbidden in production. There is no "I'll add it to the code later" — later never comes, and now there is drift.

If you need to make an emergency change through the console during an incident, the first commit after the incident closes is the IaC equivalent of what was changed manually. The change is not done until it is in code.

Plan before apply — always

Running terraform apply without reviewing terraform plan output is equivalent to deploying without reading the diff. The plan shows exactly what will be created, modified, or destroyed. Never apply without reviewing it.

For changes that destroy or replace resources — a ~ that becomes a -/+ — slow down. Understand why the resource is being replaced before applying. The pattern is always: generate a saved plan, review it completely, then apply the saved plan — never run apply directly without a preceding review step.

Step-by-step workflows and team-specific apply conventions will be documented in a future IaC Change SOP.

Remote state with locking

Local state files (terraform.tfstate) committed to git cause concurrent apply conflicts and leak resource metadata into the repository. Use remote state with locking from day one.

# backend.tf
terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

The DynamoDB table provides locking — two engineers cannot apply simultaneously. The S3 bucket stores the state remotely and retains history.

Modules for reusable patterns

Repeating infrastructure blocks inline (two nearly identical ECS service definitions copy-pasted) leads to drift. One gets updated, the other doesn't. Extract reusable infrastructure into modules.

infra/
├── modules/
│   ├── ecs-service/       # Reusable ECS service module
│   ├── rds-postgres/      # Reusable Postgres RDS module
│   └── alb-target-group/  # Reusable load balancer target
├── environments/
│   ├── staging/
│   │   ├── main.tf        # Calls modules with staging vars
│   │   └── variables.tf
│   └── production/
│       ├── main.tf        # Same modules, production vars
│       └── variables.tf

Staging and production use the same modules with different variable values. They are not separate codebases.

Secrets are managed separately from IaC

IaC provisions the secret slot — the Secrets Manager secret, the Kubernetes Secret, the Parameter Store path. It does not provision the secret value. Secrets in .tf files or in state are visible to anyone with access to the repository or state bucket.

# Correct: provision the slot, not the value
resource "aws_secretsmanager_secret" "db_password" {
  name = "production/myapp/database-password"
}

# The actual value is set out-of-band (CLI, vault, CI secret)

Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, Doppler) and populate values through a controlled path that does not touch the IaC repository.

Staging and production use the same modules, different variables

If staging and production use entirely different IaC configs, they will drift. A module update applied to production but not staging means staging no longer mirrors production. The same module with different variable files is the correct structure.


How to Implement

Step 1 — Choose your IaC tool

ToolLanguageStrengthsBest for
Terraform / OpenTofuHCLMature, large provider ecosystem, plan/apply workflowMost teams, multi-cloud
PulumiTypeScript/PythonCode-based (no DSL), strong typingTeams that prefer programmatic config
AWS CDKTypeScript/PythonNative AWS, generates CloudFormationAWS-only environments
CloudFormationYAML/JSONNative AWS, no extra toolingSimpler AWS-only projects

For most teams: Terraform (or its open-source fork OpenTofu) is the right default. It has the largest provider ecosystem, the most documentation, and the widest community.

Step 2 — Set up remote state

Create the S3 bucket and DynamoDB table before writing any other Terraform. This chicken-and-egg problem is solved by creating these resources manually (once, with the console or CLI) or with a bootstrap script:

#!/usr/bin/env bash
# scripts/bootstrap-state.sh
REGION="eu-west-1"
BUCKET="your-org-terraform-state"
TABLE="terraform-locks"

aws s3api create-bucket \
  --bucket "${BUCKET}" \
  --region "${REGION}" \
  --create-bucket-configuration LocationConstraint="${REGION}"

aws s3api put-bucket-versioning \
  --bucket "${BUCKET}" \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name "${TABLE}" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region "${REGION}"

Run this once. After this, all state lives remotely.

Step 3 — Structure the project

infra/
├── modules/
│   └── ecs-service/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── staging/
│   │   ├── backend.tf
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── backend.tf
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
└── .github/
    └── workflows/
        └── terraform.yml

Each environment directory has its own backend configuration and its own terraform.tfvars. The modules directory contains reusable building blocks.

Step 4 — Add CI: plan on PR, apply on merge

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths:
      - 'infra/**'
  push:
    branches: [main]
    paths:
      - 'infra/**'

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infra/environments/production
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.2

      - name: Terraform init
        run: terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform plan
        id: plan
        run: terraform plan -no-color
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Post plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n${{ steps.plan.outputs.stdout }}\n```'
            })

  apply:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    defaults:
      run:
        working-directory: infra/environments/production
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.2
      - run: terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      - run: terraform apply -auto-approve
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Plan output is posted as a PR comment so reviewers can see what will change before approving. Apply runs automatically on merge.

Step 5 — Import existing infrastructure before managing it

If the team is adopting IaC for existing infrastructure, import existing resources rather than destroying and recreating them. Destroying and recreating causes downtime, loses data, and is unnecessary.

# Import an existing RDS instance
terraform import aws_db_instance.main my-existing-db-identifier

# Import an existing S3 bucket
terraform import aws_s3_bucket.assets my-existing-bucket-name

Write the Terraform resource block first, run terraform import, then terraform plan to confirm the import matches the actual resource configuration.


Tools & Templates

Minimal Terraform module (modules/ecs-service/)

# modules/ecs-service/variables.tf
variable "name" {
  type        = string
  description = "Service name"
}

variable "image" {
  type        = string
  description = "Container image URI"
}

variable "desired_count" {
  type        = number
  description = "Number of running tasks"
  default     = 2
}
# modules/ecs-service/main.tf
resource "aws_ecs_service" "this" {
  name            = var.name
  cluster         = data.aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count

  deployment_controller {
    type = "ECS"
  }
}

S3 remote state backend

# environments/production/backend.tf
terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  required_version = ">= 1.7"
}

Common Pitfalls

Console changes that drift from IaC. "I'll add it to Terraform later" is a promise that is almost never kept. The console change disappears from collective memory, the Terraform state diverges from reality, and the next terraform plan proposes to undo the manual change. The rule is simple: if you make a console change, the IaC commit is part of the same task, not a follow-up.

Secrets hardcoded in .tf files or variable files. database_password = "mysecretpassword" in a .tfvars file is a secret in version control. Use sensitive = true on variables and populate them from a secrets manager, not from files committed to the repository.

Local state committed to git. terraform.tfstate in the repository means merge conflicts on every apply, secrets in the git history (state contains resource attributes), and no locking. Set up remote state before applying anything.

Applying without reviewing the plan. terraform apply -auto-approve without reading the plan output skips the most important safeguard IaC provides. The plan is the diff — read it before applying.

One giant main.tf instead of modules. A single file with 500 lines of Terraform is harder to reason about than a set of small, named modules. Extract logical units (a service, a database, a network) into modules from the start, not after the file becomes unmanageable.