Deployment Automation

Overview

Manual deployments are a liability. Under normal conditions, a careful engineer follows the runbook and things go fine. Under incident conditions — when something is already broken, everyone is watching, and the pressure to move fast is highest — manual steps get skipped, sequences get confused, and "I thought I did that" becomes the post-mortem finding.

Deployment automation removes the human from the deployment path: the same sequence runs every time, regardless of who triggered it, what time it is, or how much pressure is on. This page covers how we automate deployments, which strategies we use, and how we ensure every deployment is reversible.

CI/CD pipelines trigger deployments — see CI/CD Pipeline Best Practices for how the pipeline is structured. This page focuses on what happens inside the deployment step.

Why It Matters

Manual steps fail under pressure. Incidents amplify exactly the conditions that produce mistakes: time pressure, stress, partial information. A 12-step manual deployment runbook will be followed perfectly 90% of the time and skipped or mis-sequenced exactly when it matters most. Automation removes this failure mode.

You can only roll back what was done consistently. Rollback is only reliable if deployments are atomic and reproducible. A manual deployment that partially applied three of five steps has no clean inverse. An automated deployment that runs the same script every time has a known, reversible state.

Consistent regardless of who deploys. Manual deployments accumulate tribal knowledge: "you have to restart the queue worker after migrating, but only in production, not staging." Automation encodes that knowledge so it runs every time, without requiring the one person who remembers it.

Deployment and release are not the same thing. Automated deployment ships code to production. Release — making a feature visible to users — is a separate decision. Feature flags decouple these two events: you can deploy on Tuesday morning, observe the system, and release on Thursday afternoon when the team is ready.

Standards & Best Practices

No manual steps in production deployments

If a step is required for a deployment to succeed, it must be in the automation. A step that is "usually done manually after the deploy" is a step that will be forgotten. Document the gap honestly if automation isn't yet possible — but treat it as a debt to close, not a permanent arrangement.

Choose the right deployment strategy

Strategy	How it works	Best for	Risk if it fails
Recreate	Stop old version, start new version	Non-critical services, acceptable downtime	Full outage during switch
Rolling	Replace instances one at a time	Stateless services with multiple instances	Partial rollout mid-flight
Blue-green	Run two environments, cut traffic over	Zero-downtime requirements	Double infrastructure cost
Canary	Route a percentage of traffic to the new version	High-traffic services, gradual validation	Some users see the new version

Most stateless web services should use rolling or blue-green. Recreate is acceptable for batch workers and background services where brief downtime is tolerable. Canary requires traffic shaping infrastructure (a load balancer that can split by percentage).

Every production deployment is tagged

Every deploy to production must be tied to a specific git SHA and a human-readable release tag. This answers the question "what is currently running in production?" without requiring anyone to remember or guess. The deploy automation reads the tag, deploys it, and records the deployed SHA in your observability platform.

For tagging conventions, versioning schemes, and how tagging integrates with feature flags and progressive delivery, see Release Management.

Health checks before traffic cutover

Never route traffic to a new deployment before confirming it is healthy. A health check is an HTTP endpoint (typically /health or /readyz) that returns 200 only when the service is fully initialised — database connected, caches warm, dependencies reachable.

# Example: wait for health before proceeding
- name: Wait for health
  run: |
    for i in {1..30}; do
      status=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
      if [ "$status" = "200" ]; then
        echo "Service is healthy"
        exit 0
      fi
      echo "Attempt $i: status $status, retrying..."
      sleep 10
    done
    echo "Health check failed after 5 minutes"
    exit 1

If the health check fails, the deployment should be automatically rolled back — not left in a broken state for someone to notice.

Rollback must be faster than rollforward

Before deploying, know how to roll back. The rollback procedure should be a single command or a single pipeline trigger, not a multi-step manual process. Test the rollback in staging before you need it in production.

A common pattern: keep the previous container image tagged and available. Rollback means re-deploying the previous tag — the same automation, different version.

Feature flags decouple deployment from release

Ship code dark. Enable it deliberately. This pattern means:

Deployments happen continuously and are low-risk (no user-visible change)
Releases are a controlled decision (turn on the flag when the team is ready)
Rollback a feature means turning off the flag, not reverting a deploy

How to Implement

Step 1 — Automate the deploy job in CI/CD

The deploy job runs automatically when CI passes on main. It never runs on feature branches.

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  workflow_run:
    workflows: [CI]
    types: [completed]
    branches: [main]

jobs:
  deploy:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy
        run: ./scripts/deploy.sh
        env:
          DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
          IMAGE_TAG: ${{ github.event.workflow_run.head_sha }}

      - name: Wait for health check
        run: ./scripts/healthcheck.sh https://api.example.com/health

      - name: Create GitHub release
        uses: softprops/action-gh-release@v2
        with:
          tag_name: deploy-${{ github.event.workflow_run.head_sha }}
          name: Deploy ${{ github.event.workflow_run.head_sha }}
          generate_release_notes: true

Step 2 — Write a deploy script that is idempotent

The deploy script should be safe to run twice. If a deploy fails halfway through and is re-triggered, it should not leave the system in a broken state. Prefer declarative deployment targets (Kubernetes manifests, ECS task definitions) over imperative shell scripts where possible.

#!/usr/bin/env bash
# scripts/deploy.sh
set -euo pipefail

IMAGE="ghcr.io/your-org/your-app:${IMAGE_TAG}"

echo "Deploying ${IMAGE}..."

# Update ECS service (example)
aws ecs update-service \
  --cluster production \
  --service your-app \
  --force-new-deployment \
  --region eu-west-1

echo "Waiting for service stability..."
aws ecs wait services-stable \
  --cluster production \
  --services your-app \
  --region eu-west-1

echo "Deploy complete."

Step 3 — Implement a health check endpoint

Every service should expose a health endpoint. It should validate that the service is actually ready to handle requests — not just that the process started.

// Example: Express health endpoint
app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');
    res.json({ status: 'ok', timestamp: new Date().toISOString() });
  } catch {
    res.status(503).json({ status: 'unhealthy', reason: 'database unreachable' });
  }
});

Return 200 only when the service is fully ready. Return 503 otherwise. Never return 200 unconditionally — that defeats the purpose.

Step 4 — Tag every production release

Use a consistent tagging scheme tied to the deploy SHA. YYYY.MM.DD-{sha} is useful for teams doing continuous deployment without semantic versioning. vMAJOR.MINOR.PATCH is better for products with versioned releases. Pick one and stick to it. The tagging step belongs in the deploy pipeline so it is automatic and consistent.

For the tagging recipe and team conventions, see Release Management.

Step 5 — Document and test the rollback procedure

Before every deployment to production, the rollback procedure should be documented and verified in staging. The rollback should be a single trigger:

# Rollback: re-deploy the previous tag
./scripts/deploy.sh --tag v2024.11.14-abc1234

Or in CI: trigger the deploy workflow manually with the previous tag as input. Do not rely on rollback procedures that have never been tested.

Tools & Templates

Deployment strategy comparison (extended)

Strategy	Downtime	Rollback speed	Infrastructure overhead	Traffic routing required
Recreate	Yes	Fast	None	No
Rolling	No	Medium	Minimal	No
Blue-green	No	Instant	2× during switch	Yes
Canary	No	Instant	Minimal	Yes (% split)

GitHub environment protection rules

Use GitHub Environments to require manual approval for production deployments:

environment: production
# In GitHub settings: Environments → production → Required reviewers

This creates a human gate before automation runs — useful for teams that want to deploy on demand, not on every merge.

Minimal healthcheck script

#!/usr/bin/env bash
# scripts/healthcheck.sh <url>
set -euo pipefail

URL="${1}"
MAX_ATTEMPTS=30
INTERVAL=10

for i in $(seq 1 $MAX_ATTEMPTS); do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" "${URL}")
  if [ "${STATUS}" = "200" ]; then
    echo "Health check passed (attempt ${i})"
    exit 0
  fi
  echo "Attempt ${i}/${MAX_ATTEMPTS}: got ${STATUS}, retrying in ${INTERVAL}s..."
  sleep ${INTERVAL}
done

echo "Health check failed after $((MAX_ATTEMPTS * INTERVAL))s"
exit 1

Common Pitfalls

Hotfixes deployed manually, bypassing automation. The one time you most need the automation to work — during an incident — is the one time it gets bypassed. A hotfix deployed manually is undocumented, unreproducible, and may drift from what is in the repository. The correct fix is to make the automation fast enough that it is never worth bypassing.

No health checks before traffic cutover. A deployment that succeeds at the infrastructure level (container started, process running) may still be broken at the application level (database migration failed, dependency unreachable). Without a health check, traffic hits a broken service. With one, the deployment fails before traffic is cut over.

Rollback has never been tested. A rollback procedure that has only ever been read, never executed, is unreliable. Test the rollback in staging quarterly. Discover the gaps before an incident does.

Inconsistent tagging. Mixing v1.2.3, release-2024-11-14, and deploy-abc1234 in the same repository makes it impossible to answer "what is in production?" Pick one scheme and automate it.

Deployment and release conflated. If turning on a feature requires a deployment, the team waits for a deploy window before releasing. Feature flags eliminate this coupling — code can be deployed any time, and features can be enabled when the team is ready.

On this page