Deployment Automation
Strategies for reliable, repeatable, and auditable deployments.
Overview
Manual deployments are a liability. Under normal conditions, a careful engineer follows the runbook and things go fine. Under incident conditions — when something is already broken, everyone is watching, and the pressure to move fast is highest — manual steps get skipped, sequences get confused, and "I thought I did that" becomes the post-mortem finding.
Deployment automation removes the human from the deployment path: the same sequence runs every time, regardless of who triggered it, what time it is, or how much pressure is on. This page covers how we automate deployments, which strategies we use, and how we ensure every deployment is reversible.
CI/CD pipelines trigger deployments — see CI/CD Pipeline Best Practices for how the pipeline is structured. This page focuses on what happens inside the deployment step.
Why It Matters
Manual steps fail under pressure. Incidents amplify exactly the conditions that produce mistakes: time pressure, stress, partial information. A 12-step manual deployment runbook will be followed perfectly 90% of the time and skipped or mis-sequenced exactly when it matters most. Automation removes this failure mode.
You can only roll back what was done consistently. Rollback is only reliable if deployments are atomic and reproducible. A manual deployment that partially applied three of five steps has no clean inverse. An automated deployment that runs the same script every time has a known, reversible state.
Consistent regardless of who deploys. Manual deployments accumulate tribal knowledge: "you have to restart the queue worker after migrating, but only in production, not staging." Automation encodes that knowledge so it runs every time, without requiring the one person who remembers it.
Deployment and release are not the same thing. Automated deployment ships code to production. Release — making a feature visible to users — is a separate decision. Feature flags decouple these two events: you can deploy on Tuesday morning, observe the system, and release on Thursday afternoon when the team is ready.
Standards & Best Practices
No manual steps in production deployments
If a step is required for a deployment to succeed, it must be in the automation. A step that is "usually done manually after the deploy" is a step that will be forgotten. Document the gap honestly if automation isn't yet possible — but treat it as a debt to close, not a permanent arrangement.
Choose the right deployment strategy
| Strategy | How it works | Best for | Risk if it fails |
|---|---|---|---|
| Recreate | Stop old version, start new version | Non-critical services, acceptable downtime | Full outage during switch |
| Rolling | Replace instances one at a time | Stateless services with multiple instances | Partial rollout mid-flight |
| Blue-green | Run two environments, cut traffic over | Zero-downtime requirements | Double infrastructure cost |
| Canary | Route a percentage of traffic to the new version | High-traffic services, gradual validation | Some users see the new version |
Most stateless web services should use rolling or blue-green. Recreate is acceptable for batch workers and background services where brief downtime is tolerable. Canary requires traffic shaping infrastructure (a load balancer that can split by percentage).
Every production deployment is tagged
Every deploy to production must be tied to a specific git SHA and a human-readable release tag. This answers the question "what is currently running in production?" without requiring anyone to remember or guess. The deploy automation reads the tag, deploys it, and records the deployed SHA in your observability platform.
For tagging conventions, versioning schemes, and how tagging integrates with feature flags and progressive delivery, see Release Management.
Health checks before traffic cutover
Never route traffic to a new deployment before confirming it is healthy. A health check is an HTTP endpoint (typically /health or /readyz) that returns 200 only when the service is fully initialised — database connected, caches warm, dependencies reachable.
# Example: wait for health before proceeding
- name: Wait for health
run: |
for i in {1..30}; do
status=$(curl -s -o /dev/null -w "%{http_code}" https://api.example.com/health)
if [ "$status" = "200" ]; then
echo "Service is healthy"
exit 0
fi
echo "Attempt $i: status $status, retrying..."
sleep 10
done
echo "Health check failed after 5 minutes"
exit 1If the health check fails, the deployment should be automatically rolled back — not left in a broken state for someone to notice.
Rollback must be faster than rollforward
Before deploying, know how to roll back. The rollback procedure should be a single command or a single pipeline trigger, not a multi-step manual process. Test the rollback in staging before you need it in production.
A common pattern: keep the previous container image tagged and available. Rollback means re-deploying the previous tag — the same automation, different version.
Feature flags decouple deployment from release
Ship code dark. Enable it deliberately. This pattern means:
- Deployments happen continuously and are low-risk (no user-visible change)
- Releases are a controlled decision (turn on the flag when the team is ready)
- Rollback a feature means turning off the flag, not reverting a deploy
How to Implement
Step 1 — Automate the deploy job in CI/CD
The deploy job runs automatically when CI passes on main. It never runs on feature branches.
# .github/workflows/deploy.yml
name: Deploy to Production
on:
workflow_run:
workflows: [CI]
types: [completed]
branches: [main]
jobs:
deploy:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy
run: ./scripts/deploy.sh
env:
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
IMAGE_TAG: ${{ github.event.workflow_run.head_sha }}
- name: Wait for health check
run: ./scripts/healthcheck.sh https://api.example.com/health
- name: Create GitHub release
uses: softprops/action-gh-release@v2
with:
tag_name: deploy-${{ github.event.workflow_run.head_sha }}
name: Deploy ${{ github.event.workflow_run.head_sha }}
generate_release_notes: trueStep 2 — Write a deploy script that is idempotent
The deploy script should be safe to run twice. If a deploy fails halfway through and is re-triggered, it should not leave the system in a broken state. Prefer declarative deployment targets (Kubernetes manifests, ECS task definitions) over imperative shell scripts where possible.
#!/usr/bin/env bash
# scripts/deploy.sh
set -euo pipefail
IMAGE="ghcr.io/your-org/your-app:${IMAGE_TAG}"
echo "Deploying ${IMAGE}..."
# Update ECS service (example)
aws ecs update-service \
--cluster production \
--service your-app \
--force-new-deployment \
--region eu-west-1
echo "Waiting for service stability..."
aws ecs wait services-stable \
--cluster production \
--services your-app \
--region eu-west-1
echo "Deploy complete."Step 3 — Implement a health check endpoint
Every service should expose a health endpoint. It should validate that the service is actually ready to handle requests — not just that the process started.
// Example: Express health endpoint
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1');
res.json({ status: 'ok', timestamp: new Date().toISOString() });
} catch {
res.status(503).json({ status: 'unhealthy', reason: 'database unreachable' });
}
});Return 200 only when the service is fully ready. Return 503 otherwise. Never return 200 unconditionally — that defeats the purpose.
Step 4 — Tag every production release
Use a consistent tagging scheme tied to the deploy SHA. YYYY.MM.DD-{sha} is useful for teams doing continuous deployment without semantic versioning. vMAJOR.MINOR.PATCH is better for products with versioned releases. Pick one and stick to it. The tagging step belongs in the deploy pipeline so it is automatic and consistent.
For the tagging recipe and team conventions, see Release Management.
Step 5 — Document and test the rollback procedure
Before every deployment to production, the rollback procedure should be documented and verified in staging. The rollback should be a single trigger:
# Rollback: re-deploy the previous tag
./scripts/deploy.sh --tag v2024.11.14-abc1234Or in CI: trigger the deploy workflow manually with the previous tag as input. Do not rely on rollback procedures that have never been tested.
Tools & Templates
Deployment strategy comparison (extended)
| Strategy | Downtime | Rollback speed | Infrastructure overhead | Traffic routing required |
|---|---|---|---|---|
| Recreate | Yes | Fast | None | No |
| Rolling | No | Medium | Minimal | No |
| Blue-green | No | Instant | 2× during switch | Yes |
| Canary | No | Instant | Minimal | Yes (% split) |
GitHub environment protection rules
Use GitHub Environments to require manual approval for production deployments:
environment: production
# In GitHub settings: Environments → production → Required reviewersThis creates a human gate before automation runs — useful for teams that want to deploy on demand, not on every merge.
Minimal healthcheck script
#!/usr/bin/env bash
# scripts/healthcheck.sh <url>
set -euo pipefail
URL="${1}"
MAX_ATTEMPTS=30
INTERVAL=10
for i in $(seq 1 $MAX_ATTEMPTS); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "${URL}")
if [ "${STATUS}" = "200" ]; then
echo "Health check passed (attempt ${i})"
exit 0
fi
echo "Attempt ${i}/${MAX_ATTEMPTS}: got ${STATUS}, retrying in ${INTERVAL}s..."
sleep ${INTERVAL}
done
echo "Health check failed after $((MAX_ATTEMPTS * INTERVAL))s"
exit 1Common Pitfalls
Hotfixes deployed manually, bypassing automation. The one time you most need the automation to work — during an incident — is the one time it gets bypassed. A hotfix deployed manually is undocumented, unreproducible, and may drift from what is in the repository. The correct fix is to make the automation fast enough that it is never worth bypassing.
No health checks before traffic cutover. A deployment that succeeds at the infrastructure level (container started, process running) may still be broken at the application level (database migration failed, dependency unreachable). Without a health check, traffic hits a broken service. With one, the deployment fails before traffic is cut over.
Rollback has never been tested. A rollback procedure that has only ever been read, never executed, is unreliable. Test the rollback in staging quarterly. Discover the gaps before an incident does.
Inconsistent tagging. Mixing v1.2.3, release-2024-11-14, and deploy-abc1234 in the same repository makes it impossible to answer "what is in production?" Pick one scheme and automate it.
Deployment and release conflated. If turning on a feature requires a deployment, the team waits for a deploy window before releasing. Feature flags eliminate this coupling — code can be deployed any time, and features can be enabled when the team is ready.