EngineeringDevOps & Tooling

Incident Response & Postmortems

How we respond to production incidents, and how we learn from them without blame.

Overview

Incidents are not failures of process; they are expected, periodic revelations that the system is more complicated than the team's current model of it. Every non-trivial production system will, eventually, do something its operators did not predict. The measure of a team is not whether incidents happen — it is how fast the team can recognise, respond, recover, and learn from them.

This page covers how we think about the response side (what happens in the minutes and hours after something breaks) and the learning side (what happens in the days after it is over). It is deliberately principle-weighted — the step-by-step of any specific incident belongs in that incident's runbook, not here.

For how incidents are detected in the first place, see Monitoring & Alerting. For release-time risk controls that reduce incident frequency, see Deployment Automation.


Why It Matters

Time is the dominant cost during an incident. Users are affected, revenue is affected, and trust is being spent continuously for every minute the problem persists. Most of the value of incident response comes from being able to act quickly, not from being able to act perfectly.

Coordination failures dominate incident duration. Engineering-only causes (the actual bug) are often resolved faster than coordination causes (who is running this, who is communicating, who is deciding). A team with a clear incident structure routinely resolves incidents faster than a technically stronger team without one.

The incident is the cheapest teacher the system has. A clear-eyed postmortem turns a painful event into durable improvement — not just a patch for the specific bug, but an improvement to the conditions that allowed the bug to reach production. Teams that skip this step pay the same cost repeatedly.

Blame destroys learning. A team that fears being blamed during a postmortem will suppress the information that makes future incidents preventable. The most important cultural precondition for reliability is a postmortem process that is genuinely focused on the system, not on the humans.


Standards & Best Practices

Severity is defined by impact, not intuition

Severity levels are used to decide response urgency, who to wake up, and how widely to communicate. They must be defined in terms that are observable to someone not yet deep in the problem:

SeverityImpactResponse
SEV1Core service unavailable; revenue-generating path broken; data at riskImmediate; incident commander; all hands if needed
SEV2Major feature degraded; significant subset of users affected; no workaroundSame business hour; incident commander
SEV3Minor feature broken; small subset affected; workaround existsSame business day; single on-call engineer sufficient
SEV4Cosmetic or infrastructure issue with no user impactTracked as a normal issue; no incident process

Ambiguity about severity is resolved by choosing the higher level. Downgrading during an incident is fine; upgrading mid-incident wastes the most expensive minutes.

The incident commander role

Every SEV1 and SEV2 has one — and only one — person in charge: the incident commander (IC). The IC is not necessarily the person closest to the technical problem. The IC's job is different:

  • Decide what is being tried, in what order
  • Decide when to escalate (bring in more people, wake more people up)
  • Own the external communication (what customers are told, by whom, when)
  • Call the incident resolved — and only then release the responders

Separating the IC from the engineer debugging the actual problem is not bureaucracy. It is what prevents the person fixing the bug from being pulled into Slack to answer status questions for 45 minutes. The IC shields the responders; the responders stay focused.

The IC is rotated or assigned deliberately — not whoever happens to be on-call, not whoever is most senior. Being an IC is a practiced skill; it is explicitly trained.

Communication cadence is a contract

During an active incident:

  • Internally (engineering/operations channel): a status update at least every 15 minutes even if nothing has changed. "Still investigating, no new findings" is a valid update. Silence makes observers fill in blanks — usually pessimistically.
  • Externally (status page, customer-facing channels): updates at a cadence agreed in advance with support and communications. A published incident with no updates for an hour looks abandoned.
  • Executive/stakeholder: a single point of contact (often the IC or a designated liaison) gives stakeholders a summary; stakeholders do not pull responders away from the incident for details.

The cadence is a promise. If you set a 15-minute cadence, you keep it — even when there is nothing to say.

Stabilise first, diagnose second

The first job during an active incident is to stop the bleeding — not to identify the root cause. If a deploy caused the incident, roll it back before reading the stack trace. If a dependency is failing, fail over before building a reproduction. Root cause analysis is important, but it happens after users are no longer affected.

This is a cultural bias worth stating explicitly because it runs counter to engineering instinct. The engineer who just finished a deploy will often want to prove it was not the deploy before rolling back. During an incident, that instinct costs minutes the team does not have.

Postmortem is mandatory, regardless of outcome

Every SEV1 and SEV2 gets a postmortem. Every near-miss — an incident that was caught before it affected users — gets a postmortem at a lighter weight. The goal is not to report; it is to learn. If the team did not learn something new, the postmortem was not done right.

SEV3s may be postmortemed if they are part of a pattern or if a contributing factor is novel. Single low-impact incidents usually get an issue and a fix.

Postmortems are blameless — by design

A blameless postmortem is a specific technical practice, not a tone. The rules:

  • The document names actions and their outcomes, not the humans who took them
  • "A deploy was performed that introduced X" — not "Person Y deployed X"
  • Questions are about why the action made sense at the time, not why the person did something wrong
  • Findings focus on what signals, tools, or guardrails the system was missing

The reason for blamelessness is operational, not therapeutic. In a blame culture, the full information needed to prevent the next incident is systematically hidden. People do not lie, exactly — they omit. A blameless process extracts the full information and converts it into durable change.

Every postmortem produces actions with owners

A postmortem that ends with "we will be more careful" produces nothing. Useful postmortems end with:

  • A short list of concrete actions (3–7 typically)
  • A named owner for each action — a person, not a team
  • A target completion date
  • An issue filed for each action in the team's tracker

Actions are tracked as seriously as any other prioritised work. A team whose postmortem actions systematically slip has converted its incidents into reading material.

Follow-up actions preferentially improve the system, not the human

The hierarchy of incident prevention, from most durable to least:

  1. Make the failure impossible (design change, fundamentally different approach)
  2. Make the failure detected automatically (alerting, monitoring, invariant checks)
  3. Make the failure easier to recover from (better runbooks, better tools, faster rollbacks)
  4. Make the failure less likely through training (documentation, drills, onboarding)
  5. Ask people to be more careful

Actions should cluster at the top of this list. "Be more careful" is the weakest control known to operations; it is the action of last resort, not first resort.


How to Implement

During an active SEV1/SEV2

The shape of the response, not the steps of it:

  1. Declare the incident explicitly — named, severity-tagged, in a known channel
  2. Assign an IC
  3. Open a dedicated channel or bridge where the response happens
  4. The IC starts a timeline (events, decisions, times) — often a pinned message that is edited
  5. Stabilise first (rollback, failover, feature flag) — even before root cause is known
  6. Communicate internally at the agreed cadence; externally per the communication plan
  7. Declare the incident resolved — and only then disperse
  8. Schedule the postmortem within 48 hours, to be held within one week

Postmortem document shape

A postmortem is not a narrative essay. It is a specific document designed to be read by people who were not there and used to improve the system:

  • Summary — One paragraph: what happened, what was the impact, how long it lasted, how it was resolved
  • Impact — Who was affected, how many, for how long, what they experienced
  • Timeline — Chronological facts: when things happened, who did what, when signals fired
  • What went well — Genuine positives; this is not filler. Things the team did that worked should be reinforced
  • What went poorly — Honest, specific, non-blaming
  • Root cause(s) — What was the actual chain of causes? Single-cause explanations are usually too simple — look for contributing factors
  • Action items — Each with owner and target date
  • Lessons learned — The durable takeaways, the things a different team could read and benefit from

The timeline is the most undervalued section

The timeline is where the real learning lives. Read postmortems from other teams, and the timeline is almost always what teaches you something. Write yours to that standard: precise timestamps, specific actions, specific observations, no paraphrasing.

Postmortem review

The postmortem is shared broadly — not just within the team that ran the incident. Reading other teams' postmortems is one of the highest-leverage learning activities in engineering organisations. A monthly or bi-weekly review ritual where postmortems are discussed across teams amplifies the value further.


Tools & Templates

Postmortem template

# Postmortem: [Short descriptive title]

**Date of incident:** YYYY-MM-DD
**Severity:** SEV1 | SEV2
**Duration:** HH:MM (start → resolved)
**Authors:** [Names]
**Status:** Draft | In Review | Finalised

---

## Summary

[One paragraph: what happened, what was the impact, how was it resolved.]

## Impact

- Users affected: [estimate]
- Services affected: [list]
- Revenue/business impact: [if measurable]
- Data integrity impact: [if any]

## Timeline

(All times UTC)

- HH:MM — [Event / action / observation]
- HH:MM — [...]

## Root cause

[What ultimately caused this. Contributing factors. Why the existing controls did not prevent it.]

## What went well

- [Genuine positives — things we want to keep doing]

## What went poorly

- [Honest, specific — no blame language]

## Action items

| Action            | Owner  | Target date | Category                    |
| ----------------- | ------ | ----------- | --------------------------- |
| [Concrete action] | [Name] | YYYY-MM-DD  | prevent / detect / mitigate |

## Lessons learned

[The durable takeaways — what should a different team learn from this?]

Incident declaration template (channel message)

🚨 INCIDENT DECLARED — SEV[1/2] — [short title]

IC: @name
Scribe: @name (optional)
Comms: @name (external updates)
Impact: [one line]
Channel: #incident-[name]-[date]
Status page: [link]

Next update: HH:MM

Common Pitfalls

No IC, or the IC is also debugging. Without a dedicated coordinator, the incident response becomes a swarm of well-meaning engineers each pursuing their own hypothesis. Time evaporates. The person closest to the technical problem is almost always the wrong choice for IC — they are the one who needs protecting from coordination work, not the one to do it.

Root causing before stabilising. "Wait, I want to understand why this happened before we roll back" is the sentence that doubles incident duration. Stabilise, then diagnose.

Postmortem theatre. Filling out the postmortem template as an administrative exercise, with vague action items ("improve monitoring") and no owners. The document is filed; nothing changes; the next incident is the same.

Blame language disguised as technical language. "Person X should have tested more carefully" is blame. So is "the team should have noticed the alert" without asking why didn't the alert create noticeable attention?. Blame often hides behind passive voice and team-level attribution. The test: does the action item require a human to be more careful, or does it change what the system does?

One-cause explanations. Real incidents almost always have multiple contributing factors. A postmortem that names a single root cause is usually a postmortem that stopped looking. Keep asking "and what made that possible?" until the answers stop being informative.

Action item graveyards. Actions are assigned, never tracked, and never completed. The postmortem report becomes a ritual disconnected from work. If the team's tracker shows a long tail of overdue postmortem actions, that tail is also the list of incidents you are likely to repeat.

External comms as an afterthought. Deciding during the incident who says what to customers. The customer-facing communication plan should exist before the incident; the incident should plug into it. A team that has not practiced external communication will be unable to do it well under pressure.

Skipping postmortems for "small" incidents. A SEV3 that happens monthly is the cheapest kind of systemic failure to learn from, and the one most often ignored. Pattern-level postmortems — where the team looks at a cluster of similar SEV3s — often yield the most durable improvements.