EngineeringDevOps & Tooling

Operational Readiness

The criteria a service must meet before it is trusted to production — covering ownership, observability, response, and failure-mode design.

Overview

Operational readiness is the practice of verifying that a service is ready to be owned in production before traffic and users depend on it. It is distinct from functional readiness ("the feature works") and deployment readiness ("the deploy process works"). A service can pass both of those gates and still not be operationally ready: it may have no alerting, no runbook, no assigned on-call owner, and no tested failure path.

The cost of discovering these gaps during the first production incident is high. The cost of discovering them before launch — when there is still time to close them — is low. Operational readiness is the structured form of that discovery.

For the monitoring and alerting that readiness depends on, see Monitoring & Alerting. For what happens when readiness is insufficient and an incident occurs, see Incident Response & Postmortems. For release-time risk controls that run alongside readiness reviews, see Release Management.


Why It Matters

The first incident is the most expensive one. A service that has not been reviewed for operational readiness will have its first incident without monitoring that fires correctly, without a runbook that answers the right questions, and without an owner who knows the system. All three gaps compound the resolution time.

Ownership gaps are always discovered at the worst moment. A service with no clear on-call owner falls to whoever is on shift when it breaks — someone who may have never seen the codebase. Ownership assignment before launch means the person paged during an incident is the person who can act.

Observability is a design property, not a post-launch enhancement. Logs, metrics, and traces that are added after users are depending on a service are added while trying to diagnose a live problem. Observability added before launch is present when it is needed.

Operational readiness gates protect the on-call team. A high-volume paging environment where many alerts are low-signal, many services have no runbooks, and ownership is unclear is unsustainable. Operational readiness is how the team protects itself from the operational debt of unreviewed launches.


Standards & Best Practices

Ownership must be assigned before go-live

Every production service has:

  • A named team that owns the service — accountable for its reliability and on-call response
  • A named on-call rotation that includes the service — a pager or alerting destination that resolves to a human being, 24/7
  • A named escalation path for when the on-call engineer cannot resolve the incident alone
  • A service registry entry documenting the owner, dependencies, criticality, and on-call contact

A service whose owner is "the backend team" without a specific rotation is a service with no owner. The on-call rotation is the operationalised form of ownership.

Observability must be present and verified

Observability is not "dashboards exist." Observability means the team can answer "is this service working correctly right now?" without looking at source code or production logs manually.

Minimum observability bar before go-live:

SignalRequirement
MetricsRequest rate, error rate, and latency percentiles (p50, p95, p99) are instrumented and visible on a dashboard
LogsStructured logs are emitted for all requests, with severity, trace ID, and enough context to diagnose a failure
TracesDistributed traces are emitted for cross-service calls, so a failing request can be followed end-to-end
Health checkA health endpoint returns a meaningful status (not just "200 OK" regardless of internal state)
Dependency healthFailures in upstream dependencies are distinguishable from failures in the service itself

Alerting must be calibrated before go-live

Alerts that fire without anyone acting on them are noise. Alerts that should fire but do not are a silent failure. Before go-live:

  • At least one alert exists for each failure mode that matters to users
  • Alert thresholds are calibrated to actual traffic, not default values
  • Every alert routes to the correct on-call rotation
  • Every alert has a corresponding runbook entry (or the runbook explicitly says "no runbook; investigate from scratch")
  • Alerts have been tested — either in staging or via synthetic traffic — to verify they fire correctly

An alert that has never been tested may not fire. Testing alerts is not optional.

A runbook covers the common failure scenarios

A runbook is a document that answers "what do I do when this alert fires?" for the most common failure scenarios. It is written by the team who built the service, before launch, not improvised during the first incident.

Runbook contents for each failure scenario:

  • What does this alert mean in plain terms?
  • What is the likely cause?
  • What are the diagnostic steps (in order)?
  • What are the remediation options?
  • What is the escalation path if the runbook does not resolve it?

A runbook does not need to be exhaustive — it needs to cover the scenarios that will occur most often and the ones that are most dangerous. A runbook covering 80% of scenarios is dramatically more valuable than no runbook.

Framing note: Detailed step-by-step procedures belong in an SOP-style runbook, not in this playbook. This page establishes the requirement that a runbook exist; the runbook itself lives in the team's operational documentation.

Failure modes must be documented

For a service to be operationally ready, the team must have answered the following questions in writing:

  • What are the dependencies this service relies on?
  • What happens when each dependency is unavailable?
  • What is the blast radius of this service failing? Who is affected?
  • What are the possible degraded states (partial failure), and are they acceptable?
  • What is the rollback procedure if the go-live needs to be reversed?
  • What is the data loss boundary — if the service crashes mid-operation, what state is lost or duplicated?

This does not need to be a long document. A table of dependencies with failure modes and responses is sufficient. What matters is that the team has thought through these questions before users depend on the answers.

Capacity is verified for launch traffic

A service that is functionally correct and operationally observable can still fail at launch if it is not sized for the load it will receive.

Before go-live:

  • Load testing at expected peak traffic has been run (not just average traffic)
  • Resource limits (CPU, memory, connections) are set and verified under load
  • Scaling behaviour (auto-scaling triggers, scale-up time) has been tested
  • Database and cache connection pools are sized for concurrent users, not just single-user load
  • Rate limits on upstream dependencies are understood and within budget at peak load

Compliance and data handling are verified

For services that handle personal data, payment data, or other regulated data:

  • Data classification of what the service stores and processes is documented
  • Retention policy is defined and implemented
  • Encryption at rest and in transit is verified
  • Access logging is enabled and routed to the appropriate audit destination
  • Any required compliance documentation (GDPR DPIA, PCI self-assessment, etc.) is complete

How to Implement

Operational readiness review

The operational readiness review (ORR) is a structured conversation between the team launching the service and someone outside the team (tech lead, SRE, or senior engineer) that verifies the readiness criteria are met.

The ORR is not a bureaucratic gate — it is a forcing function for the team to have answered the questions they would otherwise defer. A team that has prepared for an ORR is a team that has thought through their production failure modes before they occur.

Readiness checklist

Before any new service or major feature goes live:

Ownership

  • Owning team and on-call rotation identified and notified
  • Service registered in the service registry (or equivalent)
  • Escalation path documented

Observability

  • Request rate, error rate, latency metrics instrumented and dashboarded
  • Structured logging emitting with trace ID
  • Distributed tracing instrumented for cross-service calls
  • Health check endpoint present and meaningful

Alerting

  • Alerts for key failure modes are configured and routed to the on-call rotation
  • Alert thresholds calibrated to realistic traffic
  • Alerts tested and verified to fire

Runbook

  • Runbook exists for the most common failure scenarios
  • Runbook linked from the alert configuration

Failure modes

  • Dependency failure modes documented
  • Rollback procedure documented and tested
  • Data loss boundary documented

Capacity

  • Load test run at expected peak traffic
  • Resource limits and auto-scaling configured and verified

Common Pitfalls

Treating go-live as a functional milestone only. "The feature is done" is a different statement from "the service is ready for production." Treating them as the same skips the operational readiness work.

Ownership assigned to a team, not a rotation. "The platform team owns this" means no specific person is paged when it breaks. Ownership must resolve to a pager destination — a rotation with defined human coverage.

Alerts that fire into a void. Alerts routed to a channel no one monitors, or to a rotation that does not include the engineers who know the service, are worse than no alerts — they create noise without action.

Runbooks written the day of the first incident. The runbook written during the first incident is the runbook written under pressure, with incomplete context, while users are affected. Runbooks written before launch are written with full context and no pressure.

Load testing at average, not peak. A service that performs acceptably under average load and falls over under launch spike is operationally unready. Test at the traffic level the service will see, not the traffic level it currently sees.

Skipping ORR for "small" services. A "small" service that fails and has no observability, no runbook, and no owner is not small to the team debugging it at 2am. Calibrate the ORR depth to the criticality and blast radius of the service, but do not skip it.