EngineeringSystem Design

Reliability Patterns

Design patterns for building systems that degrade gracefully and recover predictably under partial failure.

Overview

Every distributed system will eventually face partial failures — a downstream service that times out, a network partition that causes requests to pile up, a dependency returning errors at a rate the system did not budget for. Reliability patterns are the design vocabulary for these failure modes: they specify what a service does when things go wrong, not just when they go right.

Reliability is not added after the fact. A service designed without failure-mode thinking will eventually exhibit one of two pathologies: cascading failures upward (a single degraded dependency takes the whole system offline), or silent latency accumulation until users are blocked. The patterns on this page prevent both.

For defining the reliability targets the system must hit, see SLOs & Error Budgets. For detecting degradation in production, see Monitoring & Alerting. For the broader scalability trade-offs these patterns sit within, see Scalability Patterns.


Why It Matters

Failure in distributed systems is routine, not exceptional. Services restart, databases become briefly unavailable, networks drop packets. A system built assuming perfect conditions will fail at scale; a system designed for partial failure degrades gracefully and recovers predictably.

Cascading failures are non-linear. A single dependency that slows down — rather than fails cleanly — can exhaust the thread pool, connection pool, or queue of every service upstream of it, taking down healthy components alongside the degraded one. The circuit breaker pattern exists specifically to prevent this.

Retry logic written independently by every caller is a liability. Without a shared standard, callers implement retry with different back-off strategies, retry storms become more likely, and idempotency assumptions are inconsistent. These patterns should be first-class design decisions, not implementation details.

Users experience reliability cumulatively. A service that returns a stale response is more useful than one that returns an error — for many purposes, "slightly outdated" is good enough. Designing for graceful degradation means thinking about what the user needs, not just what the system can guarantee.


Standards & Best Practices

Retries require back-off, jitter, and a budget

A retry that fires immediately is a retry storm waiting to happen. Every retry strategy must include:

  • Exponential back-off — a base delay that doubles with each attempt (e.g. 100ms → 200ms → 400ms)
  • Jitter — randomisation added to the back-off so all callers do not retry at identical intervals
  • A maximum retry count or time budget — retries must stop; an unbounded retry loop is a denial-of-service attack against a degraded dependency
  • An idempotency check before retrying — retrying a non-idempotent operation can produce duplicate side effects

Retries are appropriate for transient errors: network timeouts, temporary server-side 500s, brief unavailability. They are not appropriate for client errors (4xx), business logic rejections, or validation failures — retrying without changing the request will never succeed.

Timeouts are mandatory at every process boundary

Every call that crosses a process boundary — HTTP request, database query, message queue dequeue, cache get — must have an explicit timeout. A call without a timeout can block the caller indefinitely. In a service mesh, blocking callers accumulate until resources are exhausted.

Timeout configuration standards:

  • Set timeouts based on the calling service's SLO, not the expected latency of the callee
  • The caller's timeout budget must be smaller than the end-user-visible deadline — otherwise the user waits beyond the useful window before getting an error
  • Timeouts must be observable: logged, alerted on when rates elevate
  • Test the timeout path explicitly — it is rarely exercised in development and almost always encountered in incidents

Circuit breakers prevent cascades

A circuit breaker wraps a dependency call and tracks its error rate. When errors exceed a configured threshold, the circuit "trips" — further calls fail immediately without attempting to reach the dependency. After a recovery window, the circuit enters a half-open state: one probe call is allowed through. If it succeeds, the circuit closes; if it fails, the window resets.

The value: a tripped circuit converts a slow failure (caller blocks waiting for timeout, releases thread, retries) into a fast failure (caller receives error immediately, can fall back). This prevents one degraded dependency from exhausting the caller's resources.

Design decisions:

  • Error threshold and recovery window are service-specific — calibrate to realistic traffic patterns
  • Circuit state must be observable: dashboards, alerts, logs
  • When the circuit is open, callers must degrade gracefully, not simply propagate the error upstream

Idempotency is a reliability feature

An operation is idempotent if calling it multiple times produces the same result as calling it once. Idempotency is what makes retries safe.

Standards:

  • Read operations must be naturally idempotent — no hidden side effects on read
  • Write operations that modify state accept and deduplicate a client-provided idempotency key
  • Idempotency keys are stored with sufficient TTL to cover the retry window
  • Operations that cannot be made idempotent (e.g. appending to an event log) have their non-idempotency documented prominently

The alternative — non-idempotent writes and retries — produces duplicate records, double-charges, and duplicate order submissions that are expensive to unwind and damaging to trust.

Graceful degradation is a product decision, made at design time

Graceful degradation means the system provides some value when a component fails, rather than failing entirely. What constitutes acceptable degradation is a product decision, not an implementation choice — it must be made at design time, not during an incident.

Common degradation patterns:

  • Cached fallback — return a cached result rather than an error when the live source is unavailable
  • Feature disable — non-critical features are hidden when their dependency fails (e.g. recommendations are suppressed if the recommendation service is down)
  • Static response — return a known-safe static response for non-critical paths
  • Read-only mode — accept reads but queue or reject writes when the write path is unavailable

Every fallback path must be observable. If the system degrades silently, degradation becomes the new normal without anyone noticing. Alert when any fallback path is used at a sustained rate.

Bulkheads isolate failure domains

A bulkhead allocates a fixed resource pool (thread pool, connection pool, semaphore) to each dependency. If dependency A is slow, its calls fill A's pool — not the shared pool used by all dependencies. Calls to dependency B continue unaffected.

Bulkhead implementation:

  • Separate pools per critical dependency, not one shared pool for all external calls
  • Pool sizes are calibrated to realistic traffic, not theoretical maximum throughput
  • Overflow is rejected fast, not queued indefinitely

The name comes from ship design: bulkheads prevent water from flooding the whole hull when a single compartment is breached.


How to Implement

Failure mode analysis at design time

For every service boundary introduced at design time, document:

  • What happens if the dependency is down?
  • What happens if the dependency is slow (returns, but slowly)?
  • What happens if the dependency returns malformed or unexpected data?
  • Which of these scenarios has a graceful fallback, and what is it?
  • Which scenarios require the caller to fail rather than degrade?

A table of failure modes with documented responses is a first-class design artefact. A service that has not answered these questions is one whose incident response plan is improvised on the day.

Testing failure paths explicitly

Reliability patterns are only as good as their test coverage. Three layers:

  1. Unit tests — for retry/circuit-breaker logic in isolation, verifying back-off curves and thresholds
  2. Integration tests — injecting failures (fault-injection middleware, test doubles that simulate timeout and error) and asserting that the fallback behaviour fires correctly
  3. Chaos testing in staging — deliberately failing dependencies and verifying the system degrades as designed, not as assumed

Reliability patterns that are untested in the failure path are reliable in theory only.

Observability for each pattern

Each pattern requires its own observability:

PatternWhat to instrument
RetriesRetry count per request, retry rate over time, final error rate after retries
TimeoutsTimeout count per dependency, trending over time
Circuit breakersCircuit state changes, open-circuit rate, recovery success/failure rate
FallbacksHow often each fallback path is used, per endpoint

A circuit breaker that trips silently or a fallback that engages without an alert is a system that is degraded without acknowledgment. Degradation without visibility is indistinguishable from failure.


Common Pitfalls

Retrying without back-off. Immediate retries on a degraded service add load to a service that is already struggling. Exponential back-off with jitter is required, not optional.

Retrying non-idempotent operations. A retry of a create-order call that does not check idempotency creates duplicate orders. Retries and idempotency must be designed as a pair.

Timeouts longer than user patience. A 30-second timeout on a user-facing call means the user waits 30 seconds for an error. Timeouts should be shorter than the user's acceptable wait time for the context.

Circuit breakers with no fallback. A tripped circuit breaker that converts slow failure into fast failure is only useful if the caller does something with the fast failure. Without a fallback, it just changes the error latency.

Untested fallback paths. The fallback is the least-exercised code in the codebase and will fail in unexpected ways if not tested explicitly. Test failure paths in staging before they are needed in production.

Retry amplification in deep call chains. If service A retries 3 times and calls service B which retries 3 times, a single user request can generate 9 downstream calls. In a multi-hop service mesh, retry amplification is a cascade-failure multiplier. Propagate retry budgets via request headers and deduct from the budget at each hop.

Hardcoded retry and timeout values. Retry counts and timeouts that are buried in code rather than configuration cannot be tuned when load patterns change. Expose them as configuration; tune them based on observed latency distributions, not guesses.