SLOs & Error Budgets
Defining, measuring, and managing service reliability through service level objectives and error budgets.
Overview
Service Level Objectives (SLOs) are the formal expression of "how reliable does this service need to be?" They are the contract between engineering reliability and the business outcomes that depend on it. Error budgets are the operational mechanism that turns that contract into a day-to-day guide for how the team spends its time.
Together, SLOs and error budgets give a team an objective, shared answer to questions that otherwise generate purely subjective disagreement: How much reliability work is enough? Can we deploy a risky change now? Does this incident justify cancelling the sprint to fix it? With an error budget, these are not opinion questions — they have answers grounded in the system's current reliability state.
For the tooling that measures the signals SLOs are built on, see Monitoring & Alerting. For what to do when an SLO is breached, see Incident Response & Postmortems. For how reliability requirements affect system design, see Reliability Patterns.
Why It Matters
"Five nines" is almost always the wrong target. 99.999% availability requires 5.26 minutes of total downtime per year. Most users cannot perceive a difference between 99.9% and 99.99% (52 versus 8.7 minutes of downtime per year). Over-engineering reliability costs as much as under-engineering it — it crowds out feature work, slows deploys, and adds complexity the system must pay for forever. SLOs set the right target, not the maximum possible.
Without an SLO, reliability is always either too high or too low. Teams without SLOs spend reliability effort based on recent pain rather than current state. After a bad week they over-invest; after a quiet month they under-invest. SLOs make the investment level visible and objective.
Error budgets convert reliability into a decision framework. An error budget that is mostly intact says: we are reliably meeting our target; we can accept more deployment risk, take on riskier experiments, and move faster. An error budget that is nearly exhausted says: we are close to unreliable; slow down, fix things, no risky changes. This decision framework requires no judgment call — the budget is either there or it is not.
Users have an implicit SLO in their head. They may not express it in nines, but users have a threshold below which they stop trusting a service. An SLO is the explicit, measurable form of that implicit threshold. Teams that measure it can see how close they are to the user's threshold; teams that do not are surprised when users churn.
Standards & Best Practices
SLIs are the raw measurements; SLOs are the targets
A Service Level Indicator (SLI) is a quantitative measurement of service behaviour. An SLO is a target for what proportion of SLI measurements should be "good" over a defined window.
SLI categories and typical formulations:
| Category | Example SLI | Example SLO |
|---|---|---|
| Availability | Proportion of requests that return a non-5xx response | 99.9% of requests succeed over a 30-day window |
| Latency | Proportion of requests completing under a threshold | 95% of requests complete in under 500ms over a 30-day window |
| Error rate | Proportion of requests that return an error | Fewer than 0.1% of requests return an error |
| Throughput | Volume processed per time window | 99% of batch jobs complete within 4 hours of scheduled start |
| Freshness | How current data is | 99% of data reads reflect writes within 60 seconds |
SLIs must measure what users actually experience. An SLI that measures infrastructure-layer metrics (CPU, memory, disk) does not directly correlate with user experience. The correct SLI measures the outcomes a user cares about: did my request succeed, and was it fast enough?
Error budgets: the SLO's operational arm
If the SLO is 99.9% over 30 days, then 0.1% of requests may be bad — that is the error budget. For a service handling 1 million requests per day, the monthly error budget is 1,000 requests.
Error budget policy defines what teams do in response to budget state:
| Budget state | Policy |
|---|---|
| > 50% remaining | Normal operations. Risky changes and experiments are acceptable. |
| 25–50% remaining | Review risky deployments before shipping. Increase incident monitoring cadence. |
| < 25% remaining | No risky changes. Prioritise reliability work. |
| Exhausted | Feature work paused. All effort on reliability until budget is restored. |
This policy must be agreed to by engineering and product leadership before an SLO exhaustion event. An error budget policy that is written down but never enforced is decoration.
SLO windows and burn rates
The measurement window for an SLO (typically 28 or 30 days) determines how quickly the budget is consumed and how fast it is replenished.
Burn rate is how quickly the error budget is being consumed relative to the allowed rate. A burn rate of 1 means the budget is being consumed at exactly the sustainable pace. A burn rate of 10 means the budget would be exhausted in 1/10th of the measurement window if the current error rate continues.
Multi-window burn rate alerts are more actionable than threshold alerts:
| Alert | Trigger | Meaning |
|---|---|---|
| Page immediately | Burn rate > 14 over 1 hour | Budget exhausted in ~3 days at this rate |
| Ticket this sprint | Burn rate > 3 over 6 hours | Budget exhausted in ~10 days at this rate |
| Investigate | Burn rate > 1 over 24 hours | Running at above sustainable pace |
Burn-rate alerts are the correct SLO-driven alerting pattern. Alerting on the raw metric (e.g. "error rate > 5%") generates too many false alarms; alerting on budget burn rate generates actionable alerts proportional to actual user impact.
SLAs are not SLOs
A Service Level Agreement (SLA) is a contractual commitment to a customer, usually with financial penalties for breach. An SLO is an internal engineering target. SLOs should be set more stringently than SLAs — typically 0.5–1.5% more available than the SLA commitment — so that the team has early warning before the SLA is breached.
Never expose an SLO directly as an SLA. The SLO is the internal target that gives the team room to act before customers are contractually owed a credit.
Not everything needs a strict SLO
Internal tooling, batch jobs with flexible deadlines, and non-critical reporting dashboards do not need the same SLO rigor as the user-facing checkout flow. Calibrate SLO investment to the user-facing impact of the service.
The triage question: if this service misses its target for an hour, what is the user impact? If the answer is "minor inconvenience or none," the SLO investment level should be lower than a service whose failure generates lost revenue or lost data.
How to Implement
Setting an initial SLO
A starting point for SLO discovery:
- What does the user care about? List the user-visible outcomes the service delivers (request succeeds, page loads, payment processes).
- Instrument the SLIs. Ensure those outcomes are measured — requests counted, errors counted, latency histograms recorded.
- Observe the baseline. Look at 90 days of historical data. What is the current reliability level?
- Set the SLO slightly below the current baseline. This sets a target that acknowledges reality without committing to a level the service has never achieved. Tighten over time.
- Get product and business sign-off. An SLO is a business commitment, not an engineering choice. The SLO level should reflect what users actually need and what the business can sustain.
SLO dashboard requirements
An SLO dashboard is a first-class operational tool, not a reporting artifact. It should show:
- Current SLI value (live)
- SLO target
- Remaining error budget (absolute and percentage)
- Error budget burn rate (current and trending)
- Days remaining until budget exhaustion at current burn rate
This should be visible on a team dashboard, not buried in a monitoring tool.
Common Pitfalls
Setting the SLO at 100% or 99.999%. An SLO at 100% is aspirational, not operational — it is immediately breached by the first network blip. An error budget of zero cannot guide any decision. Set achievable, user-meaningful targets.
SLIs that do not measure user experience. CPU and memory are not SLIs. They are signals that may indicate user impact. Measure what the user experiences directly: did the request succeed, and was it fast enough?
Error budget policy that is never enforced. A policy that says "when the budget is exhausted, stop feature work" that is never actually enforced is not a policy. The value of error budgets comes from acting on them. If the policy is enforced once, it sets a precedent; if it is ignored repeatedly, it is useless.
Treating SLOs as a reporting exercise. SLOs that are reviewed monthly in a metrics review and otherwise ignored do not improve reliability. SLOs must be part of the sprint planning, deployment, and incident process to have operational value.
Setting independent SLOs without composition. If service A depends on service B, A's SLO is constrained by B's. An availability SLO of 99.9% for a service that makes ten calls to a dependency with 99.9% availability is mathematically unachievable (0.999^10 ≈ 99%). SLO composition must be reasoned about at design time.
Never revisiting the SLO. An SLO set at product launch is calibrated to the system at launch. As traffic, complexity, and user expectations change, the SLO should be revisited — tightened if the system has become more reliable, reconsidered if the user expectations have shifted.