Monitoring & Alerting
What we track in production and how we get notified before issues become incidents.
Overview
You cannot fix what you cannot see. Monitoring is the system's immune system — it surfaces failure before users find it, provides the context needed to diagnose incidents quickly, and makes reliability a measurable property rather than a feeling.
This page covers what we instrument, how we define alerts, and the practices that separate actionable monitoring from noise. Good monitoring is not about collecting everything — it is about collecting the right things and alerting on the right conditions.
Why It Matters
Know about failures before users do. Reactive support — waiting for a user to report a problem — means every outage has a human cost before it has a technical response. Monitoring closes the gap: when the error rate spikes, the on-call engineer knows before the first support ticket arrives.
Correlate deployments with impact. Without monitoring, deploying and discovering a problem are separated by time and guesswork. With it, deployment events appear on the same timeline as latency and error rate graphs. The correlation is immediate: the deploy went out at 14:32; the error rate rose at 14:33.
SLIs and SLOs require data. Reliability targets are meaningless without measurement. An SLO of "99.9% of requests succeed" requires a counter of total requests and a counter of failed requests, measured continuously. You cannot define a target for something you do not measure.
Incident response needs context. The difference between a 20-minute and a 4-hour incident is usually the quality of the available context. Structured logs that can be queried, distributed traces that show where time was spent, and dashboards that show the state at the time of the failure all dramatically reduce the time from "something is wrong" to "here is what happened and why."
Standards & Best Practices
Alert on symptoms, not causes
Alert on what users experience, not on what the system is doing internally. An alert on "error rate > 1%" is a symptom alert — a user is getting errors. An alert on "database connection pool at 90% capacity" is a cause alert — a potential future problem. Symptom alerts are always worth investigating. Cause alerts produce false positives: the pool was at 90% but recovered naturally, nobody ever saw an error.
| Alert type | Example | Action required |
|---|---|---|
| Symptom | HTTP 5xx rate > 1% for 5 minutes | Always yes |
| Symptom | P95 latency > 2s for 5 minutes | Always yes |
| Cause | CPU at 80% | Usually no |
| Cause | Database connections at 90% capacity | Probably not yet |
Start with symptom alerts. Add cause alerts only when you have evidence they reliably predict a symptom before it occurs.
Instrument the four golden signals
Every service should expose these four signals. They cover the complete picture of how a service is behaving from a user perspective.
| Signal | What it measures | Example metric |
|---|---|---|
| Latency | How long requests take (especially slow requests) | P50, P95, P99 response time |
| Traffic | How much demand the service is receiving | Requests per second |
| Errors | The rate of requests that fail | HTTP 5xx rate, exception rate |
| Saturation | How full the service is (how close to limit) | CPU %, memory %, queue depth |
P95 and P99 latency matter more than average latency. Slow requests experienced by 5% of users are invisible in the average but visible in the tail percentile.
Structured logs only
Log in JSON. Free-text logs can be read by a human; structured logs can be queried by a machine. When an incident is occurring at 2am, you want to query level = "error" AND service = "checkout" AND user_id = "12345" — not grep through a wall of text.
{
"timestamp": "2024-11-14T14:32:01.234Z",
"level": "error",
"message": "Payment processing failed",
"service": "checkout",
"trace_id": "abc123def456",
"user_id": "usr_7890",
"order_id": "ord_4567",
"error": "Stripe API timeout after 30000ms"
}Every log entry should have at minimum: timestamp, level, message, service. Add request IDs and trace IDs so individual requests can be followed across service boundaries.
Define alert severity tiers
Not all alerts are equally urgent. An alert that wakes someone up at 3am must be worthy of that interruption. An alert about a metric trending in the wrong direction can wait for business hours.
| Severity | Name | Response time | Escalation | Meaning |
|---|---|---|---|---|
| P1 | Critical | Page immediately | Auto-escalate 15m | Users are actively impacted; data loss risk |
| P2 | High | Page within 15min | On-call engineer | Degraded experience for a subset of users |
| P3 | Medium | Business hours | Ticket created | Non-urgent degradation, no user impact yet |
| P4 | Low | Next sprint | Dashboard only | Informational; no immediate action needed |
Every alert must be assigned a severity when it is created. An alert without a severity defaults to P1 — which is wrong 99% of the time and teaches the team to ignore alerts.
Every alert has a runbook
If an alert fires and the on-call engineer doesn't know what to do, the alert is incomplete. Every alert must link to a runbook that describes:
- What this alert means
- How to investigate (what to look at first)
- Common causes and their fixes
- When to escalate
If you cannot write a runbook for an alert, the alert is not ready to be enabled.
Alert only on actionable conditions
An alert that fires and has no defined response is noise. It trains the team to ignore alerts. Before enabling any alert, ask: "If this fires at 3am, is there a specific action the on-call engineer should take?" If the answer is no, the alert should not exist as a P1 or P2. It might exist as a P3 or P4, or it should not exist at all.
How to Implement
Step 1 — Add structured logging
Replace console.log("Payment failed") with structured JSON output. In Node.js:
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: 'checkout',
env: process.env.NODE_ENV,
},
});
// Usage
logger.error(
{ order_id: order.id, user_id: user.id, error: err.message },
'Payment processing failed',
);Pino outputs newline-delimited JSON, which log aggregators (Datadog, Grafana Loki, CloudWatch Logs) can index and query natively.
Set LOG_LEVEL=debug locally, info in staging, warn in production. Never log at debug level in production — the volume is overwhelming.
Step 2 — Instrument metrics
For HTTP services, record at minimum:
// Example: Express middleware for request metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
metrics.histogram('http_request_duration_ms', duration, {
method: req.method,
route: req.route?.path ?? 'unknown',
status: String(res.statusCode),
});
metrics.increment('http_requests_total', {
method: req.method,
status: String(res.statusCode),
});
});
next();
});Label metrics with method, route, and status_code. This allows you to answer "is the error rate high for all routes, or just /checkout?"
Step 3 — Define SLIs and SLOs before setting alerts
An SLI (Service Level Indicator) is a metric. An SLO (Service Level Objective) is a target for that metric. Define both before configuring alerts.
Service: Checkout API
SLI: HTTP 5xx rate = (5xx responses / total responses) over a 5-minute window
SLO: 5xx rate < 1% over a 30-day rolling window
SLI: P95 request latency
SLO: P95 < 500ms over a 30-day rolling windowSet alerts that fire when the SLO is at risk — not when a single data point crosses a threshold.
Step 4 — Write runbooks before enabling alerts
For each alert, create a runbook in your team's documentation before enabling the alert. Link the alert description to the runbook URL.
Runbook template:
# Alert: High HTTP 5xx Rate
## Overview
The HTTP 5xx error rate for the checkout service has exceeded 1% for 5 consecutive minutes.
This alert indicates users are experiencing failed requests.
## Symptoms
- Users report payment failures or 500 errors
- Error rate graph shows spike in 5xx responses
## Investigation steps
1. Check recent deployments — did a deploy precede the spike?
- Deployment dashboard: [link]
2. Check application error logs for the top error message
- Log query: `service = "checkout" AND level = "error"` in the last 30 minutes
3. Check database connection pool metrics — is the DB healthy?
4. Check Stripe API status page if payment-related errors predominate
## Common causes and fixes
- **Database connection exhausted**: Restart the affected instances
- **Deployment introduced a bug**: Roll back to the previous version
- **Stripe API degradation**: Enable maintenance mode on the checkout page
## Escalation
- If not resolved in 30 minutes: escalate to the platform team
- If data loss is suspected: escalate to the engineering lead
## Resolution
Mark the alert as resolved in PagerDuty once the 5xx rate returns below 0.5% for 10 minutes.Step 5 — Establish a post-deployment monitoring window
After every production deployment, monitor the service's key metrics for at least 30 minutes. This is not a manual process — it is a dashboard tab kept open after deploys, or an automated canary analysis.
Configure deployment markers in your observability platform so that deployments appear as vertical lines on metric graphs. This makes the correlation between deploy and impact immediate.
Tools & Templates
Monitoring tool overview
| Tool | Strengths | Best for |
|---|---|---|
| Datadog | Full-stack: metrics, logs, traces, APM, alerts | Teams that want one platform |
| Grafana + Prometheus | Open source, flexible, powerful query language | Cost-sensitive teams, self-hosted |
| Sentry | Error tracking, stack traces, release tracking | Frontend + application error tracking |
| Honeycomb | High-cardinality queries, distributed tracing | Complex distributed systems |
| AWS CloudWatch | Native AWS integration, no extra agent | AWS-native teams |
| Grafana Loki | Log aggregation companion to Grafana + Prometheus | Teams already on Grafana stack |
Most teams benefit from combining two tools: a metrics + alerting platform (Datadog or Grafana + Prometheus) with an error tracking tool (Sentry). These are complementary, not overlapping.
Alert severity decision tree
Is a user actively experiencing an error or outage right now?
→ Yes: P1 (page immediately)
Will a user experience an error if this continues for 15 more minutes?
→ Yes: P2 (page within 15 minutes)
Is this a degradation that could become user-facing if it worsens?
→ Yes: P3 (business-hours ticket)
Is this informational with no user impact?
→ P4 (dashboard only) or delete the alertStructured log fields reference
| Field | Type | Description | Required |
|---|---|---|---|
timestamp | string | ISO 8601 timestamp | Yes |
level | string | debug, info, warn, error, fatal | Yes |
message | string | Human-readable description of the event | Yes |
service | string | Service name (checkout, auth, api) | Yes |
trace_id | string | Distributed trace ID for cross-service requests | Recommended |
request_id | string | Unique ID for the HTTP request | Recommended |
user_id | string | ID of the user associated with the event | When applicable |
error | string | Error message (for error-level logs) | When applicable |
Common Pitfalls
Alerting on causes instead of symptoms. "CPU at 80%" fires constantly and rarely corresponds to a user-facing problem. "Error rate > 1%" always corresponds to a user-facing problem. Start with symptom alerts and add cause alerts only when they have proven predictive value for a specific symptom.
Alert fatigue from too many P1s. When everything is P1, nothing is P1. A team that receives 30 pages per week treats pages as noise. Audit alert severity quarterly. If a P1 fires more than twice a week without corresponding user impact, it is miscategorised.
No runbooks. An alert with no runbook requires the on-call engineer to improvise under pressure. Improvisation during incidents is slower and more error-prone than following a documented procedure. Write the runbook before enabling the alert.
Unstructured log messages. console.log("Error processing payment for user " + userId) cannot be efficiently queried, aggregated, or alerted on. Structured logs with discrete fields for user_id, amount, and error can be. The transition from unstructured to structured logs is worth doing across the entire codebase, not just in new code.
No post-deployment monitoring window. Deployments that go out on a Friday afternoon without anyone watching the dashboards become incidents discovered Monday morning. Deploy during business hours. Keep a dashboard open for 30 minutes after every production deploy. When deploying is inconvenient because of monitoring requirements, teams naturally become more thoughtful about when and how frequently they deploy.