EngineeringDevOps & Tooling

Monitoring & Alerting

What we track in production and how we get notified before issues become incidents.

Overview

You cannot fix what you cannot see. Monitoring is the system's immune system — it surfaces failure before users find it, provides the context needed to diagnose incidents quickly, and makes reliability a measurable property rather than a feeling.

This page covers what we instrument, how we define alerts, and the practices that separate actionable monitoring from noise. Good monitoring is not about collecting everything — it is about collecting the right things and alerting on the right conditions.


Why It Matters

Know about failures before users do. Reactive support — waiting for a user to report a problem — means every outage has a human cost before it has a technical response. Monitoring closes the gap: when the error rate spikes, the on-call engineer knows before the first support ticket arrives.

Correlate deployments with impact. Without monitoring, deploying and discovering a problem are separated by time and guesswork. With it, deployment events appear on the same timeline as latency and error rate graphs. The correlation is immediate: the deploy went out at 14:32; the error rate rose at 14:33.

SLIs and SLOs require data. Reliability targets are meaningless without measurement. An SLO of "99.9% of requests succeed" requires a counter of total requests and a counter of failed requests, measured continuously. You cannot define a target for something you do not measure.

Incident response needs context. The difference between a 20-minute and a 4-hour incident is usually the quality of the available context. Structured logs that can be queried, distributed traces that show where time was spent, and dashboards that show the state at the time of the failure all dramatically reduce the time from "something is wrong" to "here is what happened and why."


Standards & Best Practices

Alert on symptoms, not causes

Alert on what users experience, not on what the system is doing internally. An alert on "error rate > 1%" is a symptom alert — a user is getting errors. An alert on "database connection pool at 90% capacity" is a cause alert — a potential future problem. Symptom alerts are always worth investigating. Cause alerts produce false positives: the pool was at 90% but recovered naturally, nobody ever saw an error.

Alert typeExampleAction required
SymptomHTTP 5xx rate > 1% for 5 minutesAlways yes
SymptomP95 latency > 2s for 5 minutesAlways yes
CauseCPU at 80%Usually no
CauseDatabase connections at 90% capacityProbably not yet

Start with symptom alerts. Add cause alerts only when you have evidence they reliably predict a symptom before it occurs.

Instrument the four golden signals

Every service should expose these four signals. They cover the complete picture of how a service is behaving from a user perspective.

SignalWhat it measuresExample metric
LatencyHow long requests take (especially slow requests)P50, P95, P99 response time
TrafficHow much demand the service is receivingRequests per second
ErrorsThe rate of requests that failHTTP 5xx rate, exception rate
SaturationHow full the service is (how close to limit)CPU %, memory %, queue depth

P95 and P99 latency matter more than average latency. Slow requests experienced by 5% of users are invisible in the average but visible in the tail percentile.

Structured logs only

Log in JSON. Free-text logs can be read by a human; structured logs can be queried by a machine. When an incident is occurring at 2am, you want to query level = "error" AND service = "checkout" AND user_id = "12345" — not grep through a wall of text.

{
  "timestamp": "2024-11-14T14:32:01.234Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "checkout",
  "trace_id": "abc123def456",
  "user_id": "usr_7890",
  "order_id": "ord_4567",
  "error": "Stripe API timeout after 30000ms"
}

Every log entry should have at minimum: timestamp, level, message, service. Add request IDs and trace IDs so individual requests can be followed across service boundaries.

Define alert severity tiers

Not all alerts are equally urgent. An alert that wakes someone up at 3am must be worthy of that interruption. An alert about a metric trending in the wrong direction can wait for business hours.

SeverityNameResponse timeEscalationMeaning
P1CriticalPage immediatelyAuto-escalate 15mUsers are actively impacted; data loss risk
P2HighPage within 15minOn-call engineerDegraded experience for a subset of users
P3MediumBusiness hoursTicket createdNon-urgent degradation, no user impact yet
P4LowNext sprintDashboard onlyInformational; no immediate action needed

Every alert must be assigned a severity when it is created. An alert without a severity defaults to P1 — which is wrong 99% of the time and teaches the team to ignore alerts.

Every alert has a runbook

If an alert fires and the on-call engineer doesn't know what to do, the alert is incomplete. Every alert must link to a runbook that describes:

  • What this alert means
  • How to investigate (what to look at first)
  • Common causes and their fixes
  • When to escalate

If you cannot write a runbook for an alert, the alert is not ready to be enabled.

Alert only on actionable conditions

An alert that fires and has no defined response is noise. It trains the team to ignore alerts. Before enabling any alert, ask: "If this fires at 3am, is there a specific action the on-call engineer should take?" If the answer is no, the alert should not exist as a P1 or P2. It might exist as a P3 or P4, or it should not exist at all.


How to Implement

Step 1 — Add structured logging

Replace console.log("Payment failed") with structured JSON output. In Node.js:

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'checkout',
    env: process.env.NODE_ENV,
  },
});

// Usage
logger.error(
  { order_id: order.id, user_id: user.id, error: err.message },
  'Payment processing failed',
);

Pino outputs newline-delimited JSON, which log aggregators (Datadog, Grafana Loki, CloudWatch Logs) can index and query natively.

Set LOG_LEVEL=debug locally, info in staging, warn in production. Never log at debug level in production — the volume is overwhelming.

Step 2 — Instrument metrics

For HTTP services, record at minimum:

// Example: Express middleware for request metrics
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    metrics.histogram('http_request_duration_ms', duration, {
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status: String(res.statusCode),
    });
    metrics.increment('http_requests_total', {
      method: req.method,
      status: String(res.statusCode),
    });
  });
  next();
});

Label metrics with method, route, and status_code. This allows you to answer "is the error rate high for all routes, or just /checkout?"

Step 3 — Define SLIs and SLOs before setting alerts

An SLI (Service Level Indicator) is a metric. An SLO (Service Level Objective) is a target for that metric. Define both before configuring alerts.

Service: Checkout API

SLI: HTTP 5xx rate = (5xx responses / total responses) over a 5-minute window
SLO: 5xx rate < 1% over a 30-day rolling window

SLI: P95 request latency
SLO: P95 < 500ms over a 30-day rolling window

Set alerts that fire when the SLO is at risk — not when a single data point crosses a threshold.

Step 4 — Write runbooks before enabling alerts

For each alert, create a runbook in your team's documentation before enabling the alert. Link the alert description to the runbook URL.

Runbook template:

# Alert: High HTTP 5xx Rate

## Overview

The HTTP 5xx error rate for the checkout service has exceeded 1% for 5 consecutive minutes.
This alert indicates users are experiencing failed requests.

## Symptoms

- Users report payment failures or 500 errors
- Error rate graph shows spike in 5xx responses

## Investigation steps

1. Check recent deployments — did a deploy precede the spike?
   - Deployment dashboard: [link]
2. Check application error logs for the top error message
   - Log query: `service = "checkout" AND level = "error"` in the last 30 minutes
3. Check database connection pool metrics — is the DB healthy?
4. Check Stripe API status page if payment-related errors predominate

## Common causes and fixes

- **Database connection exhausted**: Restart the affected instances
- **Deployment introduced a bug**: Roll back to the previous version
- **Stripe API degradation**: Enable maintenance mode on the checkout page

## Escalation

- If not resolved in 30 minutes: escalate to the platform team
- If data loss is suspected: escalate to the engineering lead

## Resolution

Mark the alert as resolved in PagerDuty once the 5xx rate returns below 0.5% for 10 minutes.

Step 5 — Establish a post-deployment monitoring window

After every production deployment, monitor the service's key metrics for at least 30 minutes. This is not a manual process — it is a dashboard tab kept open after deploys, or an automated canary analysis.

Configure deployment markers in your observability platform so that deployments appear as vertical lines on metric graphs. This makes the correlation between deploy and impact immediate.


Tools & Templates

Monitoring tool overview

ToolStrengthsBest for
DatadogFull-stack: metrics, logs, traces, APM, alertsTeams that want one platform
Grafana + PrometheusOpen source, flexible, powerful query languageCost-sensitive teams, self-hosted
SentryError tracking, stack traces, release trackingFrontend + application error tracking
HoneycombHigh-cardinality queries, distributed tracingComplex distributed systems
AWS CloudWatchNative AWS integration, no extra agentAWS-native teams
Grafana LokiLog aggregation companion to Grafana + PrometheusTeams already on Grafana stack

Most teams benefit from combining two tools: a metrics + alerting platform (Datadog or Grafana + Prometheus) with an error tracking tool (Sentry). These are complementary, not overlapping.

Alert severity decision tree

Is a user actively experiencing an error or outage right now?
  → Yes: P1 (page immediately)

Will a user experience an error if this continues for 15 more minutes?
  → Yes: P2 (page within 15 minutes)

Is this a degradation that could become user-facing if it worsens?
  → Yes: P3 (business-hours ticket)

Is this informational with no user impact?
  → P4 (dashboard only) or delete the alert

Structured log fields reference

FieldTypeDescriptionRequired
timestampstringISO 8601 timestampYes
levelstringdebug, info, warn, error, fatalYes
messagestringHuman-readable description of the eventYes
servicestringService name (checkout, auth, api)Yes
trace_idstringDistributed trace ID for cross-service requestsRecommended
request_idstringUnique ID for the HTTP requestRecommended
user_idstringID of the user associated with the eventWhen applicable
errorstringError message (for error-level logs)When applicable

Common Pitfalls

Alerting on causes instead of symptoms. "CPU at 80%" fires constantly and rarely corresponds to a user-facing problem. "Error rate > 1%" always corresponds to a user-facing problem. Start with symptom alerts and add cause alerts only when they have proven predictive value for a specific symptom.

Alert fatigue from too many P1s. When everything is P1, nothing is P1. A team that receives 30 pages per week treats pages as noise. Audit alert severity quarterly. If a P1 fires more than twice a week without corresponding user impact, it is miscategorised.

No runbooks. An alert with no runbook requires the on-call engineer to improvise under pressure. Improvisation during incidents is slower and more error-prone than following a documented procedure. Write the runbook before enabling the alert.

Unstructured log messages. console.log("Error processing payment for user " + userId) cannot be efficiently queried, aggregated, or alerted on. Structured logs with discrete fields for user_id, amount, and error can be. The transition from unstructured to structured logs is worth doing across the entire codebase, not just in new code.

No post-deployment monitoring window. Deployments that go out on a Friday afternoon without anyone watching the dashboards become incidents discovered Monday morning. Deploy during business hours. Keep a dashboard open for 30 minutes after every production deploy. When deploying is inconvenient because of monitoring requirements, teams naturally become more thoughtful about when and how frequently they deploy.