Metrics Interpretation

Reading delivery and product metrics accurately without drawing false or misleading conclusions.

Overview

Metrics are the language that product and engineering teams use to communicate with each other and with stakeholders. They compress complex realities into numbers that can be tracked, compared, and acted on. But metrics are also systematically misread — through selection bias, inappropriate aggregation, confounded variables, and the human tendency to find patterns in noise. Understanding how to interpret metrics correctly is as important as knowing which ones to track.

For using metrics to design experiments, see A/B Testing Fundamentals. For connecting metrics to planning, see Outcome-Based Planning.

Why It Matters

Misread metrics produce confident wrong decisions. A team that interprets a correlation as causation, or an average as representative, or a rising metric as progress, can make high-confidence decisions in the wrong direction. The confidence that metrics provide is only valuable if the interpretation is correct.

Vanity metrics look good but drive nothing. Page views, registered users, and "active users" (when defined loosely) are easy to move and easy to report, but rarely connect to business value. Teams that optimise for vanity metrics produce dashboards that look healthy while the product stagnates.

Leading and lagging indicators serve different purposes. A lagging indicator tells you what happened. A leading indicator tells you what is likely to happen. Confusing the two — acting urgently on a lagging indicator, or assuming a leading indicator is conclusive — produces both slow responses and premature decisions.

Statistical significance is not practical significance. A difference between two measurements can be statistically significant (very likely real) and practically insignificant (too small to matter). A 0.1% improvement in conversion that is statistically significant at p<0.01 is a real finding that is not worth acting on.

Aggregate metrics hide distribution. An average response time of 200ms is healthy. An average response time of 200ms with a p95 of 4 seconds is a user experience crisis for the 5% of users at the tail. Averages obscure distributions; distributions tell the real story.

Standards & Best Practices

North star metric

A north star metric is the single metric that best represents the value your product delivers to users. It should be:

A measure of user value, not business value (though the two should be correlated)
Something the team can influence through product decisions
Specific enough to be meaningfully moved
Lagging enough to reflect actual outcomes, not just activity

Examples by product type:

Productivity tool: tasks completed per user per week
Marketplace: successful transactions per month
Social platform: messages sent per active user per day
Analytics tool: reports run per active user per week

The north star metric focuses the team on user value. Metrics that do not connect to the north star are either supporting metrics or vanity metrics.

Leading vs lagging indicators

Type	Characteristics	Use
Leading indicator	Changes before the outcome; can be influenced in the short term	Early signal; early warning; useful for hypothesis testing
Lagging indicator	Changes after the outcome; reflects what has already happened	Ground truth for evaluating decisions; less actionable in real time

Example: Customer acquisition is a leading indicator of revenue (lagging). Increasing customer acquisition is meaningless if the customers do not convert or retain. Do not optimise leading indicators in ways that decouple them from the lagging indicators they should predict.

Vanity metrics to avoid

Metric	Why it is vanity	Replace with
Total registered users	Includes inactive, fake, and churned users	Monthly active users (with a meaningful activity definition)
Page views	Inflated by bots, empty loads, and lost users	Engagement events per session
"Growth" in absolute numbers	Does not account for churn	Net retention; cohort retention
Features shipped	Output, not outcome	Adoption rate of shipped features
Sprint velocity (as a target)	Easy to game; not a value signal	Predictability ratio; on-time delivery rate

Simpson's Paradox

Simpson's Paradox occurs when a trend appears in several groups of data but reverses when the groups are combined. Example:

Version A converts 40% of mobile users and 20% of desktop users
Version B converts 35% of mobile users and 30% of desktop users
But: B has 80% mobile traffic and A has 20% mobile traffic
Overall: B has higher aggregate conversion despite performing worse on both segments

The solution is to segment before aggregating. Always check whether your aggregate metric is masking opposite trends in subgroups, especially when the composition of the population is changing (which it usually is during a rollout).

Correlation vs causation

Two metrics moving together does not mean one causes the other. Common confounders:

Time correlation: two unrelated things growing together over time
User type correlation: features adopted by highly engaged users appear to cause engagement
Seasonal effects: metrics that vary with time of year, day of week, or external events

Before concluding that feature X caused metric Y to move, ask:

Is there a plausible causal mechanism?
Did Y move before or after users adopted X?
Did Y move for users who did not adopt X?
Was there another change at the same time that could explain the movement?

How to Implement

Building a metrics framework

For each key product area, define:

North star metric — the primary measure of user value
Supporting metrics — metrics that explain movements in the north star (retention rate, activation rate, task completion rate)
Health metrics — metrics that must stay within bounds but are not the goal (error rate, latency, support ticket volume)
Counter-metrics — metrics that should not decrease as a result of optimising the primary metric (e.g., if you optimise for activation, retention should not fall)

Interpreting a metric movement

When a metric moves unexpectedly (up or down):

Check the time period — is this a point-in-time anomaly or a trend?
Segment — does the movement appear in all user segments or only specific ones?
Look for coincident changes — what else changed at the same time? (releases, marketing campaigns, seasonal factors)
Check data pipeline integrity — is the data collection working correctly?
Generate hypotheses before drawing conclusions — what are the three most plausible explanations?
Test the hypotheses — what data would confirm or refute each one?

Dashboard hygiene

Metrics dashboards tend to accumulate over time without pruning. Apply these rules:

If nobody reads a metric for 60 days, consider removing it
If a metric cannot be explained in one sentence, it is too complex to drive decisions
If a metric is not connected to either a north star or a health threshold, it should be archived
Review the dashboard composition quarterly

Tools & Templates

Metric definition card

Document each key metric to prevent misinterpretation:

## Metric: [Name]

**Definition:** [Precise description of what is counted]
**Unit:** [e.g., users, sessions, events, %]
**Calculation:** [Formula or query]
**Numerator:** [What counts in the top]
**Denominator:** [What counts in the bottom]
**Excluded:** [What is explicitly excluded and why]

**Why this matters:** [1–2 sentences connecting to user value]
**North star relationship:** [Leading / lagging / health / counter]
**Expected range:** [Baseline value or healthy range]
**Alert threshold:** [When to investigate]

**Owner:** [Name]
**Last reviewed:** [Date]

Metric movement investigation template

## Metric Investigation — [Metric name]

**Observed change:** [Metric] moved from [X] to [Y] between [date] and [date]
**Expected direction:** [Up/down/flat]
**Significance:** [Is this within normal variation?]

### Hypotheses

1. [Hypothesis 1]
2. [Hypothesis 2]
3. [Hypothesis 3]

### Data checks

- [ ] Confirmed data collection is working correctly
- [ ] Segmented by user type: [findings]
- [ ] Checked for coincident changes: [what else happened]
- [ ] Checked comparison period for seasonality: [findings]

### Conclusion

[What caused the movement, with supporting evidence]

### Action

[What we are doing in response, or "monitoring" with a review date]

Common Pitfalls

Optimising a metric without a counter-metric. Every metric can be improved by degrading something else. A team that optimises session length without tracking task completion rate may be increasing frustration, not engagement. Define counter-metrics before optimising.

Acting on a single data point. A metric that moves unusually in one period might be noise. One data point is not a trend. Set a minimum number of periods before concluding a trend is real (typically 3 consecutive periods for weekly data).

Reporting averages for skewed distributions. Averages are appropriate for normally distributed data. For latency, revenue, session length, and most product metrics, distributions are right-skewed. Report median and percentile (p75, p95) instead of mean.

Mixing causation and correlation in presentations. Stakeholders who hear "users who use feature X have 40% higher retention" conclude that feature X causes retention. The product team knows correlation ≠ causation. Make the distinction explicit in presentations, or the stakeholder will draw the wrong conclusion and invest accordingly.

Changing metric definitions without documenting the change. When the way a metric is calculated changes, historical comparisons become meaningless. Always document definition changes in the metric card and add a note to dashboards at the change date.

Trusting all data equally. Data collection pipelines fail, instruments drift, and event schemas change. Before making a significant decision on a metric, verify that the underlying data is trustworthy. A decision based on broken instrumentation is worse than no decision at all.

On this page