A/B Testing Fundamentals

Designing and running experiments that produce statistically valid, actionable results.

Overview

A/B testing (also called controlled experimentation or split testing) is the practice of comparing two or more variants of a product experience by randomly assigning users to each variant and measuring the effect on a target metric. Done correctly, it is the most reliable way to establish causal relationships in product development. Done incorrectly — which is common — it produces misleading results that lead to confident wrong decisions.

For identifying what to test, see Feature Adoption Tracking. For interpreting the results of tests, see Metrics Interpretation.


Why It Matters

A/B tests establish causation, not just correlation. Most product analytics measure correlation — user behaviour alongside product changes. An A/B test, by randomly assigning users, controls for selection bias and external confounders. It is the only standard method for answering "did this change cause the metric to move?"

A/B tests prevent shipping things that do not work. Product teams with strong intuitions about user behaviour are often wrong. An A/B test on a change that "obviously" improves an experience regularly produces null or negative results. Testing reduces the cost of those wrong intuitions.

A/B tests accumulate knowledge. A culture of experimentation builds an internal database of what works and what does not for your specific users, product, and context. This knowledge compound over time and is more reliable than imported best practices.

Without statistical validity, A/B test results are anecdote. A test that ends because "it looks like B is winning" after a few days is not a valid A/B test. It is a confirmation bias exercise. Proper statistical design is the difference between knowledge and noise.

A/B tests set a high bar for shipping. A team that requires experiments to demonstrate improvement before shipping changes is less likely to ship neutral or negative changes. This discipline pays dividends in product quality and user trust.


Standards & Best Practices

The hypothesis

Every A/B test starts with a written hypothesis in the form:

We believe that [specific change]
will result in [specific outcome]
for [specific user segment]
because [specific mechanism].

Example:

We believe that showing estimated delivery dates on the product page (instead of just "in stock") will result in a higher add-to-cart rate for first-time visitors because reducing uncertainty about delivery timing is a known purchase hesitation factor.

A hypothesis must name:

  • The change (what is different in the treatment variant)
  • The outcome (which specific metric we expect to move)
  • The population (who this applies to)
  • The mechanism (why we expect this to work)

Hypotheses without a mechanism are untestable hunches. If you cannot articulate why the change should work, the experiment cannot confirm or refute the underlying belief.

Primary metric and guardrail metrics

Every experiment needs:

  • One primary metric — the single metric that determines whether the test is a win or a loss
  • Guardrail metrics — metrics that should not decrease as a result of the change (e.g., improving conversion should not decrease session quality)

Having one primary metric prevents post-hoc metric switching — testing until you find a metric that moved in the right direction and declaring success. That is HARKing (Hypothesising After Results are Known), and it produces false discoveries.

Sample size and statistical power

Before running a test, calculate the minimum sample size needed to detect the effect you expect:

Required factors:
- Baseline conversion rate (p₀): the current value of the primary metric
- Minimum detectable effect (MDE): the smallest improvement worth detecting
- Statistical significance threshold (α): typically 0.05 (5% false positive rate)
- Statistical power (1-β): typically 0.80 (80% probability of detecting a real effect)

Use a sample size calculator (many are freely available online) to find the number of users needed in each variant. Then estimate how long it will take to accumulate that sample given your traffic.

If the required sample takes more than 4–6 weeks to accumulate, the effect you are trying to detect may be too small to be worth testing, or your traffic is too low for reliable experimentation in this area.

Minimum test duration

Even if the required sample size is reached quickly, run the test for at least 7 days (preferably 14). This ensures:

  • Multiple full weekday/weekend cycles are captured
  • Day-of-week bias is controlled
  • Novelty effects (users behaving differently because something is new) have time to stabilise

Stopping a test early because it looks like a winner or loser produces systematically biased results — the "peeking problem."

Significance thresholds

The standard significance threshold is p < 0.05: if the probability of observing results this extreme by chance is less than 5%, the result is statistically significant.

However:

  • A result at p = 0.049 and a result at p = 0.051 are effectively the same. Statistical significance is a threshold for decision-making, not a measure of effect quality.
  • Statistical significance does not equal practical significance. A 0.2% improvement in conversion with p = 0.001 is a real finding that may not be worth shipping.
  • Multiple testing inflates false discovery rate. If you test 20 metrics and use p < 0.05, you expect 1 false positive. Pre-register your primary metric before the test to avoid this.

How to Implement

Experiment design checklist

Before launching a test:

  • Hypothesis written (change / outcome / population / mechanism)
  • Primary metric defined
  • Guardrail metrics defined
  • Minimum detectable effect specified
  • Required sample size calculated
  • Expected test duration calculated
  • Randomisation unit decided (user-level vs session-level vs device-level)
  • Control and treatment variants documented
  • Rollback plan for the treatment variant defined

Running the test

  1. Launch with equal split between control and treatment (50/50 unless there is a specific reason to skew)
  2. Do not look at results until minimum sample size and minimum duration are both reached
  3. At analysis time: check for sample ratio mismatch (unexpected imbalance in assignment)
  4. Calculate the primary metric difference and its confidence interval
  5. Check guardrail metrics
  6. Make a ship / no-ship / iterate decision

Decision framework

ResultAction
Primary metric improved, significant, guardrails safeShip the treatment
Primary metric flat, significant resultNo detectable effect — do not ship; revisit hypothesis
Primary metric degraded, significantDo not ship; investigate why
Inconclusive (did not reach significance)If test ran to full duration: treat as flat. Consider whether MDE was set correctly.
Guardrail metric degraded even if primary improvedDo not ship; investigate trade-off

Tools & Templates

Experiment brief

## Experiment Brief — [Test name]

**Hypothesis:**
We believe that [change] will result in [outcome] for [population] because [mechanism].

**Variants:**

- Control: [Description of current state]
- Treatment: [Description of proposed change]

**Primary metric:** [Metric name and definition]
**Guardrail metrics:** [Metric 1], [Metric 2]

**Minimum detectable effect:** [e.g., +5% relative improvement in conversion]
**Required sample per variant:** [N users]
**Expected duration:** [N days at current traffic levels]

**Launch date:** [Date]
**Analysis date:** [Date — do not analyze before this]

**Rollback plan:** [How to revert the treatment if needed]

Results analysis template

## Experiment Results — [Test name]

**Run period:** [Start] → [End]
**Sample (control):** [N] users
**Sample (treatment):** [N] users
**Sample ratio mismatch:** [Yes / No]

### Primary metric

| Variant   | Value | Relative change | p-value | 95% CI             |
| --------- | ----- | --------------- | ------- | ------------------ |
| Control   | [X]   | —               | —       | —                  |
| Treatment | [Y]   | [+/-Z%]         | [p]     | [[lower], [upper]] |

**Significant:** [Yes / No]

### Guardrail metrics

| Metric     | Control | Treatment | Change  | Status         |
| ---------- | ------- | --------- | ------- | -------------- |
| [Metric 1] | [X]     | [Y]       | [+/-Z%] | Safe / At risk |

### Decision: ☐ Ship treatment ☐ No ship (flat) ☐ No ship (degraded) ☐ Iterate

**Rationale:** [Brief explanation of decision]

Common Pitfalls

Peeking. Checking results before the test reaches its planned duration and stopping early when results look positive. This inflates false positive rates dramatically. A test stopped after 2 days because "B is clearly winning" is not a valid test.

HARKing (Hypothesising After Results are Known). Running a test, finding that the primary metric did not move but a secondary metric did, and declaring that the secondary metric was the hypothesis all along. Pre-register your primary metric before launching. Post-hoc metric selection is not experimentation — it is storytelling.

Underpowered tests. Running a test for a week without calculating whether a week provides enough sample to detect the expected effect. An underpowered test that produces a null result tells you nothing — you may have failed to detect a real effect because the test was too small.

Multiple simultaneous tests. Running two tests on the same user population at the same time produces interaction effects. If both tests change the same page or flow, the results of each are confounded by the other. Use a mutual exclusion layer or run tests sequentially in the same area.

Ignoring novelty effects. Users behave differently with new experiences simply because they are new. A test that runs for only 3 days may be measuring curiosity, not the long-term effect. Run tests for at least one full week to allow novelty to normalise.

Treating A/B tests as the only signal. A/B tests are the gold standard for causal inference but are not always appropriate — they require sufficient traffic, a binary control/treatment structure, and a measurable primary metric. Qualitative research, session recordings, and user interviews are equally important for understanding why, not just whether.