Testing Standards
What "good tests" look like — the testing pyramid, flakiness policy, coverage targets, and what each level of test is actually for.
Overview
Tests are not a deliverable alongside code; they are part of the code. A change without tests is not a finished change — it is a change whose behaviour is asserted by the author's confidence instead of by the system itself. The question "does this still work?" is one the codebase should answer automatically, not one the team has to re-answer manually every time something changes.
This page describes what we mean by a good test, how we think about the mix of test types, and the rules we apply when tests fail, flake, or grow stale. It is tool-agnostic: the specifics of a test framework change, but the principles of what a test is for do not.
For automated generation of tests using AI, see Automated Test Generation. For how tests fit into review workflows, see Code Review Best Practices.
Why It Matters
Tests are the contract between today's code and tomorrow's change. Without tests, every change is a guess about whether existing behaviour still holds. With tests, a change either preserves the contract or explicitly breaks it — and the team sees which, immediately.
The cost of a missing test is paid later, by someone else. Skipping a test to ship faster moves work from the author to the next engineer who breaks the behaviour and has no way of knowing. That engineer is often the original author, three months later, with no memory of why a line of code existed.
Tests define the design. A component that is hard to test is usually hard to reason about. The friction of writing a test is a signal — it frequently points at coupling, hidden state, or responsibilities that belong elsewhere. Good tests are not just verification; they are design feedback.
A flaky test is worse than no test. A test that fails intermittently teaches the team to ignore failures. Once that habit sets in, even a real failure gets re-run until it passes. Flakiness corrodes trust in the entire suite.
Standards & Best Practices
The test pyramid, as a budget not a target
Think of test types as a budget, prioritised by how much signal they give per second of runtime:
- Unit tests — The base of the pyramid. Fast (sub-millisecond each), isolated (no network, no disk, no shared state), and numerous. Most of the behaviour in a codebase should be verified here.
- Integration tests — Verify that two or more components work together as expected: a service with its database, a handler with its dependencies, a module crossing a boundary it actually crosses in production. Slower and fewer than unit tests.
- End-to-end tests — Exercise the full system from the outside in, usually through the same interface a user or another service would use. Slow, expensive to maintain, and few. Reserved for the workflows whose breakage would be user-visible.
The pyramid shape exists because the value per test decreases and the cost per test increases as you go up. A thousand unit tests can run in seconds; a hundred end-to-end tests take minutes. Inverting the pyramid — many e2e tests, few unit tests — produces a suite that is slow to run, expensive to maintain, and slow to diagnose when it fails.
A good test has one reason to fail
When a test fails, the failure message should tell you what broke, not which of twelve possible things might have broken. A test that asserts twelve things forces you to read the test and the code together to understand the failure. A test that asserts one behaviour is self-describing.
Practical applications:
- Test one behaviour per test, not one function per test
- If a test has
andin its name, it is probably two tests - Shared setup is fine; shared assertions per-test is not
Tests describe behaviour, not implementation
A test that reads like "when I call method X with argument Y, it calls method Z" is coupled to the implementation. Change the implementation to achieve the same behaviour and the test breaks. That is not a useful failure — it is a false positive that consumes time and erodes trust.
Tests should describe what the system does from the outside: given this input, the observable result is that output. When implementation-coupled tests are unavoidable (usually in tests of glue code or frameworks), isolate them and be explicit about what they are guarding.
Favour test clarity over test brevity
A test is read many more times than it is written, and it is often read under pressure when something is broken. Prefer clear, slightly repetitive tests over clever, abstracted tests. Shared helpers are useful when they are about the domain ("a user with an active subscription"); they are a liability when they are about the test infrastructure ("setup everything we might ever need").
The test body should read top-to-bottom as a story: given this state, when this thing happens, then this is observable.
Coverage is a floor, not a ceiling
Coverage percentages are useful as a floor — they tell you where there is no testing at all. They are not useful as a ceiling — 100% coverage with weak assertions is worse than 70% coverage with strong ones.
The team sets a minimum line coverage for new code (typically 70–80% for application logic, higher for libraries and lower for UI glue), enforces it in CI, and ignores the absolute number beyond that. Coverage catches absence of tests; it does not measure presence of good tests.
Zero-tolerance flakiness policy
A flaky test — one that passes or fails without code changes — is a defect, not an inconvenience. Treat it like one:
- First flaky failure: open an issue, label the test, and retry is acceptable while the issue is investigated
- Still flaky after one week: quarantine (skip) the test and block its removal from the codebase; assign an owner
- Still quarantined after the next iteration: delete the test. A permanently skipped test is false reassurance
The rule is not "no flaky tests"; it is "flaky tests do not stay in the suite." The longer flakiness is tolerated, the faster the suite loses its value.
Tests are part of the PR, not a follow-up
Tests are not a task that comes after the PR is merged. A PR without tests is an incomplete PR, in the same way that a PR that doesn't compile is an incomplete PR. The only exceptions:
- Genuine spikes or prototypes, clearly marked as such, not intended for production
- Emergency hotfixes, which must have a follow-up PR adding the test within the week
- Changes that are purely cosmetic and testable only through manual inspection (rare)
"I'll add tests in a follow-up" is the most common source of permanent test debt. Follow-ups happen much less often than promised.
Test data is first-class
Production data does not belong in tests — it leaks secrets, drifts from the schema, and is impossible to reproduce. Hand-crafted test data, on the other hand, is expensive to maintain across many tests.
The middle path is domain-specific factories: small functions that build valid, in-memory representations of the objects your domain cares about, with reasonable defaults and override-by-field for the parts the test needs. A test reads aUserWithActiveSubscription({ country: "IN" }) — not twelve lines of object-literal setup.
Factories make tests shorter, more domain-aligned, and more resilient to schema evolution (change the schema once, change all tests at once).
Integration tests must use a real database
Mocking the database in an integration test produces a test that passes when the code works against the mock — which is not what ships to production. The most common migration failures we have seen were silently green against mocked persistence.
Integration tests that touch data should run against a real instance of the database technology in use (containerised, ephemeral, seeded per-test). That instance is part of the test environment, not an optional dependency.
Reserve end-to-end tests for critical user journeys
End-to-end tests are the most expensive tests to write, run, and maintain, and they are the most likely to be flaky. They pay for themselves only when they cover workflows whose failure would be directly user-visible: checkout, signup, password reset, primary search, primary publish.
A good heuristic: if an end-to-end test were removed, would a missing-behaviour production incident become plausible? If yes, keep it. If no, it belongs one layer down.
How to Implement
What to test at each level
| Test type | Tests... | Runs in | Fails when... |
|---|---|---|---|
| Unit | Pure functions, branching logic, domain rules | Seconds | Logic is incorrect; edge cases are unhandled |
| Integration | Modules across a real boundary (DB, queue, file) | Tens of seconds | A contract between components is broken |
| Contract | An API matches its published schema | Seconds | The API drifts from the contract its consumers rely on |
| End-to-end | A user journey from outside the system | Minutes | A critical workflow is broken |
| Smoke | The deployed service is minimally functional | Seconds | Something in the deployed environment is fundamentally wrong |
A change usually needs a test at one or two of these levels. Rarely all of them.
Decision: does this change need what kind of test?
- Is this pure domain logic? → Unit test.
- Does this change how two components interact? → Integration test.
- Does this change an API consumers depend on? → Contract test.
- Does this change a user-visible workflow? → End-to-end test if the workflow is critical.
- Was the change purely internal refactoring? → No new tests, but the existing suite must still pass.
What the team owes the test suite
- Runs on every PR — A suite that runs on a schedule is a suite nobody reads.
- Fails fast — Unit tests before integration tests before end-to-end. No one wants to wait 20 minutes to find out they broke a unit test.
- Is green on
main— A consistently redmainis a broken feedback loop. Ifmainis red, stop feature work and fix it. - Deletes dead tests — Tests of removed features are not "harmless"; they are clutter that makes the suite longer and harder to maintain.
Common Pitfalls
The test that tests the mock. A test that stubs every dependency, asserts the stubs were called, and verifies no real behaviour. It gives a coverage number and no confidence. If removing the code under test still leaves the test green, the test is not testing the code.
The "snapshot" regression. Asserting that the output of a function matches a blob captured previously. These tests fail on every intentional change, forcing engineers to regenerate the snapshot without reading it. The test no longer encodes intent — it encodes the last observed output, whether that output was correct or not.
The test that is always skipped. A test quarantined for "we'll look at this later" that sits skipped for months. It is not protecting anything. Either fix it or delete it.
The test that tests three things. Named it_handles_user_signup, the body creates a user, charges a card, sends a welcome email, and asserts on all three. When it fails, you have no idea which step broke. Split it.
Treating coverage as the goal. Teams that optimise for 100% coverage often end up with tests that exist to increase the number rather than to protect behaviour. Coverage is a consequence of good testing, not the point of it.
End-to-end tests as the default. Teams new to testing sometimes reach for end-to-end tests first because they feel "more real." They are the most expensive way to test the most narrow subset of behaviours. A balanced suite tests the majority of behaviours at the unit level and reserves e2e for what genuinely needs it.
Tests that depend on each other. Tests that must run in a specific order because one sets up state for the next. This coupling makes the suite fragile, parallelisation impossible, and failure diagnosis much harder. Each test should set up what it needs and tear down what it created.
Sleeping to make async tests pass. sleep(500) in a test is a promise that the machine will always be fast enough. It won't be. Use the test framework's async primitives or an explicit wait-for condition — never a fixed sleep.