Data Modeling & Schema Design

Principles for designing schemas that are correct at inception and safe to evolve over time — without downtime.

Overview

A schema is a long-term commitment. The design decisions made when a table is created — naming conventions, normalisation level, how deletions are handled, how tenant data is isolated — determine what is easy and what is painful for the next several years. Unlike API design, where a version boundary can isolate consumers from changes, schema changes are often global and immediate. A migration that runs correctly on 10,000 rows can lock the database for minutes on 100 million rows.

This page covers the principles behind designing durable schemas and executing safe migrations. The design side (how to model the data) and the migration side (how to evolve the model) are treated together because a perfect schema applied via a destructive migration is still a production incident.

For how APIs expose data to consumers, see API Design Standards. For system-level reliability during migrations, see Reliability Patterns. For the infrastructure that runs migrations, see CI/CD Pipelines.

Why It Matters

Schema changes are the most dangerous class of production operations. A code deploy can be rolled back in minutes; a migration that drops a column, renames a field, or backfills 50 million rows cannot be undone cleanly without significant effort. The cost of getting this wrong is a production incident — often a long one.

Bad schema design compounds. A table that starts with good naming conventions and appropriate normalisation can support years of product growth. A table that starts as a loosely typed JSON blob, a flat list of nullable columns, or a mis-normalised aggregate will produce a stream of bugs, performance problems, and increasingly painful migrations as the product grows.

Multi-tenancy and deletion semantics are decisions that cannot easily be changed later. Whether tenant data is isolated by row, by schema, or by database; whether deleted records are removed or flagged — these choices, once made and data-laden, require large migrations to reverse. They deserve deliberate design up front.

Zero-downtime deployments require zero-downtime migrations. A deployment strategy that achieves zero downtime at the application layer is negated by a migration that takes an exclusive table lock for ten minutes. Application-level and database-level deployment strategies must be designed together.

Standards & Best Practices

Normalise until it hurts, then denormalise deliberately

Normalisation removes redundancy and establishes a single source of truth for each fact. The general guidance is to normalise to third normal form (3NF) by default — every non-key attribute depends only on the primary key, not on other non-key attributes.

Denormalisation (deliberate redundancy to improve read performance) is occasionally the right trade-off, but it has a cost: every write to the source data must update the denormalised copy, and the two can fall out of sync. Denormalise only when the performance problem is real and measured, not anticipated.

Signs that a schema is under-normalised:

The same fact (e.g. a user's email address) appears in multiple tables without a canonical source
Updating one record logically requires updating multiple others
Columns contain values that are derived from other columns

Naming conventions are part of the schema contract

Consumers query schemas by name. Inconsistent naming multiplies cognitive load on every engineer who touches the data layer.

Standards:

Table names are lowercase, plural, and noun-based (orders, line_items, not Order or tbl_order)
Column names are lowercase and snake_case
Foreign keys follow the pattern {referenced_table_singular}_id (e.g. user_id in the orders table)
Boolean columns are named in the affirmative (is_active, is_deleted, not disabled or not_active)
Timestamp columns are suffixed with _at and are stored as UTC (created_at, updated_at, deleted_at)
Avoid abbreviations that are not universally understood

These conventions should be automated via a linter where possible. Manual enforcement at review time does not scale.

Soft delete vs hard delete is a business decision

Soft delete marks a record as deleted (typically with a deleted_at timestamp) without removing it from the database. Hard delete removes the row.

Soft delete is appropriate when:

Data has regulatory or audit retention requirements
Deleted records need to be recoverable by users (undo, recycle bin)
References to deleted entities must remain meaningful (e.g. an order belongs to a deleted account)

Hard delete is appropriate when:

Data is ephemeral with no recovery or audit requirement
The regulatory requirement is to ensure deletion (GDPR right-to-erasure cases)
Storage cost of retained records is significant

The common mistake is defaulting to soft delete "just in case." Soft delete introduces significant complexity: every query must filter on deleted_at IS NULL, indexes must include the deleted flag, and any report that does not filter correctly will include deleted records. If the case for soft delete is not clear, default to hard delete and add soft delete only when the requirement materialises.

Multi-tenancy strategy is a schema design decision

How tenant data is isolated affects query performance, security blast radius, migration complexity, and regulatory compliance. The three main strategies:

Strategy	Description	When appropriate
Row-level	All tenants share tables; `tenant_id` column on every row	Simplest to operate; sufficient for most SaaS at early–mid scale
Schema-per-tenant	Each tenant has their own schema within one database	Easier to migrate individual tenants; adds complexity for cross-tenant reporting
Database-per-tenant	Each tenant has their own database instance	Maximum isolation; required for regulatory segregation; highest operational cost

Once chosen and data-laden, this strategy is very difficult to change. The decision should be made at design time based on the isolation and compliance requirements of the product, not after the first customer is signed.

With row-level multi-tenancy, every query must filter on tenant_id. The most common bug class in row-level multi-tenancy is a query that accidentally returns data from the wrong tenant. Use row-level security (RLS) at the database layer where supported — it prevents application-layer bypass more reliably than code-level filtering.

Schema changes must be backward compatible

During a deployment, old code and new code run simultaneously. A migration that removes a column required by old code will cause errors before the deploy completes. The safe migration order:

Adding a column:

Deploy the migration: add column as nullable with no default (or with a safe default)
Deploy the new code that writes to the column
(Optional) backfill existing rows, then add NOT NULL constraint

Removing a column:

Deploy code that no longer reads or writes the column (while column still exists)
Deploy the migration: remove the column
Never remove the column and the code reading it in the same deploy

Renaming a column:

Add the new column
Deploy code that writes to both columns
Backfill old column data into new column
Deploy code that reads from new column only
Remove old column

This expand-contract pattern (also called the parallel-run pattern) allows column changes without downtime. It requires multiple deploys but eliminates lock contention and deploy-time errors.

Large-scale migrations require a specific strategy

A migration that adds a NOT NULL column to a 50-million-row table by backfilling defaults will take an exclusive table lock for the duration of the backfill. This is a production outage. Strategies for large tables:

Batched backfill — update rows in small batches with sleep intervals, over hours or days, before adding the constraint
Online schema change tools — tools like pt-online-schema-change, gh-ost (MySQL) or pg_repack (PostgreSQL) rebuild the table in the background without an exclusive lock
Shadow table pattern — create the new table structure, dual-write to both, migrate reads, backfill, cut over
Nullable first — add a column as nullable, deploy, backfill, then add the NOT NULL constraint in a separate migration once the backfill is verified

The cost model for migrations should include time-to-execute on production data volumes, not just on development datasets. Test migrations against a recent production snapshot — or at least at order-of-magnitude production scale — before running in production.

Index design is part of the schema

An unindexed query on a large table is a latency spike. An over-indexed table is a write amplifier. Index design decisions are schema decisions and should be made at design time.

Standards:

Every foreign key column has an index (some databases create this automatically; verify)
Every column used in a frequent WHERE, JOIN, or ORDER BY is a candidate for indexing
Composite indexes are ordered by selectivity (most selective column first) and by query pattern
Unused indexes are technical debt — they consume storage and slow writes without benefiting any query
Indexes are monitored for usage (most databases expose this in system views)

How to Implement

Schema review before migration runs

Every schema change undergoes a review that covers:

Is the change backward compatible? (Can old and new code run simultaneously without errors?)
What is the estimated execution time on production data volumes?
Does the migration take an exclusive lock? If yes, is there an alternative?
Is the rollback path defined? (What is the reverse migration? Has it been tested?)
Are all affected queries updated to use new column names / structures?
Are indexes added for new columns used in query predicates?

Migration safety checklist

Before running any migration in production:

Tested against a dataset at production scale (or demonstrated safe via tool guarantees)
Backward-compatible with the code version currently deployed
Lock duration estimated and within acceptable window (or uses online schema change tooling)
Rollback migration exists and has been tested
Monitoring in place to detect query performance regression after migration runs

Common Pitfalls

Migrating and deploying code atomically. Running a destructive migration and a code deploy together, without the expand-contract buffer, is the most common cause of schema-related outages. Separate them.

Assuming development scale predicts production behaviour. A migration that takes 2 seconds on 10,000 rows takes 2,000 seconds on 10 billion rows. Test at realistic scale before running in production.

Soft delete without enforcing the filter. Every query path must filter on deleted_at IS NULL. A single query that forgets the filter leaks deleted data to the user. Use database-level views or row-level security to enforce this automatically rather than relying on every query author to remember it.

Nullable columns as a workaround for failed normalisation. A table with 40 nullable columns, each meaningful only for a specific product type, is a signal that the schema is under-normalised and that product subtypes should be their own tables or types.

Skipping index review. Adding a column used in queries without adding the index. The query performs acceptably in development (small table), degrades at scale, and produces a production incident whose root cause is a missing index.

JSONB / document columns as a schema design escape hatch. JSON columns are useful for genuinely variable schemas. They are often used to avoid schema design — the result is a column that cannot be efficiently indexed, queried, or typed. If the structure of the JSON is consistent, it belongs in typed columns.

Data Modeling & Schema Design

On this page