Data Modeling & Schema Design
Principles for designing schemas that are correct at inception and safe to evolve over time — without downtime.
Overview
A schema is a long-term commitment. The design decisions made when a table is created — naming conventions, normalisation level, how deletions are handled, how tenant data is isolated — determine what is easy and what is painful for the next several years. Unlike API design, where a version boundary can isolate consumers from changes, schema changes are often global and immediate. A migration that runs correctly on 10,000 rows can lock the database for minutes on 100 million rows.
This page covers the principles behind designing durable schemas and executing safe migrations. The design side (how to model the data) and the migration side (how to evolve the model) are treated together because a perfect schema applied via a destructive migration is still a production incident.
For how APIs expose data to consumers, see API Design Standards. For system-level reliability during migrations, see Reliability Patterns. For the infrastructure that runs migrations, see CI/CD Pipelines.
Why It Matters
Schema changes are the most dangerous class of production operations. A code deploy can be rolled back in minutes; a migration that drops a column, renames a field, or backfills 50 million rows cannot be undone cleanly without significant effort. The cost of getting this wrong is a production incident — often a long one.
Bad schema design compounds. A table that starts with good naming conventions and appropriate normalisation can support years of product growth. A table that starts as a loosely typed JSON blob, a flat list of nullable columns, or a mis-normalised aggregate will produce a stream of bugs, performance problems, and increasingly painful migrations as the product grows.
Multi-tenancy and deletion semantics are decisions that cannot easily be changed later. Whether tenant data is isolated by row, by schema, or by database; whether deleted records are removed or flagged — these choices, once made and data-laden, require large migrations to reverse. They deserve deliberate design up front.
Zero-downtime deployments require zero-downtime migrations. A deployment strategy that achieves zero downtime at the application layer is negated by a migration that takes an exclusive table lock for ten minutes. Application-level and database-level deployment strategies must be designed together.
Standards & Best Practices
Normalise until it hurts, then denormalise deliberately
Normalisation removes redundancy and establishes a single source of truth for each fact. The general guidance is to normalise to third normal form (3NF) by default — every non-key attribute depends only on the primary key, not on other non-key attributes.
Denormalisation (deliberate redundancy to improve read performance) is occasionally the right trade-off, but it has a cost: every write to the source data must update the denormalised copy, and the two can fall out of sync. Denormalise only when the performance problem is real and measured, not anticipated.
Signs that a schema is under-normalised:
- The same fact (e.g. a user's email address) appears in multiple tables without a canonical source
- Updating one record logically requires updating multiple others
- Columns contain values that are derived from other columns
Naming conventions are part of the schema contract
Consumers query schemas by name. Inconsistent naming multiplies cognitive load on every engineer who touches the data layer.
Standards:
- Table names are lowercase, plural, and noun-based (
orders,line_items, notOrderortbl_order) - Column names are lowercase and snake_case
- Foreign keys follow the pattern
{referenced_table_singular}_id(e.g.user_idin theorderstable) - Boolean columns are named in the affirmative (
is_active,is_deleted, notdisabledornot_active) - Timestamp columns are suffixed with
_atand are stored as UTC (created_at,updated_at,deleted_at) - Avoid abbreviations that are not universally understood
These conventions should be automated via a linter where possible. Manual enforcement at review time does not scale.
Soft delete vs hard delete is a business decision
Soft delete marks a record as deleted (typically with a deleted_at timestamp) without removing it from the database. Hard delete removes the row.
Soft delete is appropriate when:
- Data has regulatory or audit retention requirements
- Deleted records need to be recoverable by users (undo, recycle bin)
- References to deleted entities must remain meaningful (e.g. an order belongs to a deleted account)
Hard delete is appropriate when:
- Data is ephemeral with no recovery or audit requirement
- The regulatory requirement is to ensure deletion (GDPR right-to-erasure cases)
- Storage cost of retained records is significant
The common mistake is defaulting to soft delete "just in case." Soft delete introduces significant complexity: every query must filter on deleted_at IS NULL, indexes must include the deleted flag, and any report that does not filter correctly will include deleted records. If the case for soft delete is not clear, default to hard delete and add soft delete only when the requirement materialises.
Multi-tenancy strategy is a schema design decision
How tenant data is isolated affects query performance, security blast radius, migration complexity, and regulatory compliance. The three main strategies:
| Strategy | Description | When appropriate |
|---|---|---|
| Row-level | All tenants share tables; tenant_id column on every row | Simplest to operate; sufficient for most SaaS at early–mid scale |
| Schema-per-tenant | Each tenant has their own schema within one database | Easier to migrate individual tenants; adds complexity for cross-tenant reporting |
| Database-per-tenant | Each tenant has their own database instance | Maximum isolation; required for regulatory segregation; highest operational cost |
Once chosen and data-laden, this strategy is very difficult to change. The decision should be made at design time based on the isolation and compliance requirements of the product, not after the first customer is signed.
With row-level multi-tenancy, every query must filter on tenant_id. The most common bug class in row-level multi-tenancy is a query that accidentally returns data from the wrong tenant. Use row-level security (RLS) at the database layer where supported — it prevents application-layer bypass more reliably than code-level filtering.
Schema changes must be backward compatible
During a deployment, old code and new code run simultaneously. A migration that removes a column required by old code will cause errors before the deploy completes. The safe migration order:
Adding a column:
- Deploy the migration: add column as nullable with no default (or with a safe default)
- Deploy the new code that writes to the column
- (Optional) backfill existing rows, then add NOT NULL constraint
Removing a column:
- Deploy code that no longer reads or writes the column (while column still exists)
- Deploy the migration: remove the column
- Never remove the column and the code reading it in the same deploy
Renaming a column:
- Add the new column
- Deploy code that writes to both columns
- Backfill old column data into new column
- Deploy code that reads from new column only
- Remove old column
This expand-contract pattern (also called the parallel-run pattern) allows column changes without downtime. It requires multiple deploys but eliminates lock contention and deploy-time errors.
Large-scale migrations require a specific strategy
A migration that adds a NOT NULL column to a 50-million-row table by backfilling defaults will take an exclusive table lock for the duration of the backfill. This is a production outage. Strategies for large tables:
- Batched backfill — update rows in small batches with sleep intervals, over hours or days, before adding the constraint
- Online schema change tools — tools like
pt-online-schema-change,gh-ost(MySQL) orpg_repack(PostgreSQL) rebuild the table in the background without an exclusive lock - Shadow table pattern — create the new table structure, dual-write to both, migrate reads, backfill, cut over
- Nullable first — add a column as nullable, deploy, backfill, then add the NOT NULL constraint in a separate migration once the backfill is verified
The cost model for migrations should include time-to-execute on production data volumes, not just on development datasets. Test migrations against a recent production snapshot — or at least at order-of-magnitude production scale — before running in production.
Index design is part of the schema
An unindexed query on a large table is a latency spike. An over-indexed table is a write amplifier. Index design decisions are schema decisions and should be made at design time.
Standards:
- Every foreign key column has an index (some databases create this automatically; verify)
- Every column used in a frequent
WHERE,JOIN, orORDER BYis a candidate for indexing - Composite indexes are ordered by selectivity (most selective column first) and by query pattern
- Unused indexes are technical debt — they consume storage and slow writes without benefiting any query
- Indexes are monitored for usage (most databases expose this in system views)
How to Implement
Schema review before migration runs
Every schema change undergoes a review that covers:
- Is the change backward compatible? (Can old and new code run simultaneously without errors?)
- What is the estimated execution time on production data volumes?
- Does the migration take an exclusive lock? If yes, is there an alternative?
- Is the rollback path defined? (What is the reverse migration? Has it been tested?)
- Are all affected queries updated to use new column names / structures?
- Are indexes added for new columns used in query predicates?
Migration safety checklist
Before running any migration in production:
- Tested against a dataset at production scale (or demonstrated safe via tool guarantees)
- Backward-compatible with the code version currently deployed
- Lock duration estimated and within acceptable window (or uses online schema change tooling)
- Rollback migration exists and has been tested
- Monitoring in place to detect query performance regression after migration runs
Common Pitfalls
Migrating and deploying code atomically. Running a destructive migration and a code deploy together, without the expand-contract buffer, is the most common cause of schema-related outages. Separate them.
Assuming development scale predicts production behaviour. A migration that takes 2 seconds on 10,000 rows takes 2,000 seconds on 10 billion rows. Test at realistic scale before running in production.
Soft delete without enforcing the filter. Every query path must filter on deleted_at IS NULL. A single query that forgets the filter leaks deleted data to the user. Use database-level views or row-level security to enforce this automatically rather than relying on every query author to remember it.
Nullable columns as a workaround for failed normalisation. A table with 40 nullable columns, each meaningful only for a specific product type, is a signal that the schema is under-normalised and that product subtypes should be their own tables or types.
Skipping index review. Adding a column used in queries without adding the index. The query performs acceptably in development (small table), degrades at scale, and produces a production incident whose root cause is a missing index.
JSONB / document columns as a schema design escape hatch. JSON columns are useful for genuinely variable schemas. They are often used to avoid schema design — the result is a column that cannot be efficiently indexed, queried, or typed. If the structure of the JSON is consistent, it belongs in typed columns.