Scalability Patterns
Proven architectural patterns for building systems that handle growth without requiring rewrites.
Overview
Scalability is the property of a system that allows it to handle increasing load — more users, more data, more requests — without proportional increases in cost, latency, or operational complexity. Systems that are not designed with scalability in mind do not fail suddenly; they degrade gradually. Response times creep up, error rates climb, and operations become increasingly fragile until a rewrite becomes unavoidable.
The goal is not to design for infinite scale on day one. That is premature optimization — it adds complexity before the load exists to justify it. The goal is to know the patterns well enough to apply them at the right moment, and to avoid design choices that make scaling impossible later.
Managing the cost of addressing scalability issues over time is covered in Tech Debt Management. For the process of evaluating scalability trade-offs before building, see Design Review Frameworks.
Why It Matters
The cost of re-architecture grows nonlinearly with usage. Changing a stateful, monolithic service to a stateless, horizontally scalable one at 10,000 users is a weekend project. The same change at 10 million users, with production traffic running continuously, is a multi-quarter migration with real risk of data loss and downtime. Design choices that feel negligible early become very expensive to undo.
Unscalable design spreads its constraints. A service that cannot scale horizontally forces every dependent service to compensate for it — adding retry logic, rate limiting their own calls, buffering traffic. Scalability problems in one component leak into the entire system and constrain every team that depends on it.
Performance under load is different from performance at rest. A system that responds in 50ms under normal conditions may respond in 5,000ms under peak load. These are not the same system. Scalability patterns address the behavior of a system under stress, not just under average conditions — and those are the conditions that matter most when it counts.
Rewrites carry hidden costs. The belief that "we'll scale it properly in v2" consistently underestimates the cost of the rewrite: lost edge-case handling accreted over years, the distraction from shipping new features, and the risk that the rewrite itself introduces new scalability problems. Applying patterns incrementally is almost always cheaper than a full rewrite.
Standards & Best Practices
Design for statelessness first
The single most important scalability property is statelessness. A stateless service holds no session-specific data in memory — every request is self-contained and can be handled by any instance. This means:
- Session state lives in the database or a distributed store (Redis), not in application memory
- No local file system writes for anything that must survive a restart
- No in-process caches that diverge across instances
When every instance is equivalent, you can add or remove instances freely. When instances hold local state, you need sticky sessions, data synchronization, or complicated failover logic — all of which undermine the benefits of horizontal scaling.
Cache at the right layer
Caching reduces load on expensive resources (databases, downstream APIs) by serving previously computed results. The error is almost never "we cached too much" — it is "we cached the wrong thing" or "we cached without an invalidation strategy."
Cache hierarchy:
| Layer | What to cache | Technology | Typical TTL |
|---|---|---|---|
| In-process | Immutable config, reference data | In-memory map | Process lifetime |
| Distributed | Session data, computed aggregates, API responses | Redis / Memcached | Seconds to minutes |
| HTTP / CDN | Static assets, public API responses | CDN (Cloudflare, Fastly) | Minutes to days |
| Database query | Expensive reads with predictable inputs | Redis, DB query cache | Seconds to minutes |
What not to cache: anything that must be strongly consistent (balances, permissions, inventory counts), anything with unbounded cardinality (per-user, per-request data at high volume), and anything you cannot invalidate reliably.
Cache invalidation strategy must be defined at design time. A cache with no invalidation plan is a correctness bug waiting to happen. Common strategies: TTL (data expires after a fixed duration), event-driven invalidation (the write path invalidates the cache key), and write-through (cache is always updated on writes).
Move non-critical work off the request path
Any work that does not need to complete before the response is returned should be deferred. Sending a confirmation email, generating a PDF, updating a recommendation model, fanning out notifications — none of these belong in the synchronous request handler.
A message queue (BullMQ, SQS, RabbitMQ) accepts the work, returns a job ID, and workers process it asynchronously. Benefits:
- Request latency drops to the time required for the critical path only
- Workers can be scaled independently of the API service
- Failed jobs can be retried without impacting the user experience
- Traffic spikes are absorbed by the queue rather than hitting downstream services directly
Idempotency is required for any queued work. Workers can fail mid-processing and restart from the beginning. If the operation is not idempotent, retries cause duplicate side effects — duplicate emails, double charges, duplicate records. Every queued job must be safe to run twice.
Scale reads before writes
Most production systems are read-heavy. Writes are constrained by durability requirements; reads can be distributed more aggressively.
Read replicas route read-only queries to replica databases, reducing load on the primary. Most ORMs and database clients support this transparently. Add read replicas before considering horizontal sharding — they are significantly simpler to operate.
Connection pooling is required at any meaningful scale. Database connections are expensive to establish and limited in number. A connection pool reuses established connections across requests. Without pooling, a traffic spike can exhaust database connections and cause cascading failures. Use PgBouncer (PostgreSQL), ProxySQL (MySQL), or the connection pooling built into your ORM.
Index before you shard. Most query performance problems are indexing problems. A missing index on a high-cardinality column can make a fast query orders of magnitude slower. Profile slow queries, add indexes, and validate with EXPLAIN ANALYZE before reaching for horizontal sharding, which adds significant operational complexity.
Protect downstream services with rate limiting and backpressure
A service that accepts traffic at any volume without limiting it will eventually overwhelm its dependencies. Two complementary mechanisms:
Rate limiting enforces a ceiling on how many requests a client can make in a time window. Implement at the API gateway or in the service itself. Use the token bucket algorithm for burst-tolerant limits; use the fixed window algorithm for strict per-minute quotas.
Backpressure signals to upstream callers that the service is at capacity, so they slow their own rate rather than retrying aggressively. Return 503 Service Unavailable with a Retry-After header when at capacity. Never silently drop requests — a caller that receives no response will retry indefinitely.
Design write paths for idempotency
Any write operation that might be retried — by a client after a network timeout, by a worker after a crash, by a webhook re-delivery — must be idempotent. Applying the same operation twice must produce the same result as applying it once.
Common approaches:
- Idempotency keys: Clients send a unique key per logical operation. The server stores seen keys and returns the cached result for duplicates.
- Upsert semantics: Insert-or-update instead of unconditional insert prevents duplicate records.
- Conditional writes:
UPDATE ... WHERE version = :expected_version— the write only succeeds if the record is in the expected state.
How to Implement
Choosing the right pattern for the problem
Before applying a pattern, identify what is actually slow or failing:
- Profile first. Instrument your services before drawing conclusions. Response time percentiles (p50, p95, p99), database query times, cache hit rates, and queue depths tell you where the bottleneck actually is — not where you expect it to be.
- Fix indexing and query shape before adding infrastructure. A slow query that takes 2,000ms often becomes a 10ms query with a covering index. Infrastructure changes (read replicas, caching layers) cost significantly more to operate than an index.
- Defer async work before scaling the API service. If the API is slow, check whether the request path includes work that could be offloaded to a queue. Adding more API instances does not help if each instance is blocked on slow downstream calls.
- Add a distributed cache once read volume justifies it. Cache when the same expensive computation is performed repeatedly with the same inputs, and when slight staleness is acceptable.
- Add read replicas when the primary is I/O bound on reads. This is typically visible in DB CPU and connection count metrics.
Async queue worker pattern
// Producer — in the API handler
import { queue } from '@/lib/queue';
export async function POST(req: Request) {
const { userId, orderId } = await req.json();
await queue.add(
'send-confirmation',
{ userId, orderId },
{
attempts: 3,
backoff: { type: 'exponential', delay: 1000 },
},
);
return Response.json({ status: 'accepted' }, { status: 202 });
}
// Worker — runs in a separate process
queue.process('send-confirmation', async (job) => {
const { userId, orderId } = job.data;
await sendConfirmationEmail(userId, orderId); // must be idempotent
});Redis cache wrapper with TTL and invalidation
import { redis } from '@/lib/redis';
async function getCachedOrFetch<T>(
key: string,
ttlSeconds: number,
fetch: () => Promise<T>,
): Promise<T> {
const cached = await redis.get(key);
if (cached) return JSON.parse(cached) as T;
const value = await fetch();
await redis.setex(key, ttlSeconds, JSON.stringify(value));
return value;
}
// Invalidate on write
async function updateUserProfile(userId: string, data: ProfileUpdate) {
await db.users.update({ where: { id: userId }, data });
await redis.del(`user:profile:${userId}`);
}Pre-launch load test checklist
Before a significant release or traffic event:
- Identify the three most-called endpoints and the three slowest
- Confirm all slow queries have appropriate indexes (
EXPLAIN ANALYZE) - Verify connection pool size matches expected concurrency
- Confirm async jobs are idempotent and have retry limits
- Set rate limits on public-facing endpoints
- Run a load test at 2× expected peak traffic; observe p95 latency and error rate
- Confirm alerts fire before — not after — SLA thresholds are breached
Tools & Templates
Identifying slow queries (PostgreSQL)
-- Queries consuming the most total time in the last hour
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
WHERE total_exec_time > 1000
ORDER BY total_exec_time DESC
LIMIT 20;Connection pool configuration (PgBouncer)
[databases]
mydb = host=db.internal port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3Rate limiter (token bucket, Redis-backed)
import { redis } from '@/lib/redis';
async function checkRateLimit(key: string, limitPerMinute: number): Promise<boolean> {
const window = Math.floor(Date.now() / 60_000);
const redisKey = `ratelimit:${key}:${window}`;
const count = await redis.incr(redisKey);
if (count === 1) await redis.expire(redisKey, 60);
return count <= limitPerMinute;
}Common Pitfalls
Caching without an invalidation strategy. Adding a cache is easy; cache invalidation is famously hard. A cache with no invalidation plan serves stale data indefinitely. Define how the cache gets updated or expired at the same time you add it, not afterward.
Caching before profiling. Adding a cache in front of a slow endpoint before understanding why it is slow often masks the root cause. The endpoint may be slow because of a missing index, a N+1 query, or serialization overhead — all of which should be fixed at the source. Cache the result of an efficient operation, not an inefficient one.
In-process caches in horizontally scaled services. If you have five instances and each holds its own in-process cache, you get five different views of the data and no way to invalidate them consistently. In-process caches are only safe for truly immutable data (e.g. compiled regex, static config loaded at startup).
Adding horizontal scale without fixing stateful design. Spinning up more instances of a service that writes to local disk, holds session state in memory, or relies on a single-node queue does not improve throughput — it introduces split-brain problems. Statelessness is a precondition for horizontal scaling, not an afterthought.
Sharding too early. Horizontal database sharding is one of the most operationally complex things a team can do. It makes cross-shard queries impossible, complicates migrations, and requires careful key design. Most systems that think they need sharding actually need better indexing, read replicas, or a caching layer. Exhaust those options first.
Skipping idempotency on retryable operations. Any operation that is retried on failure — queued jobs, webhook handlers, payment processing, API mutations — must be idempotent. Discovering a double-charge or duplicate notification bug in production is orders of magnitude more expensive to fix than designing for idempotency upfront.
Rate limiting only at the API layer. Applying rate limits at the API gateway protects the API service but not the downstream services it calls. A single user triggering 50 requests per second against your API — even within rate limit — can still overwhelm a database or third-party API if there are no rate limits internally.