WAL & Write Amplification
Write Amplification: One Write Is Never One Write
In production databases, a single logical write (INSERT/UPDATE/DELETE) often results in multiple physical writes. This effect is called write amplification. If you do not understand it, you will misdiagnose latency spikes, replication lag, and sudden throughput collapses.
The main sources are:
- Index maintenance (each index is another write path)
- Redo logging (WAL / binlog)
- Page splits and fragmentation
- Checkpointing and flushing dirty pages
- Durability guarantees (fsync, sync commits)
The Write Path: Logical Change to Durable State
At a high level, durable databases must ensure that after a crash, committed changes can be recovered. That usually means:
- Write change records to a log (redo log / WAL / binlog)
- Eventually flush modified data pages to disk
- Maintain ordering and durability constraints (fsync)
Most engines follow a “write-ahead” discipline: the log is persisted before data pages are considered durable.
WAL and Binlog: What They Are for
Terminology differs but the idea is consistent:
- WAL (Write-Ahead Log): redo information for crash recovery (common in Postgres).
- Redo log: similar concept (InnoDB).
- Binlog: replication log in MySQL; may be row-based or statement-based.
These logs are write-heavy. Under load, log throughput can become the bottleneck even when CPU is available.
fsync: The Durability Toll Gate
When the database calls fsync, it forces the OS to flush buffered writes to durable storage. This is expensive because it includes device-level guarantees. The moment you need strong durability, you pay latency.
Two important operational realities:
- Fast SSD does not eliminate fsync cost; it reduces it but does not remove it.
- fsync latency variance is a major contributor to tail latency.
Group Commit: How Databases Reduce fsync Cost
If every transaction did its own fsync, throughput would collapse. Most engines use group commit: multiple transactions share a single fsync by batching commits.
This improves throughput but can increase latency slightly. Under bursty load, batching becomes your friend. Under low load, per-transaction commit cost becomes visible.
Checkpoints: Why Latency Spikes Happen
A checkpoint is when the engine ensures that dirty pages are flushed such that recovery time stays bounded. Checkpoints are necessary, but they can be disruptive:
- Checkpoint flush creates heavy IO
- IO competes with user queries
- Latency spikes appear, often at regular intervals
Typical symptom: p95/p99 latency spikes every N minutes, aligned with checkpoint activity.
Indexes Multiply Writes
Every index on a table is additional work per write:
- INSERT writes table page + each index leaf
- UPDATE may write old tuple + new tuple (MVCC) + index maintenance
- DELETE often marks dead and still needs cleanup later
Over-indexing is one of the most common causes of write throughput collapse.
Page Splits and Hotspots
Write amplification increases when page splits occur (B-Tree growth) and when hot pages are contended (monotonic keys). A single “hot leaf page” can become a bottleneck even if the system has capacity elsewhere.
Measuring Write Pressure (Practical)
You want visibility into:
- Log volume growth (WAL/binlog)
- Checkpoint frequency and duration
- fsync latency and variance
- Dirty page ratios and flush rates
-- Postgres: basic WAL activity SELECT * FROM pg_stat_wal;
-- MySQL: binlog files and size (rough visibility) SHOW BINARY LOGS;
Even if you do not have perfect metrics, log growth rate and disk write bandwidth are strong signals.
Durability Knobs: Performance vs Risk
Most databases expose durability-related settings. Changing them can reduce fsync overhead but increases the risk of data loss on crash.
Engineering stance: you must explicitly choose your durability level based on business requirements, and document the blast radius.
Do not “optimize” durability in production without a clear RPO agreement.
Failure Modes in Production
- Checkpoint storms: flush load spikes cause periodic tail latency explosions.
- Replica lag: high WAL/binlog volume overwhelms replicas.
- Disk saturation: writes saturate device bandwidth; reads slow down too.
- Fsync jitter: variable fsync time creates unstable p99 latency.
- Over-indexing: write throughput collapses after adding indexes.
- Bloat-driven amplification: dead tuples/entries increase pages touched per write.
- Log retention blowup: WAL/binlog retention fills disks, causing outages.
Operational Checklist
- Track WAL/binlog growth rate as a first-class metric.
- Correlate p95/p99 latency spikes with checkpoint/flush activity.
- Measure disk write bandwidth utilization; keep headroom for bursts.
- Audit index count on write-heavy tables; remove unused indexes.
- Validate write performance after schema/index changes under realistic load.
- Monitor fsync latency and variance; tail latency often follows it.
- Plan log retention and disk capacity; avoid surprise disk-full incidents.
- Ensure replicas have apply capacity before heavy migrations/rebuilds.
- Do not change durability settings without explicit RPO agreement.
- Run incident drills: simulate checkpoint pressure and observe service behavior.
- Document mitigation levers: throttling maintenance, batching writes, reducing indexes.
Summary
Write amplification is the hidden tax of durability and indexing. Most production incidents that look like “database is slow” are actually “write path is saturated.” Once you understand WAL/binlog, fsync, and checkpoints, you can diagnose tail latency spikes and replication lag with confidence.