Postgres Replication Model
Postgres Streaming Replication: WAL as the Source of Truth
PostgreSQL replication is built around WAL (Write-Ahead Log). The primary does not ship “table changes” directly. It ships WAL records. Replicas replay WAL to reach the same state.
Production consequence: if you understand WAL flow, you can debug most replication and failover incidents.
The Replication Flow: Primary to Standby
Core components:
- WAL generation: primary writes WAL records for changes.
- WAL sender: a process on primary that streams WAL to standbys.
- WAL receiver: a process on standby that receives WAL.
- Replay: standby replays WAL to apply changes.
Standby state is typically described by LSN positions: how far received and how far replayed.
LSN: How Postgres Measures Progress
LSN (Log Sequence Number) is a monotonically increasing position in the WAL stream. For replication monitoring, you care about:
- Sent LSN: primary has sent WAL up to this point.
- Write/Flush LSN: standby received and flushed.
- Replay LSN: standby applied changes.
Lag can exist at receive, flush, or replay stage. Each stage has different operational implications.
Async vs Sync in Postgres Terms
Postgres can run with async replication by default. In async mode:
- Primary commits after local durability.
- Standby catches up later.
In sync replication (synchronous_commit + synchronous_standby_names), primary waits for a standby acknowledgment before commit is acknowledged to the client.
What Does “Synchronous” Acknowledge?
In Postgres, you can configure what level of acknowledgment is required (engine-specific, but conceptually):
- Standby received WAL
- Standby flushed WAL to disk
- Standby replayed WAL
Choosing stronger acknowledgment reduces RPO but increases commit latency and sensitivity to slow standbys.
Replication Slots: Preventing WAL Loss
A replication slot ensures the primary retains WAL until a standby has consumed it. This prevents a standby from falling so far behind that required WAL is deleted.
Production tradeoff:
- Good: prevents replica from becoming unrecoverable.
- Bad: if a replica is dead or stuck, WAL retention grows until disk fills.
Slots are a common cause of “disk full” incidents when not monitored.
Physical vs Logical Replication (Conceptual)
Streaming replication described here is physical replication: replaying WAL to reproduce the same physical state. Logical replication exists too (publishing changes by tables), but the core HA story in Postgres is usually physical streaming replication.
Hot Standby and Read Queries
Standbys can serve read queries (hot standby). But reads on replicas are subject to:
- Replica lag (stale reads)
- Replay conflict (queries can be canceled if they block replay)
Production rule: if you route user traffic to replicas, you must accept staleness or enforce read-your-writes at app layer.
Timeline and Failover Reality
After failover, the promoted standby becomes the new primary and creates a new timeline. Old primary may have WAL that diverges. Reattaching old primary requires careful resync; you cannot just “point it back” without risk.
Operational consequence: failover is not a reversible toggle. It is a topology change.
Monitoring: What You Must Watch
In Postgres, you typically monitor replication via system views on primary and standby.
-- Primary: see connected standbys and their progress SELECT application_name, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;
-- Standby: see receive/replay lag from its perspective SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
Track both transport lag (receive/flush) and replay lag separately.
Failure Modes in Production
- Replay lag: standby receives WAL but cannot replay fast enough (CPU/IO bound).
- Network lag: WAL shipping delayed; transport lag increases.
- Slot disk fill: WAL retained indefinitely due to dead replica.
- Read query cancellations: hot standby conflicts cause query aborts.
- Sync stall: synchronous standby slow/unavailable blocks commits.
- Failover divergence: timeline changes; old primary cannot rejoin cleanly.
Operational Checklist
- Decide async vs sync replication based on RPO and latency budget.
- Monitor sent/write/flush/replay LSN gaps per standby.
- Alert on WAL retention growth when using replication slots.
- Capacity plan for worst-case WAL growth during replica outages.
- Test failover and promotion regularly; document runbook.
- Plan for re-sync of old primary after failover (do not assume reversible).
- Understand hot standby conflict behavior if serving reads from replicas.
- Monitor replica hardware resources; replay is often IO bound.
- Log replication state changes for incident correlation.
- After major write spikes, watch replay lag and checkpoint pressure.
Summary
Postgres streaming replication ships WAL and replays it. LSN positions define progress, replication slots prevent WAL loss but can fill disks, and synchronous replication trades latency for lower RPO. Failover creates a new timeline and requires deliberate reconfiguration. Production HA starts with understanding WAL flow end-to-end.