Replication Types
Replication Types: Durability vs Latency Tradeoff
Replication is not only about copying data. It defines how much data loss you can tolerate (RPO) and how fast your system can acknowledge writes (latency).
The core distinction:
- Synchronous replication
- Asynchronous replication
Asynchronous Replication
In async replication, the primary acknowledges a write after persisting locally. Replicas receive changes afterward.
Flow:
- Client sends write to primary
- Primary writes to WAL/binlog and commits
- Primary responds to client
- Replica receives and applies changes later
Advantages:
- Low write latency
- High throughput
Risk:
- If primary crashes before replica applies change, data loss occurs.
Synchronous Replication
In synchronous replication, the primary waits for one or more replicas to confirm receipt (and sometimes durability) before acknowledging commit.
Flow:
- Primary writes locally
- Replica confirms receipt/durability
- Primary acknowledges commit
Advantages:
- Stronger durability guarantees
- Lower RPO
Cost:
- Higher latency (network round-trip)
- Throughput sensitive to slow replicas
Quorum-Based Replication
Some systems allow quorum acknowledgment (e.g., 1 of 2 replicas must confirm). This balances durability and availability.
Quorum strategies require careful failure modeling to avoid split-brain.
Durability Semantics Matter
Not all “sync” guarantees are equal:
- Replica received WAL
- Replica flushed to disk
- Replica applied transaction
Each level changes durability and read consistency guarantees.
Replication and RPO
RPO (Recovery Point Objective) defines acceptable data loss.
- Async replication → RPO > 0 possible
- Sync replication → RPO close to 0 (if properly configured)
Business requirements should drive replication mode.
Replication and RTO
RTO (Recovery Time Objective) depends on:
- Failover automation
- Replica catch-up speed
- Cluster orchestration
Network Dependency
Synchronous replication couples write latency to network health. Network jitter or cross-region replication can severely impact p95/p99 latency.
Failure Modes in Production
- Replica lag spike: async replica falls behind.
- Write stall: sync replica unavailable → primary blocks commits.
- Split brain: improper failover coordination.
- Data loss: async primary crashes before replica apply.
- Quorum misconfiguration: availability reduced unexpectedly.
Operational Checklist
- Define RPO and RTO explicitly.
- Choose replication mode based on business risk tolerance.
- Monitor replica lag continuously.
- Test primary crash scenarios in staging.
- Understand what “sync” confirmation actually means in your engine.
- Document failover decision rules.
- Avoid cross-region sync replication unless latency budget allows.
- Monitor commit latency impact when enabling sync mode.
- Test network degradation scenarios.
- Have rollback plan for replication mode changes.
Summary
Replication mode determines your durability-latency tradeoff. Async maximizes throughput but risks data loss. Sync minimizes RPO but increases latency and sensitivity to replica health. Production engineering requires explicit RPO/RTO alignment with replication configuration.