Replication Lag: Causes, Metrics, and Mitigations
Replication Lag Basics: The Hidden Consistency Window
Replication lag is the time or offset difference between a primary node and its replicas. In asynchronous or semi-synchronous replication models, lag represents the window during which replicas may return stale data.
Replication lag is not inherently a bug. It is a measurable tradeoff. However, when unmonitored or misunderstood, it becomes a source of data inconsistency, failover risk, and production incidents.
What Causes Replication Lag?
Replication lag typically occurs due to one or more of the following factors:
- High write throughput exceeding replica processing capacity
- Disk I/O bottlenecks (slow fsync)
- Network latency or packet loss
- Cross-region replication delay
- Replica CPU saturation
- Large transaction batching
Lag is rarely caused by a single event. It is usually the result of sustained pressure.
Types of Replication Lag Metrics
Time-Based Lag
Measured in seconds or milliseconds behind the primary.
Log Offset Lag
Difference in log sequence numbers (LSN) or commit indexes.
Transaction Count Lag
Number of unprocessed transactions in replication queue.
Time-based lag is intuitive, but offset-based lag provides more precise operational insight.
Production Scenario: Stale Reads During Peak Traffic
Symptom
Users intermittently observe outdated account balances during high traffic events.
Root Cause
Replication lag increases from 50ms to 2 seconds under write surge. Reads are served from lagging replicas.
Diagnosis
- Replication delay metric spikes during peak hours.
- Replica CPU near saturation.
- Primary stable and responsive.
Resolution
- Scale read replicas horizontally.
- Increase replica hardware performance.
- Route critical reads to primary during lag spike.
Lag and Failover Risk
Replication lag directly affects failover safety.
If primary crashes while replicas are behind:
- Recent writes may not exist on promoted replica.
- Data loss window equals replication lag window.
Asynchronous replication always carries potential data loss equal to lag duration.
Lag Amplification Under Load
Lag tends to increase nonlinearly as replica utilization approaches saturation.
As utilization nears 100%:
- Queue depth increases.
- Apply latency increases.
- Commit offset gap widens rapidly.
This resembles queueing theory behavior near capacity limits.
Cross-Region Lag
In multi-region deployments, base replication delay equals network RTT plus processing overhead.
Under congestion, cross-region lag can spike unpredictably.
Design decisions must consider whether cross-region replication is synchronous or asynchronous.
Observability Signals
- Replication offset difference
- Replica apply duration
- Replication queue depth
- Replica CPU utilization
- Disk fsync latency
Monitoring only time-based lag is insufficient. Offset and queue depth reveal early pressure signals.
Mitigation Strategies
- Limit write burst size
- Increase replica parallel apply workers
- Optimize disk I/O
- Use synchronous replication for critical writes
- Temporarily disable replica reads under high lag
Lag as an SLO
Mature systems define acceptable lag windows:
- 99% of replication lag must be below 200ms
- Maximum lag must not exceed 1 second
- Replica divergence must auto-alert above threshold
This converts consistency drift into measurable operational policy.
Failure Injection Test
# Replication lag stress test 1) Apply sustained write load 2) Measure replication offset growth 3) Saturate replica disk artificially 4) Observe lag spike 5) Validate alerting triggers
Operational Checklist
- Is replication lag measured continuously?
- Are alerts tied to user-facing workflows?
- Is failover tested with non-zero lag?
- Are replica reads disabled automatically under extreme lag?
- Is hardware provisioned with headroom?
Key Takeaways
- Replication lag defines the staleness window.
- Lag increases nonlinearly near saturation.
- Asynchronous replication introduces data loss risk equal to lag window.
- Monitoring offset and queue depth provides early warning.
- Lag must be treated as a measurable SLO.
Replication lag is not merely a performance metric. It is a correctness boundary. Understanding and monitoring it is fundamental to operating distributed data systems safely at scale.