DISTRIBUTED-SYSTEMS-ENGINEERING Contents

Replication Lag: Causes, Metrics, and Mitigations

Replication lag measures how far replicas fall behind the primary. This lesson explains how lag occurs, how to measure it correctly, and how it impacts consistency, failover safety, and user-facing correctness in production systems.

On this page

Replication Lag Basics: The Hidden Consistency Window

Replication lag is the time or offset difference between a primary node and its replicas. In asynchronous or semi-synchronous replication models, lag represents the window during which replicas may return stale data.

Replication lag is not inherently a bug. It is a measurable tradeoff. However, when unmonitored or misunderstood, it becomes a source of data inconsistency, failover risk, and production incidents.

What Causes Replication Lag?

Replication lag typically occurs due to one or more of the following factors:

  • High write throughput exceeding replica processing capacity
  • Disk I/O bottlenecks (slow fsync)
  • Network latency or packet loss
  • Cross-region replication delay
  • Replica CPU saturation
  • Large transaction batching

Lag is rarely caused by a single event. It is usually the result of sustained pressure.

Types of Replication Lag Metrics

Time-Based Lag

Measured in seconds or milliseconds behind the primary.

Log Offset Lag

Difference in log sequence numbers (LSN) or commit indexes.

Transaction Count Lag

Number of unprocessed transactions in replication queue.

Time-based lag is intuitive, but offset-based lag provides more precise operational insight.

Production Scenario: Stale Reads During Peak Traffic

Symptom

Users intermittently observe outdated account balances during high traffic events.

Root Cause

Replication lag increases from 50ms to 2 seconds under write surge. Reads are served from lagging replicas.

Diagnosis

  • Replication delay metric spikes during peak hours.
  • Replica CPU near saturation.
  • Primary stable and responsive.

Resolution

  • Scale read replicas horizontally.
  • Increase replica hardware performance.
  • Route critical reads to primary during lag spike.

Lag and Failover Risk

Replication lag directly affects failover safety.

If primary crashes while replicas are behind:

  • Recent writes may not exist on promoted replica.
  • Data loss window equals replication lag window.

Asynchronous replication always carries potential data loss equal to lag duration.

Lag Amplification Under Load

Lag tends to increase nonlinearly as replica utilization approaches saturation.

As utilization nears 100%:

  • Queue depth increases.
  • Apply latency increases.
  • Commit offset gap widens rapidly.

This resembles queueing theory behavior near capacity limits.

Cross-Region Lag

In multi-region deployments, base replication delay equals network RTT plus processing overhead.

Under congestion, cross-region lag can spike unpredictably.

Design decisions must consider whether cross-region replication is synchronous or asynchronous.

Observability Signals

  • Replication offset difference
  • Replica apply duration
  • Replication queue depth
  • Replica CPU utilization
  • Disk fsync latency

Monitoring only time-based lag is insufficient. Offset and queue depth reveal early pressure signals.

Mitigation Strategies

  • Limit write burst size
  • Increase replica parallel apply workers
  • Optimize disk I/O
  • Use synchronous replication for critical writes
  • Temporarily disable replica reads under high lag

Lag as an SLO

Mature systems define acceptable lag windows:

  • 99% of replication lag must be below 200ms
  • Maximum lag must not exceed 1 second
  • Replica divergence must auto-alert above threshold

This converts consistency drift into measurable operational policy.

Failure Injection Test

# Replication lag stress test
1) Apply sustained write load
2) Measure replication offset growth
3) Saturate replica disk artificially
4) Observe lag spike
5) Validate alerting triggers

Operational Checklist

  • Is replication lag measured continuously?
  • Are alerts tied to user-facing workflows?
  • Is failover tested with non-zero lag?
  • Are replica reads disabled automatically under extreme lag?
  • Is hardware provisioned with headroom?

Key Takeaways

  • Replication lag defines the staleness window.
  • Lag increases nonlinearly near saturation.
  • Asynchronous replication introduces data loss risk equal to lag window.
  • Monitoring offset and queue depth provides early warning.
  • Lag must be treated as a measurable SLO.

Replication lag is not merely a performance metric. It is a correctness boundary. Understanding and monitoring it is fundamental to operating distributed data systems safely at scale.