Replication Lag Debugging (Root Causes + Fixes)

Replication lag occurs when replicas fall behind the primary, risking stale reads and failover inconsistency. This lesson explains detection metrics, root causes, recovery strategies, and production debugging procedures.

On this page

Replication Lag Debugging: When Replicas Fall Behind Reality

Replication lag occurs when follower or replica nodes fall behind the primary node in applying committed updates. In distributed systems, replication lag introduces stale reads, delayed failover readiness, and potential data correctness risks during leader transitions.

Lag is often invisible until it becomes critical.

Why Replication Lag Matters

Read replicas may serve outdated data.
Failover to lagging replica risks data loss.
Write-after-read consistency may break.
Event processing pipelines may delay.

Lag is both a performance and correctness issue.

Common Causes of Replication Lag

High write throughput on primary.
Network latency or packet loss.
Slow disk I/O on replica.
Large transactions or batch updates.
Resource contention (CPU, memory).
Replica applying logs single-threaded.

Lag is usually capacity or infrastructure related.

Key Metrics to Monitor

Replication delay in seconds.
Log sequence number (LSN) difference.
Replica apply queue depth.
Network throughput between nodes.
Disk write latency on replica.

LSN gap provides precise measurement of divergence.

Production Scenario: Stale User Profile Data

Symptom

User updates profile but sees old value immediately after.

Architecture

Writes go to primary.
Reads served from replica.

Root Cause

Replica lag of 4 seconds during high write traffic.

Diagnosis

Replication delay metric spike.
High disk I/O wait on replica.
Write-heavy workload during peak hours.

Resolution

Upgrade replica disk performance.
Introduce read-after-write routing for critical paths.
Reduce large batch transaction size.

Step-by-Step Debugging Procedure

1) Measure Replication Delay

# Check replication status
SHOW REPLICA STATUS;

Observe seconds behind primary.

2) Compare Log Positions

Check primary LSN.
Check replica applied LSN.
Calculate gap.

3) Inspect Resource Utilization on Replica

CPU usage.
Disk write latency.
I/O wait percentage.
Memory pressure.

4) Analyze Network Conditions

Inter-node latency.
Packet retransmissions.
Bandwidth utilization.

5) Evaluate Transaction Characteristics

Large transactions blocking apply thread.
Long-running locks.
Bulk updates.

Lag root cause often visible in one of these layers.

Replication Lag During Failover

If failover occurs while replica is behind:

Recent writes may be lost.
Client-visible inconsistency may occur.
Manual reconciliation may be required.

Always verify replica sync before promotion.

Mitigation Strategies

1) Parallel Replication

Enable multi-threaded log apply.

2) Improve Disk I/O

Use faster storage for replicas.

3) Traffic Shaping

Throttle write bursts.

4) Read Routing Strategy

Route read-after-write requests to primary.

5) Capacity Headroom

Ensure replica has spare CPU and I/O capacity.

Observability Requirements

Real-time lag dashboards.
Alert on threshold breaches.
LSN gap trend monitoring.
Replica resource usage visibility.
Failover readiness indicator.

Lag detection must be proactive.

Failure Injection Test

# Replication lag simulation
1) Generate heavy write workload
2) Monitor replica delay
3) Simulate disk slowdown on replica
4) Observe LSN divergence
5) Attempt failover under lag
6) Validate safe promotion rules

Lag behavior should be validated before real incidents.

Common Anti-Patterns

Serving all reads from replicas without awareness.
No alerting on replication delay.
Promoting lagging replica during failover.
Ignoring large transaction impact.
Operating replica near full resource utilization.

Lag grows quietly until visible damage occurs.

Operational Checklist

Is replication delay continuously monitored?
Are failover rules lag-aware?
Are large transactions minimized?
Is replica capacity provisioned adequately?
Are read-after-write paths defined clearly?

Key Takeaways

Replication lag risks stale reads and data loss.
LSN gap is precise measure of divergence.
Disk I/O and transaction size commonly cause lag.
Failover must consider replica freshness.
Lag monitoring is mandatory in production systems.

Replication lag debugging requires visibility into storage, networking, and transaction patterns. In production-grade distributed systems, ensuring replica freshness is fundamental to maintaining both performance and correctness guarantees.

← Cascading Failure Analysis (Finding the First Domino)

Kafka Consumer Lag Playbook (Rebalance, Throughput, Backlog) →