DATABASE-ADVANCED Contents

MySQL Replication Model

Binlog, GTID, replica topology, and how to reason about apply lag and failover.

On this page

MySQL Replication: Binlog as the Replication Stream

MySQL replication is driven by the binary log (binlog). The primary records changes to binlog, and replicas fetch and apply those changes.

Production consequence: replication health depends on two pipelines:

  • Transport: fetching binlog events from primary
  • Apply: executing those events on the replica

Core Components: Primary and Replica Roles

On the primary:

  • Transactions commit
  • Binlog events are generated

On the replica:

  • IO thread fetches binlog events and writes to relay log
  • SQL thread(s) read relay log and apply events

Lag can happen in either stage: fetching or applying.

Binlog Formats: STATEMENT vs ROW vs MIXED

Binlog format changes correctness and performance characteristics.

  • STATEMENT: logs SQL statements. Smaller logs, but can be non-deterministic with certain functions/time.
  • ROW: logs row changes. More deterministic and safer, but can produce much larger binlogs.
  • MIXED: engine chooses per statement.

Production reality: many modern setups prefer ROW for correctness, accepting larger log volume.

GTID: Global Transaction Identifiers

GTID provides a unique identifier for each transaction, simplifying failover and replica reconfiguration.

Without GTID, failover often involves manual binlog file/position management.

With GTID, replicas can automatically resume from the correct transaction set.

Relay Log: The Replica's Local Queue

Replicas do not apply directly from primary binlog. They write events to relay logs first. Relay logs act as a queue. If apply is slow, relay logs grow.

Operational implication: relay log growth is an early warning signal for apply lag.

Why Replicas Lag: Transport vs Apply

There are two major lag classes:

  • Network/transport lag: IO thread cannot fetch fast enough.
  • Apply lag: SQL thread(s) cannot apply fast enough.

Most production lag incidents are apply lag.

Apply Lag: The Real Bottleneck

Apply lag happens when applying changes is slower than generating them. Common causes:

  • Single-threaded applier bottleneck
  • Large transactions (big batch updates)
  • Hotspot contention on replica (same rows updated repeatedly)
  • Expensive secondary index maintenance
  • Disk IO saturation on replica

Multi-Threaded Replication

MySQL supports multi-threaded replica apply, but it is not magic. Parallelism depends on:

  • How transactions are partitioned (schema/db based or logical groupings)
  • Conflicts between transactions (same rows/keys)

If your workload updates a small hotspot set, parallel apply will not help much.

Commit Order and Consistency Constraints

Replication must preserve correctness. Some configurations enforce commit order, reducing parallelism potential. This can be the hidden reason why replicas cannot catch up even with multiple threads.

Measuring Replication Health

The classic tool is replica status output.

-- On replica
SHOW REPLICA STATUS;

Key fields vary by version, but you typically inspect:

  • Seconds behind source (coarse signal)
  • Relay log space growth
  • SQL thread state (applying, waiting, error)
  • Last error and error code

Seconds Behind Is Not a Complete Metric

Seconds-behind is often misleading:

  • It may be NULL if replication is broken.
  • It measures apply delay relative to events, not freshness of read.
  • Under burst writes, it can jump dramatically.

Always correlate with relay log growth and apply throughput.

Failure Modes in Production

  • Replica lag explosion: batch job generates huge binlog faster than apply.
  • Relay log disk fill: relay log grows until disk is full.
  • Non-determinism: statement-based replication diverges.
  • GTID misconfig: failover causes duplicate or missing transactions.
  • Replica apply errors: schema drift causes apply failures.
  • Hotspot contention: parallel apply ineffective due to conflicting keys.

Operational Checklist

  • Choose binlog format intentionally; prefer deterministic behavior for HA.
  • Enable and validate GTID if you plan automated failover.
  • Monitor relay log size and disk usage on replicas.
  • Alert on replication SQL thread errors immediately.
  • Identify workloads that generate huge transactions; chunk them.
  • Capacity plan replica IO; apply is often disk-bound.
  • Validate multi-threaded replication effectiveness under your workload.
  • Do not rely only on seconds-behind; track apply throughput too.
  • Test failover drills and replica re-seeding procedure.
  • Ensure schema migrations are replication-safe and coordinated.

Summary

MySQL replication ships binlog events to replicas via relay logs, then applies them. Most lag incidents are apply bottlenecks, not network. Binlog format affects correctness and volume. GTID simplifies failover but must be configured correctly. Production HA requires monitoring both transport and apply pipelines and controlling large transactions that overwhelm replicas.