REACT Contents

Leader Election Concepts

Leader election coordinates writes and membership under failure. Learn heartbeats, leases, epochs, fencing tokens, and operational pitfalls like stale leaders, clock drift, and unsafe failover that causes split brain.

On this page

Why Leaders Exist

  • Provide a single ordered write path.
  • Centralize coordination decisions such as membership changes.
  • Simplify conflict resolution compared to multi leader designs.

Leader Election Requirements

  • At most one leader can be active for the same group at a time.
  • Leader must be discoverable by clients and followers.
  • Leadership must transfer safely under failures.

Heartbeats and Failure Detection

  • Followers detect leader failure via missed heartbeats.
  • Timeouts are probabilistic and can cause false positives.
  • Production rule: tune timeouts to match network and GC behavior, not optimism.

Leases and Epochs

  • Lease: time bounded leadership grant.
  • Epoch term: monotonic leadership generation number.
  • Epochs help clients reject stale leaders and prevent split brain writes.

Fencing Tokens

  • A fencing token is a monotonic number attached to writes.
  • Storage rejects writes with older tokens.
  • This prevents a previously isolated leader from continuing to write after a new leader exists.

Clock Drift Risks

  • Lease safety depends on clock assumptions.
  • Clock drift and pauses can break lease logic if not designed carefully.
  • Prefer epoch based validation at the storage boundary over relying only on time.

Failure Modes

  • Split brain from network partition and unsafe lease logic.
  • Thrashing where leaders change too frequently due to aggressive timeouts.
  • Stale leader continues serving writes because clients cache old leader address.
  • Unbounded failover causes write unavailability during repeated elections.

Incident Triage Checklist

  • Are elections frequent? Inspect timeouts, GC pauses, and network loss.
  • Is a stale leader serving? Verify epoch or fencing enforcement at storage.
  • Do clients retry safely and refresh leader discovery promptly?
  • Is membership view consistent across nodes?

Production Checklist

  • Leadership epochs are monotonic and validated on every write.
  • Client leader discovery has fast refresh and fallback.
  • Election timeouts tuned and tested under latency and pause conditions.
  • Split brain protection via fencing tokens or equivalent mechanism.