REACT Contents

Consensus Algorithms Overview

Consensus provides agreement on a single value or log despite failures. Learn the problems it solves, quorum assumptions, safety vs liveness, and what Raft and Paxos style systems guarantee in production.

On this page

What Consensus Solves

  • Agree on a leader and membership changes.
  • Replicate an ordered log of commands safely.
  • Provide a single source of truth for critical metadata.

Safety vs Liveness

  • Safety: nothing bad happens, no two leaders commit conflicting logs.
  • Liveness: something good eventually happens, the system makes progress.
  • Production rule: safety is non negotiable, liveness depends on timing and failure assumptions.

Quorum Basics

  • Consensus typically requires a majority quorum.
  • Majority intersection prevents two different quorums from committing conflicting decisions.
  • With 2f + 1 nodes, the system can tolerate f failures while maintaining safety.

Log Replication Model

  • Client submits command to leader.
  • Leader appends to log and replicates to followers.
  • Once a quorum acknowledges, entry is committed.
  • Committed entries are applied to a deterministic state machine.

Raft vs Paxos High Level

  • Raft emphasizes understandability with leader election and log replication steps.
  • Paxos family focuses on proving safety properties under asynchronous networks.
  • In practice, both implement quorum based agreement with similar guarantees.

Operational Concerns

  • Write latency depends on quorum round trips.
  • Membership changes must be handled safely to avoid losing quorum.
  • Snapshots and log compaction are required for long running clusters.

Failure Modes

  • Election storms due to aggressive timeouts and unstable networks.
  • Disk latency spikes cause followers to fall behind and reduce throughput.
  • Misconfigured membership change reduces fault tolerance and breaks availability.
  • Split brain behavior if quorum rules are violated by implementation or operators.

Production Checklist

  • Majority quorum is enforced for commits.
  • Election timeouts tuned for real latency and pauses.
  • Snapshots and compaction are configured and monitored.
  • Node health signals include disk, network, and replication lag.