Consensus Algorithms Overview
On this page
What Consensus Solves
- Agree on a leader and membership changes.
- Replicate an ordered log of commands safely.
- Provide a single source of truth for critical metadata.
Safety vs Liveness
- Safety: nothing bad happens, no two leaders commit conflicting logs.
- Liveness: something good eventually happens, the system makes progress.
- Production rule: safety is non negotiable, liveness depends on timing and failure assumptions.
Quorum Basics
- Consensus typically requires a majority quorum.
- Majority intersection prevents two different quorums from committing conflicting decisions.
- With 2f + 1 nodes, the system can tolerate f failures while maintaining safety.
Log Replication Model
- Client submits command to leader.
- Leader appends to log and replicates to followers.
- Once a quorum acknowledges, entry is committed.
- Committed entries are applied to a deterministic state machine.
Raft vs Paxos High Level
- Raft emphasizes understandability with leader election and log replication steps.
- Paxos family focuses on proving safety properties under asynchronous networks.
- In practice, both implement quorum based agreement with similar guarantees.
Operational Concerns
- Write latency depends on quorum round trips.
- Membership changes must be handled safely to avoid losing quorum.
- Snapshots and log compaction are required for long running clusters.
Failure Modes
- Election storms due to aggressive timeouts and unstable networks.
- Disk latency spikes cause followers to fall behind and reduce throughput.
- Misconfigured membership change reduces fault tolerance and breaks availability.
- Split brain behavior if quorum rules are violated by implementation or operators.
Production Checklist
- Majority quorum is enforced for commits.
- Election timeouts tuned for real latency and pauses.
- Snapshots and compaction are configured and monitored.
- Node health signals include disk, network, and replication lag.