Consensus Basics
What Consensus Solves
Consensus is how a set of nodes agree on a single decision or ordered log of decisions, even when some nodes fail or messages are delayed. In production, consensus is the foundation for leader election, replicated state machines, and coordination services.
The Problem: Agreement Under Failure
In a single process, state is obvious. In distributed systems, state must be replicated. Replication introduces disagreement when nodes observe events in different orders or lose connectivity. Consensus creates a shared source of truth for critical coordination decisions.
Where You Encounter Consensus
- Choosing a leader for a cluster
- Maintaining a consistent configuration store
- Coordinating membership changes
- Serializing writes into an ordered log
Consensus Is Not a Database Feature
Consensus is a coordination mechanism. Many databases embed consensus-like protocols to replicate logs, but you should treat it as a distinct cost center: it adds latency, requires majority quorum, and demands careful operational discipline.
Majority Quorum Is the Usual Safety Rule
Most practical consensus systems rely on the idea that a majority of nodes overlap between decisions. This overlap prevents two different majorities from committing conflicting decisions at the same time.
Cluster size -> quorum (majority) 3 nodes -> 2 5 nodes -> 3 7 nodes -> 4 If you lose quorum, you can be available OR consistent, not both.
Latency and Availability Implications
Consensus typically requires multiple network round trips. That increases tail latency, especially cross-region. It also reduces availability: if you cannot reach a majority, you cannot safely commit new decisions.
Production Smells
- Using consensus for high-volume data writes (expensive and slow)
- Running even-sized clusters (no benefit, same quorum size as odd+1)
- Cross-region consensus without strict latency budgeting
- Assuming consensus eliminates all failures (it does not)
Production-First Takeaway
Consensus is for coordination, not bulk data. Use it to pick leaders and serialize critical decisions, accept the latency and quorum requirements, and design your system so loss of quorum is a controlled operational state.