Distributed System Fundamentals
On this page
What Makes a System Distributed
- Multiple nodes communicate over a network to provide one service.
- Network and node failures are normal, not exceptional.
- State is replicated or partitioned to scale and survive failures.
Core Properties
- Availability: the system responds to requests.
- Consistency: all observers see compatible state.
- Partition tolerance: system continues despite network splits.
- Latency: network hops dominate performance at scale.
Hard Truths
- The network is unreliable and can drop, reorder, and duplicate messages.
- Clocks are not perfectly synchronized and can drift.
- Failures are partial: one node fails while others continue.
- Retries can cause duplicates. Idempotency is mandatory.
Time and Ordering
- Wall clock time is not a safe ordering mechanism.
- Logical clocks (Lamport, vector clocks) help reason about causality.
- Monotonicity matters: out of order events create correctness bugs.
Common Building Blocks
- Replication: leader based, leaderless, quorum reads and writes.
- Partitioning: consistent hashing, range sharding, rebalancing.
- Coordination: consensus, membership, leases.
- Messaging: at most once, at least once, exactly once as a goal.
Production Failure Modes
- Partition: nodes cannot reach each other, clients see timeouts.
- Split brain: multiple leaders accept writes and diverge state.
- Retry storms: clients amplify an outage with aggressive retries.
- Thundering herd: many nodes recover and hit dependencies at once.
- Backpressure failure: queues grow without control and collapse latency.
Operational Checklist
- Idempotency keys for write operations.
- Timeouts, retry caps, and jittered backoff.
- Load shedding and backpressure paths.
- Clear health checks and readiness gating.
- Observability: tracing, saturation metrics, and error budgets.