Distributed System Fundamentals

Distributed systems trade simplicity for scale and resilience. Learn core properties, time and failure assumptions, and production failure modes like partitions, retries, and split brain.

On this page

What Makes a System Distributed

Multiple nodes communicate over a network to provide one service.
Network and node failures are normal, not exceptional.
State is replicated or partitioned to scale and survive failures.

Core Properties

Availability: the system responds to requests.
Consistency: all observers see compatible state.
Partition tolerance: system continues despite network splits.
Latency: network hops dominate performance at scale.

Hard Truths

The network is unreliable and can drop, reorder, and duplicate messages.
Clocks are not perfectly synchronized and can drift.
Failures are partial: one node fails while others continue.
Retries can cause duplicates. Idempotency is mandatory.

Time and Ordering

Wall clock time is not a safe ordering mechanism.
Logical clocks (Lamport, vector clocks) help reason about causality.
Monotonicity matters: out of order events create correctness bugs.

Common Building Blocks

Replication: leader based, leaderless, quorum reads and writes.
Partitioning: consistent hashing, range sharding, rebalancing.
Coordination: consensus, membership, leases.
Messaging: at most once, at least once, exactly once as a goal.

Production Failure Modes

Partition: nodes cannot reach each other, clients see timeouts.
Split brain: multiple leaders accept writes and diverge state.
Retry storms: clients amplify an outage with aggressive retries.
Thundering herd: many nodes recover and hit dependencies at once.
Backpressure failure: queues grow without control and collapse latency.

Operational Checklist

Idempotency keys for write operations.
Timeouts, retry caps, and jittered backoff.
Load shedding and backpressure paths.
Clear health checks and readiness gating.
Observability: tracing, saturation metrics, and error budgets.

CAP Theorem in Practice →