What Makes a System Distributed?
What Makes a System Distributed?
Most developers think a distributed system simply means “multiple servers.” That definition is dangerously incomplete. A distributed system is not defined by the number of machines, but by the need to coordinate work across failure-prone network boundaries. The defining property is not scale. It is partial failure.
If one component can fail while others continue running, and the system must continue operating correctly, you are dealing with a distributed system. That changes everything about how you design, debug, and reason about correctness.
The Real Defining Constraint: Partial Failure
In a monolithic application, failures are usually binary: the process crashes or it does not. In distributed systems, failures are ambiguous. A service may be slow, unreachable, partitioned, or processing but not responding. You cannot reliably distinguish between a slow node and a dead one.
This uncertainty creates the core complexity. Network communication introduces:
- Unbounded latency
- Packet loss
- Out-of-order delivery
- Temporary partitions
- Clock drift between machines
The moment your architecture depends on a network boundary for correctness, you are in distributed systems territory.
Production Scenario: The “It Works Locally” Illusion
Symptom
A checkout service times out intermittently under load. CPU usage looks normal. No crashes are visible.
Root Cause
The payment service is deployed in another availability zone. During peak traffic, cross-zone latency increases from 3ms to 60ms. Retry logic multiplies the load, causing cascading latency amplification.
Diagnosis
Tracing reveals multiple retries per request. Network metrics show transient latency spikes between zones.
Resolution
- Introduce request deadlines.
- Add exponential backoff with jitter.
- Limit retries globally.
- Co-locate latency-sensitive services.
The system did not fail because a machine crashed. It failed because coordination over a network boundary behaved differently under load.
Latency Is Not Just Performance — It Is Correctness
Latency in distributed systems is not only about speed. It directly affects correctness. Timeouts, retries, and quorum decisions depend on latency assumptions. When latency shifts, behavior changes.
Consider a quorum-based system with N=3 replicas and R=2 read quorum. If one replica becomes slow but not dead, the system may serve stale reads or increase tail latency dramatically.
This is why distributed systems must be designed around:
- Timeout budgets
- Retry policies
- Failure detection thresholds
- Load shedding mechanisms
The Network Is the Computer
One of the most important mindset shifts is accepting that the network is not reliable. The classic “Fallacies of Distributed Computing” describe incorrect assumptions engineers often make. These include believing that latency is zero or that the network is reliable.
Reference: Fallacies of Distributed Computing
In production, violations of these assumptions manifest as retry storms, split-brain scenarios, or cascading failures.
Checklist: Are You in Distributed Systems Land?
- Does your service depend on another over the network?
- Can components fail independently?
- Do you use replication?
- Do you rely on consensus or coordination?
- Do you implement retries or timeouts?
If the answer to any of these is yes, you are operating a distributed system — whether you intended to or not.
Key Takeaways
- Distributed systems are defined by coordination under partial failure.
- Network boundaries introduce ambiguity, not just latency.
- Retries and timeouts are correctness mechanisms, not optimizations.
- Production incidents often emerge from small network deviations.
Understanding this definition is foundational. Everything else in distributed systems — consistency, consensus, replication, observability — builds on this constraint.