What Makes a System Distributed?

A distributed system is not defined by multiple machines, but by coordination under partial failure. This lesson reframes the definition through real production failure modes, latency tradeoffs, and operational complexity.

On this page

What Makes a System Distributed?

Most developers think a distributed system simply means “multiple servers.” That definition is dangerously incomplete. A distributed system is not defined by the number of machines, but by the need to coordinate work across failure-prone network boundaries. The defining property is not scale. It is partial failure.

If one component can fail while others continue running, and the system must continue operating correctly, you are dealing with a distributed system. That changes everything about how you design, debug, and reason about correctness.

The Real Defining Constraint: Partial Failure

In a monolithic application, failures are usually binary: the process crashes or it does not. In distributed systems, failures are ambiguous. A service may be slow, unreachable, partitioned, or processing but not responding. You cannot reliably distinguish between a slow node and a dead one.

This uncertainty creates the core complexity. Network communication introduces:

Unbounded latency
Packet loss
Out-of-order delivery
Temporary partitions
Clock drift between machines

The moment your architecture depends on a network boundary for correctness, you are in distributed systems territory.

Production Scenario: The “It Works Locally” Illusion

Symptom

A checkout service times out intermittently under load. CPU usage looks normal. No crashes are visible.

Root Cause

The payment service is deployed in another availability zone. During peak traffic, cross-zone latency increases from 3ms to 60ms. Retry logic multiplies the load, causing cascading latency amplification.

Diagnosis

Tracing reveals multiple retries per request. Network metrics show transient latency spikes between zones.

Resolution

Introduce request deadlines.
Add exponential backoff with jitter.
Limit retries globally.
Co-locate latency-sensitive services.

The system did not fail because a machine crashed. It failed because coordination over a network boundary behaved differently under load.

Latency Is Not Just Performance — It Is Correctness

Latency in distributed systems is not only about speed. It directly affects correctness. Timeouts, retries, and quorum decisions depend on latency assumptions. When latency shifts, behavior changes.

Consider a quorum-based system with N=3 replicas and R=2 read quorum. If one replica becomes slow but not dead, the system may serve stale reads or increase tail latency dramatically.

This is why distributed systems must be designed around:

Timeout budgets
Retry policies
Failure detection thresholds
Load shedding mechanisms

The Network Is the Computer

One of the most important mindset shifts is accepting that the network is not reliable. The classic “Fallacies of Distributed Computing” describe incorrect assumptions engineers often make. These include believing that latency is zero or that the network is reliable.

Reference: Fallacies of Distributed Computing

In production, violations of these assumptions manifest as retry storms, split-brain scenarios, or cascading failures.

Checklist: Are You in Distributed Systems Land?

Does your service depend on another over the network?
Can components fail independently?
Do you use replication?
Do you rely on consensus or coordination?
Do you implement retries or timeouts?

If the answer to any of these is yes, you are operating a distributed system — whether you intended to or not.

Key Takeaways

Distributed systems are defined by coordination under partial failure.
Network boundaries introduce ambiguity, not just latency.
Retries and timeouts are correctness mechanisms, not optimizations.
Production incidents often emerge from small network deviations.

Understanding this definition is foundational. Everything else in distributed systems — consistency, consensus, replication, observability — builds on this constraint.

Latency vs Throughput (Why You Can’t Optimize Both Blindly) →