SYSTEM-DESIGN Contents

Distributed Locking

Understand safe coordination across nodes with distributed locks.

On this page

What Distributed Locks Are For

Distributed locks coordinate work across multiple processes or machines: “only one worker should run this job”, “only one node should mutate this resource at a time”. They are tempting, but in production they are a frequent source of subtle outages because networks and clocks are unreliable.

Prefer Avoiding Locks

Production-first design tries to avoid distributed locks when possible by using safer patterns:

  • Idempotency: allow repeated execution safely
  • Single-writer per key: partition work so only one node owns a key range
  • Database constraints: unique indexes and transactions to enforce exclusivity
  • Leases with expiry: bounded ownership rather than indefinite locks

Why Locks Fail in Distributed Systems

  • Network partitions make lock ownership ambiguous
  • Clock skew breaks expiry assumptions
  • Process pauses (GC, CPU starvation) delay heartbeats
  • Lock services can become a single point of failure

Locks vs Leases

A lease is a lock with an expiry that must be renewed. Leases are generally safer because they are self-healing: if the holder dies or loses connectivity, ownership can eventually transfer. But leases require careful renewal and fencing.

Fencing Tokens Prevent Stale Owners

Even with leases, a slow process can believe it still owns the lock after another node has taken over. A fencing token is a monotonically increasing number that downstream systems can use to reject actions from stale owners.

Fencing concept:
- Lock acquisition returns token T (increasing)
- All writes include token T
- Storage rejects writes with token < latest token
This prevents stale lock holders from corrupting state.

When You Actually Need a Distributed Lock

  • One-time migrations where double execution is dangerous
  • Leader-only jobs where partitioning is not available
  • External side effects that cannot be made idempotent

Production-First Takeaway

Distributed locks are risky. Avoid them when possible via idempotency and partitioned ownership. If you must lock, use leases with renewal, add fencing tokens, and treat the lock service as critical infrastructure with monitoring and failure testing.