SYSTEM-DESIGN Contents

CAP Theorem

Understand consistency, availability, and partition tolerance trade-offs in distributed systems.

On this page

What CAP Actually Says

CAP is not a slogan and it is not about picking two forever. It is a statement about what happens during a network partition: when nodes cannot reliably communicate, a distributed system must choose between returning potentially inconsistent results (availability) or refusing some requests to preserve correctness (consistency).

Define the Terms Precisely

  • Consistency (C): every read sees the latest acknowledged write (or an error). In practice this means a single, linearizable view of state for the operation in question.
  • Availability (A): every request receives a non-error response (not necessarily the latest data) from a non-failing node.
  • Partition tolerance (P): the system continues operating despite network splits, packet loss, or delayed messages between nodes.

Why 'P' Is Not Optional

In real production environments, partitions are not rare edge cases. They appear as packet loss, DNS failures, overloaded networks, cross-region jitter, or partial connectivity. If you build a distributed system, you are already in the world where P exists. The decision you control is what to do when it happens.

CAP Decisions Are Per Operation

Many systems are not purely CP or AP. They choose based on workflow:

  • CP-like for money movements, permissions, inventory reservations
  • AP-like for feeds, analytics counters, recommendations

Consistency Is Not Free

Stronger consistency requires coordination. Coordination adds latency and can reduce availability during partitions. If your p95 latency budget is tight, strong consistency on the critical path may force you to simplify the design or isolate coordination to fewer operations.

Availability Without Correctness Has a Cost

Choosing availability during partitions means accepting stale reads or conflicting writes. In production, that pushes complexity to your application layer: conflict resolution, reconciliation jobs, and UX patterns that communicate eventual convergence.

Production Examples

Example 1: Permissions (CP leaning)
- If permission state is uncertain during partition, deny access or error
- Prefer correctness to avoid data leakage

Example 2: Social feed (AP leaning)
- If some writes are delayed, show slightly stale feed
- Prefer availability to keep UX responsive

How to Use CAP in Design

  • Write down partition scenarios (region split, replica lag, quorum loss).
  • Mark which operations must be correct vs can be stale.
  • Choose coordination only where correctness is required.
  • Design reconciliation paths for AP-like operations.

Production-First Takeaway

CAP is about behavior under partition. Treat it as an operational decision: which endpoints may return stale data, which must fail fast, and how you recover and reconcile after connectivity returns.