Latency vs Throughput

Latency is time per request; throughput is work per second. Optimizing one can harm the other, so design around the real bottleneck.

On this page

Two Metrics You Must Not Confuse

Latency is how long one request takes. Throughput is how many requests you can complete per second. Production issues often happen when teams optimize throughput while tail latency gets worse, or chase lower latency while overall capacity collapses.

Latency

Latency is measured in time (ms). In production, you care about distributions, not averages. A system with 50ms average latency can still have 2s p99 latency during load because queues and slow dependencies dominate the tail.

Throughput

Throughput is measured in requests per second (RPS) or operations per second. Throughput depends on available capacity: CPU, DB connections, network bandwidth, downstream quotas, and concurrency limits.

The Trade-Off: Concurrency and Queues

Increasing concurrency can increase throughput until you hit a bottleneck. After that point, extra concurrency mostly creates queues, memory pressure, and tail latency spikes. Production-first systems set explicit concurrency limits to avoid self-inflicted overload.

Typical pattern:
- Low concurrency: low throughput, low latency
- Medium concurrency: higher throughput, stable latency
- High concurrency beyond capacity: similar throughput, exploding p95/p99 latency

Latency Budgets and Service Composition

Real requests are composed of multiple steps: validation, cache, database, and upstream calls. A single slow dependency can dominate total latency. Production design allocates budgets and enforces timeouts so no step can consume the entire request budget.

Throughput Requires Controlling Work

Throughput is not just “more servers.” It is also controlling per-request work: avoid N+1 queries, stream large payloads, and reduce unnecessary serialization. The cheapest throughput improvement is usually removing wasted work.

Choosing What to Optimize

If users complain about slowness, optimize tail latency (p95/p99), not averages.
If the system cannot handle peak traffic, optimize throughput and saturation (CPU, DB, queues).
If cost is the constraint, reduce work per request and improve cache hit rates.

Production-First Takeaway

Latency and throughput are linked through capacity and queues. Optimize with measurements and explicit limits: without concurrency controls and timeouts, higher throughput attempts often create worse user experience.

← Constraints, Assumptions, and Design Inputs

Tail Latency (p95/p99) and Why It Dominates UX →