Latency vs Throughput
Two Metrics You Must Not Confuse
Latency is how long one request takes. Throughput is how many requests you can complete per second. Production issues often happen when teams optimize throughput while tail latency gets worse, or chase lower latency while overall capacity collapses.
Latency
Latency is measured in time (ms). In production, you care about distributions, not averages. A system with 50ms average latency can still have 2s p99 latency during load because queues and slow dependencies dominate the tail.
Throughput
Throughput is measured in requests per second (RPS) or operations per second. Throughput depends on available capacity: CPU, DB connections, network bandwidth, downstream quotas, and concurrency limits.
The Trade-Off: Concurrency and Queues
Increasing concurrency can increase throughput until you hit a bottleneck. After that point, extra concurrency mostly creates queues, memory pressure, and tail latency spikes. Production-first systems set explicit concurrency limits to avoid self-inflicted overload.
Typical pattern: - Low concurrency: low throughput, low latency - Medium concurrency: higher throughput, stable latency - High concurrency beyond capacity: similar throughput, exploding p95/p99 latency
Latency Budgets and Service Composition
Real requests are composed of multiple steps: validation, cache, database, and upstream calls. A single slow dependency can dominate total latency. Production design allocates budgets and enforces timeouts so no step can consume the entire request budget.
Throughput Requires Controlling Work
Throughput is not just “more servers.” It is also controlling per-request work: avoid N+1 queries, stream large payloads, and reduce unnecessary serialization. The cheapest throughput improvement is usually removing wasted work.
Choosing What to Optimize
- If users complain about slowness, optimize tail latency (p95/p99), not averages.
- If the system cannot handle peak traffic, optimize throughput and saturation (CPU, DB, queues).
- If cost is the constraint, reduce work per request and improve cache hit rates.
Production-First Takeaway
Latency and throughput are linked through capacity and queues. Optimize with measurements and explicit limits: without concurrency controls and timeouts, higher throughput attempts often create worse user experience.