Latency vs Throughput (Why You Can’t Optimize Both Blindly)

Latency and throughput are not interchangeable performance metrics. In distributed systems, optimizing one often degrades the other, and misunderstanding this tradeoff leads directly to production incidents.

On this page

Latency vs Throughput: Why You Cannot Optimize Both Blindly

Latency and throughput are often discussed together, but they represent fundamentally different dimensions of system performance. Latency measures how long a single request takes to complete. Throughput measures how many requests the system can process over time. In distributed systems, confusing these two leads to incorrect scaling decisions and cascading failures.

Optimizing for throughput often increases latency. Optimizing for latency often limits throughput. The tension between them is structural, not accidental.

Definitions That Matter in Production

Latency

The time between sending a request and receiving a response. This includes network delay, queueing delay, processing time, and serialization/deserialization overhead.

Throughput

The number of requests processed per unit time. Usually measured in requests per second (RPS), transactions per second (TPS), or messages per second.

In distributed systems, throughput increases often rely on batching, parallelism, and replication. Each of these techniques can increase tail latency if not carefully controlled.

The Queueing Reality

The relationship between latency and throughput is best understood through queueing behavior. As system utilization approaches capacity, latency does not increase linearly. It increases exponentially.

When a service operates at 50% CPU, response times may look stable. At 80%, slight traffic spikes cause queue buildup. At 95%, tail latency explodes.

This is why systems designed purely around maximum throughput collapse under real-world traffic variance.

Production Scenario: The Autoscaling Illusion

Symptom

A service scales horizontally under load. Throughput increases as expected, but p99 latency becomes unstable and spikes unpredictably.

Root Cause

Autoscaling reacts to average CPU usage. However, traffic arrives in bursts. New instances take time to warm up. During that window, request queues grow, increasing tail latency.

Diagnosis

Monitoring shows CPU below 70% average.
p50 latency is stable.
p99 latency spikes during burst traffic.

This mismatch reveals a system optimized for throughput but not protected against queue amplification.

Resolution

Introduce headroom (operate at 60–70% target utilization).
Scale based on queue depth or request rate, not just CPU.
Implement load shedding for non-critical traffic.
Use concurrency limits to cap in-flight requests.

Tail Latency Is the Real Metric

Average latency hides instability. Distributed systems are dominated by tail latency (p95, p99, p99.9). A single slow dependency can delay the entire request chain.

In fan-out architectures (one request calling multiple services), tail latency compounds. If each downstream service has a 1% chance of being slow, ten parallel calls dramatically increase the probability of a slow overall response.

Reference: The Tail at Scale – Dean & Barroso

Throughput-Oriented Techniques That Increase Latency

Batch processing
Large message sizes
High concurrency pools
Aggressive retry policies
Write-heavy replication strategies

Each of these increases system capacity but can inflate tail latency under stress.

Latency-Oriented Techniques That Limit Throughput

Strict concurrency limits
Synchronous replication
Small batch sizes
Low timeout thresholds
Over-aggressive load shedding

Each reduces variance but caps maximum sustainable throughput.

Operational Checklist

Are you measuring p99, not just averages?
Do you know your maximum safe utilization?
Do you have queue depth metrics?
Do retries amplify load under failure?
Do you have headroom for burst traffic?

Key Takeaways

Latency and throughput are competing forces in distributed systems.
Operating near maximum capacity guarantees latency instability.
Tail latency determines user experience, not averages.
Capacity planning must include burst tolerance and queue behavior.

In distributed systems engineering, performance is not about maximizing numbers. It is about maintaining stability under uncertainty.

← What Makes a System Distributed?

The Partial Failure Model (The Real Enemy) →