DISTRIBUTED-SYSTEMS-ENGINEERING Contents

The Thundering Herd (Cache, Locks, and Stampedes)

The thundering herd problem occurs when many clients simultaneously retry or access the same resource, overwhelming the system. This lesson explains retry storms, cache stampedes, mitigation strategies, and production safeguards.

On this page

The Thundering Herd Problem: When Synchronization Turns into a Stampede

The thundering herd problem occurs when a large number of clients or processes simultaneously attempt to access or retry the same resource. In distributed systems, synchronized behavior — especially during failures — can overwhelm infrastructure and amplify outages.

What begins as a small failure can cascade into systemic overload.

Common Causes

  • Simultaneous retry after dependency timeout.
  • Cache expiration at the same timestamp.
  • Leader election after node failure.
  • Batch job scheduled at fixed intervals.
  • Mass reconnect attempts after network outage.

Synchronized timing is the root trigger.

Retry Storm Scenario

Initial Event

A downstream service experiences 2-second latency spike.

Client Behavior

  • All clients timeout at 2 seconds.
  • All immediately retry.
  • Retry load doubles traffic instantly.

The dependency collapses under retry amplification.

Cache Stampede Example

Assume cached data expires at the same moment for all clients.

Cache TTL = 60 seconds
All keys expire at 12:00:00
Thousands of requests regenerate data simultaneously

Backend service becomes overloaded.

Production Scenario: Redis Outage Amplified

Symptom

Short Redis disruption leads to full API outage.

Root Cause

All application instances retry cache calls simultaneously with no jitter or backoff.

Impact

  • CPU saturation.
  • Connection pool exhaustion.
  • Cascading timeouts.

Resolution

  • Introduce exponential backoff with jitter.
  • Implement request-level circuit breakers.
  • Stagger cache expiration times.

Mitigation Strategies

1) Exponential Backoff with Jitter

retry_delay = base * 2^attempt + random_jitter

Jitter prevents synchronized retry waves.

2) Token Bucket or Rate Limiting

Limit number of concurrent retries or regeneration attempts.

3) Cache Randomized Expiration

Add random offset to TTL values.

4) Request Coalescing

Allow only one request to regenerate shared data while others wait.

5) Circuit Breakers

Prevent excessive retry attempts when dependency unhealthy.

Autoscaling Interaction

Autoscaling can worsen thundering herd behavior:

  • New instances start simultaneously.
  • All perform warm-up calls.
  • Downstream service overloaded.

Warm-up traffic must be controlled.

Leader Election Storm

In consensus systems:

  • Node failure triggers election.
  • Multiple nodes compete simultaneously.
  • Repeated election cycles increase instability.

Randomized election timeouts mitigate this risk.

Observability Requirements

  • Retry rate metrics.
  • Dependency request burst detection.
  • Cache regeneration frequency.
  • Connection pool saturation.
  • Concurrent request counts.

Retry storms should be visible immediately.

Failure Injection Test

# Thundering herd validation
1) Simulate dependency slowdown
2) Trigger client timeouts
3) Observe retry burst behavior
4) Apply jitter and backoff
5) Confirm retry distribution smooths out
6) Validate system stability under repeated failures

Common Anti-Patterns

  • Immediate retries without delay.
  • No jitter in backoff calculation.
  • Synchronized TTL expiration.
  • No circuit breaker.
  • No retry budget limit.

Synchronized systems fail together.

Operational Checklist

  • Are retries using exponential backoff with jitter?
  • Are cache TTLs randomized?
  • Is retry budget enforced?
  • Are retry storms detectable in monitoring?
  • Are circuit breakers configured correctly?

Key Takeaways

  • Thundering herd is caused by synchronized behavior.
  • Retries without jitter amplify failures.
  • Cache stampedes overload backends.
  • Backoff and randomness reduce synchronization.
  • Failure amplification must be engineered against.

The thundering herd problem demonstrates how coordinated client behavior can transform minor instability into catastrophic outage. Production-grade distributed systems must deliberately introduce randomness and rate control to prevent synchronized collapse.