The Thundering Herd (Cache, Locks, and Stampedes)
The Thundering Herd Problem: When Synchronization Turns into a Stampede
The thundering herd problem occurs when a large number of clients or processes simultaneously attempt to access or retry the same resource. In distributed systems, synchronized behavior — especially during failures — can overwhelm infrastructure and amplify outages.
What begins as a small failure can cascade into systemic overload.
Common Causes
- Simultaneous retry after dependency timeout.
- Cache expiration at the same timestamp.
- Leader election after node failure.
- Batch job scheduled at fixed intervals.
- Mass reconnect attempts after network outage.
Synchronized timing is the root trigger.
Retry Storm Scenario
Initial Event
A downstream service experiences 2-second latency spike.
Client Behavior
- All clients timeout at 2 seconds.
- All immediately retry.
- Retry load doubles traffic instantly.
The dependency collapses under retry amplification.
Cache Stampede Example
Assume cached data expires at the same moment for all clients.
Cache TTL = 60 seconds All keys expire at 12:00:00 Thousands of requests regenerate data simultaneously
Backend service becomes overloaded.
Production Scenario: Redis Outage Amplified
Symptom
Short Redis disruption leads to full API outage.
Root Cause
All application instances retry cache calls simultaneously with no jitter or backoff.
Impact
- CPU saturation.
- Connection pool exhaustion.
- Cascading timeouts.
Resolution
- Introduce exponential backoff with jitter.
- Implement request-level circuit breakers.
- Stagger cache expiration times.
Mitigation Strategies
1) Exponential Backoff with Jitter
retry_delay = base * 2^attempt + random_jitter
Jitter prevents synchronized retry waves.
2) Token Bucket or Rate Limiting
Limit number of concurrent retries or regeneration attempts.
3) Cache Randomized Expiration
Add random offset to TTL values.
4) Request Coalescing
Allow only one request to regenerate shared data while others wait.
5) Circuit Breakers
Prevent excessive retry attempts when dependency unhealthy.
Autoscaling Interaction
Autoscaling can worsen thundering herd behavior:
- New instances start simultaneously.
- All perform warm-up calls.
- Downstream service overloaded.
Warm-up traffic must be controlled.
Leader Election Storm
In consensus systems:
- Node failure triggers election.
- Multiple nodes compete simultaneously.
- Repeated election cycles increase instability.
Randomized election timeouts mitigate this risk.
Observability Requirements
- Retry rate metrics.
- Dependency request burst detection.
- Cache regeneration frequency.
- Connection pool saturation.
- Concurrent request counts.
Retry storms should be visible immediately.
Failure Injection Test
# Thundering herd validation 1) Simulate dependency slowdown 2) Trigger client timeouts 3) Observe retry burst behavior 4) Apply jitter and backoff 5) Confirm retry distribution smooths out 6) Validate system stability under repeated failures
Common Anti-Patterns
- Immediate retries without delay.
- No jitter in backoff calculation.
- Synchronized TTL expiration.
- No circuit breaker.
- No retry budget limit.
Synchronized systems fail together.
Operational Checklist
- Are retries using exponential backoff with jitter?
- Are cache TTLs randomized?
- Is retry budget enforced?
- Are retry storms detectable in monitoring?
- Are circuit breakers configured correctly?
Key Takeaways
- Thundering herd is caused by synchronized behavior.
- Retries without jitter amplify failures.
- Cache stampedes overload backends.
- Backoff and randomness reduce synchronization.
- Failure amplification must be engineered against.
The thundering herd problem demonstrates how coordinated client behavior can transform minor instability into catastrophic outage. Production-grade distributed systems must deliberately introduce randomness and rate control to prevent synchronized collapse.