Retries and Backoff (When Retries Are Harmful)
On this page
Retry Policy
- Retry transient failures (timeouts, 5xx) only.
- Use bounded attempts and exponential backoff.
- Add jitter to avoid synchronized retries.
Simple Backoff with Jitter
import random
import time
def retry(fn, *, attempts: int, base_backoff: float):
last_exc = None
for i in range(attempts):
try:
return fn()
except Exception as e:
last_exc = e
sleep = base_backoff * (2 ** i)
sleep = sleep * (0.5 + random.random()) # jitter
time.sleep(sleep)
raise last_exc
Operational Checklist
- Retries must have timeouts; otherwise retries just extend hangs.
- Use a retry budget per request/service to limit amplification.
- Do not retry non-idempotent operations without safeguards.
Failure Modes
- Retry storm: clients pile on a failing dependency.
- Hidden latency: retries make p99 worse even if success rate improves.