SYSTEM-DESIGN Contents

Retry Strategies (Jitter, Caps, Budgets)

Retries can heal transient failures or amplify outages. Production-safe retries require jitter, max attempts, and overall time budgets.

On this page

Retries Can Heal or Destroy a System

Retries can recover from transient failures, but they also multiply load. Under partial outages, retries are one of the most common causes of cascading failures. Production-safe retry strategies are strict about when to retry, how often, and within what budget.

Retry Only the Right Failures

  • Good candidates: timeouts, connection resets, 502/503 from transient upstream issues
  • Bad candidates: validation errors (4xx), auth failures, “not found”, business rule rejections

Idempotency Is Mandatory for Safe Retries

Retrying a write without idempotency can create duplicate side effects (double insert, double charge). If you retry writes, require idempotency keys and store outcomes for safe replay.

Use Jitter to Avoid Retry Storms

If many clients retry at the same time (for example after a deploy or a brief outage), synchronized retries amplify the problem. Add jitter (randomness) to spread retries over time.

Retry schedule example:
Attempt 1: immediate
Attempt 2: 100ms + jitter
Attempt 3: 200ms + jitter
Attempt 4: 400ms + jitter
Stop (cap attempts)

Respect Time Budgets

Retries must fit within the request deadline. If the remaining budget is 80ms, it is better to fail fast than to retry with no chance of success. Budget-aware retries prevent wasted work and protect tail latency.

Cap Attempts and Add Circuit Breakers

A retry policy without caps is a denial-of-service mechanism against your own dependencies. Cap attempts, and combine retries with circuit breakers so you stop retrying when an upstream is clearly unhealthy.

Production-First Takeaway

Retries are a tool, not a default. Retry only transient failures, jitter delays, cap attempts, respect budgets, and make writes safe with idempotency.