Retries and Timeouts Basics
Retries and Timeouts Are Dependency Management
In production, most failures are not "your code crashed." They are timeouts, slowdowns, and intermittent errors in dependencies: databases, upstream HTTP services, caches, and queues. A reliable service must bound how long it waits and how often it retries.
Production mindset:
- Timeouts prevent request threads/tasks from getting stuck
- Retries recover from transient failures
- Limits prevent retry storms that amplify outages
Start With Timeouts: Always Bound Waiting
Before you add retries, ensure every external call has a timeout. Without timeouts, retries can make outages worse by stacking stuck tasks.
Sync example with std::time (conceptual):
use std::time::{Duration, Instant};
fn do_work_with_deadline(deadline: Instant) -> Result<(), String> {
if Instant::now() > deadline {
return Err("deadline exceeded".to_string());
}
Ok(())
}
In async Rust (Tokio), use timeouts around awaited operations:
use tokio::time::{timeout, Duration};
async fn call_upstream() -> Result<String, String> {
let result = timeout(Duration::from_secs(2), async {
// pretend this is an HTTP call
Ok::
Production rule: every network and database operation should be bounded by a timeout or deadline.
Retries: Only for Transient Failures
Retries are appropriate when failure is likely transient:
- Network hiccup
- Temporary upstream overload (5xx)
- Connection reset
- Rate-limited responses (with respect for Retry-After)
Retries are not appropriate for:
- Validation errors (4xx)
- Authentication failures
- Schema errors
- Deterministic business rule failures
A Minimal Retry Loop with Backoff
Keep retries bounded and add a small backoff. Even a simple exponential backoff reduces thundering herds.
use tokio::time::{sleep, Duration};
async fn retry_simple<F, Fut, T>(
mut attempts: u32,
mut f: F,
) -> Result<T, String>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = Result<T, String>>,
{
let mut backoff_ms: u64 = 100;
loop {
match f().await {
Ok(v) => return Ok(v),
Err(e) => {
attempts -= 1;
if attempts == 0 {
return Err(e);
}
sleep(Duration::from_millis(backoff_ms)).await;
backoff_ms = (backoff_ms * 2).min(1000);
}
}
}
}
Production note: keep maximum backoff bounded. Infinite backoff or unbounded retries can hide incidents and create long tail latencies.
Combine Timeout + Retry Correctly
Each attempt should be bounded by its own timeout, and the whole operation should also have an overall deadline when possible.
use tokio::time::{timeout, Duration};
async fn call_with_timeout() -> Result<String, String> {
timeout(Duration::from_secs(2), async {
Ok::
Then retry that bounded attempt:
async fn call_with_retry() -> Result<String, String> {
retry_simple(3, || async {
call_with_timeout().await
}).await
}
Production rule: never retry an unbounded operation.
Idempotency: Retries Must Be Safe
Retries can cause duplicate effects if the operation is not idempotent. Reads are usually safe. Writes must be designed carefully.
Examples of safe retry:
- GET /resource (read)
- PUT with the same payload (idempotent update)
- POST with an idempotency key
Production rule: only retry writes if you are confident they are idempotent or protected by idempotency keys.
Prevent Retry Storms
When a dependency is down, retries can multiply load and make recovery harder. Basic protections:
- Small max retry count (2-3 attempts)
- Backoff with jitter (optional at this stage)
- Timeouts on every attempt
- Fail fast when the dependency is clearly unhealthy
As you mature, you add circuit breakers and bulkheads, but the minimal baseline is bounded retries + timeouts.
Observability Signals to Add
Even in a minimal setup, emit signals that help you detect dependency issues:
- Count retries
- Count timeouts
- Measure call latency
At least log with stable fields:
tracing::warn!(attempt = 2, "upstream call failed, retrying");
Production rule: do not log full error bodies from external services if they may contain sensitive data.
Production Checklist
- Timeouts on all external calls
- Retries only for transient failures
- Small bounded retry count (2-3)
- Backoff between attempts
- Retries safe for idempotent operations
- Retry and timeout signals observable (logs/metrics)
Retries and timeouts are the first line of defense against real-world flakiness. They do not guarantee reliability, but without them, production incidents become inevitable and harder to recover from.