timeouts, resilience patterns

Design production-grade resilience in Spring Boot with explicit timeouts, retries, circuit breakers, bulkheads and failure isolation.

On this page

Why Missing Timeouts Cause Cascading Failures

In distributed systems, slow is worse than down. If a downstream dependency slows down and you have no timeouts: - Threads block indefinitely - Connection pools exhaust - Request queues grow - CPU rises due to context switching - Kubernetes restarts pods - Traffic shifts to fewer healthy instances - Entire cluster destabilizes This is how cascading failure begins.

Incident Scenario: One Slow Dependency Took Down Everything

A payment provider degraded from 150ms to 25 seconds. Your service had no HTTP read timeout. Each request waited. Thread pool of 200 threads filled. New requests queued. Liveness probe failed. Pods restarted. Load concentrated on remaining instances. Full outage in 90 seconds. Root cause: No timeout. No circuit breaker. No containment.

Anti-Pattern: Infinite Wait + Blind Retry

Common production mistakes: - No connect timeout - No read timeout - Retry on every exception - Unlimited retries - Retry without backoff Retries increase load on a failing dependency. That is not resilience. That is amplification.

Explicit Timeouts Everywhere

Every I/O boundary must define timeouts: - HTTP client connect timeout - HTTP client read timeout - Database query timeout - Messaging poll timeout - Overall request deadline Example RestTemplate configuration:

import org.springframework.boot.web.client.RestTemplateBuilder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
public class HttpClientConfig {

    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder builder) {
        return builder
            .setConnectTimeout(Duration.ofSeconds(2))
            .setReadTimeout(Duration.ofSeconds(3))
            .build();
    }
}

Never rely on defaults. Some drivers default to infinite timeout.

Circuit Breaker: Fail Fast Instead of Hanging

Circuit breaker behavior: - Tracks failure rate - Opens when threshold exceeded - Immediately rejects new calls - Periodically tests recovery This prevents thread exhaustion. Example with Resilience4j:

import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import org.springframework.stereotype.Service;

@Service
public class PricingClient {

    @CircuitBreaker(name = "pricing", fallbackMethod = "fallback")
    public Price getPrice(String sku) {
        // call downstream
        return new Price();
    }

    public Price fallback(String sku, Throwable t) {
        return Price.unavailable(sku);
    }
}

Important: Fallback must not hide critical failures silently. Sometimes failing fast is safer than returning degraded data.

Retries: Use Carefully

Retry only for transient failures: - network glitch - 502, 503, 504 - connection timeout Never retry: - 400 validation errors - authentication failures - business conflicts Retry must include: - small retry count (2–3 max) - exponential backoff - jitter - idempotency analysis If operation is not idempotent, retry can cause data corruption.

Bulkhead: Contain Concurrency

Without bulkhead: One slow dependency consumes all threads. Bulkhead limits concurrent calls to risky dependencies. Other endpoints remain functional. Failure should degrade a feature, not kill the system.

Timeout Budgeting

If overall SLA is 2 seconds: - DB: 400ms - Downstream A: 600ms - Downstream B: 500ms - Internal processing: 300ms - Buffer: 200ms Without budgeting, components compete blindly.

Metrics and Observability

Monitor: - Timeout count - Circuit open events - Retry attempts - Fallback usage - Latency percentiles (P95, P99) If fallback usage increases, system is degraded even if uptime looks fine.

Checklist

- Define explicit timeouts on all I/O - Implement circuit breakers on remote dependencies - Limit retries and add backoff + jitter - Verify idempotency before enabling retry - Use bulkheads to isolate risky calls - Define timeout budgets per endpoint - Monitor resilience metrics - Perform failure injection testing Systems fail under stress. Resilience determines whether they collapse or degrade gracefully.

← Request ID, correlation

graceful shutdown, readiness/liveness →