DISTRIBUTED-SYSTEMS-ENGINEERING Contents

Bulkheads (Stop One Fire From Burning the Ship)

The bulkhead pattern isolates resources so failures in one component do not exhaust the entire system. This lesson explains thread pool isolation, connection pool partitioning, and production design strategies to prevent internal cascading failures.

On this page

Bulkhead Pattern: Isolating Failures Inside a Service

The bulkhead pattern isolates resources so that failure in one subsystem does not consume all available capacity. The name comes from ship design, where compartments are separated so that flooding in one area does not sink the entire vessel.

In distributed systems, bulkheads prevent internal cascading failures caused by shared resource exhaustion.

The Core Problem: Shared Resource Collapse

Most services share finite resources:

  • Thread pools
  • Connection pools
  • CPU capacity
  • Memory
  • Async worker queues

If one dependency slows down, calls to that dependency may block threads or saturate connections. Without isolation, unrelated features degrade as well.

Production Scenario: Reporting Endpoint Kills Core API

Symptom

Heavy traffic to reporting endpoint causes login and checkout APIs to time out.

Root Cause

All endpoints shared the same thread pool and database connection pool. Reporting queries were slow and consumed all resources.

Diagnosis

  • Thread pool saturation at 100 percent.
  • Database connections exhausted.
  • Core endpoints blocked waiting for threads.

Resolution

  • Separate thread pools for reporting vs transactional endpoints.
  • Limit connection pool allocation per feature.
  • Apply request rate limiting to reporting API.

Types of Bulkhead Isolation

1) Thread Pool Isolation

Allocate separate execution pools per dependency or feature.

core_pool = 50 threads
reporting_pool = 10 threads

If reporting fails, core traffic continues using its own pool.

2) Connection Pool Partitioning

Divide database or HTTP connection pools per dependency.

This prevents one slow backend from consuming all connections.

3) Queue Isolation

Use separate message queues for critical and non-critical workloads.

High-volume background processing should not delay transactional flows.

4) Process-Level Isolation

Run critical and non-critical workloads in separate services or containers.

This provides the strongest isolation boundary.

Bulkheads vs Circuit Breakers

  • Circuit breakers stop traffic to failing dependencies.
  • Bulkheads isolate resource consumption inside a service.

They complement each other but solve different problems.

Capacity Planning Considerations

Isolation introduces tradeoffs:

  • Underutilized resources in one pool cannot be borrowed by another.
  • Misconfigured limits can cause premature throttling.
  • Too many small pools increase operational complexity.

Bulkheads require careful capacity sizing.

Graceful Degradation Strategy

When a bulkhead is saturated:

  • Reject requests with controlled error responses.
  • Degrade non-essential features.
  • Preserve core functionality.

The goal is controlled failure, not total outage.

Observability Requirements

  • Thread pool utilization per pool
  • Queue depth per workload
  • Connection pool saturation per dependency
  • Rejected request count per bulkhead
  • Latency distribution per feature

Without per-bulkhead metrics, isolation effectiveness cannot be validated.

Failure Injection Test

# Bulkhead validation test
1) Generate heavy load on non-critical endpoint
2) Observe reporting pool saturation
3) Confirm core endpoint latency remains stable
4) Verify controlled rejection occurs only in isolated pool
5) Measure resource utilization boundaries

Common Anti-Patterns

  • Single global thread pool for all workloads
  • Unlimited connection pools
  • No rejection policy when pool saturates
  • Over-isolation causing idle resources
  • Ignoring backpressure signals

Operational Checklist

  • Are critical and non-critical workloads isolated?
  • Are resource limits explicitly defined?
  • Are saturation thresholds monitored?
  • Is graceful degradation defined per bulkhead?
  • Are capacity assumptions tested under load?

Key Takeaways

  • Bulkheads isolate resource consumption within a service.
  • They prevent one failing component from degrading unrelated features.
  • Isolation can be applied at thread, connection, queue, or process level.
  • Proper capacity planning is required to avoid over-restriction.
  • Bulkheads enable controlled degradation instead of total collapse.

The bulkhead pattern transforms shared-resource risk into compartmentalized failure. In production systems, this separation is often the difference between partial degradation and full outage.