Bulkheads (Stop One Fire From Burning the Ship)
Bulkhead Pattern: Isolating Failures Inside a Service
The bulkhead pattern isolates resources so that failure in one subsystem does not consume all available capacity. The name comes from ship design, where compartments are separated so that flooding in one area does not sink the entire vessel.
In distributed systems, bulkheads prevent internal cascading failures caused by shared resource exhaustion.
The Core Problem: Shared Resource Collapse
Most services share finite resources:
- Thread pools
- Connection pools
- CPU capacity
- Memory
- Async worker queues
If one dependency slows down, calls to that dependency may block threads or saturate connections. Without isolation, unrelated features degrade as well.
Production Scenario: Reporting Endpoint Kills Core API
Symptom
Heavy traffic to reporting endpoint causes login and checkout APIs to time out.
Root Cause
All endpoints shared the same thread pool and database connection pool. Reporting queries were slow and consumed all resources.
Diagnosis
- Thread pool saturation at 100 percent.
- Database connections exhausted.
- Core endpoints blocked waiting for threads.
Resolution
- Separate thread pools for reporting vs transactional endpoints.
- Limit connection pool allocation per feature.
- Apply request rate limiting to reporting API.
Types of Bulkhead Isolation
1) Thread Pool Isolation
Allocate separate execution pools per dependency or feature.
core_pool = 50 threads reporting_pool = 10 threads
If reporting fails, core traffic continues using its own pool.
2) Connection Pool Partitioning
Divide database or HTTP connection pools per dependency.
This prevents one slow backend from consuming all connections.
3) Queue Isolation
Use separate message queues for critical and non-critical workloads.
High-volume background processing should not delay transactional flows.
4) Process-Level Isolation
Run critical and non-critical workloads in separate services or containers.
This provides the strongest isolation boundary.
Bulkheads vs Circuit Breakers
- Circuit breakers stop traffic to failing dependencies.
- Bulkheads isolate resource consumption inside a service.
They complement each other but solve different problems.
Capacity Planning Considerations
Isolation introduces tradeoffs:
- Underutilized resources in one pool cannot be borrowed by another.
- Misconfigured limits can cause premature throttling.
- Too many small pools increase operational complexity.
Bulkheads require careful capacity sizing.
Graceful Degradation Strategy
When a bulkhead is saturated:
- Reject requests with controlled error responses.
- Degrade non-essential features.
- Preserve core functionality.
The goal is controlled failure, not total outage.
Observability Requirements
- Thread pool utilization per pool
- Queue depth per workload
- Connection pool saturation per dependency
- Rejected request count per bulkhead
- Latency distribution per feature
Without per-bulkhead metrics, isolation effectiveness cannot be validated.
Failure Injection Test
# Bulkhead validation test 1) Generate heavy load on non-critical endpoint 2) Observe reporting pool saturation 3) Confirm core endpoint latency remains stable 4) Verify controlled rejection occurs only in isolated pool 5) Measure resource utilization boundaries
Common Anti-Patterns
- Single global thread pool for all workloads
- Unlimited connection pools
- No rejection policy when pool saturates
- Over-isolation causing idle resources
- Ignoring backpressure signals
Operational Checklist
- Are critical and non-critical workloads isolated?
- Are resource limits explicitly defined?
- Are saturation thresholds monitored?
- Is graceful degradation defined per bulkhead?
- Are capacity assumptions tested under load?
Key Takeaways
- Bulkheads isolate resource consumption within a service.
- They prevent one failing component from degrading unrelated features.
- Isolation can be applied at thread, connection, queue, or process level.
- Proper capacity planning is required to avoid over-restriction.
- Bulkheads enable controlled degradation instead of total collapse.
The bulkhead pattern transforms shared-resource risk into compartmentalized failure. In production systems, this separation is often the difference between partial degradation and full outage.