Saga Pattern
On this page
Why Sagas Exist
- Distributed systems rarely support safe global ACID transactions across services.
- Workflows span services: order, payment, inventory, shipping.
- Sagas provide eventual consistency with explicit steps and compensations.
Saga Types
- Choreography: services react to events and trigger next steps.
- Orchestration: a coordinator issues commands and tracks progress.
- Production rule: pick based on observability and complexity, not ideology.
Core Design Requirements
- Each step must be idempotent.
- Each step must have a compensation or a defined terminal failure policy.
- State must be persisted for recovery after crashes.
- Timeouts are part of correctness and must be explicit.
Compensation vs Rollback
- Compensation is a new business action that semantically undoes a previous step.
- Compensation may be imperfect: refunds instead of reversing authorization.
- Production rule: define acceptable outcomes for partial completion.
Choreography Failure Risks
- Hidden coupling through events and implicit ordering assumptions.
- Harder end to end visibility without correlation and tracing.
- Poison messages can stall a step and block progress.
Orchestration Failure Risks
- Coordinator becomes a bottleneck or single point of failure if not replicated.
- State machine bugs can cause stuck workflows.
- Complexity moves into the orchestrator codebase.
Operational Patterns
- Use correlation ids across all commands and events.
- Persist saga state as a state machine with explicit transitions.
- Use outbox pattern for publish reliability.
- Use dead letter queues and replay tooling.
Failure Modes
- Stuck saga: step never completes due to missing event or consumer lag.
- Duplicate step: message redelivery triggers a side effect twice.
- Compensation failure: refund fails and workflow remains inconsistent.
- Out of order events: state machine applies transitions incorrectly.
Production Checklist
- All steps are idempotent and have retry caps.
- Compensation is defined and tested for each reversible step.
- Saga state is persisted and recoverable after restart.
- Correlation ids and tracing provide end to end visibility.
- Runbooks exist for replay, manual intervention, and compensation.