Sagas (Orchestration vs Choreography)
Saga Pattern Overview: Distributed Transactions Without Global Locking
The Saga pattern is a distributed transaction management strategy that replaces atomic multi-node commits with a sequence of local transactions. Each step commits independently, and if a later step fails, previously completed steps are undone using compensating actions.
Instead of enforcing strict atomicity across services, Sagas embrace eventual consistency and recovery through compensation.
Why Sagas Exist
Two-Phase Commit and Three-Phase Commit provide strong atomic guarantees but introduce blocking behavior, coordination bottlenecks, and poor availability under partitions. At internet scale, these tradeoffs are often unacceptable.
Sagas solve the problem differently:
- Break global transaction into smaller local transactions.
- Commit each step independently.
- If failure occurs, execute compensating actions to reverse prior work.
This improves availability at the cost of temporary inconsistency.
Basic Saga Flow Example
Consider an order workflow:
- Create order record.
- Reserve inventory.
- Charge payment.
- Schedule shipment.
If step 3 fails, steps 1 and 2 must be compensated:
- Cancel order.
- Release inventory.
Each service performs its own local transaction and exposes a compensating operation.
Two Saga Coordination Models
1) Orchestration
A central orchestrator controls the workflow.
- Orchestrator sends commands to services.
- Services reply with success or failure.
- Orchestrator triggers compensations if needed.
Advantages:
- Clear control flow.
- Easier observability.
- Simpler debugging.
Disadvantages:
- Central coordination logic.
- Orchestrator can become complex.
2) Choreography
No central controller. Services emit events and react to them.
- OrderCreated event triggers InventoryService.
- InventoryReserved event triggers PaymentService.
- PaymentFailed event triggers compensations.
Advantages:
- Looser coupling.
- More decentralized control.
Disadvantages:
- Harder to trace.
- Complex event chains.
Production Scenario: Payment Failure After Inventory Reservation
Symptom
Inventory remains reserved after payment fails.
Root Cause
Compensation logic not triggered correctly due to event delivery delay.
Diagnosis
- InventoryReserved event processed.
- PaymentFailed event not consumed due to lag.
- No monitoring for stuck saga instances.
Resolution
- Implement saga state tracking store.
- Add timeout-based compensation fallback.
- Monitor incomplete saga instances.
Designing Compensating Actions
Compensation must:
- Be idempotent.
- Be safe to retry.
- Reverse the effect of original step logically.
Compensation is not always perfect reversal. Sometimes it is a logical correction rather than exact undo.
Saga State Management
Production systems track saga progress explicitly:
- Persistent saga state record.
- Status transitions (STARTED, COMPLETED, FAILED, COMPENSATING).
- Timeout monitoring for stalled steps.
Without state tracking, debugging is extremely difficult.
Tradeoffs Compared to 2PC
- No global locking.
- Higher availability under partition.
- Temporary inconsistency possible.
- Requires compensating logic.
- Operational complexity in workflow tracking.
Sagas trade atomicity for resilience and scalability.
Observability Requirements
- Active saga count
- Failed saga rate
- Compensation execution count
- Saga duration percentiles
- Stuck saga detection alerts
Saga workflows must be visible end-to-end.
Failure Injection Test
# Saga validation test 1) Execute full multi-step workflow 2) Inject failure at step N 3) Verify compensating actions execute 4) Confirm final state consistency 5) Measure recovery duration
Common Anti-Patterns
- No compensation for irreversible actions.
- Ignoring idempotency of compensation steps.
- No timeout handling for stalled steps.
- Hidden choreography logic spread across services.
- No centralized tracing of saga progress.
Operational Checklist
- Is every step paired with a compensation?
- Are compensations idempotent?
- Is saga state persisted durably?
- Are stalled workflows detectable?
- Is end-to-end tracing implemented?
Key Takeaways
- Saga pattern replaces global atomicity with compensating transactions.
- It improves availability and scalability.
- Temporary inconsistency must be tolerated.
- Compensation logic is critical and must be safe.
- Observability and state tracking are mandatory.
The Saga pattern represents the practical evolution of distributed transaction management. Instead of preventing partial progress, it embraces it and provides structured recovery. In production-grade distributed systems, Sagas are often the preferred alternative to blocking commit protocols.