Saga Orchestration (Central Coordinator Done Right)
Saga Pattern (Orchestration): Centralized Control of Distributed Workflows
In orchestration-based Sagas, a central orchestrator service coordinates the entire workflow. Instead of services reacting to events autonomously, the orchestrator explicitly commands each step and decides what to do next based on responses.
This model provides clarity, traceability, and deterministic control at the cost of introducing a coordination component.
Core Idea
The orchestrator:
- Maintains saga state.
- Sends commands to participating services.
- Receives success or failure responses.
- Triggers compensating actions when necessary.
The business workflow becomes explicit in a single place.
Example: Order Processing Workflow
Steps:
- Create order.
- Reserve inventory.
- Charge payment.
- Schedule shipment.
Orchestrator logic:
start_saga(order_id)
call create_order()
if success:
call reserve_inventory()
if success:
call charge_payment()
if success:
call schedule_shipment()
mark_saga_completed()
else:
call release_inventory()
call cancel_order()
mark_saga_failed()
else:
call cancel_order()
mark_saga_failed()
The entire control flow is centralized.
Saga State Persistence
Orchestrator must persist state durably:
- Saga ID
- Current step
- Completed steps
- Compensation progress
- Timeout deadlines
If the orchestrator crashes, it must recover and resume the workflow.
Production Scenario: Orchestrator Crash Mid-Workflow
Symptom
Orders remain stuck in intermediate state after orchestrator restart.
Root Cause
Saga state not persisted atomically. After crash, orchestrator could not determine last completed step.
Diagnosis
- Inconsistent saga state entries.
- Missing compensation markers.
- No idempotency protection on commands.
Resolution
- Persist state transitions before sending commands.
- Use idempotency keys for each step invocation.
- Replay incomplete sagas on startup.
Timeout Handling
Each step must define a timeout.
- If service does not respond within defined time, trigger compensation.
- Timeout must be tracked in saga state store.
Timeout logic prevents indefinite workflow blocking.
Idempotency Requirements
All commands must be idempotent.
- Duplicate execution should not corrupt state.
- Compensating actions must be safe to retry.
Orchestrator retries are common under transient failures.
Observability and Tracing
Orchestration simplifies observability:
- Single saga ID tracks entire workflow.
- Central state store enables querying incomplete workflows.
- Metrics per step duration and failure rate.
This is easier than distributed choreography tracing.
Scalability Considerations
The orchestrator itself must be scalable and resilient:
- Stateless execution layer.
- Durable external state store.
- Horizontal scaling across multiple instances.
- Leader election if needed for coordination tasks.
The orchestrator should not become a single point of failure.
Interaction with Messaging Systems
Commands may be sent via:
- Synchronous RPC
- Message broker
- Workflow engine queue
Asynchronous messaging improves decoupling and resilience.
Common Anti-Patterns
- In-memory saga state only.
- No replay mechanism on restart.
- No monitoring of stuck workflows.
- Compensation logic not tested.
- Orchestrator tightly coupled to service internals.
Failure Injection Test
# Orchestrated saga validation 1) Start multi-step workflow 2) Crash orchestrator after step N 3) Restart orchestrator 4) Confirm workflow resumes correctly 5) Inject failure at step M 6) Verify compensations execute in correct order
Operational Checklist
- Is saga state persisted durably?
- Are step commands idempotent?
- Are compensations tested under retry?
- Are timeouts enforced and observable?
- Can incomplete sagas be queried and inspected?
Key Takeaways
- Orchestration centralizes saga control logic.
- State persistence is critical for reliability.
- Timeout and compensation logic must be explicit.
- Idempotency is mandatory for safety.
- Observability is simpler than choreography but operational discipline is required.
Orchestration-based Sagas provide strong operational clarity and deterministic workflow control. When implemented carefully, they offer a production-safe alternative to distributed atomic commits without sacrificing availability.