Saga Orchestration (Central Coordinator Done Right)

Orchestration-based Sagas use a central coordinator to manage distributed transaction workflows. This lesson explains workflow engines, state persistence, compensation handling, timeout recovery, and production reliability patterns.

On this page

Saga Pattern (Orchestration): Centralized Control of Distributed Workflows

In orchestration-based Sagas, a central orchestrator service coordinates the entire workflow. Instead of services reacting to events autonomously, the orchestrator explicitly commands each step and decides what to do next based on responses.

This model provides clarity, traceability, and deterministic control at the cost of introducing a coordination component.

Core Idea

The orchestrator:

Maintains saga state.
Sends commands to participating services.
Receives success or failure responses.
Triggers compensating actions when necessary.

The business workflow becomes explicit in a single place.

Example: Order Processing Workflow

Steps:

Create order.
Reserve inventory.
Charge payment.
Schedule shipment.

Orchestrator logic:

start_saga(order_id)

call create_order()
if success:
    call reserve_inventory()
    if success:
        call charge_payment()
        if success:
            call schedule_shipment()
            mark_saga_completed()
        else:
            call release_inventory()
            call cancel_order()
            mark_saga_failed()
    else:
        call cancel_order()
        mark_saga_failed()

The entire control flow is centralized.

Saga State Persistence

Orchestrator must persist state durably:

Saga ID
Current step
Completed steps
Compensation progress
Timeout deadlines

If the orchestrator crashes, it must recover and resume the workflow.

Production Scenario: Orchestrator Crash Mid-Workflow

Symptom

Orders remain stuck in intermediate state after orchestrator restart.

Root Cause

Saga state not persisted atomically. After crash, orchestrator could not determine last completed step.

Diagnosis

Inconsistent saga state entries.
Missing compensation markers.
No idempotency protection on commands.

Resolution

Persist state transitions before sending commands.
Use idempotency keys for each step invocation.
Replay incomplete sagas on startup.

Timeout Handling

Each step must define a timeout.

If service does not respond within defined time, trigger compensation.
Timeout must be tracked in saga state store.

Timeout logic prevents indefinite workflow blocking.

Idempotency Requirements

All commands must be idempotent.

Duplicate execution should not corrupt state.
Compensating actions must be safe to retry.

Orchestrator retries are common under transient failures.

Observability and Tracing

Orchestration simplifies observability:

Single saga ID tracks entire workflow.
Central state store enables querying incomplete workflows.
Metrics per step duration and failure rate.

This is easier than distributed choreography tracing.

Scalability Considerations

The orchestrator itself must be scalable and resilient:

Stateless execution layer.
Durable external state store.
Horizontal scaling across multiple instances.
Leader election if needed for coordination tasks.

The orchestrator should not become a single point of failure.

Interaction with Messaging Systems

Commands may be sent via:

Synchronous RPC
Message broker
Workflow engine queue

Asynchronous messaging improves decoupling and resilience.

Common Anti-Patterns

In-memory saga state only.
No replay mechanism on restart.
No monitoring of stuck workflows.
Compensation logic not tested.
Orchestrator tightly coupled to service internals.

Failure Injection Test

# Orchestrated saga validation
1) Start multi-step workflow
2) Crash orchestrator after step N
3) Restart orchestrator
4) Confirm workflow resumes correctly
5) Inject failure at step M
6) Verify compensations execute in correct order

Operational Checklist

Is saga state persisted durably?
Are step commands idempotent?
Are compensations tested under retry?
Are timeouts enforced and observable?
Can incomplete sagas be queried and inspected?

Key Takeaways

Orchestration centralizes saga control logic.
State persistence is critical for reliability.
Timeout and compensation logic must be explicit.
Idempotency is mandatory for safety.
Observability is simpler than choreography but operational discipline is required.

Orchestration-based Sagas provide strong operational clarity and deterministic workflow control. When implemented carefully, they offer a production-safe alternative to distributed atomic commits without sacrificing availability.

← Sagas (Orchestration vs Choreography)

Saga Choreography (Emergent Workflows, Emergent Pain) →