Saga Pattern vs Two-Phase Commit: Distributed Transactions Without The Lies
Two-phase commit is the textbook answer for distributed transactions. It also doesn't survive contact with real systems. The saga pattern — orchestrated or choreographed — is what production systems actually use. Here is the difference, the implementation patterns, and the compensation logic that handles the inevitable failure cases.
The order flow involves four services: payment, inventory, fulfillment, notifications. The textbook answer is “wrap it in a distributed transaction.” The textbook is wrong. Two-phase commit (2PC) requires every participating service to support XA or similar distributed-transaction protocols, which Postgres does, MySQL does poorly, and most third-party services (Stripe, SendGrid, your search backend) do not at all.
Real systems use sagas. A saga is a sequence of local transactions, each of which has a compensating action that undoes it. If step 4 fails, you run the compensations for steps 3, 2, 1 in reverse. The transaction is eventually consistent, not atomic — but it is implementable.
This post is the comparison, the two saga implementation patterns (orchestrated vs choreographed), and the compensation rules that hold up in production.
Why 2PC is the wrong answer
The 2PC protocol:
- Coordinator asks all participants “can you commit?”
- Each replies yes/no. If anyone says no, coordinator broadcasts “abort.”
- If all say yes, coordinator broadcasts “commit.” Everyone commits.
The properties that make 2PC look attractive: atomic across services, no compensation needed. The properties that make 2PC unworkable in practice:
- Blocking on coordinator failure. If the coordinator crashes between phase 1 and phase 2, every participant is stuck holding locks waiting for instructions. Recovery is manual.
- Requires XA or DTC support. Most modern services don’t speak XA. Postgres does (via
PREPARE TRANSACTION); your payment provider does not. - Synchronous. Every service must be available simultaneously. Any timeout breaks the transaction.
- Performance. Holds locks across multiple network round-trips. Cuts throughput dramatically.
For a single multi-table commit within a Postgres database, 2PC is fine — but you can usually do that with a regular transaction. For cross-service atomicity, 2PC’s failure modes outweigh its benefits in 95% of real systems.
What a saga is
A saga is a sequence of local transactions. Each step’s local transaction either succeeds or fails. If a later step fails, you run compensating transactions for previous steps in reverse order to undo their effects.
Forward: [pay] [reserve] [ship] [notify]
↓ fails
Compensate: [refund] ← [release] ← (steps 1-2 undone)
The trade: not atomic. There is a window where money is captured but inventory is not reserved. If the system is observed mid-saga, it is in an inconsistent intermediate state. Eventually consistent, after the compensation runs.
In return for that trade: each step is a local transaction in its own service. No XA, no coordinator-blocking, no holding locks across the network. Each service stays loosely coupled.
Orchestrated saga: one coordinator, many services
A central orchestrator drives the saga forward. It calls service A; on success, calls B; on failure, calls A’s compensation. The state of the saga is a row in the orchestrator’s database, updated as steps complete.
type SagaState =
| { phase: 'starting' }
| { phase: 'paid'; paymentId: string }
| { phase: 'reserved'; paymentId: string; reservationId: string }
| { phase: 'shipped'; paymentId: string; reservationId: string; shipmentId: string }
| { phase: 'completed' }
| { phase: 'compensating'; reason: string; lastSuccessful: string }
| { phase: 'failed'; reason: string };
async function runOrderSaga(orderId: string) {
let state: SagaState = { phase: 'starting' };
await persist(orderId, state);
try {
const paymentId = await charge(orderId);
state = { phase: 'paid', paymentId };
await persist(orderId, state);
const reservationId = await reserveInventory(orderId);
state = { phase: 'reserved', paymentId, reservationId };
await persist(orderId, state);
const shipmentId = await ship(orderId);
state = { phase: 'shipped', paymentId, reservationId, shipmentId };
await persist(orderId, state);
await notify(orderId);
state = { phase: 'completed' };
await persist(orderId, state);
} catch (err) {
await compensate(orderId, state, err.message);
}
}
async function compensate(orderId: string, state: SagaState, reason: string) {
if (state.phase === 'shipped' || state.phase === 'reserved') {
await releaseInventory(state.reservationId);
}
if (state.phase === 'shipped' || state.phase === 'reserved' || state.phase === 'paid') {
await refund(state.paymentId);
}
await persist(orderId, { phase: 'failed', reason });
}
Properties:
- Easy to reason about. The flow is in one place; you can read the orchestrator and understand the saga.
- Resumable. If the orchestrator crashes mid-saga, on restart it reads the persisted state and continues from where it left off.
- Centralized logic. Adding a new step requires only orchestrator changes.
Costs:
- The orchestrator is a single point of coupling. All services must be reachable to/from it.
- The orchestrator is a candidate for becoming complex.
Tools: Temporal, AWS Step Functions, Camunda Zeebe. Each handles persistence, retries, and timeouts for you.
Choreographed saga: events, no orchestrator
Each service publishes domain events; other services subscribe and react. There is no central coordinator.
[ payment service ] → publishes "OrderPaid" → [ inventory service ]
[ inventory service ] → publishes "InventoryReserved" → [ fulfillment service ]
[ fulfillment service ] → publishes "OrderShipped" → [ notification service ]
Failure path:
[ inventory service ] → publishes "InventoryReservationFailed" → [ payment service ]
→ triggers refund
Properties:
- Loose coupling. Each service knows only its inputs and outputs.
- No central component to fail.
- Independent scaling. Each service handles its events at its own pace.
Costs:
- Hard to follow. Tracing a saga across logs is genuinely difficult.
- Implicit ordering. It is not always obvious whether two compensating events are guaranteed to arrive in the right order.
- Schema coupling. Services depend on event shapes; a change ripples.
Choreography is appealing in theory and painful in practice for sagas with more than 3-4 steps. Most production sagas I have seen end up orchestrated.
Compensation rules that hold up
Three rules that make compensations correct.
1. Compensations must be idempotent. A retry of refund(paymentId) must not refund twice. Use the payment ID as an idempotency key. Most payment providers support this natively.
2. Compensations must be commutative if possible. If two compensations can run, the order should not matter. In practice, you order them deterministically (reverse of forward order), but defensive programming helps.
3. Compensations cannot fail. If a compensation fails, the saga is in an unrecoverable state and requires human intervention. Design compensations as the simplest, most reliable code in the system. If refund cannot run reliably, escalate — log to a dead-letter queue, alert ops, freeze the saga.
The “cannot fail” rule is harder than it sounds. Some compensations are inherently failable (the third party rejects the refund because it has been more than 90 days). Plan for those: an explicit “manual intervention required” state.
Steps that are not compensable
Some actions cannot be undone. Sending an email. Calling an external service that does not support reversal. Triggering a webhook to a partner.
Two strategies:
Make non-compensable steps run last. Order: charge → reserve → ship → notify. If notification was first, you’d have to “un-notify” — impossible. Putting it last means compensation only happens for the early steps, which are designed to be reversible.
Pivot transactions. If you must run a non-compensable action mid-saga, design that step as a “pivot” — beyond it, the saga always continues forward, never compensates. Your saga state machine has to know this.
Persistence and recovery
The orchestrator’s state must be durable. Steps:
- Persist new saga state before calling the next service.
- Crash → restart → read state → resume from the next step.
This works because each service step is idempotent (you re-call it with the same business key, it returns the existing result if it ran already). Combined with idempotency on the saga side, “did I already do step 3?” is answered by querying the downstream service.
For a Temporal-style orchestrator, the framework handles this for you. For a homegrown orchestrator, the pattern is roughly the same as the outbox pattern — write the next intended step, dispatch it, mark complete.
Observability
A saga in flight is multiple service calls scattered across logs. Tracing is mandatory. Each saga gets a trace ID; every service call carries it; the saga’s progression shows up as a single trace in your tracing tool.
For more business-level visibility, store the saga state itself in a table users / ops can query: “show me all order sagas in ‘compensating’ state.” This is what saves you when 50 sagas are stuck.
When to use what
A practical decision tree:
- Single-database, multi-table change? Regular transaction. No saga.
- Cross-service, all services controlled by you, low volume? Orchestrated saga.
- Cross-service, includes services not under your control (Stripe, SendGrid)? Orchestrated saga, with compensation logic for what you control and idempotency for what you don’t.
- Very high volume, services own their own logic, simple flow? Choreographed saga via events.
- Strict atomicity required, all participants are Postgres-like? 2PC could work — but reconsider whether you can model the data so it lives in one DB.
The right answer is usually #2 or #3. The “we wrote our own choreographed saga across 8 services” case usually ends in tears.
The takeaway
Two-phase commit is a textbook answer that fails in real systems. Sagas — orchestrated or choreographed — are what production uses. Pick orchestration for clarity and choreography for very loose coupling. Make every step idempotent. Design compensations as the most reliable code in the system. Order steps so non-compensable ones come last.
The next time someone says “we need a distributed transaction,” ask “what compensation do we run if step three fails?” The answer is the saga design.
A note from Yojji
The kind of distributed-systems engineering that turns “we need atomic across services” from a pipe dream into a working saga — orchestrators, compensations, idempotency keys, the metrics that prove it works — is the kind of senior backend skill Yojji’s teams bring to client work.
Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms, and event-driven backends — including the saga and workflow design that decides whether multi-service flows stay correct as the system grows.