How Distributed Transactions Work — Consistency Across Services

How Distributed Transactions Work — Consistency Across Services

2026-03-23

A transaction on a single database is well-understood: BEGIN, do the work, COMMIT. ACID guarantees are enforced by the database engine. But what if the transaction spans multiple databases, multiple services, or multiple partitions?

A user places an order: the order service creates the order, the payment service charges the card, and the inventory service decrements stock. Each service has its own database. If the payment succeeds but the inventory update fails, the system is inconsistent — the user was charged but the item isn't reserved.

Distributed transactions solve this, but the solutions involve fundamental tradeoffs.

Two-Phase Commit (2PC)

The classic approach. A coordinator manages the transaction across all participating nodes:

Phase 1: Prepare

  1. Coordinator sends "prepare" to all participants.
  2. Each participant writes its changes to its WAL, acquires locks, and replies "yes" (ready to commit) or "no" (cannot commit).

Phase 2: Commit 3. If all participants voted "yes", the coordinator sends "commit" to all. 4. Each participant commits and releases locks. 5. If any participant voted "no", the coordinator sends "abort" to all.

This is atomic: all commit or all abort. But it has serious problems:

Blocking — if the coordinator crashes after Phase 1 but before Phase 2, participants hold locks indefinitely. They've voted "yes" and can't unilaterally decide to commit or abort. They're stuck until the coordinator recovers.

Latency — two network round trips (prepare + commit) across all participants. In a microservices architecture with 5 services, this adds significant latency.

Availability — every participant must be reachable. If one is down, the entire transaction blocks.

2PC is used within databases (coordinating writes across partitions) but rarely across services. PostgreSQL's prepared transactions and MySQL's XA transactions implement 2PC.

Sagas — Long-Running Transactions

A saga replaces a single distributed transaction with a sequence of local transactions, each with a compensating action that undoes it if a later step fails.

1. Order Service: Create order (compensate: cancel order)
2. Payment Service: Charge card (compensate: refund card)
3. Inventory Service: Reserve item (compensate: release reservation)

If step 3 fails:

  • Execute compensation for step 2 (refund the card).
  • Execute compensation for step 1 (cancel the order).
  • The system returns to its original state.

Advantages over 2PC:

  • No coordinator holding locks across services.
  • Each step is a local transaction (fast, no distributed coordination).
  • Services don't need to be available simultaneously.

Disadvantages:

  • No isolation — between step 1 and step 3, the order exists but the item isn't reserved. Other transactions might see this intermediate state.
  • Compensation complexity — not everything is easily reversible. A sent email can't be unsent. A notification can't be undelivered.
  • Ordering — compensations must run in reverse order. If compensation fails, manual intervention may be needed.

Two saga execution patterns:

Choreography — each service publishes events when it completes its step. The next service listens and acts. No central coordinator. Simple but hard to trace and debug.

Orchestration — a central orchestrator tells each service what to do and handles failures. Easier to understand and debug but introduces a single point of coordination.

The Outbox Pattern

A common problem: a service needs to update its database AND publish a message (to a queue or event bus) atomically. If it updates the database and crashes before publishing, the message is lost. If it publishes and crashes before updating, the database is inconsistent.

The outbox pattern solves this:

  1. Write the database change AND the outgoing message to the same database in a single local transaction.
  2. A separate process reads unpublished messages from the outbox table and publishes them.
  3. After successful publishing, mark the message as sent.

Because step 1 is a local transaction, it's atomic. The message is guaranteed to eventually be published (step 2 retries on failure). The consumer must handle duplicates (the message might be published more than once).

Idempotency — Handling Duplicates

In distributed systems, messages can be delivered more than once (retries, network duplicates). An operation is idempotent if applying it multiple times has the same effect as applying it once.

Idempotent: SET balance = 100 — running it twice leaves balance at 100. Not idempotent: ADD 50 TO balance — running it twice adds 100.

Making operations idempotent:

  • Use unique request IDs. Track which IDs have been processed. Reject duplicates.
  • Use absolute values instead of deltas. SET instead of INCREMENT.
  • Design state transitions that can only happen once. An order in state "pending" can transition to "paid" — a second "pay" request for an already-paid order is a no-op.

Idempotency is the most practical tool for distributed consistency. It doesn't solve everything, but it makes retries safe — and retries are the fundamental recovery mechanism in distributed systems.

What About Exactly-Once Delivery?

In theory, exactly-once message delivery is impossible in a distributed system (messages can be lost or duplicated). In practice, systems achieve effectively exactly-once by combining at-least-once delivery with idempotent processing.

Kafka achieves this with idempotent producers (deduplicate by sequence number) and transactional consumers (commit offsets and processing atomically).

Choosing an Approach

ApproachConsistencyLatencyComplexityUse case
2PCStrong (ACID)HighModerateWithin a single database across partitions
Saga (orchestrated)EventualLow per stepHighCross-service business processes
Saga (choreographed)EventualLow per stepHighEvent-driven microservices
Outbox + idempotencyEventualLowModerateReliable messaging + database updates

Most microservices architectures use sagas with the outbox pattern and idempotent consumers. 2PC is reserved for within-database coordination.

Next Steps

This completes the distributed systems learning path. You now understand:

These problems are why distributed systems are hard — and why understanding the tradeoffs matters more than memorizing the algorithms.