Distributed Transactions & 2PC — Grind: DSA, System Design, Interview Prep

Why Single-Node ACID Isn’t Enough

A single PostgreSQL instance gives you full ACID: transactions are atomic, isolated, and durable. Scale to multiple services or databases, and ACID breaks at the seams.

Consider: a user places an order. Your orders service deducts inventory. Your payments service charges a credit card. These live in two databases, owned by two teams, communicating over a network. How do you guarantee that either both happen or neither happens?

If orders succeeds but payments fails, you’ve shipped a product you won’t be paid for. If payments succeeds but orders fails, you’ve charged a customer for a product that won’t arrive. The naive solution — try one, then the other, rollback if the second fails — fails because the rollback of the first operation is itself a network call that can fail.

This is the distributed transaction problem. There is no perfect solution. Every approach trades some combination of performance, availability, or complexity.

Two-Phase Commit (2PC)

Two-Phase Commit is the classical solution for distributed atomicity. It coordinates multiple resource managers (databases, message queues) through a single coordinator process.

Phase 1: Prepare

The coordinator sends PREPARE to all participants. Each participant:

Executes the transaction’s work (but does not commit).
Writes a WAL record ensuring it can either commit or abort later.
Acquires all necessary locks.
Responds YES (ready to commit) or NO (cannot commit — rollback immediately).

After responding YES, a participant is in the prepared state. It must commit or abort depending on the coordinator’s decision — it cannot unilaterally abort now, even if the coordinator goes silent.

Phase 2: Commit or Abort

If all participants voted YES:

The coordinator writes a commit record to its own durable log.
Sends COMMIT to all participants.
Participants commit and release locks.

If any participant voted NO:

The coordinator sends ABORT to all participants.
Participants undo their work and release locks.

Coordinator                   Orders DB         Payments DB
    │                              │                  │
    │── PREPARE ──────────────────►│                  │
    │── PREPARE ───────────────────────────────────►  │
    │◄─ YES ───────────────────────│                  │
    │◄─ YES ────────────────────────────────────────  │
    │                                                  │
    │  (coordinator writes COMMIT to its own WAL)      │
    │── COMMIT ───────────────────►│                  │
    │── COMMIT ────────────────────────────────────►  │
    │◄─ ACK ───────────────────────│                  │
    │◄─ ACK ────────────────────────────────────────  │

The Blocking Problem: The Fatal Flaw

2PC has a well-known blocking scenario:

Coordinator sends PREPARE to all participants. All respond YES.
Coordinator writes commit record to its WAL.
Coordinator crashes before sending COMMIT.

Now the participants are blocked indefinitely:

They are in the prepared state — holding locks, unable to commit or abort on their own.
They must wait for the coordinator to recover and resend COMMIT.
If the coordinator’s disk is corrupted, the participants wait forever.
Other transactions that need the locked rows are also blocked.

This is called the in-doubt window. In practice, databases implement coordinator crash recovery: the recovered coordinator reads its WAL, finds the committed entry, and resends COMMIT to participants. But during the recovery window (minutes to hours in pathological cases), the system is effectively frozen.

PostgreSQL 2PC example:

-- Application acts as coordinator
BEGIN;
-- Shard 1
PREPARE TRANSACTION 'order_payment_001'; -- Phase 1: prepare

-- If Shard 2 also prepares successfully:
COMMIT PREPARED 'order_payment_001';    -- Phase 2: commit

-- Or on failure:
ROLLBACK PREPARED 'order_payment_001';

-- Check in-doubt transactions:
SELECT * FROM pg_prepared_xacts;

Three-Phase Commit (3PC)

3PC adds a PRE-COMMIT phase between Prepare and Commit to break the blocking scenario:

Phase 1 (CanCommit): Coordinator asks “Can you commit?” Participants respond yes/no without locking.
Phase 2 (PreCommit): If all yes, coordinator sends PRE-COMMIT. Participants prepare and acknowledge.
Phase 3 (DoCommit): Coordinator sends COMMIT. Participants commit.

The key property: after receiving PRE-COMMIT, a participant knows the coordinator will send COMMIT. If the coordinator crashes after all participants received PRE-COMMIT, any remaining participant can take over and commit — they know all participants have prepared.

Why 3PC isn’t used in practice: it assumes a synchronous network with no partitions. During a network partition, it’s possible for some participants to receive PRE-COMMIT and commit, while others (partitioned away) timeout and abort. The result: some nodes committed, others aborted — a split-brain data corruption. Network partitions are common in production. 3PC is worse than 2PC under realistic failure conditions.

Spanner: Distributed ACID with TrueTime

Google Spanner achieves external consistency across globally distributed Paxos groups using TrueTime — an API that provides bounded-uncertainty timestamps backed by GPS receivers and atomic clocks deployed at every data center.

TrueTime.Now() returns an interval [earliest, latest] rather than a point. The uncertainty ε is typically 1-7ms. Spanner uses this as follows:

A transaction reads at a read timestamp within the TrueTime interval.
Before committing, the coordinator waits for TrueTime.Now().earliest > commit_timestamp + ε — ensuring the commit timestamp is definitively in the past from every server’s perspective.
This “commit wait” delays commits by ε (1-7ms) but guarantees that any transaction that starts after this commit sees the committed data.

The result: full external consistency — equivalent to running all transactions serially in real-time order — across a global database. The cost: 1-7ms additional commit latency and the extraordinary infrastructure of atomic clocks in every datacenter.

The Saga Pattern

For most microservices, 2PC is impractical: each service owns its database, there is no shared coordinator, and the blocking failure mode is unacceptable. The Saga pattern is the alternative.

A Saga is a sequence of local transactions, each of which updates a single service’s database and publishes an event or sends a command to trigger the next step. If a step fails, a series of compensating transactions are executed in reverse order to undo the work already done.

Saga vs Database Rollback

Critical distinction: compensating transactions are NOT database rollbacks. They are new, forward-moving transactions that implement the business-level inverse of the original operation.

Original: INSERT INTO orders (status='PLACED') → Compensating: UPDATE orders SET status='CANCELLED'
Original: UPDATE inventory SET qty = qty - 10 → Compensating: UPDATE inventory SET qty = qty + 10
Original: charge_credit_card(amount=100) → Compensating: refund_credit_card(amount=100)

The original transaction’s effects are visible to the outside world between the forward and compensating steps. A customer sees their order placed, then sees it cancelled — with potentially a refund email and a cancellation email. This is a business consequence, not a transparent database rollback. The saga must be designed with this in mind.

Saga Choreography

In choreography, each service publishes events, and other services react to those events without a central coordinator:

OrderService          PaymentService         InventoryService
    │                      │                       │
    │── OrderPlaced ───────►│                       │
    │                       │── PaymentProcessed ──►│
    │                       │                       │── InventoryReserved
    │                       │                       │── OrderCompleted ──►
    │                       │                       │
    │     (on failure)      │                       │
    │◄─ PaymentFailed ───── │                       │
    │── OrderCancelled ──────────────────────────── │

Advantages: Loose coupling; each service is only aware of its events. Disadvantages: Hard to track the overall state of a saga; difficult to add new steps; debugging failures requires tracing events across services.

Saga Orchestration

A central orchestrator (a dedicated service or a workflow engine) explicitly commands each step and handles failures:

OrchestratorService
    │
    │── CreateOrder(orderId) ──────────────────► OrderService
    │◄─ OrderCreated ────────────────────────── OrderService
    │
    │── ProcessPayment(orderId, amount) ────────► PaymentService
    │◄─ PaymentFailed ──────────────────────── PaymentService
    │
    │── CancelOrder(orderId) ─────────────────► OrderService  (compensate)
    │◄─ OrderCancelled ──────────────────────── OrderService
    │
    │── NotifyUser(orderId, "payment failed") ─► NotificationService

Advantages: Single place to view the saga state; easier to add steps; failures are centrally handled. Disadvantages: Orchestrator is a new dependency; risk of becoming a monolith if business logic leaks into it.

The Dual-Write Problem and Outbox Pattern

The most common distributed transaction bug is the dual-write: updating a database and publishing a message to Kafka/RabbitMQ in the same operation without atomicity.

// BROKEN: dual-write with no atomicity
async function placeOrder(order: Order): Promise<void> {
  await db.query("INSERT INTO orders ...");
  await kafka.publish("order.placed", order); // this can fail!
  // If kafka.publish fails: DB has the order, Kafka doesn't
  // If process crashes between the two: same problem
}

If the publish fails after the DB write, downstream services never see the event. If the DB write fails after the publish (less likely with this ordering, but possible with process crashes), the event fires for a non-existent order.

The Outbox Pattern solves this by making the event publish part of the same database transaction:

// CORRECT: outbox pattern
async function placeOrder(order: Order, db: Database): Promise<void> {
  await db.transaction(async (trx) => {
    await trx.query("INSERT INTO orders VALUES (?)", [order]);
    await trx.query(
      "INSERT INTO outbox (event_type, payload, status) VALUES (?, ?, 'pending')",
      ["order.placed", JSON.stringify(order)]
    );
    // Both writes commit atomically — or both roll back
  });
}

// Separate outbox worker (runs continuously):
async function outboxWorker(db: Database, kafka: KafkaProducer): Promise<void> {
  while (true) {
    const events = await db.query(
      "SELECT * FROM outbox WHERE status = 'pending' LIMIT 100 FOR UPDATE SKIP LOCKED"
    );
    for (const event of events) {
      await kafka.publish(event.event_type, JSON.parse(event.payload));
      await db.query("UPDATE outbox SET status = 'sent' WHERE id = ?", [event.id]);
    }
    await sleep(100);
  }
}

The outbox guarantees at-least-once delivery: if the outbox worker crashes after publishing but before marking the row sent, it will re-publish on restart. Consumers must be idempotent (deduplicate by event ID).

CDC: Change Data Capture

Instead of a polling outbox worker, CDC (Change Data Capture) reads the database’s WAL directly and publishes changes to Kafka:

Debezium (open source, by Red Hat): connects to PostgreSQL logical replication, MySQL binlog, or MongoDB oplog and publishes change events to Kafka.
No polling overhead; changes are streamed in real time with sub-second latency.
The outbox table rows appear in the WAL and are streamed as Kafka events automatically.

PostgreSQL WAL ──► Debezium connector ──► Kafka topic "outbox.events"
                                             │
                                             ▼
                                     PaymentService
                                     InventoryService
                                     NotificationService

Eventually Consistent Systems: BASE

Many large-scale systems abandon distributed ACID in favor of BASE:

Basically Available: the system responds to every request, even if some nodes are partitioned.
Soft state: data may be temporarily inconsistent across nodes.
Eventually consistent: given enough time without new writes, all replicas will converge to the same value.

BASE systems (Cassandra, DynamoDB, Couchbase in eventual mode) make different trade-offs than ACID. They achieve higher availability under network partitions (per the CAP theorem) at the cost of requiring applications to handle conflicts, stale reads, and out-of-order events.

Vector Clocks and Causality

In Dynamo-style systems, vector clocks track causal relationships between events:

A vector clock is a map from node ID to logical timestamp: {A: 3, B: 2, C: 1}. When node A receives a message from node B, it takes the component-wise maximum of the two clocks and increments its own counter.

Vector clocks allow the system to determine: did event X happen before event Y, after it, or are they concurrent (causally independent)? Concurrent writes to the same key produce conflicts that must be reconciled — either by last-write-wins (using wall clock or Lamport timestamp), by application-level merge, or by keeping both versions and presenting the conflict to the client (Riak’s approach).

When to Use 2PC vs Saga

Criteria	2PC	Saga
Consistency requirement	Strong (atomic)	Eventual (with compensation)
Participant autonomy	All must support 2PC protocol	Each service handles its own DB
Failure tolerance	Blocked if coordinator crashes	Never blocks; compensates
Latency	Higher (2 round trips + locks)	Lower (async steps)
Complexity	Protocol complexity	Compensation logic complexity
Long-running transactions	Impractical (locks held)	Natural fit
Cross-organization systems	Often impossible	Feasible via events
Examples	Same-org microservices with 2PC support	Order fulfillment, travel booking

Rule of thumb: use 2PC when you control all participants and need strict atomicity with low latency. Use Saga when participants are autonomous services (separate databases, separate teams) or when the transaction is long-running.

TypeScript Pseudocode: Saga Orchestrator with Compensation

type StepResult = { success: boolean; data?: unknown; error?: string };

type SagaStep = {
  name: string;
  execute: (context: SagaContext) => Promise<StepResult>;
  compensate: (context: SagaContext) => Promise<void>;
};

type SagaContext = {
  sagaId: string;
  orderId: string;
  amount: number;
  executedSteps: string[];
  results: Map<string, unknown>;
};

class SagaOrchestrator {
  constructor(private db: { saveSagaState: (ctx: SagaContext) => Promise<void> }) {}

  async execute(steps: SagaStep[], context: SagaContext): Promise<boolean> {
    for (const step of steps) {
      await this.db.saveSagaState(context); // persist before each step
      const result = await step.execute(context);

      if (!result.success) {
        console.error(`Step ${step.name} failed: ${result.error}`);
        await this.compensate(steps, context);
        return false;
      }
      context.executedSteps.push(step.name);
      if (result.data) context.results.set(step.name, result.data);
    }
    return true;
  }

  private async compensate(steps: SagaStep[], context: SagaContext): Promise<void> {
    // Run compensating transactions in reverse order for executed steps only
    const executed = steps.filter(s => context.executedSteps.includes(s.name));
    for (const step of [...executed].reverse()) {
      try {
        await step.compensate(context);
        console.log(`Compensated: ${step.name}`);
      } catch (err) {
        // Compensation failure: requires human intervention / dead-letter queue
        console.error(`COMPENSATION FAILED for ${step.name}:`, err);
        await this.alertOpsTeam(context.sagaId, step.name, err);
      }
    }
  }

  private async alertOpsTeam(sagaId: string, step: string, err: unknown): Promise<void> {
    // Publish to dead-letter queue for manual remediation
    console.error(`ALERT: Manual intervention required for saga ${sagaId}, step ${step}`);
  }
}

// Usage: Order + Payment saga
const orderSaga: SagaStep[] = [
  {
    name: "CreateOrder",
    execute: async (ctx) => {
      const result = await fetch(`/api/orders`, {
        method: "POST",
        body: JSON.stringify({ id: ctx.orderId, amount: ctx.amount }),
      });
      return { success: result.ok };
    },
    compensate: async (ctx) => {
      await fetch(`/api/orders/${ctx.orderId}/cancel`, { method: "POST" });
    },
  },
  {
    name: "ReserveInventory",
    execute: async (ctx) => {
      const result = await fetch(`/api/inventory/reserve`, {
        method: "POST",
        body: JSON.stringify({ orderId: ctx.orderId }),
      });
      return { success: result.ok };
    },
    compensate: async (ctx) => {
      await fetch(`/api/inventory/release/${ctx.orderId}`, { method: "POST" });
    },
  },
  {
    name: "ProcessPayment",
    execute: async (ctx) => {
      const result = await fetch(`/api/payments/charge`, {
        method: "POST",
        body: JSON.stringify({ orderId: ctx.orderId, amount: ctx.amount }),
      });
      return { success: result.ok };
    },
    compensate: async (ctx) => {
      await fetch(`/api/payments/refund/${ctx.orderId}`, { method: "POST" });
    },
  },
];

// Run the saga
const orchestrator = new SagaOrchestrator({ saveSagaState: async () => {} });
const context: SagaContext = {
  sagaId: "saga-001",
  orderId: "order-123",
  amount: 99.99,
  executedSteps: [],
  results: new Map(),
};
await orchestrator.execute(orderSaga, context);

Interview Deep-Dive: Payment + Inventory Atomicity

Q: How would you handle a payment deduction with inventory update atomically across two services?

A complete answer:

There are two main approaches, and the right choice depends on the consistency requirement and latency budget.

If strong atomicity is required and you control both services: use 2PC with a shared coordinator. Both services expose a two-phase commit API. The coordinator sends PREPARE (each service locks and stages the update), then COMMIT. The risk is the coordinator’s crash leaving participants blocked. This is acceptable in a controlled environment with fast failover.

For most microservices: use the Saga pattern with the Outbox. Each service runs its own database transaction atomically. The flow: (1) OrderService creates the order and writes an InventoryReserve command to its outbox in the same transaction. (2) A CDC connector (Debezium) reads the outbox and publishes to Kafka. (3) InventoryService consumes the event, reserves inventory, and publishes an InventoryReserved event. (4) PaymentService consumes that, charges the card, publishes PaymentProcessed. (5) OrderService marks the order confirmed.

If payment fails, PaymentService publishes a PaymentFailed event. InventoryService compensates by releasing the reservation. OrderService compensates by cancelling the order. The customer sees an order attempt that was cancelled — not a silent failure.

Key requirements: every consumer must be idempotent (handle duplicate events), and compensation logic must be thoroughly tested. Saga failure compensation failures (a refund that itself fails) require dead-letter queues and ops tooling for manual remediation.

Idempotency: The Prerequisite for Distributed Correctness

Every distributed transaction pattern — 2PC, Saga, Outbox — relies on the ability to retry operations safely. Without idempotency, retries produce duplicate effects: double charges, double shipments, duplicate order records.

Idempotency means: applying the same operation multiple times has the same effect as applying it once.

Idempotency Key Pattern

// Idempotent payment service
async function chargeCard(
  orderId: string,  // serves as idempotency key
  amount: number,
  db: Database
): Promise<ChargeResult> {
  // Check if we already processed this charge
  const existing = await db.query(
    "SELECT * FROM payments WHERE order_id = ? AND status = 'completed'",
    [orderId]
  );
  if (existing.length > 0) {
    return { success: true, paymentId: existing[0].id, deduplicated: true };
  }

  // Attempt the charge
  const chargeResult = await stripeClient.charges.create({
    amount: Math.round(amount * 100),
    currency: "usd",
    idempotency_key: `order_${orderId}`, // Stripe deduplicates on their end too
  });

  // Persist the result
  await db.query(
    "INSERT INTO payments (order_id, stripe_id, status, amount) VALUES (?, ?, 'completed', ?)",
    [orderId, chargeResult.id, amount]
  );

  return { success: true, paymentId: chargeResult.id, deduplicated: false };
}

Every external API call in a saga step should use a stable idempotency key derived from the saga’s business identifier. This makes retries — from network timeouts, process crashes, or explicit saga restarts — safe.

At-Least-Once vs Exactly-Once

Distributed systems generally provide at-least-once delivery guarantees (messages may be delivered multiple times) rather than exactly-once (impossible in a truly distributed system without idempotent receivers). “Exactly-once processing” is achieved by combining at-least-once delivery with idempotent consumers.

Distributed Transactions in Practice: Common Pitfalls

Failing Open vs Failing Closed

In a saga, when a compensation step fails (e.g., the refund API is down), you must decide: fail open (assume the compensation will eventually succeed, mark saga complete) or fail closed (block the saga, alert ops, require manual intervention).

Always fail closed for financial sagas. A failed refund must go to a dead-letter queue with monitoring and alerting. “Eventually consistent” should not mean “we’ll figure it out later” — it means the system converges to correctness through a well-defined mechanism.

Saga Timeout Budget

Each step in a saga consumes latency budget. For a user-facing operation (checkout), the total saga must complete in under ~5 seconds or the UX degrades. Timeouts per step:

CreateOrder:        200ms  (local DB write)
ReserveInventory:   500ms  (internal service, same DC)
ProcessPayment:    2000ms  (external Stripe API, includes network)
SendConfirmation:   300ms  (email service, async acceptable)
──────────────────────────
Total budget:      ~3000ms for happy path

Set explicit timeout values and circuit breakers on each step. If ProcessPayment consistently takes >3 seconds, the saga should fail fast rather than letting the user wait.

Testing Distributed Transactions

The hardest part of sagas is testing failure and compensation paths. Requirements:

Chaos testing: randomly inject failures at each step and verify compensation produces a consistent end state.
Idempotency testing: deliver each event twice and verify no duplicate effects.
Out-of-order delivery: deliver step N’s event before step N-1’s event and verify the saga handles it.
Saga state persistence: crash the orchestrator mid-saga and verify it resumes from the last known state on restart.

Tools: Testcontainers for integration tests with real databases and queues, Toxiproxy for network failure simulation, AWS Fault Injection Simulator for cloud environments.

Key Takeaways

2PC coordinates distributed atomicity through a coordinator; Phase 1 prepares all participants, Phase 2 commits or aborts.
2PC’s fatal flaw: a coordinator crash after Phase 1 leaves participants blocked indefinitely holding locks.
3PC is non-blocking in theory but fails during network partitions — not used in practice.
Spanner achieves global ACID using TrueTime (GPS + atomic clocks) with 1-7ms commit wait overhead.
Sagas replace distributed ACID with local transactions + compensating transactions; effects are visible during execution.
Saga compensations are business-level operations (cancel, refund), not database rollbacks — they have user-visible consequences.
The Outbox pattern atomically stores events in the database alongside business data; CDC (Debezium) publishes them to Kafka.
Every saga step must be idempotent — retries from network failures or process crashes must not produce duplicate effects.
Failed compensations must go to a dead-letter queue with alerting — never silently swallow compensation failures in financial flows.
Choose 2PC for controlled environments with strict atomicity needs; choose Saga for autonomous services with long-running flows.