Event-Driven Architecture

Commands, Events, and Queries: The Semantic Difference

The most important conceptual foundation in event-driven architecture is the distinction between three types of messages. Getting this wrong causes subtle design errors that compound over time.

Command: A request to do something. Imperative. Can be rejected. Has exactly one handler.

PlaceOrder, ProcessPayment, CancelShipment
Names are verb phrases in the imperative
Carries enough data for the handler to act without additional queries

Event: A statement that something has happened. Past tense. Cannot be rejected — it already happened. Can have zero or many consumers.

OrderPlaced, PaymentProcessed, ShipmentCancelled
Names are verb phrases in the past tense
Immutable — never edit or delete a published event

Query: A request for data. Has no side effects. Returns a result. Never modifies state.

GetOrder, ListActiveShipments, GetCustomerBalance
Idempotent by definition

// Command: imperative, one handler, can fail
interface PlaceOrderCommand {
  type: "PlaceOrder";
  customerId: string;
  items: Array<{ skuId: string; quantity: number }>;
  paymentMethodId: string;
}

// Event: past tense, multiple consumers, immutable fact
interface OrderPlacedEvent {
  type: "OrderPlaced";
  eventId: string;          // globally unique
  occurredAt: string;       // ISO 8601 timestamp
  orderId: string;
  customerId: string;
  totalCents: number;
  items: Array<{ skuId: string; quantity: number; priceCents: number }>;
}

// Query: read-only, no side effects
interface GetOrderQuery {
  type: "GetOrder";
  orderId: string;
}

This distinction maps directly to the CQRS and Event Sourcing patterns. Commands produce events. Events update read models. Queries hit read models.

Event-Driven vs. Request-Driven: An Honest Comparison

Request-Driven (synchronous):

Client → Order Service → Inventory Service (sync call)
                      → Payment Service (sync call)
                      → Notification Service (sync call)
← Response (after all calls complete)

Latency = sum of all downstream latencies
Availability = product of all service availabilities
               (0.99 × 0.99 × 0.99 = 0.97 — worse than any individual)

Event-Driven (asynchronous):

Client → Order Service (write to DB + publish OrderPlaced)
← Response immediately (order is PENDING)

[Async, in parallel]
Inventory Service consumes OrderPlaced → reserves stock
Payment Service consumes OrderPlaced → charges card
Notification Service consumes OrderPlaced → sends confirmation

Latency = order service latency only
Availability = each service fails independently, no cascade

Where event-driven wins:

Long-running workflows (fulfillment, fraud detection, shipping)
Fan-out patterns (one event, many consumers)
Cross-team decoupling (teams evolve independently)
Temporal decoupling (downstream can be down; messages queue)

Where event-driven loses:

Simple CRUD where you need an immediate consistent response
Debugging: event chains are harder to trace than call stacks
Ordering guarantees: messages may arrive out of order
Eventual consistency: the user sees “Order Pending” not “Order Confirmed”

Event Sourcing: The Append-Only Log as Source of Truth

Traditional systems store current state: the orders table contains one row per order with the latest column values. You lose history — there’s no record of what changed or why.

Event sourcing stores the sequence of events that caused the current state. The current state is derived by replaying events.

Traditional (state-based):

orders table:
┌──────────┬─────────┬──────────────┬───────────┐
│ order_id │ status  │ total_cents  │ updated_at│
├──────────┼─────────┼──────────────┼───────────┤
│ ord-123  │ SHIPPED │ 4999         │ 2024-01-15│
└──────────┴─────────┴──────────────┴───────────┘
(You cannot know: was it CANCELLED then reinstated? Was the price modified?)


Event-sourced:

order_events table:
┌──────────┬────────────────────┬──────────────────────────────────────────┐
│ order_id │ event_type         │ payload                                  │
├──────────┼────────────────────┼──────────────────────────────────────────┤
│ ord-123  │ OrderCreated       │ {customerId: "c-1", items: [...]}        │
│ ord-123  │ ItemAdded          │ {skuId: "sku-9", quantity: 2}            │
│ ord-123  │ ItemRemoved        │ {skuId: "sku-7"}                         │
│ ord-123  │ PaymentProcessed   │ {chargeId: "ch-456", amountCents: 4999} │
│ ord-123  │ OrderConfirmed     │ {confirmedAt: "2024-01-14T10:00:00Z"}    │
│ ord-123  │ ShipmentDispatched │ {trackingId: "1Z999...", carrier: "UPS"} │
└──────────┴────────────────────┴──────────────────────────────────────────┘

Rebuilding State from Events

type OrderEvent =
  | { type: "OrderCreated"; customerId: string; items: LineItem[] }
  | { type: "ItemAdded"; skuId: string; quantity: number; priceCents: number }
  | { type: "ItemRemoved"; skuId: string }
  | { type: "PaymentProcessed"; chargeId: string; amountCents: number }
  | { type: "OrderConfirmed"; confirmedAt: string }
  | { type: "OrderCancelled"; reason: string };

interface OrderState {
  orderId: string;
  customerId: string;
  status: "DRAFT" | "CONFIRMED" | "CANCELLED";
  items: Map<string, { quantity: number; priceCents: number }>;
  totalCents: number;
  chargeId?: string;
}

function applyEvent(state: Partial<OrderState>, event: OrderEvent): Partial<OrderState> {
  switch (event.type) {
    case "OrderCreated":
      return {
        ...state,
        customerId: event.customerId,
        status: "DRAFT",
        items: new Map(event.items.map((i) => [i.skuId, i])),
        totalCents: event.items.reduce((sum, i) => sum + i.priceCents * i.quantity, 0),
      };

    case "ItemAdded": {
      const items = new Map(state.items);
      items.set(event.skuId, { quantity: event.quantity, priceCents: event.priceCents });
      return {
        ...state,
        items,
        totalCents: [...items.values()].reduce((s, i) => s + i.priceCents * i.quantity, 0),
      };
    }

    case "ItemRemoved": {
      const items = new Map(state.items);
      items.delete(event.skuId);
      return {
        ...state,
        items,
        totalCents: [...items.values()].reduce((s, i) => s + i.priceCents * i.quantity, 0),
      };
    }

    case "PaymentProcessed":
      return { ...state, chargeId: event.chargeId };

    case "OrderConfirmed":
      return { ...state, status: "CONFIRMED" };

    case "OrderCancelled":
      return { ...state, status: "CANCELLED" };

    default:
      return state;
  }
}

async function loadOrder(orderId: string): Promise<OrderState> {
  const events = await eventStore.getEvents(orderId);
  const state = events.reduce(applyEvent, {});
  return state as OrderState;
}

Snapshots: Avoiding Full Replay Cost

Replaying 10,000 events to get the current state of a bank account is expensive. Snapshots solve this by periodically materializing the current state.

async function loadOrderWithSnapshot(orderId: string): Promise<OrderState> {
  // Load latest snapshot
  const snapshot = await snapshotStore.getLatest(orderId);

  // Load only events AFTER the snapshot
  const events = await eventStore.getEventsAfter(
    orderId,
    snapshot?.version ?? 0
  );

  // Apply events on top of snapshot state
  const baseState = snapshot?.state ?? {};
  return events.reduce(applyEvent, baseState) as OrderState;
}

// Create snapshot every 100 events
async function maybeSaveSnapshot(orderId: string, version: number): Promise<void> {
  if (version % 100 === 0) {
    const state = await loadOrderWithSnapshot(orderId);
    await snapshotStore.save({ orderId, version, state });
  }
}

Event sourcing benefits:

Complete audit log for free — every state transition recorded
Temporal queries: “What was the order state at 2PM yesterday?”
Event replay for debugging, analytics, and backfilling new read models
Time travel: replay up to any point in history

Event sourcing costs:

Eventual consistency on reads (read model may lag)
Schema evolution is harder (old events must remain parseable)
Performance: rebuilding state requires replay (mitigated by snapshots)
Conceptually unfamiliar — steep learning curve for teams

CQRS: Command Query Responsibility Segregation

CQRS separates the write model (command side) from the read model (query side). They can be different data structures, different databases, even different services.

           WRITE SIDE                          READ SIDE
           (Command Model)                     (Query Model)

 ┌──────────┐    ┌──────────┐          ┌────────────────────┐
 │ Command  │───▶│  Domain  │──events──▶│  Projection        │
 │ Handler  │    │  Model   │          │  (event consumer)  │
 └──────────┘    └──────────┘          └─────────┬──────────┘
                      │                          │ writes
                      │ persist events           ▼
                      ▼                 ┌────────────────────┐
                 ┌──────────┐          │   Read Database    │
                 │  Event   │          │  (denormalized,    │
                 │  Store   │          │   query-optimized) │
                 └──────────┘          └─────────┬──────────┘
                                                 │
                                       ┌─────────▼──────────┐
                                       │   Query Handler    │
                                       │   (returns DTO)    │
                                       └────────────────────┘

Why CQRS and Event Sourcing Pair Naturally

The event store is the write-side persistence. Each published event is consumed by one or more projections that maintain denormalized read models optimized for specific query patterns.

// Write side: domain model, normalized, append-only
class OrderCommandHandler {
  async handle(command: PlaceOrderCommand): Promise<void> {
    const orderId = generateOrderId();
    const events: OrderEvent[] = [
      {
        type: "OrderCreated",
        customerId: command.customerId,
        items: command.items.map((i) => ({
          skuId: i.skuId,
          quantity: i.quantity,
          priceCents: await catalogService.getPrice(i.skuId),
        })),
      },
    ];
    await eventStore.append(orderId, events, expectedVersion: -1); // optimistic concurrency
    await eventBus.publishAll(events.map((e) => ({ ...e, orderId })));
  }
}

// Read side: projection, denormalized for fast queries
class OrderSummaryProjection {
  // One read model per query pattern
  async on(event: OrderPlacedEvent): Promise<void> {
    await readDb.upsert("order_summaries", {
      orderId: event.orderId,
      customerId: event.customerId,
      status: "PENDING",
      totalCents: event.totalCents,
      itemCount: event.items.length,
      createdAt: event.occurredAt,
    });
  }

  async on(event: OrderConfirmedEvent): Promise<void> {
    await readDb.update("order_summaries",
      { orderId: event.orderId },
      { status: "CONFIRMED" }
    );
  }
}

// Query handler: fast, denormalized read
async function getOrderSummary(orderId: string): Promise<OrderSummaryDTO> {
  return readDb.findOne("order_summaries", { orderId });
}

Multiple read models from the same events:

// Same events, different read model shapes for different query patterns

class CustomerOrderHistoryProjection {
  // Optimized for: "show all orders for customer X, newest first"
  async on(event: OrderPlacedEvent) {
    await readDb.upsert("customer_order_history", {
      customerId: event.customerId,
      orderId: event.orderId,
      totalCents: event.totalCents,
      placedAt: event.occurredAt,
    });
  }
}

class InventoryAnalyticsProjection {
  // Optimized for: "which SKUs were ordered most in the last 30 days?"
  async on(event: OrderPlacedEvent) {
    for (const item of event.items) {
      await readDb.increment("sku_order_counts", item.skuId, item.quantity);
    }
  }
}

class FraudDetectionProjection {
  // Optimized for: "how many orders has this customer placed in the last hour?"
  async on(event: OrderPlacedEvent) {
    await readDb.lpush(`fraud:orders:${event.customerId}`, event.occurredAt);
    await readDb.expire(`fraud:orders:${event.customerId}`, 3600);
  }
}

Choreography vs. Orchestration: Deep Comparison

This is one of the most debated design decisions in event-driven architecture.

Choreography: Services React to Events

Ride-sharing: Choreography approach

1. ride-service publishes: RideRequested { rideId, passengerId, pickup, dropoff }

2. pricing-service subscribes:
   - Calculates surge price
   - Publishes: PriceCalculated { rideId, estimatedFare }

3. driver-matching-service subscribes to PriceCalculated:
   - Finds nearby drivers
   - Publishes: DriverAssigned { rideId, driverId }

4. notification-service subscribes to DriverAssigned:
   - Sends push notification to passenger

5. payment-service subscribes to RideCompleted:
   - Charges the stored payment method
   - Publishes: PaymentCollected { rideId, chargeId }

Choreography in code:

// Each service subscribes independently — no central controller
kafkaConsumer.subscribe(["RideRequested"], async (event: RideRequestedEvent) => {
  const fare = await calculateSurgePricedFare(event);
  await kafka.publish("PriceCalculated", {
    rideId: event.rideId,
    estimatedFareCents: fare,
    surgeMultiplier: fare.surgeMultiplier,
  });
});

kafkaConsumer.subscribe(["PriceCalculated"], async (event: PriceCalculatedEvent) => {
  const driver = await matchNearestDriver(event.rideId);
  await kafka.publish("DriverAssigned", {
    rideId: event.rideId,
    driverId: driver.id,
    estimatedArrivalSeconds: driver.etaSeconds,
  });
});

Orchestration: Central Coordinator Drives the Workflow

class RideSagaOrchestrator {
  async handleRideRequest(request: RideRequest): Promise<void> {
    const rideId = await rideRepo.create(request);

    // Step 1: pricing
    const fare = await pricingService.calculate(rideId);
    if (!fare.acceptable) {
      await rideRepo.updateStatus(rideId, "CANCELLED_NO_DRIVERS");
      return;
    }

    // Step 2: driver matching
    const assignment = await driverMatchingService.assign(rideId, fare);
    if (!assignment.success) {
      await rideRepo.updateStatus(rideId, "CANCELLED_NO_DRIVERS");
      return;
    }

    // Step 3: notify passenger
    await notificationService.notifyPassenger(rideId, assignment);

    // Step 4: start ride tracking
    await trackingService.startTracking(rideId, assignment.driverId);
  }
}

Side-by-Side Comparison

Property             Choreography                Orchestration
────────────────────────────────────────────────────────────────
Control flow         Distributed (emergent)      Centralized (explicit)
Coupling             Low (event contracts only)  Higher (orchestrator knows all)
Visibility           Hard (event tracing needed) Easy (one place to read)
Testing              Hard (requires full event   Easier (test orchestrator
                     chain setup)                with mocked services)
Failure handling     Each service handles own    Orchestrator handles all
                     failures                    compensations centrally
Adding a new step    Add new subscriber          Modify orchestrator
Debugging            Requires distributed        Check orchestrator state
                     tracing tools               table
Best for             High-scale fan-out,         Long-running workflows,
                     many independent            regulatory compliance,
                     consumers                   complex compensations

Uber’s approach: Uber’s Cadence/Temporal workflow engine is a sophisticated orchestration approach — workflows are code, state is automatically persisted, and the engine handles retries, timeouts, and compensations durably.

Event Schema Evolution

Events are append-only and immutable — but schemas must evolve as products change. This is one of the hardest operational challenges in event sourcing.

Compatibility Types

Backward compatible (consumers can read new events with old schema):

Adding optional fields (existing consumers ignore them)

Forward compatible (consumers using new schema can read old events):

Removing optional fields (default values handle absence)

Full compatibility (both directions):

The sweet spot — requires careful discipline

Avro Schema Evolution

// Version 1
{
  "type": "record",
  "name": "OrderPlaced",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalCents", "type": "long" }
  ]
}

// Version 2: adds optional field — backward AND forward compatible
{
  "type": "record",
  "name": "OrderPlaced",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalCents", "type": "long" },
    { "name": "currencyCode", "type": "string", "default": "USD" }
  ]
}

Protobuf Schema Evolution

// Version 1
message OrderPlaced {
  string order_id = 1;
  string customer_id = 2;
  int64 total_cents = 3;
}

// Version 2: safe changes
message OrderPlaced {
  string order_id = 1;
  string customer_id = 2;
  int64 total_cents = 3;
  string currency_code = 4;  // new optional field — old consumers ignore it
  // DO NOT reuse field number 3 if you remove total_cents
  // DO NOT change the type of field 1, 2, or 3
}

The Upcaster Pattern

When you must make a breaking change, use an upcaster — a function that transforms old event versions to the current version at read time:

interface EventEnvelope {
  schemaVersion: number;
  payload: unknown;
}

function upcaste(envelope: EventEnvelope): OrderPlacedEvent {
  if (envelope.schemaVersion === 1) {
    const v1 = envelope.payload as OrderPlacedV1;
    return {
      ...v1,
      currencyCode: "USD",          // default for all v1 events
      schemaVersion: 2,
    };
  }
  if (envelope.schemaVersion === 2) {
    return envelope.payload as OrderPlacedEvent;
  }
  throw new Error(`Unknown schema version: ${envelope.schemaVersion}`);
}

The Dual-Write Problem and the Outbox Pattern

The dual-write problem is subtle and devastating: if you write to your database AND publish an event in the same logical operation, either can fail independently, leaving the system in an inconsistent state.

// BROKEN: dual-write — one of these can fail
async function placeOrder(command: PlaceOrderCommand): Promise<void> {
  await orderRepo.save(order);           // success
  await kafka.publish("OrderPlaced", order); // network error → event never published!
  // OR:
  await kafka.publish("OrderPlaced", order); // success
  await orderRepo.save(order);              // DB error → event published but no DB record!
}

The Outbox Pattern (also called Transactional Outbox) uses a single atomic database transaction for both the domain state and the outbox table:

// CORRECT: single transaction, outbox approach
async function placeOrder(command: PlaceOrderCommand): Promise<void> {
  await db.transaction(async (tx) => {
    // Write domain state
    await tx.query(
      `INSERT INTO orders (id, customer_id, status, total_cents)
       VALUES ($1, $2, 'PENDING', $3)`,
      [orderId, command.customerId, totalCents]
    );

    // Write to outbox IN THE SAME TRANSACTION
    await tx.query(
      `INSERT INTO outbox (id, aggregate_id, event_type, payload, published_at)
       VALUES ($1, $2, 'OrderPlaced', $3, NULL)`,
      [eventId, orderId, JSON.stringify(orderPlacedEvent)]
    );
    // Transaction commits both or neither — atomicity guarantee
  });
}

// Outbox relay: a separate process polls and publishes
async function outboxRelay(): Promise<void> {
  while (true) {
    const events = await db.query(
      `SELECT * FROM outbox WHERE published_at IS NULL
       ORDER BY created_at LIMIT 100 FOR UPDATE SKIP LOCKED`
    );

    for (const event of events) {
      await kafka.publish(event.event_type, JSON.parse(event.payload));
      await db.query(
        `UPDATE outbox SET published_at = NOW() WHERE id = $1`,
        [event.id]
      );
    }

    await sleep(100); // poll every 100ms
  }
}

Change Data Capture (CDC) approach: Instead of polling the outbox table, use a CDC tool like Debezium to read PostgreSQL’s write-ahead log (WAL) and publish changes directly. This is more reliable and lower latency than polling.

PostgreSQL WAL → Debezium → Kafka

Debezium reads the replication stream from PostgreSQL, converts
WAL entries to structured events, and publishes to Kafka topics.
At-least-once delivery guaranteed; idempotent consumers required.

Eventual Consistency in Event-Driven Systems

Event-driven systems are eventually consistent by nature. The read model lags behind the write model by the time it takes to process events. This is a feature, not a bug — but it has implications.

Read-your-writes consistency problem:

User places order → write model updated → event published → ...
User refreshes page → read model queried → event not yet processed → "Order not found!"

Solutions:

Optimistic UI update: Update the client-side state immediately, don’t wait for the read model
Version-based polling: After a write, poll until the read model version catches up
Command-side read: For the immediate response, read from the write model (event store), not the read model

// Version-based consistency: wait for projection to catch up
async function placeOrderAndWait(command: PlaceOrderCommand): Promise<OrderSummary> {
  const { orderId, eventVersion } = await commandBus.send(command);

  // Poll read model until it's processed this event version
  const deadline = Date.now() + 5000; // 5s max wait
  while (Date.now() < deadline) {
    const summary = await readDb.findOne("order_summaries", { orderId });
    if (summary && summary.version >= eventVersion) return summary;
    await sleep(50);
  }

  // Fallback: reconstruct from event store directly
  return reconstructFromEvents(orderId);
}

Event Storming: Discovering Events Collaboratively

Event Storming (Alberto Brandolini) is a workshop technique for discovering the events in a domain before writing code. It surfaces implicit business rules, identifies bounded contexts, and aligns business and technical teams.

Event Storming artifacts on a physical or virtual whiteboard:

🟠 Domain Events (orange stickies)  — "OrderPlaced", "PaymentFailed"
🔵 Commands (blue stickies)         — "PlaceOrder", "CancelOrder"
🟡 Read Models (yellow stickies)    — "Current stock level", "Customer order history"
🟣 Aggregates (purple stickies)     — "Order", "Customer", "Product"
🔴 Hot spots (red stickies)         — Questions, concerns, unknowns
🟢 Policies (green stickies)        — "When PaymentFailed → notify customer"

Process flow (left to right, time-ordered):
PlaceOrder → OrderCreated → [Policy: reserve inventory] → InventoryReserved →
ProcessPayment → PaymentProcessed → OrderConfirmed → ShipmentDispatched → OrderDelivered

Real-World Examples

Uber: Event-Sourced Dispatch

Uber’s dispatch system processes millions of ride events per second. The dispatch state machine is event-sourced: every state transition (ride requested, driver matched, ride started, ride completed, payment processed) is an event appended to a per-ride event log.

Key properties:

Event store is Apache Kafka with a very high retention period
Projections build real-time read models for driver apps (driver assignment, pickup ETA)
Replay allows backfilling analytics tables, debugging production incidents, and A/B testing dispatch algorithms on historical data

Netflix: Kafka as the Nervous System

Netflix processes 700 billion events per day through Apache Kafka. Key use cases:

Activity tracking: Every play, pause, seek, rating → event → recommendation model update
Operational events: Service health events → automatic failover
Data pipeline: Events → S3 (raw) → Spark (aggregation) → Hive (analytics)
Cross-region replication: Events replicated across AWS regions via Kafka MirrorMaker

Netflix’s principle: “The log is the truth.” The Kafka topic is the authoritative source; everything else is a derived projection.

Interview Checklist

Before walking out of an event-driven architecture interview, verify you have covered:

The hardest interview question in this space:

“You’re designing a payment system. A payment event is published to Kafka. The consumer processes it and marks it as complete. Kafka has a brief network issue and redelivers the same event. How do you prevent double charging?”

Answer: Idempotency keys. The payment event carries a globally unique eventId. The payment processor uses INSERT INTO processed_events (event_id) VALUES ($1) ON CONFLICT DO NOTHING before processing. If the insert is a no-op (conflict), the event was already processed — skip it. The payment table uses the same eventId as the idempotency key for the payment gateway API call (Stripe, Braintree all support idempotency keys natively).