ACID Transactions & Write-Ahead Log — Grind: DSA, System Design, Interview Prep

ACID: Precise Definitions

“ACID” is often taught as four marketing bullets. At the system level, each property has a precise meaning in terms of what a database must guarantee when hardware fails, processes crash, or concurrent transactions interfere.

Atomicity

A transaction’s writes either all appear or none appear — from the perspective of any observer, including a crash recovery routine reading the database after a power failure.

This is not about concurrency. It’s about failure. If a transaction updates 12 rows and the server crashes after row 7, the database must undo the first 7 writes on recovery. The half-committed state must become invisible.

Implementation: the database keeps a log of what it intended to do. On recovery, incomplete transactions are rolled back using the log. PostgreSQL uses WAL undo records. InnoDB uses its undo log.

Consistency

Often called the “weakest” of the ACID properties because it is partly the application’s responsibility. A transaction must bring the database from one valid state to another valid state, where “valid” means satisfying all declared integrity constraints: CHECK, FOREIGN KEY, UNIQUE, NOT NULL.

The database enforces structural consistency (constraint violations are rejected). Application consistency — the business rules that can’t be expressed as constraints — is the developer’s problem.

Isolation

Concurrent transactions must not observe each other’s intermediate state. The formal guarantee is serializability: the outcome must be equivalent to some serial (one-at-a-time) execution.

In practice, databases offer a spectrum of isolation levels (Read Uncommitted → Read Committed → Repeatable Read → Serializable), each trading anomaly exposure for throughput. See the MVCC article for deep detail.

Durability

Once a transaction commits, its writes persist — even if the system crashes immediately after the commit returns to the client. The database must have flushed enough information to non-volatile storage to reconstruct the committed state.

This is the hardest property to implement efficiently. A naive implementation calls fsync() for every committed write — correct but catastrophically slow (a single fsync on a spinning disk takes ~5ms, capping throughput at 200 TPS). The Write-Ahead Log exists almost entirely to serve durability at high throughput.

Write-Ahead Log: The Fundamental Guarantee

The WAL rule is simple and absolute:

A modified data page must never reach disk before the log record describing that modification has been durably written to the log.

This one rule is sufficient to make crash recovery possible. Here is why:

If a data page is written to disk but the log record is not, and the system crashes, recovery has no way to know whether that write happened — or whether it should be undone.
If the log record is written first and the system crashes before the data page hits disk, recovery can redo the write from the log. The data page on disk is simply stale — not corrupted.

The WAL turns random writes (scattering dirty pages across the disk) into sequential writes (appending to the log file). Sequential I/O is 10-100x faster on spinning disks and significantly faster even on SSDs due to reduced write amplification.

 Client       Database             WAL File          Data Files
   │             │                    │                   │
   │── INSERT ──►│                    │                   │
   │             │── append log ─────►│ (sequential)      │
   │             │   record           │                   │
   │             │── fsync ──────────►│ (durability)      │
   │             │                    │                   │
   │◄─ COMMIT ───│                    │                   │
   │             │                    │                   │
   │             │── write dirty ─────────────────────────►
   │             │   pages (async)    │                   │ (background)

The client gets a COMMIT response the moment the log is durable. The actual data page write happens in the background — possibly seconds later. This is safe because the log can reconstruct it.

WAL Record Structure

Each WAL record describes one modification:

WAL Record Layout:
┌────────────────────────────────────────────────────────┐
│ LSN (Log Sequence Number) — 8 bytes, monotonically inc │
│ Transaction ID (XID) — 4 bytes                         │
│ Record type — HEAP_INSERT / HEAP_UPDATE / HEAP_DELETE  │
│   / COMMIT / ABORT / CHECKPOINT / ...                  │
│ Relation ID (table)                                    │
│ Block number (which page)                              │
│ Offset within page                                     │
│ Before image (for undo) — optional                     │
│ After image (for redo) — the new state                 │
│ CRC-32 checksum                                        │
└────────────────────────────────────────────────────────┘

LSN (Log Sequence Number) is the byte offset of the record within the WAL stream. It is the fundamental ordering primitive. Data pages store the LSN of the last WAL record that modified them (pd_lsn in PostgreSQL). During recovery, a page is only redone if the WAL record’s LSN is greater than the page’s stored LSN — preventing double-application.

-- Inspect WAL in PostgreSQL
SELECT pg_current_wal_lsn();           -- current WAL position
SELECT pg_wal_lsn_diff(
  pg_current_wal_lsn(), '0/1000000'); -- bytes written since a point

-- pg_waldump decodes WAL records
-- pg_waldump -p $PGDATA/pg_wal -s 0/1A2B3C00 -n 10

Force and Steal Policies

The Buffer Manager controls when dirty (modified) pages move from the buffer pool to disk. Two binary policy choices define the space of possible implementations:

STEAL: Can the buffer manager write a dirty page to disk before the transaction commits?

No-Steal: Never. Dirty pages stay in RAM until commit. Simple but requires infinite RAM for large transactions.
Steal: Yes. Dirty pages can be evicted anytime. Requires undo logging (old values must be logged so a crash can restore them).

FORCE: Must the buffer manager write all dirty pages to disk at commit time?

Force: Yes. Commit does a full flush. Simple but slow — large transactions flush many pages.
No-Force: No. Commit only writes the log. Pages are written lazily. Requires redo logging (new values must be logged so recovery can redo them).

The four combinations:

	Force	No-Force
No-Steal	No undo, no redo needed. Impractical (RAM limits)	Redo only. Impractical (RAM limits)
Steal	Undo only. Simple but slow commit	Steal + No-Force: UNDO + REDO. Used by all real systems

Every production database — PostgreSQL, MySQL InnoDB, Oracle, SQL Server — uses Steal + No-Force because it allows arbitrary transaction sizes and fast commits. The price is needing both undo and redo in the WAL.

ARIES: The Recovery Algorithm

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), developed at IBM Research in 1992, is the foundational recovery algorithm. PostgreSQL’s recovery is based on ARIES principles.

Recovery after a crash proceeds in three phases:

Phase 1: Analysis

Scan the WAL forward from the last checkpoint. Build:

The dirty page table (DPT): which pages were dirty at crash time, and the earliest LSN that could have modified them.
The transaction table: which transactions were active at crash time, and their last LSN.

Phase 2: Redo

Scan the WAL forward from the earliest LSN in the DPT. For every log record:

Fetch the page from disk.
If the page’s stored pd_lsn is less than the record’s LSN, apply the change (redo it).
Otherwise, skip — the page was already flushed with this change.

After redo, the database is in the exact state it was at the moment of the crash — including the partial writes of uncommitted transactions.

Phase 3: Undo

Scan the WAL backwards for all transactions that were active at crash time (found in the transaction table from Phase 1). For each, apply undo records to roll back their changes.

WAL Timeline:
──────────────────────────────────────────────────────────►
       │         │                             │
   Checkpoint  T1 writes               CRASH
               T2 writes
               T1 commits
               T2 still in-progress

Analysis: starts at Checkpoint
Redo:     starts at earliest dirty LSN → re-applies T1 and T2 writes
Undo:     rolls back T2 (was uncommitted at crash)

The result: T1’s commit is visible. T2’s partial writes are gone. The database is consistent.

Checkpointing

If recovery always started from the beginning of the WAL, it would be unbounded. Checkpointing periodically records a position before which no redo is needed.

A checkpoint:

Flushes all dirty pages to disk (ensuring redo is not needed before the checkpoint LSN).
Writes a CHECKPOINT record to WAL with the current LSN.
Saves the checkpoint LSN so recovery knows where to start.

Fuzzy checkpoint (used by PostgreSQL): instead of stopping all writes during the checkpoint (which would freeze the database), the checkpoint proceeds in the background. It records the start LSN, then flushes dirty pages identified at that point. Some of those pages may receive additional writes during the checkpoint — that is fine, because the redo phase handles it.

checkpoint_completion_target = 0.9  -- PostgreSQL default
-- Spreads checkpoint I/O over 90% of the checkpoint interval
-- Prevents I/O spikes that would saturate storage

Recovery time is directly bounded by checkpoint frequency:

Checkpoint every 5 minutes → maximum recovery scans 5 minutes of WAL.
max_wal_size in PostgreSQL triggers checkpoints when WAL exceeds this size.
More frequent checkpoints → shorter recovery, higher I/O overhead.

Group Commit: Batching fsync for Throughput

fsync() is expensive. On a typical NVMe SSD, one fsync takes ~100µs. On a spinning disk, ~5ms. A naive system that calls fsync per commit is capped at 10,000 TPS (NVMe) or 200 TPS (spinning disk).

Group commit batches multiple transactions’ WAL writes into a single fsync:

T1 commits → append to WAL buffer, wait...
T2 commits → append to WAL buffer, wait...
T3 commits → append to WAL buffer → fsync() → notify T1, T2, T3: done

PostgreSQL implements group commit automatically. The commit_delay parameter adds a small deliberate delay before fsync to allow more transactions to join the batch. At high load, this is unnecessary — the system is already batching.

PostgreSQL synchronous_commit levels:

Level	Durability	Latency	Data Loss Risk
`off`	None until checkpoint	~0ms	Up to checkpoint interval
`local`	WAL flushed on primary	~1ms	Primary crash after commit
`remote_write`	Replica received in OS buffer	~2ms	Replica crashes
`remote_apply`	Replica applied WAL	~5ms	None (quorum)
`on` (default)	WAL flushed on primary	~1ms	Zero on primary

synchronous_commit = off is a legitimate throughput knob. It risks losing the last few transactions on primary crash — acceptable for session-level ephemeral data, never for financial records.

Durability vs Performance: fsync, O_DIRECT, Write Barriers

fsync(): Forces the kernel to flush its write buffer to the storage device. Without it, the OS may report “written” after only writing to its page cache — data is lost if power fails.

O_DIRECT: Bypasses the kernel page cache entirely. The database manages its own buffer pool. PostgreSQL uses this on Linux for WAL when wal_sync_method = fdatasync (or open_datasync on some platforms). Avoids double-buffering.

Write barriers: Modern SSDs have write caches (DRAM capacitors or supercapacitors for protection). fsync issues a “flush cache” command to the device. Some cloud storage (EBS, etc.) emulates this correctly. Misconfigured storage that lies about fsync is a known source of data corruption — the PostgreSQL 12 release notes documented this explicitly after ext4 in some configurations did not properly honor it.

Real-world WAL write amplification:

A single-row INSERT in PostgreSQL generates:

One WAL record (~100-500 bytes depending on row size)
One WAL page write (8KB, containing many records — shared with concurrent transactions)
Eventually: one dirty heap page write (8KB)
Eventually: one or more dirty index page writes

The WAL record is tiny compared to page writes. WAL write amplification for small transactions is typically 5-20x relative to the logical data size.

Replicas and WAL Streaming

WAL is the mechanism for replication in PostgreSQL. The standby connects to the primary and receives WAL records in real time via streaming replication:

Primary                          Standby
  │                                 │
  │── WAL record (INSERT) ─────────►│ WAL receiver process
  │── WAL record (UPDATE) ─────────►│ writes to local WAL
  │── WAL record (COMMIT) ─────────►│
  │                                 │── startup process
  │                                 │   applies WAL records
  │                                 │   (same as crash recovery)

Physical replication ships WAL byte-for-byte. The standby is a binary copy of the primary. Fast, but standby must be same PostgreSQL version and platform.

Logical replication decodes WAL into logical operations (INSERT/UPDATE/DELETE on specific tables) and ships those. Enables cross-version upgrades and selective table replication.

WAL in Distributed Systems

The insight that “a log is sufficient for consensus” extends beyond single-node databases.

Raft (used by etcd, CockroachDB, Consul) is essentially a distributed WAL. The Raft leader writes entries to its log, replicates to followers, and considers an entry committed once a quorum acknowledges it. This is structurally identical to a WAL with distributed fsync.

LSM tree WAL: RocksDB, Cassandra, and other LSM-based stores also use a WAL. Writes go to the WAL first, then the in-memory MemTable. If the process crashes, the WAL replays into the MemTable on restart. The compaction process that merges SSTables is a separate concern from crash recovery — the WAL handles the latter.

The WAL pattern is universal: append to a durable log first, apply to the data structure second. The data structure can always be rebuilt from the log.

TypeScript Pseudocode: WAL-Based Crash Recovery

type LSN = number;
type XID = number;

type WALRecord = {
  lsn: LSN;
  xid: XID;
  type: "INSERT" | "UPDATE" | "DELETE" | "COMMIT" | "ABORT" | "CHECKPOINT";
  pageId?: number;
  pageLsn?: LSN;    // page's LSN at time of modification
  afterImage?: Uint8Array;
  beforeImage?: Uint8Array;
};

type Page = { id: number; lsn: LSN; data: Uint8Array };
type Transaction = { xid: XID; lastLsn: LSN; status: "active" | "committed" | "aborted" };

function recoverARIES(
  walLog: WALRecord[],
  checkpointLsn: LSN,
  diskPages: Map<number, Page>
): void {
  // Phase 1: Analysis — scan from checkpoint
  const dirtyPageTable = new Map<number, LSN>(); // pageId → recLsn
  const txnTable = new Map<XID, Transaction>();

  for (const rec of walLog.filter(r => r.lsn >= checkpointLsn)) {
    if (rec.type === "COMMIT") {
      txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "committed" });
    } else if (rec.type === "ABORT") {
      txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "aborted" });
    } else if (rec.pageId !== undefined) {
      if (!txnTable.has(rec.xid)) {
        txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "active" });
      }
      if (!dirtyPageTable.has(rec.pageId)) {
        dirtyPageTable.set(rec.pageId, rec.lsn);
      }
    }
  }

  // Phase 2: Redo — re-apply all changes
  const earliestLsn = Math.min(...dirtyPageTable.values());
  for (const rec of walLog.filter(r => r.lsn >= earliestLsn)) {
    if (rec.pageId === undefined || !rec.afterImage) continue;
    const page = diskPages.get(rec.pageId);
    if (!page || page.lsn < rec.lsn) {
      // Page on disk is stale — redo this record
      diskPages.set(rec.pageId, {
        id: rec.pageId,
        lsn: rec.lsn,
        data: rec.afterImage,
      });
    }
  }

  // Phase 3: Undo — roll back active transactions
  const toUndo = [...txnTable.values()].filter(t => t.status === "active");
  for (const txn of toUndo) {
    // Walk WAL backwards applying undo (beforeImage) for this XID
    for (const rec of [...walLog].reverse()) {
      if (rec.xid !== txn.xid) continue;
      if (rec.type === "COMMIT" || rec.type === "ABORT") break;
      if (rec.pageId !== undefined && rec.beforeImage) {
        diskPages.set(rec.pageId, {
          id: rec.pageId,
          lsn: rec.lsn,
          data: rec.beforeImage,
        });
      }
    }
  }
}

Interview Deep-Dive: How Does a Database Survive a Crash?

Q: How does a database survive a crash?

A complete answer:

The database writes a WAL record to a log file — describing every modification — before writing the actual data page. The log is flushed to disk (fsync) at commit time. Data pages are written lazily in the background.

On crash, the recovery routine reads the WAL from the last checkpoint. It first redoes all changes found in the log, restoring the exact pre-crash state — including partial writes of uncommitted transactions. Then it undoes all changes that belonged to transactions that never committed.

The WAL rule (log before data) makes this safe: if a data page write was lost in the crash, the WAL has the record to redo it. If the system crashed mid-transaction, the WAL has before-images to undo it.

The cost of durability is the fsync on commit. Group commit amortizes this across many concurrent transactions. synchronous_commit = off skips the fsync entirely — gaining ~10x throughput but risking the loss of a few transactions on crash.

WAL Segments, Archiving, and PITR

PostgreSQL’s WAL is not a single file — it is a sequence of fixed-size segment files (default 16MB each), named with a 24-character hex number encoding the timeline and segment number:

$PGDATA/pg_wal/
  000000010000000000000001   (timeline 1, segment 1)
  000000010000000000000002
  000000010000000000000003   ← current WAL segment
  ...

When the current segment fills, PostgreSQL creates the next one. Old segments are recycled (renamed and reused) or archived, depending on configuration.

WAL archiving (archive_mode = on) copies completed WAL segments to an external location (S3, NFS, another server) before recycling them. This enables Point-in-Time Recovery (PITR):

# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
# %p = full path of WAL file, %f = filename

# Restore: take a base backup + replay WAL up to a target time
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'

PITR is the gold-standard backup strategy for PostgreSQL. The base backup is a consistent filesystem snapshot; WAL replay brings it to any point in time after that backup. With a 5-minute base backup cadence and continuous WAL archiving, you can recover to within seconds of any point in the last week.

WAL Level Settings

PostgreSQL supports three WAL levels (controlled by wal_level):

Level	Content	Use Case
`minimal`	Enough for crash recovery only	Standalone, no replicas
`replica`	Includes info for streaming replication	Standby replicas
`logical`	Includes decoded logical changes	Logical replication, Debezium CDC

Logical WAL is significantly larger — it includes full before/after images of rows for decoding. replica WAL is the default and appropriate for most setups.

Measuring WAL and I/O Performance

Real numbers matter. Here are the I/O characteristics that govern WAL-based systems:

Latency targets:

NVMe SSD fsync: 50-200µs
SATA SSD fsync: 200-500µs
Spinning disk fsync: 3-10ms
Networked storage (EBS gp3) fsync: 200-600µs

Write throughput with WAL:

At synchronous_commit = on: throughput limited by fsync latency × degree of group commit batching
At 500µs fsync, with 50 concurrent committers group-committing: ~2,000 commits/s per fsync → 100,000 TPS effective
At 5ms fsync (spinning disk): ~200 solo TPS; group commit with 20 batched → ~4,000 TPS effective

WAL write amplification example (single row INSERT):

Logical write:    24 bytes (int PK + varchar(20))
WAL record:      ~180 bytes (header + before/after images)
WAL page:        8,192 bytes (shared with other records — amortized)
Heap page:       8,192 bytes (dirty, written at checkpoint)
Index page:      8,192 bytes × (number of indexes)

For 1 index: ~16KB written per logical 24-byte row = ~666x amplification

This amplification is why SSD endurance (TBW — terabytes written) matters for database servers. A PostgreSQL server doing 10,000 INSERTs/second with one index is writing approximately 160MB/s to storage.

Key Takeaways

ACID’s durability property requires writing to non-volatile storage before acknowledging commit.
WAL achieves this with sequential appends rather than random page writes — 10-100x faster on spinning disk.
Steal + No-Force is the universal production policy: evict pages freely, never force all pages on commit.
ARIES recovery: Analysis (build state), Redo (re-apply everything), Undo (roll back non-committed).
Checkpoints bound recovery time by establishing a WAL start point; checkpoint_completion_target spreads I/O.
Group commit amortizes fsync cost across concurrent transactions; synchronous_commit = off trades durability for throughput.
WAL archiving + base backups enable Point-in-Time Recovery to any second in the retention window.
WAL is also the replication mechanism and the distributed log in Raft-based systems.
Write amplification is real: a single logical row write may produce 666x more bytes written at the storage layer.