ACID: Precise Definitions
“ACID” is often taught as four marketing bullets. At the system level, each property has a precise meaning in terms of what a database must guarantee when hardware fails, processes crash, or concurrent transactions interfere.
Atomicity
A transaction’s writes either all appear or none appear — from the perspective of any observer, including a crash recovery routine reading the database after a power failure.
This is not about concurrency. It’s about failure. If a transaction updates 12 rows and the server crashes after row 7, the database must undo the first 7 writes on recovery. The half-committed state must become invisible.
Implementation: the database keeps a log of what it intended to do. On recovery, incomplete transactions are rolled back using the log. PostgreSQL uses WAL undo records. InnoDB uses its undo log.
Consistency
Often called the “weakest” of the ACID properties because it is partly the application’s responsibility. A transaction must bring the database from one valid state to another valid state, where “valid” means satisfying all declared integrity constraints: CHECK, FOREIGN KEY, UNIQUE, NOT NULL.
The database enforces structural consistency (constraint violations are rejected). Application consistency — the business rules that can’t be expressed as constraints — is the developer’s problem.
Isolation
Concurrent transactions must not observe each other’s intermediate state. The formal guarantee is serializability: the outcome must be equivalent to some serial (one-at-a-time) execution.
In practice, databases offer a spectrum of isolation levels (Read Uncommitted → Read Committed → Repeatable Read → Serializable), each trading anomaly exposure for throughput. See the MVCC article for deep detail.
Durability
Once a transaction commits, its writes persist — even if the system crashes immediately after the commit returns to the client. The database must have flushed enough information to non-volatile storage to reconstruct the committed state.
This is the hardest property to implement efficiently. A naive implementation calls fsync() for every committed write — correct but catastrophically slow (a single fsync on a spinning disk takes ~5ms, capping throughput at 200 TPS). The Write-Ahead Log exists almost entirely to serve durability at high throughput.
Write-Ahead Log: The Fundamental Guarantee
The WAL rule is simple and absolute:
A modified data page must never reach disk before the log record describing that modification has been durably written to the log.
This one rule is sufficient to make crash recovery possible. Here is why:
- If a data page is written to disk but the log record is not, and the system crashes, recovery has no way to know whether that write happened — or whether it should be undone.
- If the log record is written first and the system crashes before the data page hits disk, recovery can redo the write from the log. The data page on disk is simply stale — not corrupted.
The WAL turns random writes (scattering dirty pages across the disk) into sequential writes (appending to the log file). Sequential I/O is 10-100x faster on spinning disks and significantly faster even on SSDs due to reduced write amplification.
Client Database WAL File Data Files
│ │ │ │
│── INSERT ──►│ │ │
│ │── append log ─────►│ (sequential) │
│ │ record │ │
│ │── fsync ──────────►│ (durability) │
│ │ │ │
│◄─ COMMIT ───│ │ │
│ │ │ │
│ │── write dirty ─────────────────────────►
│ │ pages (async) │ │ (background)
The client gets a COMMIT response the moment the log is durable. The actual data page write happens in the background — possibly seconds later. This is safe because the log can reconstruct it.
WAL Record Structure
Each WAL record describes one modification:
WAL Record Layout:
┌────────────────────────────────────────────────────────┐
│ LSN (Log Sequence Number) — 8 bytes, monotonically inc │
│ Transaction ID (XID) — 4 bytes │
│ Record type — HEAP_INSERT / HEAP_UPDATE / HEAP_DELETE │
│ / COMMIT / ABORT / CHECKPOINT / ... │
│ Relation ID (table) │
│ Block number (which page) │
│ Offset within page │
│ Before image (for undo) — optional │
│ After image (for redo) — the new state │
│ CRC-32 checksum │
└────────────────────────────────────────────────────────┘
LSN (Log Sequence Number) is the byte offset of the record within the WAL stream. It is the fundamental ordering primitive. Data pages store the LSN of the last WAL record that modified them (pd_lsn in PostgreSQL). During recovery, a page is only redone if the WAL record’s LSN is greater than the page’s stored LSN — preventing double-application.
-- Inspect WAL in PostgreSQL
SELECT pg_current_wal_lsn(); -- current WAL position
SELECT pg_wal_lsn_diff(
pg_current_wal_lsn(), '0/1000000'); -- bytes written since a point
-- pg_waldump decodes WAL records
-- pg_waldump -p $PGDATA/pg_wal -s 0/1A2B3C00 -n 10
Force and Steal Policies
The Buffer Manager controls when dirty (modified) pages move from the buffer pool to disk. Two binary policy choices define the space of possible implementations:
STEAL: Can the buffer manager write a dirty page to disk before the transaction commits?
- No-Steal: Never. Dirty pages stay in RAM until commit. Simple but requires infinite RAM for large transactions.
- Steal: Yes. Dirty pages can be evicted anytime. Requires undo logging (old values must be logged so a crash can restore them).
FORCE: Must the buffer manager write all dirty pages to disk at commit time?
- Force: Yes. Commit does a full flush. Simple but slow — large transactions flush many pages.
- No-Force: No. Commit only writes the log. Pages are written lazily. Requires redo logging (new values must be logged so recovery can redo them).
The four combinations:
| Force | No-Force | |
|---|---|---|
| No-Steal | No undo, no redo needed. Impractical (RAM limits) | Redo only. Impractical (RAM limits) |
| Steal | Undo only. Simple but slow commit | Steal + No-Force: UNDO + REDO. Used by all real systems |
Every production database — PostgreSQL, MySQL InnoDB, Oracle, SQL Server — uses Steal + No-Force because it allows arbitrary transaction sizes and fast commits. The price is needing both undo and redo in the WAL.
ARIES: The Recovery Algorithm
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), developed at IBM Research in 1992, is the foundational recovery algorithm. PostgreSQL’s recovery is based on ARIES principles.
Recovery after a crash proceeds in three phases:
Phase 1: Analysis
Scan the WAL forward from the last checkpoint. Build:
- The dirty page table (DPT): which pages were dirty at crash time, and the earliest LSN that could have modified them.
- The transaction table: which transactions were active at crash time, and their last LSN.
Phase 2: Redo
Scan the WAL forward from the earliest LSN in the DPT. For every log record:
- Fetch the page from disk.
- If the page’s stored
pd_lsnis less than the record’s LSN, apply the change (redo it). - Otherwise, skip — the page was already flushed with this change.
After redo, the database is in the exact state it was at the moment of the crash — including the partial writes of uncommitted transactions.
Phase 3: Undo
Scan the WAL backwards for all transactions that were active at crash time (found in the transaction table from Phase 1). For each, apply undo records to roll back their changes.
WAL Timeline:
──────────────────────────────────────────────────────────►
│ │ │
Checkpoint T1 writes CRASH
T2 writes
T1 commits
T2 still in-progress
Analysis: starts at Checkpoint
Redo: starts at earliest dirty LSN → re-applies T1 and T2 writes
Undo: rolls back T2 (was uncommitted at crash)
The result: T1’s commit is visible. T2’s partial writes are gone. The database is consistent.
Checkpointing
If recovery always started from the beginning of the WAL, it would be unbounded. Checkpointing periodically records a position before which no redo is needed.
A checkpoint:
- Flushes all dirty pages to disk (ensuring redo is not needed before the checkpoint LSN).
- Writes a CHECKPOINT record to WAL with the current LSN.
- Saves the checkpoint LSN so recovery knows where to start.
Fuzzy checkpoint (used by PostgreSQL): instead of stopping all writes during the checkpoint (which would freeze the database), the checkpoint proceeds in the background. It records the start LSN, then flushes dirty pages identified at that point. Some of those pages may receive additional writes during the checkpoint — that is fine, because the redo phase handles it.
checkpoint_completion_target = 0.9 -- PostgreSQL default
-- Spreads checkpoint I/O over 90% of the checkpoint interval
-- Prevents I/O spikes that would saturate storage
Recovery time is directly bounded by checkpoint frequency:
- Checkpoint every 5 minutes → maximum recovery scans 5 minutes of WAL.
max_wal_sizein PostgreSQL triggers checkpoints when WAL exceeds this size.- More frequent checkpoints → shorter recovery, higher I/O overhead.
Group Commit: Batching fsync for Throughput
fsync() is expensive. On a typical NVMe SSD, one fsync takes ~100µs. On a spinning disk, ~5ms. A naive system that calls fsync per commit is capped at 10,000 TPS (NVMe) or 200 TPS (spinning disk).
Group commit batches multiple transactions’ WAL writes into a single fsync:
T1 commits → append to WAL buffer, wait...
T2 commits → append to WAL buffer, wait...
T3 commits → append to WAL buffer → fsync() → notify T1, T2, T3: done
PostgreSQL implements group commit automatically. The commit_delay parameter adds a small deliberate delay before fsync to allow more transactions to join the batch. At high load, this is unnecessary — the system is already batching.
PostgreSQL synchronous_commit levels:
| Level | Durability | Latency | Data Loss Risk |
|---|---|---|---|
off | None until checkpoint | ~0ms | Up to checkpoint interval |
local | WAL flushed on primary | ~1ms | Primary crash after commit |
remote_write | Replica received in OS buffer | ~2ms | Replica crashes |
remote_apply | Replica applied WAL | ~5ms | None (quorum) |
on (default) | WAL flushed on primary | ~1ms | Zero on primary |
synchronous_commit = off is a legitimate throughput knob. It risks losing the last few transactions on primary crash — acceptable for session-level ephemeral data, never for financial records.
Durability vs Performance: fsync, O_DIRECT, Write Barriers
fsync(): Forces the kernel to flush its write buffer to the storage device. Without it, the OS may report “written” after only writing to its page cache — data is lost if power fails.
O_DIRECT: Bypasses the kernel page cache entirely. The database manages its own buffer pool. PostgreSQL uses this on Linux for WAL when wal_sync_method = fdatasync (or open_datasync on some platforms). Avoids double-buffering.
Write barriers: Modern SSDs have write caches (DRAM capacitors or supercapacitors for protection). fsync issues a “flush cache” command to the device. Some cloud storage (EBS, etc.) emulates this correctly. Misconfigured storage that lies about fsync is a known source of data corruption — the PostgreSQL 12 release notes documented this explicitly after ext4 in some configurations did not properly honor it.
Real-world WAL write amplification:
A single-row INSERT in PostgreSQL generates:
- One WAL record (~100-500 bytes depending on row size)
- One WAL page write (8KB, containing many records — shared with concurrent transactions)
- Eventually: one dirty heap page write (8KB)
- Eventually: one or more dirty index page writes
The WAL record is tiny compared to page writes. WAL write amplification for small transactions is typically 5-20x relative to the logical data size.
Replicas and WAL Streaming
WAL is the mechanism for replication in PostgreSQL. The standby connects to the primary and receives WAL records in real time via streaming replication:
Primary Standby
│ │
│── WAL record (INSERT) ─────────►│ WAL receiver process
│── WAL record (UPDATE) ─────────►│ writes to local WAL
│── WAL record (COMMIT) ─────────►│
│ │── startup process
│ │ applies WAL records
│ │ (same as crash recovery)
Physical replication ships WAL byte-for-byte. The standby is a binary copy of the primary. Fast, but standby must be same PostgreSQL version and platform.
Logical replication decodes WAL into logical operations (INSERT/UPDATE/DELETE on specific tables) and ships those. Enables cross-version upgrades and selective table replication.
WAL in Distributed Systems
The insight that “a log is sufficient for consensus” extends beyond single-node databases.
Raft (used by etcd, CockroachDB, Consul) is essentially a distributed WAL. The Raft leader writes entries to its log, replicates to followers, and considers an entry committed once a quorum acknowledges it. This is structurally identical to a WAL with distributed fsync.
LSM tree WAL: RocksDB, Cassandra, and other LSM-based stores also use a WAL. Writes go to the WAL first, then the in-memory MemTable. If the process crashes, the WAL replays into the MemTable on restart. The compaction process that merges SSTables is a separate concern from crash recovery — the WAL handles the latter.
The WAL pattern is universal: append to a durable log first, apply to the data structure second. The data structure can always be rebuilt from the log.
TypeScript Pseudocode: WAL-Based Crash Recovery
type LSN = number;
type XID = number;
type WALRecord = {
lsn: LSN;
xid: XID;
type: "INSERT" | "UPDATE" | "DELETE" | "COMMIT" | "ABORT" | "CHECKPOINT";
pageId?: number;
pageLsn?: LSN; // page's LSN at time of modification
afterImage?: Uint8Array;
beforeImage?: Uint8Array;
};
type Page = { id: number; lsn: LSN; data: Uint8Array };
type Transaction = { xid: XID; lastLsn: LSN; status: "active" | "committed" | "aborted" };
function recoverARIES(
walLog: WALRecord[],
checkpointLsn: LSN,
diskPages: Map<number, Page>
): void {
// Phase 1: Analysis — scan from checkpoint
const dirtyPageTable = new Map<number, LSN>(); // pageId → recLsn
const txnTable = new Map<XID, Transaction>();
for (const rec of walLog.filter(r => r.lsn >= checkpointLsn)) {
if (rec.type === "COMMIT") {
txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "committed" });
} else if (rec.type === "ABORT") {
txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "aborted" });
} else if (rec.pageId !== undefined) {
if (!txnTable.has(rec.xid)) {
txnTable.set(rec.xid, { xid: rec.xid, lastLsn: rec.lsn, status: "active" });
}
if (!dirtyPageTable.has(rec.pageId)) {
dirtyPageTable.set(rec.pageId, rec.lsn);
}
}
}
// Phase 2: Redo — re-apply all changes
const earliestLsn = Math.min(...dirtyPageTable.values());
for (const rec of walLog.filter(r => r.lsn >= earliestLsn)) {
if (rec.pageId === undefined || !rec.afterImage) continue;
const page = diskPages.get(rec.pageId);
if (!page || page.lsn < rec.lsn) {
// Page on disk is stale — redo this record
diskPages.set(rec.pageId, {
id: rec.pageId,
lsn: rec.lsn,
data: rec.afterImage,
});
}
}
// Phase 3: Undo — roll back active transactions
const toUndo = [...txnTable.values()].filter(t => t.status === "active");
for (const txn of toUndo) {
// Walk WAL backwards applying undo (beforeImage) for this XID
for (const rec of [...walLog].reverse()) {
if (rec.xid !== txn.xid) continue;
if (rec.type === "COMMIT" || rec.type === "ABORT") break;
if (rec.pageId !== undefined && rec.beforeImage) {
diskPages.set(rec.pageId, {
id: rec.pageId,
lsn: rec.lsn,
data: rec.beforeImage,
});
}
}
}
}
Interview Deep-Dive: How Does a Database Survive a Crash?
Q: How does a database survive a crash?
A complete answer:
The database writes a WAL record to a log file — describing every modification — before writing the actual data page. The log is flushed to disk (fsync) at commit time. Data pages are written lazily in the background.
On crash, the recovery routine reads the WAL from the last checkpoint. It first redoes all changes found in the log, restoring the exact pre-crash state — including partial writes of uncommitted transactions. Then it undoes all changes that belonged to transactions that never committed.
The WAL rule (log before data) makes this safe: if a data page write was lost in the crash, the WAL has the record to redo it. If the system crashed mid-transaction, the WAL has before-images to undo it.
The cost of durability is the fsync on commit. Group commit amortizes this across many concurrent transactions. synchronous_commit = off skips the fsync entirely — gaining ~10x throughput but risking the loss of a few transactions on crash.
WAL Segments, Archiving, and PITR
PostgreSQL’s WAL is not a single file — it is a sequence of fixed-size segment files (default 16MB each), named with a 24-character hex number encoding the timeline and segment number:
$PGDATA/pg_wal/
000000010000000000000001 (timeline 1, segment 1)
000000010000000000000002
000000010000000000000003 ← current WAL segment
...
When the current segment fills, PostgreSQL creates the next one. Old segments are recycled (renamed and reused) or archived, depending on configuration.
WAL archiving (archive_mode = on) copies completed WAL segments to an external location (S3, NFS, another server) before recycling them. This enables Point-in-Time Recovery (PITR):
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
# %p = full path of WAL file, %f = filename
# Restore: take a base backup + replay WAL up to a target time
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'
PITR is the gold-standard backup strategy for PostgreSQL. The base backup is a consistent filesystem snapshot; WAL replay brings it to any point in time after that backup. With a 5-minute base backup cadence and continuous WAL archiving, you can recover to within seconds of any point in the last week.
WAL Level Settings
PostgreSQL supports three WAL levels (controlled by wal_level):
| Level | Content | Use Case |
|---|---|---|
minimal | Enough for crash recovery only | Standalone, no replicas |
replica | Includes info for streaming replication | Standby replicas |
logical | Includes decoded logical changes | Logical replication, Debezium CDC |
Logical WAL is significantly larger — it includes full before/after images of rows for decoding. replica WAL is the default and appropriate for most setups.
Measuring WAL and I/O Performance
Real numbers matter. Here are the I/O characteristics that govern WAL-based systems:
Latency targets:
- NVMe SSD
fsync: 50-200µs - SATA SSD
fsync: 200-500µs - Spinning disk
fsync: 3-10ms - Networked storage (EBS gp3)
fsync: 200-600µs
Write throughput with WAL:
- At
synchronous_commit = on: throughput limited by fsync latency × degree of group commit batching - At 500µs fsync, with 50 concurrent committers group-committing: ~2,000 commits/s per fsync → 100,000 TPS effective
- At 5ms fsync (spinning disk): ~200 solo TPS; group commit with 20 batched → ~4,000 TPS effective
WAL write amplification example (single row INSERT):
Logical write: 24 bytes (int PK + varchar(20))
WAL record: ~180 bytes (header + before/after images)
WAL page: 8,192 bytes (shared with other records — amortized)
Heap page: 8,192 bytes (dirty, written at checkpoint)
Index page: 8,192 bytes × (number of indexes)
For 1 index: ~16KB written per logical 24-byte row = ~666x amplification
This amplification is why SSD endurance (TBW — terabytes written) matters for database servers. A PostgreSQL server doing 10,000 INSERTs/second with one index is writing approximately 160MB/s to storage.
Key Takeaways
- ACID’s durability property requires writing to non-volatile storage before acknowledging commit.
- WAL achieves this with sequential appends rather than random page writes — 10-100x faster on spinning disk.
- Steal + No-Force is the universal production policy: evict pages freely, never force all pages on commit.
- ARIES recovery: Analysis (build state), Redo (re-apply everything), Undo (roll back non-committed).
- Checkpoints bound recovery time by establishing a WAL start point;
checkpoint_completion_targetspreads I/O. - Group commit amortizes fsync cost across concurrent transactions;
synchronous_commit = offtrades durability for throughput. - WAL archiving + base backups enable Point-in-Time Recovery to any second in the retention window.
- WAL is also the replication mechanism and the distributed log in Raft-based systems.
- Write amplification is real: a single logical row write may produce 666x more bytes written at the storage layer.