Core Architecture · Topic 17 of 21

MVCC & Isolation Levels

300 XP

Why Concurrency Control Is Hard

Databases must serve thousands of concurrent transactions. Without coordination, two transactions reading and writing the same data produce results that could never arise from any serial execution — a correctness violation called a non-serializable history.

The naive solution is locking: readers block writers and writers block readers. This works, but it kills throughput. Every SELECT on a hot row blocks every UPDATE. Analytical queries running for seconds stall all OLTP writes.

MVCC (Multi-Version Concurrency Control) breaks this deadlock. The key insight: instead of blocking a reader until a writer finishes, give the reader an older version of the data. Writers never block readers. Readers never block writers. Each transaction sees a consistent snapshot of the database as it existed at a point in time.

This is the architectural foundation of PostgreSQL, Oracle, MySQL InnoDB, CockroachDB, and virtually every serious production database built after 1980.


The Core Idea: Tuple Versioning

In a non-MVCC system, a row is a single location in storage. UPDATE means overwriting it. This forces a read-write conflict.

In MVCC, a row is a chain of versions. UPDATE appends a new version; it does not erase the old one. A transaction picks the version that was current when it started, and reads that — regardless of what newer versions exist.

Timeline:  T=1          T=2          T=3          T=4
           (BEGIN)      (UPDATE)     (COMMIT)     (new reader)

Row "Alice balance=1000"
           [v1: 1000]                             ← reader at T=1 sees this
                        [v2: 800]                 ← writer created this
                                     (v2 visible) ← reader at T=4 sees this

The old version is not immediately deleted. It stays until no active transaction can possibly need it — at that point a garbage collection process (VACUUM in PostgreSQL) reclaims it.


PostgreSQL Heap Page Layout: xmin, xmax, ctid

In PostgreSQL, every row version (called a tuple) in the heap has hidden system columns that implement MVCC:

ColumnTypeMeaning
ctidtidPhysical location: (page number, slot number)
xminxidTransaction ID that created this tuple
xmaxxidTransaction ID that deleted/updated this tuple (0 = still live)
infomaskuint16Flags: committed, aborted, has nulls, heap-only-tuple, etc.
cmincidCommand ID within transaction (for intra-transaction visibility)

When you run UPDATE accounts SET balance = 800 WHERE id = 1:

  1. PostgreSQL finds the old tuple (xmin=5, xmax=0).
  2. It stamps xmax=9 on the old tuple (where 9 is the current transaction ID).
  3. It inserts a new tuple (xmin=9, xmax=0) with balance=800.
  4. The old tuple is now “dead” from transaction 9’s perspective — but still physically present.
-- You can inspect these system columns directly
SELECT ctid, xmin, xmax, balance FROM accounts WHERE id = 1;
-- ctid     | xmin | xmax | balance
-- (0,1)    |  5   |  9   |  1000   ← old version (dead to T9+)
-- (0,2)    |  9   |  0   |  800    ← new version (live)

This means a single UPDATE of a 100-byte row writes a 100-byte tuple plus all the system column overhead, and leaves the old tuple in place until VACUUM runs. Write amplification and table bloat are real costs.


Transaction IDs and the Wraparound Problem

PostgreSQL uses a 32-bit transaction ID (xid). That gives 4,294,967,296 unique values. At 1,000 transactions per second, you exhaust this in roughly 49 days of runtime — except PostgreSQL reuses IDs.

PostgreSQL’s visibility rule uses modular arithmetic: of any two XIDs, the one that is “newer” is defined as the one within 2^31 steps forward in the circular space. This means every tuple’s xmin/xmax must be compared against the current XID using this modular comparison, not a simple integer compare.

The wraparound catastrophe: if VACUUM never runs and a table has tuples with xmin that is 2^31 transactions in the past, PostgreSQL can no longer determine whether those tuples are visible. It would treat ancient data as “in the future.”

PostgreSQL prevents this with:

  • autovacuum_freeze_max_age (default: 200 million transactions) — autovacuum freezes tuples before they reach the danger zone.
  • Freezing: tuples older than vacuum_freeze_min_age have their xmin replaced with a special FrozenTransactionId (XID 2), which is always considered older than every real XID.
  • Emergency shutdown: PostgreSQL will refuse to start new transactions and print a prominent warning if you approach wraparound.
-- Check how close each database is to XID wraparound
SELECT datname,
       age(datfrozenxid) AS xid_age,
       2^31 - age(datfrozenxid) AS xids_remaining
FROM pg_database
ORDER BY xid_age DESC;

A busy OLTP database should keep xid_age well below 200 million. Production incidents from XID wraparound have taken down large sites — it is a known operational hazard.


Snapshots and Visibility Rules

When a transaction begins (or, at read committed level, when each statement begins), PostgreSQL takes a snapshot of the currently active transactions:

Snapshot = {
  xmin: lowest active XID    (all XIDs below this are committed or aborted)
  xmax: next XID to be assigned  (all XIDs >= this don't exist yet)
  xip:  list of in-progress XIDs between xmin and xmax
}

The visibility check for a tuple with (xmin=A, xmax=B) is:

VISIBLE if:
  xmin is committed AND xmin < snapshot.xmax AND xmin NOT IN snapshot.xip
  AND
  (xmax = 0   // not deleted
   OR xmax is aborted
   OR xmax >= snapshot.xmax  // deleted by a future transaction
   OR xmax IN snapshot.xip)  // deleted by a concurrent (not yet committed) transaction

In pseudocode:

function isTupleVisible(tuple: Tuple, snapshot: Snapshot): boolean {
  // Creator must be committed and visible to snapshot
  if (!isCommitted(tuple.xmin)) return false;
  if (tuple.xmin >= snapshot.xmax) return false;
  if (snapshot.xip.has(tuple.xmin)) return false;

  // If deleted, the deleter must be invisible to snapshot
  if (tuple.xmax !== 0 && isCommitted(tuple.xmax)) {
    if (tuple.xmax < snapshot.xmax && !snapshot.xip.has(tuple.xmax)) {
      return false; // deleted by a committed, visible transaction
    }
  }
  return true;
}

The infomask bits cache whether xmin/xmax are committed, so PostgreSQL usually avoids consulting pg_xact (the commit log) on the hot path.


Isolation Levels in MVCC Terms

SQL defines four isolation levels. MVCC implements them differently than 2PL does.

Read Committed

The snapshot is taken per statement, not per transaction. Each query sees the latest committed data at the moment it starts.

-- Session A:
BEGIN;
SELECT balance FROM accounts WHERE id = 1; -- sees 1000

-- Session B (concurrent):
UPDATE accounts SET balance = 800 WHERE id = 1;
COMMIT;

-- Session A continues:
SELECT balance FROM accounts WHERE id = 1; -- now sees 800 (non-repeatable read!)
COMMIT;

Anomaly exposed: non-repeatable reads. The same query returns different results within the same transaction. Fine for many workloads; dangerous for anything that joins data from multiple reads and assumes consistency.

Repeatable Read (Snapshot Isolation)

The snapshot is taken once at transaction start. All reads in the transaction see the same consistent point in time. Non-repeatable reads are eliminated.

PostgreSQL’s “Repeatable Read” is actually Snapshot Isolation (SI), which is stronger than what SQL standard requires but still not fully serializable.

Anomaly exposed: write skew (see below).

Serializable (SSI)

PostgreSQL’s Serializable uses Serializable Snapshot Isolation (SSI), which adds predicate tracking on top of SI. This is the gold standard: the execution is guaranteed to be equivalent to some serial order.


Write Skew and Serialization Anomalies

Write skew is the canonical SI anomaly. It cannot happen in a fully serializable system.

The Doctor On-Call Problem

-- Rule: at least one doctor must be on-call at all times
-- Currently: Alice=on-call, Bob=on-call

-- Transaction 1 (Alice going off-call):
BEGIN;
SELECT COUNT(*) FROM doctors WHERE on_call = true; -- sees 2, safe to proceed
UPDATE doctors SET on_call = false WHERE name = 'Alice';
COMMIT;

-- Transaction 2 (Bob going off-call, concurrent):
BEGIN;
SELECT COUNT(*) FROM doctors WHERE on_call = true; -- also sees 2, safe to proceed
UPDATE doctors SET on_call = false WHERE name = 'Bob';
COMMIT;

-- Result: both committed, zero doctors on-call. Rule violated.

Both transactions read a consistent snapshot, but their writes conflict on a logical predicate that neither individually violated. This is write skew under SI.

The Bank Account Anomaly

-- Constraint: checking + savings >= 0
-- checking = 70, savings = 30

-- T1: withdraw 80 from checking (reads savings=30, total=100, ok)
-- T2: withdraw 80 from savings  (reads checking=70, total=100, ok)
-- Both commit → checking=-10, savings=-50. Constraint violated.

How PostgreSQL SSI Works: SIREAD Locks

PostgreSQL SSI (introduced in 9.1, based on the Cahill et al. paper) tracks read/write conflicts between transactions using lightweight SIREAD locks.

A SIREAD lock records “transaction T read predicate P” without blocking anyone. When T2 writes to a range that T1 has a SIREAD lock on, the system records a rw-conflict (T1 →rw T2). If it detects a dangerous cycle of rw-conflicts (T1 →rw T2 →rw T1, or a three-way cycle), one transaction is aborted with:

ERROR: could not serialize access due to read/write dependencies among transactions

The key insight: SSI is optimistic. It doesn’t block anything. It aborts at commit time if a cycle is detected. Applications must retry.

-- Set the isolation level
BEGIN ISOLATION LEVEL SERIALIZABLE;
-- All reads acquire SIREAD locks automatically
SELECT COUNT(*) FROM doctors WHERE on_call = true;
UPDATE doctors SET on_call = false WHERE name = 'Alice';
COMMIT; -- PostgreSQL checks for cycles here

The overhead of SSI is low for most workloads — SIREAD locks are in-memory and much lighter than row-level locks.


VACUUM and Dead Tuple Cleanup

Because MVCC retains old tuple versions, dead tuples accumulate. Without cleanup:

  1. Table bloat: pages fill with dead tuples; sequential scans slow down.
  2. Index bloat: index entries point to dead heap tuples; index scans waste I/O.
  3. XID wraparound risk (described above).

VACUUM performs:

  1. Scan heap pages for dead tuples (xmax committed, no active snapshot can see them).
  2. Mark dead tuple slots as reusable.
  3. Update visibility maps (tracks pages with all-visible tuples, used by index-only scans).
  4. Update pg_class.relfrozenxid to record the oldest live XID in the table.
  5. Optionally VACUUM FULL compacts the table (requires exclusive lock — avoid in production).

autovacuum runs VACUUM automatically based on thresholds:

trigger when dead_tuples > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples

Default: trigger when 20% of rows are dead. For a 10-million-row table, that’s 2 million dead tuples before autovacuum fires — potentially gigabytes of bloat. High-write tables often need aggressive autovacuum tuning:

-- Per-table autovacuum tuning
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1% dead tuples
  autovacuum_vacuum_cost_delay = 2         -- less aggressive throttling
);

MySQL InnoDB: Undo Log MVCC

InnoDB implements MVCC differently from PostgreSQL. Instead of storing multiple versions in the heap, InnoDB stores only the current row in the B-tree and keeps older versions in a separate undo log.

InnoDB row (current)
    ┌─────────────────────────────────┐
    │ data | trx_id | roll_pointer ───┼──► undo log record (version N-1)
    └─────────────────────────────────┘                        │

                                                    undo log record (version N-2)


                                                            ...

When a reader needs an older version, InnoDB reconstructs it by applying undo records backwards from the current version. This is called a read view.

Comparison:

AspectPostgreSQLMySQL InnoDB
Old versions storedIn heap (same page)In separate undo log
Read path (old version)Find tuple in heap directlyReconstruct from undo chain
Write amplificationHigher (full tuple written)Lower (undo delta only)
VACUUM neededYes (dead tuples in heap)Undo log purged by purge thread
Index updatesHOT avoids index write if possibleAlways updates clustered index
Long-running transactionsBloats tableBloats undo log

Both approaches have real costs with long-running transactions: PostgreSQL can’t VACUUM dead tuples referenced by open snapshots; InnoDB’s undo log grows unbounded.


HOT Updates: Avoiding Index Churn

When PostgreSQL updates a row and the indexed columns do not change, it can use a Heap-Only Tuple (HOT) update:

  1. The new tuple is written on the same heap page as the old one.
  2. The old tuple’s ctid is updated to point to the new tuple (forming a HOT chain).
  3. No index entry is written for the new tuple.

Index entries still point to the old tuple’s ctid, but a HOT chain traversal leads to the current version. This avoids index bloat for updates to non-indexed columns — a significant win for tables with many indexes.

HOT is only possible when the page has enough free space. Fill factor tuning matters:

CREATE TABLE accounts (id int, balance int, updated_at timestamptz)
WITH (fillfactor = 70); -- leave 30% free for HOT updates

MVCC vs. Two-Phase Locking

PropertyMVCC2PL
Readers block writersNoYes
Writers block readersNoYes
Deadlocks possibleNo (readers) / Yes (writers)Yes
Read anomaliesDepends on levelDepends on level
SerializableSSI (optimistic)S2PL (pessimistic)
Throughput (read-heavy)ExcellentPoor
Throughput (write-heavy, conflicts)GoodPoor (deadlocks, waits)
Storage overheadDead tuples / undo logLock table
Predicate lockingSIREAD (SSI)Required for serializable

2PL remains used in some embedded databases and older systems. For general-purpose OLTP, MVCC won decisively by the 1990s.


Interview Deep-Dive: Concurrent Reads and Writes in PostgreSQL

Q: Explain how PostgreSQL handles concurrent reads and writes.

A complete answer:

PostgreSQL uses MVCC. Every row version has xmin (the XID that created it) and xmax (the XID that deleted it). When a transaction begins, it takes a snapshot — recording the current XID watermark and the set of in-progress transactions. Every read uses this snapshot to determine which tuple version is visible: a tuple is visible if its creator committed before the snapshot and its deleter either hasn’t committed or committed after the snapshot.

This means readers never block writers. A long-running analytical query can scan millions of rows while concurrent transactions update those same rows — the reader sees the state as of its snapshot start, and writers produce new tuple versions without interfering.

The cost is storage: dead tuples accumulate until VACUUM reclaims them. VACUUM must not remove any version that an open snapshot could still need — so a long-running transaction keeps dead tuples alive, causing bloat.

At the Serializable isolation level, PostgreSQL adds SSI: SIREAD locks track which predicates each transaction read. At commit time, the system checks for rw-conflict cycles that would indicate a non-serializable execution, aborting one transaction if a cycle exists.


TypeScript Pseudocode: MVCC Visibility Check

type XID = number;
type TupleVersion = {
  xmin: XID;
  xmax: XID; // 0 means not deleted
  data: Record<string, unknown>;
};

type Snapshot = {
  xmin: XID;       // all XIDs below this are committed
  xmax: XID;       // all XIDs >= this don't exist yet
  xip: Set<XID>;   // in-progress XIDs in [xmin, xmax)
};

type CommitLog = Map<XID, "committed" | "aborted" | "in-progress">;

function isXidVisible(xid: XID, snapshot: Snapshot, clog: CommitLog): boolean {
  if (xid >= snapshot.xmax) return false;         // future transaction
  if (snapshot.xip.has(xid)) return false;        // in-progress at snapshot time
  return clog.get(xid) === "committed";           // must be committed
}

function isTupleVisible(
  tuple: TupleVersion,
  snapshot: Snapshot,
  clog: CommitLog
): boolean {
  // Creator must be visible
  if (!isXidVisible(tuple.xmin, snapshot, clog)) return false;

  // If deleted, the deleter must NOT be visible
  if (tuple.xmax !== 0) {
    if (isXidVisible(tuple.xmax, snapshot, clog)) return false;
  }
  return true;
}

function mvccScan(
  versions: TupleVersion[],
  snapshot: Snapshot,
  clog: CommitLog
): TupleVersion[] {
  return versions.filter(t => isTupleVisible(t, snapshot, clog));
}

// Simulate an UPDATE
function mvccUpdate(
  heap: TupleVersion[],
  oldTupleIndex: number,
  newData: Record<string, unknown>,
  currentXid: XID
): void {
  // Stamp xmax on old version
  heap[oldTupleIndex].xmax = currentXid;
  // Append new version
  heap.push({ xmin: currentXid, xmax: 0, data: newData });
}

Phantom Reads: The Isolation Level Gap

Phantom reads are distinct from non-repeatable reads and require special handling.

Non-repeatable read: a row you already read changes value between two reads in the same transaction.

Phantom read: a new row appears (or disappears) between two reads of the same predicate.

-- T1 (Repeatable Read / Snapshot Isolation):
BEGIN;
SELECT * FROM orders WHERE amount > 1000; -- returns 5 rows

-- T2 (concurrent, commits):
INSERT INTO orders (amount) VALUES (5000);
COMMIT;

-- T1 continues:
SELECT * FROM orders WHERE amount > 1000; -- still 5 rows under SI
-- The new row is invisible because T1's snapshot predates T2's commit

Under Snapshot Isolation, phantom reads on set membership are actually prevented — T1 sees a consistent snapshot and the new row doesn’t appear. However, under Read Committed they do appear. More dangerously, a transaction may act on a predicate (e.g., count rows) and then modify data based on that count — creating a write skew even without seeing the new row in a second read.

This is why the “phantom read” column in SQL isolation level tables is subtly different from what most tutorials explain. The real phantom hazard under SI is the write skew on predicates, handled only by SSI’s SIREAD locks.


Oracle MVCC: A Different Approach

Oracle was the first commercial database to implement MVCC (circa 1984). Oracle’s approach differs from both PostgreSQL and InnoDB:

  • Oracle stores only the current version of a row in the data block (like InnoDB).
  • Old versions are stored in a rollback segment (undo tablespace).
  • Each data block has an Interested Transaction List (ITL) — a small header listing which transactions have modified rows in this block, and pointers to their undo entries.
  • When a reader needs an older version, Oracle reconstructs it using the undo chain.

The read consistency unit in Oracle is the query SCN (System Change Number) — Oracle’s equivalent of a snapshot. Each query sees data as of its SCN. By default, Oracle uses Statement-level read consistency (equivalent to Read Committed). Setting SET TRANSACTION ISOLATION LEVEL SERIALIZABLE in Oracle gives snapshot isolation at transaction start.

Oracle has never implemented SSI — it relies on application-level locking (SELECT ... FOR UPDATE) or manual conflict detection for workloads that require serializability.


Observing MVCC Internals in PostgreSQL

PostgreSQL exposes MVCC internals through several system views and functions:

-- See all active snapshots (requires pg_stat_activity access)
SELECT pid, query, xact_start, state, backend_xid, backend_xmin
FROM pg_stat_activity
WHERE state != 'idle';

-- Find the oldest active transaction (limits VACUUM)
SELECT min(backend_xmin) AS oldest_xmin FROM pg_stat_activity;

-- Estimate table bloat from dead tuples
SELECT relname,
       n_dead_tup,
       n_live_tup,
       round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Inspect actual tuple visibility (requires pageinspect extension)
CREATE EXTENSION IF NOT EXISTS pageinspect;
SELECT t_xmin, t_xmax, t_infomask, t_data
FROM heap_page_items(get_raw_page('accounts', 0));

Key operational insight: a long-running transaction or an idle-in-transaction session holds backend_xmin at its snapshot’s xmin value. This prevents VACUUM from removing any dead tuple newer than that xmin — across the entire database, not just the tables that transaction touched. One abandoned session can cause unbounded bloat.

-- Kill idle-in-transaction sessions older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '10 minutes';

-- Set a statement timeout and idle-in-transaction timeout in postgresql.conf:
-- statement_timeout = '30s'
-- idle_in_transaction_session_timeout = '5min'

Key Takeaways

  • MVCC stores multiple versions of each row; readers pick the version current at their snapshot time.
  • PostgreSQL encodes versions as xmin/xmax on heap tuples; InnoDB and Oracle use undo logs.
  • XID wraparound is a real operational risk — VACUUM and freezing are mandatory; monitor age(datfrozenxid).
  • Read Committed takes a snapshot per statement; Snapshot Isolation takes one per transaction.
  • Write skew is the fundamental SI anomaly — only SSI (PostgreSQL’s Serializable) prevents it via SIREAD locks.
  • VACUUM reclaims dead tuples; long-running transactions (including idle-in-transaction) prevent this and cause bloat.
  • HOT updates avoid index writes when indexed columns are unchanged; tune fillfactor for write-heavy tables.
  • One idle-in-transaction session can stall VACUUM across the entire database — set idle_in_transaction_session_timeout.