Core Architecture · Topic 14 of 21

Storage Engines & B-Trees

300 XP

The Fundamental Problem of Persistent Storage

Every database must answer the same question: given that disk I/O is orders of magnitude slower than CPU and RAM, and that random access is far slower than sequential access — how do you store and retrieve data efficiently?

The answer depends on your workload. Systems optimized for reads look different from systems optimized for writes. Two data structures dominate: B-trees (read-optimized, in-place updates) and LSM trees (write-optimized, out-of-place append). Understanding their mechanics at the implementation level explains every major database design decision made in the last 40 years.


B-Tree: Structure and Properties

A B-tree (more precisely, a B+ tree in most databases) is a balanced, ordered tree where:

  • All data lives in leaf nodes; internal nodes hold only keys and child pointers.
  • Every node corresponds to one disk page (typically 4KB, 8KB, or 16KB).
  • All leaf nodes are linked in a doubly-linked list, enabling efficient range scans.
  • The tree is always balanced: all leaves are at the same depth.

Branching Factor and Tree Height

For a page size of 8KB and a key size of 8 bytes plus a 6-byte page pointer (14 bytes per entry), one internal node holds:

8192 bytes / 14 bytes ≈ 585 entries per internal node

Height of the tree for N leaf pages:

Height = ceil(log_585(N))

N leaf pages = 1 million  → height ≈ ceil(log_585(1,000,000)) = 4
N leaf pages = 1 billion  → height ≈ ceil(log_585(1,000,000,000)) = 6

A billion-row table requires at most 6 disk seeks to find a row by primary key. That is the magic of B-trees.

Root (1 page)
  ├── Internal (585 pages)
  │     ├── Leaf  (585 × 585 = 342,225 pages)
  │     │      each leaf stores multiple rows
  ...

B-Tree Write Path

  1. Descend from root to the leaf page that should contain the new key (O(log n) page reads).
  2. Insert the key into the leaf. If the leaf is not full (fill factor), done — write one page.
  3. Split on overflow: the leaf is split into two half-full pages. The middle key is promoted to the parent. The parent may also need to split, cascading upward.
  4. Write all modified pages to disk (via WAL first — see the ACID & WAL article).

A routine insert touches 1 page (no split) to O(log n) pages (cascading splits, rare in practice for typical fill factors).

B-Tree Read Path

A point lookup: read root → internal nodes → leaf. Each hop is one page read. O(log n) page I/Os.

A range scan: find the starting leaf, then follow the linked list of leaf nodes. Effectively sequential I/O once the start is found.

Write Amplification in B-Trees

A single logical write produces:

  1. WAL record (sequential write to log)
  2. Dirty heap page (random write, buffered)
  3. One or more dirty index pages (random writes)
  4. All of the above fsynced at checkpoint time

For a table with 5 indexes, a single row INSERT might dirty 6 pages. The actual page size is 8KB. Write amplification: 6 × 8KB = 48KB for a 100-byte logical write — ~480x amplification at the storage level.

This is manageable for OLTP (the buffer pool absorbs many random writes), but it becomes the bottleneck at very high write throughput.


LSM Tree: Log-Structured Merge-Tree

The LSM tree was proposed by Patrick O’Neil et al. in 1996 and implemented in production by Google’s BigTable (2006), which spawned LevelDB, RocksDB, Cassandra, HBase, and countless others.

The core insight: convert random writes into sequential writes by buffering all writes in memory and periodically flushing sorted, immutable files to disk.

LSM Write Path

Write("alice", 1000)


   WAL (sequential, append-only) ────► fsync ────► ACK to client


   MemTable (in-memory sorted structure, usually a skip list or red-black tree)

      │ (when MemTable exceeds threshold, e.g. 64MB)

   Flush to L0 SSTable (Sorted String Table, immutable file on disk)

      │ (background compaction process)

   L1 SSTables → L2 SSTables → ... (larger, fewer, sorted and non-overlapping)

Every write is a sequential append: WAL is sequential, MemTable flush is sequential (writes a sorted file), compaction reads and writes sequentially. There are no random writes in the steady state.

SSTable Structure

An SSTable is an immutable, sorted file on disk:

SSTable: data.sst
┌─────────────────────────────────────────────────────┐
│ Data blocks (sorted key-value pairs, compressed)    │
│   "alice" → 1000                                    │
│   "bob"   → 500                                     │
│   "carol" → 750                                     │
├─────────────────────────────────────────────────────┤
│ Index block (sparse index: key → block offset)      │
│   "alice" → offset 0                                │
│   "carol" → offset 4096                             │
├─────────────────────────────────────────────────────┤
│ Bloom filter (probabilistic set membership test)    │
├─────────────────────────────────────────────────────┤
│ Footer (index block offset, bloom filter offset)    │
└─────────────────────────────────────────────────────┘

LSM Read Path

A point lookup must check every level:

  1. Check the MemTable (in-memory, O(log n)).
  2. Check each L0 SSTable in recency order (L0 files can have overlapping key ranges).
  3. For L1 and beyond: use the level’s index to find which SSTable might have the key, then check that SSTable’s bloom filter, then (if positive) do a binary search within the SSTable.

In the worst case (key doesn’t exist), this touches every level. Read amplification is the LSM’s Achilles heel.

Bloom Filters

A bloom filter is a probabilistic data structure that can definitively say “this key is NOT in this SSTable” (no false negatives) and probabilistically say “this key might be in this SSTable” (false positives possible).

For a bloom filter with m bits, k hash functions, and n inserted elements, the false positive rate (FPR) is:

FPR = (1 - e^(-kn/m))^k

Optimal k for given m and n: k = (m/n) × ln(2) ≈ 0.693 × (m/n)

In practice, 10 bits per key → FPR ≈ 0.8%. For a 100-million-key SSTable, that’s 1.25MB of bloom filter data. At an FPR of 0.8%, 99.2% of unnecessary SSTable lookups are skipped — a massive read optimization.


Compaction Strategies

SSTables accumulate. Without compaction, reads would need to check too many files, and storage would contain too many obsolete versions of keys. Compaction merges SSTables, discards obsolete entries, and enforces the level structure.

Size-Tiered Compaction (Cassandra default)

Group SSTables by size. When a size tier has enough members (typically 4), merge them into one larger SSTable. Simple but creates temporary space amplification (the merged output plus all inputs exist simultaneously during compaction).

Tier 1: [4MB] [4MB] [4MB] [4MB] → merge → [16MB]
Tier 2: [16MB] [16MB] [16MB] [16MB] → merge → [64MB]
...

Leveled Compaction (RocksDB/LevelDB default)

L0 files can have overlapping key ranges. When L0 reaches a threshold (typically 4 files), they are merged into L1. L1 files cover non-overlapping key ranges and have a fixed total size budget (e.g., 10MB). When L1 overflows, one L1 file is merged into the overlapping L2 files. And so on.

This enforces that all levels L1+ have non-overlapping key ranges, which dramatically reduces read amplification (at most one file per level to check for a given key) at the cost of more write amplification (a key may be compacted many times as it moves through levels).

FIFO Compaction

Simply delete old SSTables when total size exceeds a budget. Only appropriate for time-series data where old data naturally expires.


The RUM Conjecture

O’Neil et al.’s insight is formalized as the RUM conjecture:

You cannot simultaneously optimize Read amplification, Update (write) amplification, and Memory (space) amplification. Optimizing any two worsens the third.

StructureRead AmpWrite AmpSpace Amp
B-treeLowMediumLow
LSM (Size-Tiered)HighLowHigh
LSM (Leveled)Low-MediumHighLow-Medium
Hash tableO(1)LowHigh (with chaining)

This is why there is no universally best storage engine. The right choice depends on workload read/write ratio, key size, value size, and access patterns.


B-Tree vs LSM: When Each Wins

ScenarioWinnerReason
OLTP: mostly reads, some updatesB-treeLow read amplification; random writes absorbed by buffer pool
Write-heavy: continuous high-throughput insertsLSMAll writes are sequential; much lower write amp
Time-series, event logs, append-onlyLSMNever updates old data; compaction discards naturally
Analytical queries, range scansB-treeLeaf link list gives sequential scan; predictable I/O
Key-value store with large valuesEitherLSM often separates keys and values (WiscKey/Titan)
Mixed OLTP with many secondary indexesB-treeLSM secondary indexes are complex and expensive

Real databases:

  • PostgreSQL, MySQL: B-tree heap + B-tree indexes
  • RocksDB (embeds in CockroachDB, TiKV, MyRocks, Yugabyte): LSM
  • Cassandra, HBase, ScyllaDB: LSM
  • SQLite: B-tree
  • FoundationDB: custom B-tree with MVCC

Column Stores vs Row Stores

Everything discussed so far stores data row by row (NSM: N-ary Storage Model). Each page contains complete rows.

For OLAP (analytical queries that aggregate millions of rows but touch only a few columns), row stores are wasteful: a query like SELECT AVG(price) FROM orders reads every column of every row into cache, even though it only needs price.

Column stores (DSM: Decomposition Storage Model) store each column as a separate file or segment:

Row store page (8KB):    Column store segment:
[id][name][age][salary]  [id]: 1,2,3,4,5,...
[1][Alice][30][100000]   [name]: Alice,Bob,...
[2][Bob][25][80000]      [age]: 30,25,...
[3][Carol][35][120000]   [salary]: 100000,80000,...  ← only this needed

Advantages for OLAP:

  • Higher cache efficiency: a query touching 1 of 100 columns reads 1/100 of the data.
  • Better compression: a column of the same data type compresses much better than mixed-type rows (run-length encoding, dictionary encoding, bitpacking).
  • SIMD operations: modern CPUs can aggregate a column of int64 values 8 at a time using AVX-512.

Column stores: ClickHouse, Redshift, BigQuery, Snowflake, DuckDB, Apache Parquet (file format).


PostgreSQL Index Types

PostgreSQL is unusual in offering multiple index access methods:

Index TypeStructureBest For
B-treeBalanced treeEquality, range, ORDER BY — the default
HashHash table (on-disk)Equality only, slightly faster for point lookups
BRINMin/max per block rangeVery large tables with physically ordered data (time-series, log tables)
GiSTGeneralized Search Tree (extensible)Geometric data, full-text, nearest-neighbor
GINGeneralized Inverted IndexArray containment, JSONB, full-text search
SP-GiSTSpace-partitioned GiSTNon-overlapping spaces: quadtrees, radix trees

BRIN is the most misunderstood. It stores only (min, max) for each range of consecutive heap pages. For a 1-billion-row time-series table ordered by created_at, a BRIN index of pages_per_range=128 is ~1MB and eliminates 99%+ of pages for a date-range query. The B-tree equivalent would be hundreds of megabytes.


TypeScript Pseudocode: LSM MemTable and SSTable Flush

// MemTable: sorted in-memory buffer (backed by a skip list or red-black tree in production)
class MemTable {
  private data = new Map<string, string | null>(); // null = tombstone (DELETE)
  private sizeBytes = 0;
  readonly maxSizeBytes: number;

  constructor(maxSizeBytes = 64 * 1024 * 1024) { // 64MB default
    this.maxSizeBytes = maxSizeBytes;
  }

  put(key: string, value: string): void {
    this.data.set(key, value);
    this.sizeBytes += key.length + value.length;
  }

  delete(key: string): void {
    this.data.set(key, null); // Write a tombstone
    this.sizeBytes += key.length + 8;
  }

  get(key: string): string | null | undefined {
    return this.data.get(key); // undefined = not present; null = deleted
  }

  isFull(): boolean {
    return this.sizeBytes >= this.maxSizeBytes;
  }

  // Return sorted entries for SSTable flush
  sortedEntries(): [string, string | null][] {
    return [...this.data.entries()].sort(([a], [b]) => a.localeCompare(b));
  }
}

// SSTable: immutable sorted file (simplified in-memory representation)
class SSTable {
  private index: Map<string, number> = new Map(); // key → byte offset
  private readonly entries: [string, string | null][];

  constructor(sortedEntries: [string, string | null][]) {
    this.entries = sortedEntries;
    sortedEntries.forEach(([key], i) => this.index.set(key, i));
  }

  get(key: string): string | null | undefined {
    const idx = this.index.get(key);
    if (idx === undefined) return undefined;
    return this.entries[idx][1]; // null = tombstone
  }

  // In production: bloom filter check before index lookup
  mightContain(_key: string): boolean {
    return true; // simplified; real impl uses bloom filter
  }
}

// LSM Engine: orchestrates MemTable, immutable MemTable, and SSTables
class LSMEngine {
  private memTable = new MemTable();
  private wal: string[] = []; // simplified WAL (would be a file)
  private ssTables: SSTable[] = []; // L0 SSTables, newest first

  put(key: string, value: string): void {
    this.wal.push(`PUT ${key} ${value}`); // WAL first
    this.memTable.put(key, value);
    if (this.memTable.isFull()) this.flush();
  }

  delete(key: string): void {
    this.wal.push(`DELETE ${key}`);
    this.memTable.delete(key);
    if (this.memTable.isFull()) this.flush();
  }

  get(key: string): string | undefined {
    // Check MemTable first (most recent)
    const memResult = this.memTable.get(key);
    if (memResult !== undefined) {
      return memResult === null ? undefined : memResult; // null = tombstone
    }

    // Check SSTables from newest to oldest
    for (const ssTable of this.ssTables) {
      if (!ssTable.mightContain(key)) continue; // bloom filter
      const result = ssTable.get(key);
      if (result !== undefined) {
        return result === null ? undefined : result;
      }
    }
    return undefined;
  }

  private flush(): void {
    const ssTable = new SSTable(this.memTable.sortedEntries());
    this.ssTables.unshift(ssTable); // newest first
    this.memTable = new MemTable();
    this.wal = []; // truncate WAL after successful flush
    // In production: trigger background compaction check here
  }
}

Interview Deep-Dive: Why Does Cassandra Use an LSM Tree?

Q: Why does Cassandra use an LSM tree instead of a B-tree?

A complete answer:

Cassandra is designed for write-heavy, time-series and event workloads at massive scale. Its primary design constraint is: ingest data as fast as possible across many nodes.

An LSM tree converts every write into two sequential operations: append to WAL, insert into MemTable. When the MemTable fills, it’s flushed as an immutable SSTable file. There are no random writes. At 100,000 writes/second, this is far more efficient than a B-tree that would scatter dirty pages across a multi-gigabyte file.

The read trade-off: Cassandra reads must check multiple SSTables. It mitigates this with bloom filters (eliminate 99%+ of unnecessary SSTable reads) and compaction (reduce the number of SSTables over time). But reads are still slower than B-trees for arbitrary key lookups — which is acceptable for Cassandra’s targeted access patterns (always query by partition key).

The operational trade-off: compaction consumes I/O and CPU in the background. A poorly-sized cluster with insufficient compaction throughput will accumulate too many L0 SSTables, degrading reads significantly. This is a known operational challenge with LSM-based systems.


Concrete Performance Numbers

Real storage engine performance at scale (approximate, varies by hardware):

B-tree (PostgreSQL, 8KB pages, NVMe SSD):

  • Point lookup: 3-5 page reads for a 100M row table → 3-5 × 100µs = 300-500µs
  • Write throughput: limited by random write amplification → typically 20,000-50,000 rows/sec sustained on a single writer
  • Read throughput: buffer pool hit rate dominates; 95% hit rate → sub-millisecond reads
  • Index size for 100M rows (8-byte int key): ~3.5GB

LSM (RocksDB, 64MB MemTable, Leveled compaction):

  • Write throughput: 200,000-500,000 rows/sec (all sequential writes)
  • Read latency: MemTable hit → <1µs; L1 SSTable → 100-300µs; L5 SSTable → 1-5ms (bloom filter eliminates most)
  • Compaction write amplification (Leveled): 10-30x the actual data size written
  • Space amplification (Leveled): 1.1-1.3x (much lower than Size-Tiered)

Write amplification comparison on a write-heavy workload (100K writes/sec, 64-byte values):

Storage rate (MB/s):
  B-tree (1 index):   ~400 MB/s written to storage
  LSM (Leveled):      ~640 MB/s written to storage (10x amp × 64B × 100K/s)
  LSM (Size-Tiered):  ~320 MB/s written to storage (5x amp) — but reads suffer

This illustrates that LSM Leveled compaction has higher write amplification than B-tree for hot data that fits in the buffer pool — the advantage is for workloads where data is too large for the buffer pool or where writes arrive faster than B-trees can absorb random I/O.


Crash Recovery in LSM Trees

When an LSM-based process crashes, the MemTable (in-memory) is lost. Recovery uses the WAL:

On startup after crash:
  1. Open WAL file
  2. Replay all records since last MemTable flush
  3. Reconstruct the MemTable
  4. Resume normal operation

Checkpoint optimization:
  RocksDB writes a MANIFEST file tracking which SSTables belong to which level.
  On restart, the MANIFEST tells RocksDB the exact L0-Ln layout without scanning all SSTables.
  Only the WAL since the last MemTable flush needs replaying.

The WAL in LSM systems is conceptually identical to the WAL in B-tree systems — it’s the “log before data” guarantee applied to the MemTable → SSTable transition instead of the buffer pool → data file transition.


Key Takeaways

  • B-trees offer O(log n) reads in 4-6 disk seeks for billion-row tables; writes update pages in place with WAL.
  • B-tree write amplification is 5-20x; acceptable for OLTP but a bottleneck at extreme write throughput.
  • LSM trees buffer writes in memory (MemTable) and flush immutable SSTables; all I/O is sequential.
  • LSM read amplification is mitigated by bloom filters (per-SSTable probabilistic membership test) and compaction.
  • Leveled compaction (RocksDB) minimizes space and read amplification at the cost of higher write amplification vs Size-Tiered.
  • The RUM conjecture: you cannot simultaneously minimize read, write, and space amplification.
  • Column stores win for OLAP by storing each column separately, enabling vectorized operations and better compression.
  • B-tree buffer pool hit rate (~95% for OLTP) is why B-trees win for hot-data workloads; LSM wins when writes exceed I/O capacity.
  • BRIN indexes are dramatically smaller than B-trees for physically ordered data — use them for time-series tables.