Core Architecture · Topic 15 of 21

Indexing & Query Execution

300 XP

Why Indexes Exist

A database table without an index is a heap: an unordered collection of rows. Finding a row requires scanning every row — O(n) cost. With 10 million rows at 100 bytes each = 1GB of data that must be read from disk. At 500MB/s sequential read, that’s 2 seconds for every query.

An index is a separate data structure that maintains a sorted mapping from key values to heap locations. With a B-tree index on the same 10M row table, finding a row costs O(log n) = ~24 comparisons, each hitting a cached B-tree node. The index itself is small enough to fit in the database buffer pool.

Sequential Scan:
  Disk pages: [page1][page2][page3]...[page65536]
  Cost: O(n) — read all pages

B-tree Index Scan:
  Index:      root → [page, key range]
                       └──► internal node
                              └──► leaf node → [heap TID]
  Cost: O(log n) to find leaf + O(k) heap fetches for k matching rows

The crossover point: if a query returns >5-10% of rows, the planner prefers a sequential scan (predictable sequential I/O) over an index scan (random I/O). This is counterintuitive — index scans are not always faster.


B-Tree Index Internals

PostgreSQL (and MySQL InnoDB, SQLite) uses a variant of the B+ tree: all data lives in leaf nodes; internal nodes contain only keys for routing. Leaf nodes are linked in sorted order, enabling efficient range scans.

B+ Tree with order 3 (max 3 keys per node):

Internal:       [15 | 30]
                /    |    \
Leaf:    [5|8|12]  [15|20|25]  [30|40|50]
              ↔         ↔           ↔
         (linked list at leaf level — range scan walks the list)

Properties:

  • Balanced: all leaf nodes at same depth → guaranteed O(log n) worst case
  • Nodes are sized to match a disk page (typically 8KB in PostgreSQL)
  • A page holds ~400 keys at 20 bytes each → log_400(10M) ≈ 3 levels for 10M rows
  • Practical depth: 3-4 levels even for billion-row tables

The left-most prefix rule governs composite index usage. Given CREATE INDEX idx ON orders(customer_id, status, created_at):

Query predicateUses index?Why
WHERE customer_id = 5✅ YesUses first column
WHERE customer_id = 5 AND status = 'pending'✅ YesUses first two columns
WHERE customer_id = 5 AND created_at > '2024-01-01'✅ PartialSkips status, uses customer_id then scans
WHERE status = 'pending'❌ NoSkips first column
WHERE created_at > '2024-01-01'❌ NoSkips first two columns

Composite index key order matters. Put equality predicates first, range predicates last. High-cardinality columns first (reduces the result set faster).


Covering Indexes and Index-Only Scans

A covering index includes all columns needed by a query — the query can be answered entirely from the index without touching the heap (table rows).

-- Without covering index: index scan + heap fetch for every row
CREATE INDEX idx_orders_user ON orders(user_id);
SELECT user_id, status, total FROM orders WHERE user_id = 42;
-- Plan: Index Scan on idx_orders_user → fetch heap page for each row

-- With covering index: index-only scan
CREATE INDEX idx_orders_user_cover ON orders(user_id) INCLUDE (status, total);
SELECT user_id, status, total FROM orders WHERE user_id = 42;
-- Plan: Index Only Scan on idx_orders_user_cover → no heap access!

Index-only scans are dramatically faster when the table has many columns and rows. PostgreSQL must check the visibility map to ensure the covering index has current MVCC data; frequent updates reduce the benefit.

INCLUDE vs key columns: In PostgreSQL 11+, INCLUDE (col) adds columns to the leaf without affecting the sort order. Never put wide columns in the key — they inflate every level of the B-tree.


Hash Indexes

Hash indexes offer O(1) equality lookups by hashing the key to a bucket. PostgreSQL’s hash indexes are in-memory for the hot set and spill to disk.

CREATE INDEX idx_users_email_hash ON users USING HASH (email);

-- Fast: exact equality
SELECT * FROM users WHERE email = 'alice@example.com';

-- Useless: range queries have no concept of order
SELECT * FROM users WHERE email > 'a'; -- cannot use hash index

When to use hash indexes:

  • Only equality lookups, no range scans
  • Very high equality query rate
  • Key is wide (UUID, long string) — B-tree internal nodes must store full key copies; hash reduces to fixed-size hash

In practice, B-tree handles equality nearly as well as hash for most workloads. Hash indexes shine only in specialized high-volume equality lookup scenarios.


GiST and GIN Indexes

GiST (Generalized Search Tree): A framework for building balanced tree indexes for non-scalar types. Used for:

-- Full-text search
CREATE INDEX ON articles USING GIST (to_tsvector('english', body));

-- Geometric types (PostGIS)
CREATE INDEX ON locations USING GIST (geom);
SELECT * FROM locations WHERE ST_DWithin(geom, ST_Point(-73.9, 40.7), 1000);

-- Range types
CREATE INDEX ON reservations USING GIST (during);
SELECT * FROM reservations WHERE during && '[2024-01-10, 2024-01-15)';

GIN (Generalized Inverted Index): Maps each element of a composite value to a list of rows containing it. Used for:

-- JSONB containment
CREATE INDEX ON events USING GIN (payload jsonb_path_ops);
SELECT * FROM events WHERE payload @> '{"type": "click", "button": "buy"}';

-- Array containment
CREATE INDEX ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgres', 'performance'];

-- Full-text search (faster lookups than GiST, slower to update)
CREATE INDEX ON articles USING GIN (to_tsvector('english', body));

GIN vs GiST trade-offs:

PropertyGINGiST
Lookup speedFasterSlower
Build timeSlowerFaster
Update performanceSlower (inverted index rebuild)Faster
Index sizeLargerSmaller

GIN is preferred for static/rarely-updated data. GiST for frequently-updated data.


Partial and Functional Indexes

Partial index: An index with a WHERE clause — only a subset of rows is indexed.

-- Index only pending orders (assume completed orders are rarely queried)
CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
  WHERE status = 'pending';

-- The planner uses this index only for queries that match the WHERE condition
EXPLAIN SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';
-- → Index Scan on idx_orders_pending

EXPLAIN SELECT * FROM orders WHERE user_id = 42 AND status = 'completed';
-- → Sequential Scan (partial index doesn't apply)

Benefits: smaller index (fits in memory), faster writes (fewer index entries to maintain), better statistics for the planner.

Functional index: Index on an expression rather than a raw column.

-- Case-insensitive email lookups
CREATE INDEX idx_users_lower_email ON users(lower(email));
SELECT * FROM users WHERE lower(email) = lower('Alice@EXAMPLE.COM');
-- → uses the functional index

-- Index on extracted JSONB field
CREATE INDEX idx_events_type ON events((payload->>'type'));
SELECT * FROM events WHERE payload->>'type' = 'purchase';

-- Index on date part
CREATE INDEX idx_orders_month ON orders(date_trunc('month', created_at));

The query’s WHERE clause must match the index expression exactly for the planner to recognize it.


Index Selectivity

Selectivity is the fraction of rows that match a given predicate. Low selectivity = many matching rows = index scan not beneficial.

-- Selectivity of different columns in a users table with 1M rows:
-- gender: 2 distinct values → selectivity = 0.5 (50% of rows match)
-- country: 200 distinct values → selectivity = 0.005 (0.5% match)
-- user_id: 1M distinct values → selectivity = 0.000001 (one row)

-- Index on gender is nearly useless — planner will ignore it
CREATE INDEX idx_users_gender ON users(gender);
EXPLAIN SELECT * FROM users WHERE gender = 'female'; -- Seq Scan!

-- Index on user_id is extremely useful
CREATE INDEX idx_users_id ON users(user_id);
EXPLAIN SELECT * FROM users WHERE user_id = 42; -- Index Scan, 1 row

The planner uses column statistics (pg_stats) to estimate selectivity. ANALYZE updates these statistics by sampling a fraction of the table. Stale statistics cause the planner to choose wrong plans.

-- View statistics for a column
SELECT n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

-- Force statistics update
ANALYZE orders;

-- Increase statistics target for skewed distributions
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500; -- default is 100
ANALYZE orders;

Query Planning: Reading EXPLAIN ANALYZE

The cost-based optimizer (CBO) generates multiple candidate plans and picks the lowest-cost one. Cost is in arbitrary units (not milliseconds) calibrated by seq_page_cost, random_page_cost, cpu_tuple_cost.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, u.email, sum(oi.price)
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at > '2024-01-01'
GROUP BY o.id, u.email;

Sample output and how to read it:

Hash Join  (cost=15234.00..89234.00 rows=12000 width=48)
           (actual time=234.123..891.456 rows=11834 loops=1)
  Buffers: shared hit=4523 read=12341
  ->  Hash Join  (cost=...)
      ...
  ->  Seq Scan on orders  (cost=0.00..45234.00 rows=500000 width=24)
                          (actual time=0.034..234.678 rows=487234 loops=1)
        Filter: (created_at > '2024-01-01')
        Rows Removed by Filter: 512766
        Buffers: shared hit=2341 read=8923

Key signals:

  • cost=X..Y: X = startup cost (before first row), Y = total cost
  • rows=N (planner estimate) vs rows=M (actual): large discrepancy = stale stats
  • Buffers: hit=N read=M: hit = served from buffer pool (fast), read = from disk (slow). High read count with many loops = index not caching well.
  • loops=N on inner node = that node executed N times. Nested loop with loops=100000 is almost always a problem.

Join Algorithms

PostgreSQL chooses among three join algorithms based on table sizes, available memory, and whether inputs are sorted:

Nested Loop Join:

For each row in outer table:
  For each matching row in inner table:
    Emit pair

Cost: O(outer_rows × inner_rows) or O(outer_rows × log(inner_rows)) with index

Best for: small outer table, indexed inner table. The canonical “1 + N” case.

Hash Join:

Build phase: Read all of inner table → build hash table on join key
Probe phase: Read outer table → probe hash table for matches

Cost: O(inner_rows) build + O(outer_rows) probe = O(n + m)

Best for: large tables without useful indexes, unsorted data. Memory-intensive: if the hash table doesn’t fit in work_mem, PostgreSQL spills to disk (a “batch” hash join).

Merge Join:

Require: both inputs sorted on join key (or sortable within the plan)
Merge: walk both sorted lists simultaneously, emit matches

Cost: O(n log n + m log m) for sort + O(n + m) merge

Best for: pre-sorted inputs (from indexes), large tables, range joins. Avoids memory pressure of hash joins.

-- Force specific join type for testing (never do this in production)
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Only nested loop remains

The N+1 Query Problem

The most common ORM-induced performance disaster:

// TypeScript with a naive ORM (pseudo-code)

// 1 query to fetch all orders
const orders = await db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);

// N queries — one per order!
for (const order of orders) {
  const items = await db.query(
    'SELECT * FROM order_items WHERE order_id = $1',
    [order.id]
  );
  order.items = items;
}
// Total: 1 + N queries. For 1000 orders: 1001 round-trips to the DB.
// Each round-trip: ~1ms network + query overhead = ~1 second total.

Fix: JOIN and map in application code:

// 1 query total
const rows = await db.query(`
  SELECT
    o.id          AS order_id,
    o.total,
    o.created_at,
    oi.id         AS item_id,
    oi.product_id,
    oi.quantity,
    oi.price
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.id
  WHERE o.user_id = $1
  ORDER BY o.id
`, [userId]);

// Group in application memory
const ordersMap = new Map<number, Order>();
for (const row of rows) {
  if (!ordersMap.has(row.order_id)) {
    ordersMap.set(row.order_id, {
      id: row.order_id,
      total: row.total,
      created_at: row.created_at,
      items: [],
    });
  }
  ordersMap.get(row.order_id)!.items.push({
    id: row.item_id,
    product_id: row.product_id,
    quantity: row.quantity,
    price: row.price,
  });
}
const orders = [...ordersMap.values()];

Alternative fix: IN clause batch loading (DataLoader pattern):

// Collect all IDs first, then batch load
const orderIds = orders.map(o => o.id);
const items = await db.query(
  'SELECT * FROM order_items WHERE order_id = ANY($1)',
  [orderIds]
);
// Group items by order_id in memory — 2 queries total regardless of N

Slow Query Patterns to Avoid

-- 1. LIKE with leading wildcard: cannot use B-tree index
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Fix: reverse the string, or use pg_trgm GIN index

-- 2. Implicit type cast: index on integer id, querying with string
SELECT * FROM users WHERE id = '42';  -- id is INTEGER
-- PostgreSQL must cast every row. Fix: use typed parameter.

-- 3. OR on different indexed columns: can't use single index
SELECT * FROM users WHERE email = 'a@b.com' OR phone = '555-1234';
-- Fix: two separate queries + UNION ALL, or a composite index if columns co-occur

-- 4. SELECT *: fetches all columns, prevents index-only scans
SELECT * FROM orders WHERE user_id = 42;
-- Fix: select only needed columns

-- 5. NOT IN with subquery: can't use index, often O(n²)
SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM order_items);
-- Fix: LEFT JOIN ... WHERE oi.product_id IS NULL, or NOT EXISTS

-- 6. Function on indexed column breaks index
SELECT * FROM users WHERE UPPER(name) = 'ALICE';
-- Fix: CREATE INDEX ON users(UPPER(name)) — functional index

-- 7. OFFSET for pagination: scans and discards rows
SELECT * FROM posts ORDER BY created_at DESC OFFSET 50000 LIMIT 20;
-- Fix: keyset pagination — WHERE created_at < $last_seen ORDER BY created_at DESC LIMIT 20

Partitioning

Partitioning splits a large table into smaller physical tables while presenting a unified logical interface. The query planner applies partition pruning to skip irrelevant partitions.

-- Range partitioning on created_at (common for time-series data)
CREATE TABLE orders (
  id         BIGSERIAL,
  user_id    BIGINT,
  total      NUMERIC,
  created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_01 PARTITION OF orders
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders
  FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Planner only scans orders_2024_01 — partition pruning!
EXPLAIN SELECT * FROM orders WHERE created_at >= '2024-01-15' AND created_at < '2024-02-01';
-- → Append → Seq Scan on orders_2024_01

-- Hash partitioning for even distribution (no natural range)
CREATE TABLE users (id BIGSERIAL, email TEXT)
  PARTITION BY HASH (id);
CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 1);
-- ... etc.

Each partition has its own indexes — they are much smaller, fitting more easily in cache. VACUUM, ANALYZE, REINDEX operate per-partition.

Partition limitations: Cross-partition queries (no pruning possible) are still expensive. Foreign keys to/from partitioned tables have restrictions. Unique constraints must include the partition key.


Index Maintenance

Indexes require maintenance. B-tree nodes accumulate dead entries from deleted/updated rows (PostgreSQL’s MVCC leaves old row versions in place until VACUUM removes them).

-- Find bloated indexes
SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

-- Rebuild index online (no table lock!)
CREATE INDEX CONCURRENTLY new_idx ON orders(user_id, created_at);
DROP INDEX CONCURRENTLY old_idx;

-- Regular VACUUM prevents index bloat
VACUUM ANALYZE orders;

-- Check index usage — find unused indexes (candidates for removal)
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

CREATE INDEX CONCURRENTLY builds the index without holding a table lock. It takes longer (two passes) and cannot run inside a transaction. Use this in production — the standard CREATE INDEX takes a full table lock that blocks writes.


Interview: Diagnosing a Slow Query After Data Growth

A query that took 10ms now takes 10 seconds after the table grew from 10K to 10M rows.

Step 1: Get the execution plan.

EXPLAIN (ANALYZE, BUFFERS) <your slow query>;

Step 2: Identify the bottleneck node. Look for the operation with the highest actual time. Is it a Seq Scan on a large table? A nested loop with high loops? A hash join with disk spill (Batches: N where N > 1)?

Step 3: Check planner estimates vs actuals. If rows=100 (estimate) but rows=1000000 (actual), statistics are stale. Run ANALYZE table_name.

Step 4: Check indexes. Is an index missing? Does the query filter on an unindexed column? Is the selectivity low enough that the planner prefers a seq scan?

Step 5: Check for implicit type mismatches. A parameter of type text on an integer column forces a full scan.

Step 6: Check BUFFERS. High shared read (disk fetches) relative to shared hit (buffer pool hits) means the working set no longer fits in shared_buffers. Consider increasing it or adding more RAM.

Step 7: Examine index bloat. If the table has seen heavy delete/update activity, indexes may be bloated. REINDEX CONCURRENTLY.

Step 8: Consider schema changes. Covering index? Partial index? Partitioning if the table is time-series? Denormalization if the join is always the same?

The answer that impresses interviewers: start with EXPLAIN ANALYZE, identify the actual cost node, check statistics, then consider structural changes — in that order.