Why Indexes Exist
A database table without an index is a heap: an unordered collection of rows. Finding a row requires scanning every row — O(n) cost. With 10 million rows at 100 bytes each = 1GB of data that must be read from disk. At 500MB/s sequential read, that’s 2 seconds for every query.
An index is a separate data structure that maintains a sorted mapping from key values to heap locations. With a B-tree index on the same 10M row table, finding a row costs O(log n) = ~24 comparisons, each hitting a cached B-tree node. The index itself is small enough to fit in the database buffer pool.
Sequential Scan:
Disk pages: [page1][page2][page3]...[page65536]
Cost: O(n) — read all pages
B-tree Index Scan:
Index: root → [page, key range]
└──► internal node
└──► leaf node → [heap TID]
Cost: O(log n) to find leaf + O(k) heap fetches for k matching rows
The crossover point: if a query returns >5-10% of rows, the planner prefers a sequential scan (predictable sequential I/O) over an index scan (random I/O). This is counterintuitive — index scans are not always faster.
B-Tree Index Internals
PostgreSQL (and MySQL InnoDB, SQLite) uses a variant of the B+ tree: all data lives in leaf nodes; internal nodes contain only keys for routing. Leaf nodes are linked in sorted order, enabling efficient range scans.
B+ Tree with order 3 (max 3 keys per node):
Internal: [15 | 30]
/ | \
Leaf: [5|8|12] [15|20|25] [30|40|50]
↔ ↔ ↔
(linked list at leaf level — range scan walks the list)
Properties:
- Balanced: all leaf nodes at same depth → guaranteed
O(log n)worst case - Nodes are sized to match a disk page (typically 8KB in PostgreSQL)
- A page holds ~400 keys at 20 bytes each →
log_400(10M) ≈ 3levels for 10M rows - Practical depth: 3-4 levels even for billion-row tables
The left-most prefix rule governs composite index usage. Given CREATE INDEX idx ON orders(customer_id, status, created_at):
| Query predicate | Uses index? | Why |
|---|---|---|
WHERE customer_id = 5 | ✅ Yes | Uses first column |
WHERE customer_id = 5 AND status = 'pending' | ✅ Yes | Uses first two columns |
WHERE customer_id = 5 AND created_at > '2024-01-01' | ✅ Partial | Skips status, uses customer_id then scans |
WHERE status = 'pending' | ❌ No | Skips first column |
WHERE created_at > '2024-01-01' | ❌ No | Skips first two columns |
Composite index key order matters. Put equality predicates first, range predicates last. High-cardinality columns first (reduces the result set faster).
Covering Indexes and Index-Only Scans
A covering index includes all columns needed by a query — the query can be answered entirely from the index without touching the heap (table rows).
-- Without covering index: index scan + heap fetch for every row
CREATE INDEX idx_orders_user ON orders(user_id);
SELECT user_id, status, total FROM orders WHERE user_id = 42;
-- Plan: Index Scan on idx_orders_user → fetch heap page for each row
-- With covering index: index-only scan
CREATE INDEX idx_orders_user_cover ON orders(user_id) INCLUDE (status, total);
SELECT user_id, status, total FROM orders WHERE user_id = 42;
-- Plan: Index Only Scan on idx_orders_user_cover → no heap access!
Index-only scans are dramatically faster when the table has many columns and rows. PostgreSQL must check the visibility map to ensure the covering index has current MVCC data; frequent updates reduce the benefit.
INCLUDE vs key columns: In PostgreSQL 11+, INCLUDE (col) adds columns to the leaf without affecting the sort order. Never put wide columns in the key — they inflate every level of the B-tree.
Hash Indexes
Hash indexes offer O(1) equality lookups by hashing the key to a bucket. PostgreSQL’s hash indexes are in-memory for the hot set and spill to disk.
CREATE INDEX idx_users_email_hash ON users USING HASH (email);
-- Fast: exact equality
SELECT * FROM users WHERE email = 'alice@example.com';
-- Useless: range queries have no concept of order
SELECT * FROM users WHERE email > 'a'; -- cannot use hash index
When to use hash indexes:
- Only equality lookups, no range scans
- Very high equality query rate
- Key is wide (UUID, long string) — B-tree internal nodes must store full key copies; hash reduces to fixed-size hash
In practice, B-tree handles equality nearly as well as hash for most workloads. Hash indexes shine only in specialized high-volume equality lookup scenarios.
GiST and GIN Indexes
GiST (Generalized Search Tree): A framework for building balanced tree indexes for non-scalar types. Used for:
-- Full-text search
CREATE INDEX ON articles USING GIST (to_tsvector('english', body));
-- Geometric types (PostGIS)
CREATE INDEX ON locations USING GIST (geom);
SELECT * FROM locations WHERE ST_DWithin(geom, ST_Point(-73.9, 40.7), 1000);
-- Range types
CREATE INDEX ON reservations USING GIST (during);
SELECT * FROM reservations WHERE during && '[2024-01-10, 2024-01-15)';
GIN (Generalized Inverted Index): Maps each element of a composite value to a list of rows containing it. Used for:
-- JSONB containment
CREATE INDEX ON events USING GIN (payload jsonb_path_ops);
SELECT * FROM events WHERE payload @> '{"type": "click", "button": "buy"}';
-- Array containment
CREATE INDEX ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgres', 'performance'];
-- Full-text search (faster lookups than GiST, slower to update)
CREATE INDEX ON articles USING GIN (to_tsvector('english', body));
GIN vs GiST trade-offs:
| Property | GIN | GiST |
|---|---|---|
| Lookup speed | Faster | Slower |
| Build time | Slower | Faster |
| Update performance | Slower (inverted index rebuild) | Faster |
| Index size | Larger | Smaller |
GIN is preferred for static/rarely-updated data. GiST for frequently-updated data.
Partial and Functional Indexes
Partial index: An index with a WHERE clause — only a subset of rows is indexed.
-- Index only pending orders (assume completed orders are rarely queried)
CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
WHERE status = 'pending';
-- The planner uses this index only for queries that match the WHERE condition
EXPLAIN SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';
-- → Index Scan on idx_orders_pending
EXPLAIN SELECT * FROM orders WHERE user_id = 42 AND status = 'completed';
-- → Sequential Scan (partial index doesn't apply)
Benefits: smaller index (fits in memory), faster writes (fewer index entries to maintain), better statistics for the planner.
Functional index: Index on an expression rather than a raw column.
-- Case-insensitive email lookups
CREATE INDEX idx_users_lower_email ON users(lower(email));
SELECT * FROM users WHERE lower(email) = lower('Alice@EXAMPLE.COM');
-- → uses the functional index
-- Index on extracted JSONB field
CREATE INDEX idx_events_type ON events((payload->>'type'));
SELECT * FROM events WHERE payload->>'type' = 'purchase';
-- Index on date part
CREATE INDEX idx_orders_month ON orders(date_trunc('month', created_at));
The query’s WHERE clause must match the index expression exactly for the planner to recognize it.
Index Selectivity
Selectivity is the fraction of rows that match a given predicate. Low selectivity = many matching rows = index scan not beneficial.
-- Selectivity of different columns in a users table with 1M rows:
-- gender: 2 distinct values → selectivity = 0.5 (50% of rows match)
-- country: 200 distinct values → selectivity = 0.005 (0.5% match)
-- user_id: 1M distinct values → selectivity = 0.000001 (one row)
-- Index on gender is nearly useless — planner will ignore it
CREATE INDEX idx_users_gender ON users(gender);
EXPLAIN SELECT * FROM users WHERE gender = 'female'; -- Seq Scan!
-- Index on user_id is extremely useful
CREATE INDEX idx_users_id ON users(user_id);
EXPLAIN SELECT * FROM users WHERE user_id = 42; -- Index Scan, 1 row
The planner uses column statistics (pg_stats) to estimate selectivity. ANALYZE updates these statistics by sampling a fraction of the table. Stale statistics cause the planner to choose wrong plans.
-- View statistics for a column
SELECT n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
-- Force statistics update
ANALYZE orders;
-- Increase statistics target for skewed distributions
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500; -- default is 100
ANALYZE orders;
Query Planning: Reading EXPLAIN ANALYZE
The cost-based optimizer (CBO) generates multiple candidate plans and picks the lowest-cost one. Cost is in arbitrary units (not milliseconds) calibrated by seq_page_cost, random_page_cost, cpu_tuple_cost.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, u.email, sum(oi.price)
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at > '2024-01-01'
GROUP BY o.id, u.email;
Sample output and how to read it:
Hash Join (cost=15234.00..89234.00 rows=12000 width=48)
(actual time=234.123..891.456 rows=11834 loops=1)
Buffers: shared hit=4523 read=12341
-> Hash Join (cost=...)
...
-> Seq Scan on orders (cost=0.00..45234.00 rows=500000 width=24)
(actual time=0.034..234.678 rows=487234 loops=1)
Filter: (created_at > '2024-01-01')
Rows Removed by Filter: 512766
Buffers: shared hit=2341 read=8923
Key signals:
cost=X..Y:X= startup cost (before first row),Y= total costrows=N(planner estimate) vsrows=M(actual): large discrepancy = stale statsBuffers: hit=N read=M:hit= served from buffer pool (fast),read= from disk (slow). Highreadcount with manyloops= index not caching well.loops=Non inner node = that node executed N times. Nested loop with loops=100000 is almost always a problem.
Join Algorithms
PostgreSQL chooses among three join algorithms based on table sizes, available memory, and whether inputs are sorted:
Nested Loop Join:
For each row in outer table:
For each matching row in inner table:
Emit pair
Cost: O(outer_rows × inner_rows) or O(outer_rows × log(inner_rows)) with index
Best for: small outer table, indexed inner table. The canonical “1 + N” case.
Hash Join:
Build phase: Read all of inner table → build hash table on join key
Probe phase: Read outer table → probe hash table for matches
Cost: O(inner_rows) build + O(outer_rows) probe = O(n + m)
Best for: large tables without useful indexes, unsorted data. Memory-intensive: if the hash table doesn’t fit in work_mem, PostgreSQL spills to disk (a “batch” hash join).
Merge Join:
Require: both inputs sorted on join key (or sortable within the plan)
Merge: walk both sorted lists simultaneously, emit matches
Cost: O(n log n + m log m) for sort + O(n + m) merge
Best for: pre-sorted inputs (from indexes), large tables, range joins. Avoids memory pressure of hash joins.
-- Force specific join type for testing (never do this in production)
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Only nested loop remains
The N+1 Query Problem
The most common ORM-induced performance disaster:
// TypeScript with a naive ORM (pseudo-code)
// 1 query to fetch all orders
const orders = await db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
// N queries — one per order!
for (const order of orders) {
const items = await db.query(
'SELECT * FROM order_items WHERE order_id = $1',
[order.id]
);
order.items = items;
}
// Total: 1 + N queries. For 1000 orders: 1001 round-trips to the DB.
// Each round-trip: ~1ms network + query overhead = ~1 second total.
Fix: JOIN and map in application code:
// 1 query total
const rows = await db.query(`
SELECT
o.id AS order_id,
o.total,
o.created_at,
oi.id AS item_id,
oi.product_id,
oi.quantity,
oi.price
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = $1
ORDER BY o.id
`, [userId]);
// Group in application memory
const ordersMap = new Map<number, Order>();
for (const row of rows) {
if (!ordersMap.has(row.order_id)) {
ordersMap.set(row.order_id, {
id: row.order_id,
total: row.total,
created_at: row.created_at,
items: [],
});
}
ordersMap.get(row.order_id)!.items.push({
id: row.item_id,
product_id: row.product_id,
quantity: row.quantity,
price: row.price,
});
}
const orders = [...ordersMap.values()];
Alternative fix: IN clause batch loading (DataLoader pattern):
// Collect all IDs first, then batch load
const orderIds = orders.map(o => o.id);
const items = await db.query(
'SELECT * FROM order_items WHERE order_id = ANY($1)',
[orderIds]
);
// Group items by order_id in memory — 2 queries total regardless of N
Slow Query Patterns to Avoid
-- 1. LIKE with leading wildcard: cannot use B-tree index
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Fix: reverse the string, or use pg_trgm GIN index
-- 2. Implicit type cast: index on integer id, querying with string
SELECT * FROM users WHERE id = '42'; -- id is INTEGER
-- PostgreSQL must cast every row. Fix: use typed parameter.
-- 3. OR on different indexed columns: can't use single index
SELECT * FROM users WHERE email = 'a@b.com' OR phone = '555-1234';
-- Fix: two separate queries + UNION ALL, or a composite index if columns co-occur
-- 4. SELECT *: fetches all columns, prevents index-only scans
SELECT * FROM orders WHERE user_id = 42;
-- Fix: select only needed columns
-- 5. NOT IN with subquery: can't use index, often O(n²)
SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM order_items);
-- Fix: LEFT JOIN ... WHERE oi.product_id IS NULL, or NOT EXISTS
-- 6. Function on indexed column breaks index
SELECT * FROM users WHERE UPPER(name) = 'ALICE';
-- Fix: CREATE INDEX ON users(UPPER(name)) — functional index
-- 7. OFFSET for pagination: scans and discards rows
SELECT * FROM posts ORDER BY created_at DESC OFFSET 50000 LIMIT 20;
-- Fix: keyset pagination — WHERE created_at < $last_seen ORDER BY created_at DESC LIMIT 20
Partitioning
Partitioning splits a large table into smaller physical tables while presenting a unified logical interface. The query planner applies partition pruning to skip irrelevant partitions.
-- Range partitioning on created_at (common for time-series data)
CREATE TABLE orders (
id BIGSERIAL,
user_id BIGINT,
total NUMERIC,
created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Planner only scans orders_2024_01 — partition pruning!
EXPLAIN SELECT * FROM orders WHERE created_at >= '2024-01-15' AND created_at < '2024-02-01';
-- → Append → Seq Scan on orders_2024_01
-- Hash partitioning for even distribution (no natural range)
CREATE TABLE users (id BIGSERIAL, email TEXT)
PARTITION BY HASH (id);
CREATE TABLE users_0 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE users_1 PARTITION OF users FOR VALUES WITH (modulus 4, remainder 1);
-- ... etc.
Each partition has its own indexes — they are much smaller, fitting more easily in cache. VACUUM, ANALYZE, REINDEX operate per-partition.
Partition limitations: Cross-partition queries (no pruning possible) are still expensive. Foreign keys to/from partitioned tables have restrictions. Unique constraints must include the partition key.
Index Maintenance
Indexes require maintenance. B-tree nodes accumulate dead entries from deleted/updated rows (PostgreSQL’s MVCC leaves old row versions in place until VACUUM removes them).
-- Find bloated indexes
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild index online (no table lock!)
CREATE INDEX CONCURRENTLY new_idx ON orders(user_id, created_at);
DROP INDEX CONCURRENTLY old_idx;
-- Regular VACUUM prevents index bloat
VACUUM ANALYZE orders;
-- Check index usage — find unused indexes (candidates for removal)
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
CREATE INDEX CONCURRENTLY builds the index without holding a table lock. It takes longer (two passes) and cannot run inside a transaction. Use this in production — the standard CREATE INDEX takes a full table lock that blocks writes.
Interview: Diagnosing a Slow Query After Data Growth
A query that took 10ms now takes 10 seconds after the table grew from 10K to 10M rows.
Step 1: Get the execution plan.
EXPLAIN (ANALYZE, BUFFERS) <your slow query>;
Step 2: Identify the bottleneck node. Look for the operation with the highest actual time. Is it a Seq Scan on a large table? A nested loop with high loops? A hash join with disk spill (Batches: N where N > 1)?
Step 3: Check planner estimates vs actuals. If rows=100 (estimate) but rows=1000000 (actual), statistics are stale. Run ANALYZE table_name.
Step 4: Check indexes. Is an index missing? Does the query filter on an unindexed column? Is the selectivity low enough that the planner prefers a seq scan?
Step 5: Check for implicit type mismatches. A parameter of type text on an integer column forces a full scan.
Step 6: Check BUFFERS. High shared read (disk fetches) relative to shared hit (buffer pool hits) means the working set no longer fits in shared_buffers. Consider increasing it or adding more RAM.
Step 7: Examine index bloat. If the table has seen heavy delete/update activity, indexes may be bloated. REINDEX CONCURRENTLY.
Step 8: Consider schema changes. Covering index? Partial index? Partitioning if the table is time-series? Denormalization if the join is always the same?
The answer that impresses interviewers: start with EXPLAIN ANALYZE, identify the actual cost node, check statistics, then consider structural changes — in that order.