The Memory Wall
In 1994, William Wulf and Sally McKee coined the term memory wall: CPU clock speeds were doubling every 18 months (Moore’s Law) but DRAM latency was improving at only 7% per year. The gap has only widened since.
Year CPU Speed DRAM Latency Gap
1980 10 MHz 200ns 2 cycles
1990 100 MHz 100ns 10 cycles
2000 1 GHz 60ns 60 cycles
2010 3 GHz 60ns 180 cycles
2024 5 GHz 60ns 300 cycles
Modern CPUs execute 4+ instructions per cycle. A cache miss to DRAM wastes 200-300 cycles — enough for the CPU to have executed 800-1200 instructions. This is the most important performance bottleneck in modern systems, dwarfing algorithmic complexity for small-to-medium N.
The CPU hierarchy bridges the gap with increasingly small, fast, and expensive memory at each level:
Level Size Latency Bandwidth Shared?
--------- -------- -------- --------- -------
Registers ~1KB 0 cycles - Per core
L1 cache 32-64KB 4 cycles 1 TB/s Per core
L2 cache 256KB-1MB 12 cycles 400 GB/s Per core
L3 cache 8-64MB 40 cycles 200 GB/s Per socket
DRAM 16-256GB 60-100ns 50 GB/s All cores
NVMe SSD 1-8TB 100µs 7 GB/s System
HDD 1-20TB 10ms 200 MB/s System
Network ∞ 100µs-10ms 10 Gb/s Remote
These numbers should be memorized. Every performance optimization decision you make is constrained by this table.
Cache Lines: The Unit of Transfer
Caches do not transfer individual bytes. They transfer cache lines — 64 bytes on x86/ARM, 128 bytes on some newer ARM implementations.
When your code accesses byte at address 0x1000, the CPU loads the entire 64-byte line 0x1000-0x103F into L1 cache. The next access anywhere in that line is free (0 cycles from cache).
Memory: [byte 0][byte 1]...[byte 63][byte 64]...[byte 127]
├────── cache line 0 (64B) ──────┤
├── cache line 1 ──┤
Access byte[4]: → loads cache line 0 → bytes 0-63 now in L1
Access byte[32]: → already in L1 → free!
Access byte[68]: → loads cache line 1 → another 64B fetch
This is why array traversal is fast: each cache line load brings 8 consecutive int64 values, so 1 in 8 accesses actually hits memory.
This is why linked lists are slow: each node pointer typically points to a different cache line (random allocation from the heap). Traversing a 1M-node linked list = 1M cache misses = 1M × 60ns = 60ms of pure memory stall time. The same 1M elements in a sorted array = ~2ms.
Linked List traversal:
node → [data|next*] → [data|next*] → [data|next*]
↕ cache miss ↕ cache miss ↕ cache miss
(each node in a different cache line)
Array traversal:
[e0][e1][e2][e3][e4][e5][e6][e7] | [e8][e9]...
├────── one cache line load ────┘
(8 elements per cache miss)
Spatial and Temporal Locality
The two principles that make caching effective:
Spatial locality: If you access address X, you’ll likely access X+1, X+2, … soon. Arrays exploit spatial locality. Matrices, buffers, and packed structs exploit spatial locality.
Temporal locality: If you access address X, you’ll likely access it again soon. Hot variables in loops, frequently-called functions exploit temporal locality.
The matrix multiplication example (spatial locality failure):
// Row-major access (C/C++/Java layout): FAST
// matrix[row][col] — consecutive cols are adjacent in memory
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j]; // B[k][j] — column traversal of B!
// B is accessed column-by-column: B[0][j], B[1][j], B[2][j]...
// Row size = N doubles = 8N bytes. For N=1024: 8KB per row step.
// Each B[k][j] access is a new cache line = N² cache misses for B.
// Cache-oblivious fix: transpose B first
// Then B_T[j][k] = B[k][j], so inner loop accesses B_T row-by-row
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B_T[j][k]; // now B_T is row-major traversal
Performance difference for N=1024: ~3x on modern hardware. Optimized BLAS (LAPACK/OpenBLAS) use cache blocking (tiling) — subdivide the computation into blocks that fit in L2 cache.
Array of Structs vs Struct of Arrays
This is the most impactful cache optimization for data-intensive code:
// Array of Structs (AoS) — intuitive but cache-inefficient for partial access
struct Particle {
float x, y, z; // position (12 bytes)
float vx, vy, vz; // velocity (12 bytes)
float mass; // (4 bytes)
int id; // (4 bytes)
// Total: 32 bytes per particle
};
Particle particles[1000000];
// Update only positions: access pattern touches x,y,z but loads all 32 bytes
// Each cache line (64B) holds 2 particles; velocity/mass/id are loaded but unused.
// For 1M particles: 32MB read just for positions (half is wasted bandwidth)
for (int i = 0; i < N; i++) {
particles[i].x += particles[i].vx * dt; // loads 32B, uses 8B
}
// Struct of Arrays (SoA) — cache-efficient for partial access
struct Particles {
float* x; // 1M floats = 4MB
float* y;
float* z;
float* vx;
float* vy;
float* vz;
float* mass;
int* id;
};
// Update only positions: only x, y, z, vx, vy, vz arrays loaded (24MB total)
// Each cache line holds 16 floats — perfect for SIMD vectorization
for (int i = 0; i < N; i++) {
px[i] += pvx[i] * dt; // 4MB read, 4MB write — no wasted bandwidth
}
SoA is the default layout in game engines (Unity DOTS, Rust ECS frameworks) and scientific computing. The compiler can also auto-vectorize SoA loops because the data is contiguous and homogeneous.
False Sharing
False sharing is one of the most insidious performance bugs in concurrent systems. Two threads write to different variables that happen to live on the same cache line, causing constant cache coherence traffic between cores.
Core 0 owns cache line: [counter_a | padding | counter_b | padding]
Core 1 owns cache line: [counter_a | padding | counter_b | padding]
Thread 0 increments counter_a → writes to its L1 cache → marks Core 1's copy INVALID
Thread 1 increments counter_b → must re-fetch entire cache line from Core 0
Thread 0 increments counter_a → must re-fetch entire cache line from Core 1
... (cache line bounces between cores continuously)
This can make multi-threaded code slower than single-threaded due to the overhead of cache coherence messages on the interconnect.
The fix: pad to cache line boundaries.
// BAD: counter_a and counter_b on the same cache line
struct Counters {
long counter_a; // 8 bytes
long counter_b; // 8 bytes — same 64-byte cache line as counter_a!
};
// GOOD: explicit padding forces separate cache lines
#define CACHE_LINE_SIZE 64
struct Counters {
long counter_a;
char _pad_a[CACHE_LINE_SIZE - sizeof(long)]; // 56 bytes padding
long counter_b;
char _pad_b[CACHE_LINE_SIZE - sizeof(long)];
};
// Better: use compiler alignment attribute
struct __attribute__((aligned(64))) Counter {
long value;
};
Counter counters[NUM_THREADS]; // each Counter on its own cache line
// Java: @Contended (JDK 8+) — JVM adds padding automatically
import sun.misc.Contended;
public class Counter {
@Contended
public volatile long value;
}
// Enable with JVM flag: -XX:-RestrictContended
Benchmark example: Simple multi-threaded counter increment, 8 threads, 100M iterations each:
Without padding (false sharing): ~4.2 seconds
With padding (cache line aligned): ~0.6 seconds
7x speedup from padding.
MESI Cache Coherence Protocol
MESI ensures all cores see a consistent view of memory. Each cache line is in one of four states:
State Meaning
--------- ----------------------------------------------------------------
Modified(M) This core has the only valid copy; it's been modified (dirty).
Must write back to memory before another core can read it.
Exclusive(E) This core has the only copy; it matches memory (clean).
Can transition to M on write without bus transaction.
Shared(S) Multiple cores have valid copies matching memory (clean).
Must invalidate others before writing.
Invalid(I) This cache line is stale/absent. Must fetch before use.
State transitions:
Read miss: I → S (fetch from memory or another cache in S/M)
Write miss: I → M (fetch + invalidate all S copies)
Upgrade: S → M (invalidate all other S copies — bus broadcast needed)
Write-back: M → E or I (flush dirty line to memory on eviction)
Snoop: S → I (another core upgraded to M — invalidate our S copy)
Why MESI is expensive at scale: Every S→M transition requires broadcasting an invalidation to all other cores, which must acknowledge it. On a 64-core machine with a 4-hop interconnect, this latency adds up. This is why lock-free algorithms using CAS on shared variables are slow at high thread counts — they constantly generate S→M transitions.
MOESI/MESIF extensions: Some architectures add O (Owned — dirty but shared) or F (Forward — designated to supply data to other caches), reducing write-back traffic.
Branch Prediction
Modern CPUs use a branch predictor to speculatively execute code before knowing whether a branch is taken. Misprediction requires flushing the pipeline — ~15-20 cycle penalty.
// Poorly-predicted branch: random array values → ~50% misprediction rate
for (int i = 0; i < N; i++) {
if (data[i] >= 128) { // random → CPU can't predict
sum += data[i];
}
}
// After sorting: data[i] >= 128 is false for first half, true for second half
// CPU predicts "not taken" for first half, "taken" for second half
// → near-perfect prediction, no pipeline stalls
sort(data, data + N);
for (int i = 0; i < N; i++) {
if (data[i] >= 128) {
sum += data[i];
}
}
// Benchmark: unsorted ~6ns/element, sorted ~1.5ns/element (4x difference!)
Branchless programming: Eliminate branches entirely using conditional moves.
// Branch version (misprediction-prone)
if (a > b) max = a;
else max = b;
// Branchless version (cmov instruction on x86)
max = a > b ? a : b; // compiler generates CMOV — no branch!
// Branchless absolute value
int abs_x = (x ^ (x >> 31)) - (x >> 31); // no branch
The branch predictor has ~4096 entries (2-level adaptive predictor in modern CPUs). Each entry tracks the last 2-4 outcomes of a branch (taken/not-taken history). Highly regular branches (always-taken loops, if-statements that rarely change) predict perfectly.
SIMD: Single Instruction Multiple Data
SIMD instructions process multiple data elements simultaneously using wide registers:
SSE2 (2004): 128-bit registers → 4 floats or 2 doubles at once
AVX2 (2013): 256-bit registers → 8 floats or 4 doubles at once
AVX-512 (2017): 512-bit registers → 16 floats or 8 doubles at once
#include <immintrin.h> // Intel intrinsics
// Scalar: add 8 floats
float add_scalar(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) c[i] = a[i] + b[i];
}
// AVX2: add 8 floats simultaneously
void add_avx2(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]); // load 8 floats
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb); // 8 additions in 1 instruction
_mm256_storeu_ps(&c[i], vc);
}
}
// With AVX2: ~8x throughput improvement for float arrays
The compiler auto-vectorizes simple loops automatically (-O2 and -march=native). Complex loops with dependencies, function calls, or aliased pointers prevent auto-vectorization. Use __restrict__ to assert no aliasing, and structure loops to make vectorization obvious.
JavaScript TypedArrays and SIMD: V8 and SpiderMonkey can JIT-compile loops over Float32Array / Float64Array to SIMD instructions. Always use typed arrays for numerical computation in JavaScript.
Hardware Prefetching and Software Prefetch Hints
The hardware prefetcher monitors memory access patterns and speculatively fetches cache lines before they’re needed. It detects:
- Sequential strides (array traversal)
- Constant strides (every N bytes)
- Stream patterns (forward/backward sequential)
It cannot detect pointer-chasing patterns (linked lists, trees, hash tables).
Software prefetch hints tell the CPU to start fetching a cache line before you need it:
#include <xmmintrin.h>
// Prefetch for reading (locality: NTA = non-temporal, 0 = keep in L1)
_mm_prefetch((char*)&data[i + 64], _MM_HINT_T0); // prefetch 64 elements ahead
// Typical usage in a loop processing large arrays:
for (int i = 0; i < N; i++) {
// Prefetch a "distance" ahead (tune based on memory latency / throughput)
if (i + 16 < N) {
__builtin_prefetch(&data[i + 16], 0, 1); // GCC prefetch hint
}
process(data[i]);
}
Non-temporal stores: For streaming writes where data won’t be re-read soon, bypass the cache to avoid polluting it with data you won’t use:
// Write directly to memory without caching (write-combining buffer)
_mm256_stream_ps(&output[i], result); // bypasses cache on write
Memory Allocators: jemalloc and mimalloc
malloc is not free. The default ptmalloc2 (glibc) uses a global lock for large allocations, causing contention in multi-threaded servers.
jemalloc (used by Facebook, Firefox, Redis):
- Per-thread caches (tcaches) for common allocation sizes — no lock for small allocs
- Size-class segregation: 8, 16, 32, 48, 64… bytes — reduces internal fragmentation
- Huge pages support — reduces TLB pressure
mimalloc (Microsoft):
- Similar per-thread approach but with tighter memory reuse
- Better performance for allocation-heavy workloads (30-50% faster than jemalloc in some benchmarks)
# Use jemalloc for Node.js / any Linux process
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 node server.js
# Use mimalloc
LD_PRELOAD=/usr/local/lib/libmimalloc.so node server.js
JavaScript: TypedArrays and Cache Behavior
// Float64Array: contiguous 8-byte double values — cache friendly
const N = 10_000_000;
const arr = new Float64Array(N);
// Sequential sum: excellent cache behavior
// Each 64B cache line holds 8 doubles → 1 cache miss per 8 elements
let sum = 0;
for (let i = 0; i < N; i++) sum += arr[i]; // ~50ms
// Random access: terrible cache behavior
const indices = new Int32Array(N).map(() => Math.random() * N | 0);
let sum2 = 0;
for (let i = 0; i < N; i++) sum2 += arr[indices[i]]; // ~800ms (16x slower!)
// Array<number>: each element is a boxed heap object (pointer + double)
// NOT cache-friendly — pointer chasing through the heap
const jsArr = new Array(N).fill(0);
// Sequential access still works but 8x more memory (8B ptr + 8B double + header)
SharedArrayBuffer: The only way to share memory between Worker threads in Node.js. Uses the same physical memory region, accessed via typed arrays.
// Main thread:
const sab = new SharedArrayBuffer(Float64Array.BYTES_PER_ELEMENT * N);
const shared = new Float64Array(sab);
// Worker thread:
const shared2 = new Float64Array(workerData.sab);
// shared and shared2 reference the same physical memory
// Use Atomics for synchronization (covered in concurrency-primitives)
Real Systems: Cache-Aware Design
Redis: Redis stores data in an in-memory hash table. For small hash keys (≤64 bytes, ≤128 fields), it uses a ziplist (contiguous memory block) instead of a proper hash table — all fields and values packed together, fitting in a few cache lines. For small sets, this is faster than a hash table despite O(n) lookup, because cache locality dominates.
RocksDB block cache:
RocksDB caches recently-accessed SSTable blocks in a block cache (typically LRU, 8KB blocks). The block cache is clock-based (approximates LRU without per-block locks). Hot keys stay cache-warm; cold keys cause block reads from NVMe SSD (~100µs).
Database buffer pool (PostgreSQL):
PostgreSQL maintains shared_buffers (default 128MB, should be 25% of RAM). All reads/writes go through the buffer pool. A sequential scan with enable_seqscan=on uses ring buffering (bounded cache usage) to avoid evicting hot pages.
Interview: Row-Major vs Column-Major Traversal
“Why is iterating a 2D array row-by-row faster than column-by-column?”
The answer combines three concepts:
-
Memory layout: In C/C++/Java/JavaScript, 2D arrays are stored row-major —
matrix[0][0], matrix[0][1], ..., matrix[0][N-1], matrix[1][0], .... The elements of each row are contiguous in memory. -
Cache lines: When you access
matrix[0][0], the CPU loads a 64-byte cache line containingmatrix[0][0]throughmatrix[0][7](for 8-byte doubles). Row traversal accessesmatrix[0][1]next — it’s already in cache. -
Column traversal cache misses: When you access
matrix[0][0], thenmatrix[1][0], thenmatrix[2][0]— each of these is in a different cache line (separated byN * 8bytes). For a 1024×1024 matrix, column elements are 8KB apart — each access is a guaranteed cache miss.
Row-major: [r0c0][r0c1][r0c2]...[r0c7] | [r0c8]...[r0c15]
├── one 64B cache line ─────┘
Row traversal: r0c0 → r0c1 → r0c2 → ... all in same line = 1 miss per 8 elements
Column traversal: r0c0 → r1c0 → r2c0 → ... each in different line = 1 miss per element
Performance ratio: 8x for doubles (fits 8 per cache line)
For 1024×1024: row-major ~5ms, column-major ~40ms
NUMA (Non-Uniform Memory Access) adds another dimension: in multi-socket servers, memory banks are local to one socket. Accessing remote NUMA memory is 2-3x slower than local. Thread pinning (taskset, numactl) ensures threads access their local NUMA node’s memory.