Core Architecture · Topic 12 of 21

CPU Architecture & Cache Hierarchy

200 XP

The Memory Wall

In 1994, William Wulf and Sally McKee coined the term memory wall: CPU clock speeds were doubling every 18 months (Moore’s Law) but DRAM latency was improving at only 7% per year. The gap has only widened since.

Year    CPU Speed   DRAM Latency   Gap
1980    10 MHz      200ns          2 cycles
1990    100 MHz     100ns          10 cycles
2000    1 GHz       60ns           60 cycles
2010    3 GHz       60ns           180 cycles
2024    5 GHz       60ns           300 cycles

Modern CPUs execute 4+ instructions per cycle. A cache miss to DRAM wastes 200-300 cycles — enough for the CPU to have executed 800-1200 instructions. This is the most important performance bottleneck in modern systems, dwarfing algorithmic complexity for small-to-medium N.

The CPU hierarchy bridges the gap with increasingly small, fast, and expensive memory at each level:

Level       Size        Latency     Bandwidth   Shared?
---------   --------    --------    ---------   -------
Registers   ~1KB        0 cycles    -           Per core
L1 cache    32-64KB     4 cycles    1 TB/s      Per core
L2 cache    256KB-1MB   12 cycles   400 GB/s    Per core
L3 cache    8-64MB      40 cycles   200 GB/s    Per socket
DRAM        16-256GB    60-100ns    50 GB/s     All cores
NVMe SSD    1-8TB       100µs       7 GB/s      System
HDD         1-20TB      10ms        200 MB/s    System
Network     ∞           100µs-10ms  10 Gb/s     Remote

These numbers should be memorized. Every performance optimization decision you make is constrained by this table.


Cache Lines: The Unit of Transfer

Caches do not transfer individual bytes. They transfer cache lines — 64 bytes on x86/ARM, 128 bytes on some newer ARM implementations.

When your code accesses byte at address 0x1000, the CPU loads the entire 64-byte line 0x1000-0x103F into L1 cache. The next access anywhere in that line is free (0 cycles from cache).

Memory: [byte 0][byte 1]...[byte 63][byte 64]...[byte 127]
                 ├────── cache line 0 (64B) ──────┤
                                                  ├── cache line 1 ──┤

Access byte[4]:  → loads cache line 0 → bytes 0-63 now in L1
Access byte[32]: → already in L1 → free!
Access byte[68]: → loads cache line 1 → another 64B fetch

This is why array traversal is fast: each cache line load brings 8 consecutive int64 values, so 1 in 8 accesses actually hits memory.

This is why linked lists are slow: each node pointer typically points to a different cache line (random allocation from the heap). Traversing a 1M-node linked list = 1M cache misses = 1M × 60ns = 60ms of pure memory stall time. The same 1M elements in a sorted array = ~2ms.

Linked List traversal:
node → [data|next*] → [data|next*] → [data|next*]
         ↕ cache miss    ↕ cache miss    ↕ cache miss
         (each node in a different cache line)

Array traversal:
[e0][e1][e2][e3][e4][e5][e6][e7] | [e8][e9]...
 ├────── one cache line load ────┘
  (8 elements per cache miss)

Spatial and Temporal Locality

The two principles that make caching effective:

Spatial locality: If you access address X, you’ll likely access X+1, X+2, … soon. Arrays exploit spatial locality. Matrices, buffers, and packed structs exploit spatial locality.

Temporal locality: If you access address X, you’ll likely access it again soon. Hot variables in loops, frequently-called functions exploit temporal locality.

The matrix multiplication example (spatial locality failure):

// Row-major access (C/C++/Java layout): FAST
// matrix[row][col] — consecutive cols are adjacent in memory
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B[k][j];  // B[k][j] — column traversal of B!

// B is accessed column-by-column: B[0][j], B[1][j], B[2][j]...
// Row size = N doubles = 8N bytes. For N=1024: 8KB per row step.
// Each B[k][j] access is a new cache line = N² cache misses for B.

// Cache-oblivious fix: transpose B first
// Then B_T[j][k] = B[k][j], so inner loop accesses B_T row-by-row
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B_T[j][k];  // now B_T is row-major traversal

Performance difference for N=1024: ~3x on modern hardware. Optimized BLAS (LAPACK/OpenBLAS) use cache blocking (tiling) — subdivide the computation into blocks that fit in L2 cache.


Array of Structs vs Struct of Arrays

This is the most impactful cache optimization for data-intensive code:

// Array of Structs (AoS) — intuitive but cache-inefficient for partial access
struct Particle {
    float x, y, z;     // position (12 bytes)
    float vx, vy, vz;  // velocity (12 bytes)
    float mass;         // (4 bytes)
    int   id;           // (4 bytes)
    // Total: 32 bytes per particle
};
Particle particles[1000000];

// Update only positions: access pattern touches x,y,z but loads all 32 bytes
// Each cache line (64B) holds 2 particles; velocity/mass/id are loaded but unused.
// For 1M particles: 32MB read just for positions (half is wasted bandwidth)
for (int i = 0; i < N; i++) {
    particles[i].x += particles[i].vx * dt;  // loads 32B, uses 8B
}

// Struct of Arrays (SoA) — cache-efficient for partial access
struct Particles {
    float* x;    // 1M floats = 4MB
    float* y;
    float* z;
    float* vx;
    float* vy;
    float* vz;
    float* mass;
    int*   id;
};

// Update only positions: only x, y, z, vx, vy, vz arrays loaded (24MB total)
// Each cache line holds 16 floats — perfect for SIMD vectorization
for (int i = 0; i < N; i++) {
    px[i] += pvx[i] * dt;  // 4MB read, 4MB write — no wasted bandwidth
}

SoA is the default layout in game engines (Unity DOTS, Rust ECS frameworks) and scientific computing. The compiler can also auto-vectorize SoA loops because the data is contiguous and homogeneous.


False Sharing

False sharing is one of the most insidious performance bugs in concurrent systems. Two threads write to different variables that happen to live on the same cache line, causing constant cache coherence traffic between cores.

Core 0 owns cache line: [counter_a | padding | counter_b | padding]
Core 1 owns cache line: [counter_a | padding | counter_b | padding]

Thread 0 increments counter_a → writes to its L1 cache → marks Core 1's copy INVALID
Thread 1 increments counter_b → must re-fetch entire cache line from Core 0
Thread 0 increments counter_a → must re-fetch entire cache line from Core 1
... (cache line bounces between cores continuously)

This can make multi-threaded code slower than single-threaded due to the overhead of cache coherence messages on the interconnect.

The fix: pad to cache line boundaries.

// BAD: counter_a and counter_b on the same cache line
struct Counters {
    long counter_a;  // 8 bytes
    long counter_b;  // 8 bytes — same 64-byte cache line as counter_a!
};

// GOOD: explicit padding forces separate cache lines
#define CACHE_LINE_SIZE 64
struct Counters {
    long counter_a;
    char _pad_a[CACHE_LINE_SIZE - sizeof(long)];  // 56 bytes padding
    long counter_b;
    char _pad_b[CACHE_LINE_SIZE - sizeof(long)];
};

// Better: use compiler alignment attribute
struct __attribute__((aligned(64))) Counter {
    long value;
};
Counter counters[NUM_THREADS];  // each Counter on its own cache line
// Java: @Contended (JDK 8+) — JVM adds padding automatically
import sun.misc.Contended;

public class Counter {
    @Contended
    public volatile long value;
}
// Enable with JVM flag: -XX:-RestrictContended

Benchmark example: Simple multi-threaded counter increment, 8 threads, 100M iterations each:

Without padding (false sharing):  ~4.2 seconds
With padding (cache line aligned): ~0.6 seconds

7x speedup from padding.


MESI Cache Coherence Protocol

MESI ensures all cores see a consistent view of memory. Each cache line is in one of four states:

State        Meaning
---------    ----------------------------------------------------------------
Modified(M)  This core has the only valid copy; it's been modified (dirty).
             Must write back to memory before another core can read it.
Exclusive(E) This core has the only copy; it matches memory (clean).
             Can transition to M on write without bus transaction.
Shared(S)    Multiple cores have valid copies matching memory (clean).
             Must invalidate others before writing.
Invalid(I)   This cache line is stale/absent. Must fetch before use.

State transitions:

Read miss:    I → S (fetch from memory or another cache in S/M)
Write miss:   I → M (fetch + invalidate all S copies)
Upgrade:      S → M (invalidate all other S copies — bus broadcast needed)
Write-back:   M → E or I (flush dirty line to memory on eviction)
Snoop:        S → I (another core upgraded to M — invalidate our S copy)

Why MESI is expensive at scale: Every S→M transition requires broadcasting an invalidation to all other cores, which must acknowledge it. On a 64-core machine with a 4-hop interconnect, this latency adds up. This is why lock-free algorithms using CAS on shared variables are slow at high thread counts — they constantly generate S→M transitions.

MOESI/MESIF extensions: Some architectures add O (Owned — dirty but shared) or F (Forward — designated to supply data to other caches), reducing write-back traffic.


Branch Prediction

Modern CPUs use a branch predictor to speculatively execute code before knowing whether a branch is taken. Misprediction requires flushing the pipeline — ~15-20 cycle penalty.

// Poorly-predicted branch: random array values → ~50% misprediction rate
for (int i = 0; i < N; i++) {
    if (data[i] >= 128) {  // random → CPU can't predict
        sum += data[i];
    }
}

// After sorting: data[i] >= 128 is false for first half, true for second half
// CPU predicts "not taken" for first half, "taken" for second half
// → near-perfect prediction, no pipeline stalls
sort(data, data + N);
for (int i = 0; i < N; i++) {
    if (data[i] >= 128) {
        sum += data[i];
    }
}
// Benchmark: unsorted ~6ns/element, sorted ~1.5ns/element (4x difference!)

Branchless programming: Eliminate branches entirely using conditional moves.

// Branch version (misprediction-prone)
if (a > b) max = a;
else        max = b;

// Branchless version (cmov instruction on x86)
max = a > b ? a : b;  // compiler generates CMOV — no branch!

// Branchless absolute value
int abs_x = (x ^ (x >> 31)) - (x >> 31);  // no branch

The branch predictor has ~4096 entries (2-level adaptive predictor in modern CPUs). Each entry tracks the last 2-4 outcomes of a branch (taken/not-taken history). Highly regular branches (always-taken loops, if-statements that rarely change) predict perfectly.


SIMD: Single Instruction Multiple Data

SIMD instructions process multiple data elements simultaneously using wide registers:

SSE2  (2004): 128-bit registers → 4 floats or 2 doubles at once
AVX2  (2013): 256-bit registers → 8 floats or 4 doubles at once
AVX-512 (2017): 512-bit registers → 16 floats or 8 doubles at once
#include <immintrin.h>  // Intel intrinsics

// Scalar: add 8 floats
float add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) c[i] = a[i] + b[i];
}

// AVX2: add 8 floats simultaneously
void add_avx2(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);  // load 8 floats
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);   // 8 additions in 1 instruction
        _mm256_storeu_ps(&c[i], vc);
    }
}
// With AVX2: ~8x throughput improvement for float arrays

The compiler auto-vectorizes simple loops automatically (-O2 and -march=native). Complex loops with dependencies, function calls, or aliased pointers prevent auto-vectorization. Use __restrict__ to assert no aliasing, and structure loops to make vectorization obvious.

JavaScript TypedArrays and SIMD: V8 and SpiderMonkey can JIT-compile loops over Float32Array / Float64Array to SIMD instructions. Always use typed arrays for numerical computation in JavaScript.


Hardware Prefetching and Software Prefetch Hints

The hardware prefetcher monitors memory access patterns and speculatively fetches cache lines before they’re needed. It detects:

  • Sequential strides (array traversal)
  • Constant strides (every N bytes)
  • Stream patterns (forward/backward sequential)

It cannot detect pointer-chasing patterns (linked lists, trees, hash tables).

Software prefetch hints tell the CPU to start fetching a cache line before you need it:

#include <xmmintrin.h>

// Prefetch for reading (locality: NTA = non-temporal, 0 = keep in L1)
_mm_prefetch((char*)&data[i + 64], _MM_HINT_T0);  // prefetch 64 elements ahead

// Typical usage in a loop processing large arrays:
for (int i = 0; i < N; i++) {
    // Prefetch a "distance" ahead (tune based on memory latency / throughput)
    if (i + 16 < N) {
        __builtin_prefetch(&data[i + 16], 0, 1);  // GCC prefetch hint
    }
    process(data[i]);
}

Non-temporal stores: For streaming writes where data won’t be re-read soon, bypass the cache to avoid polluting it with data you won’t use:

// Write directly to memory without caching (write-combining buffer)
_mm256_stream_ps(&output[i], result);  // bypasses cache on write

Memory Allocators: jemalloc and mimalloc

malloc is not free. The default ptmalloc2 (glibc) uses a global lock for large allocations, causing contention in multi-threaded servers.

jemalloc (used by Facebook, Firefox, Redis):

  • Per-thread caches (tcaches) for common allocation sizes — no lock for small allocs
  • Size-class segregation: 8, 16, 32, 48, 64… bytes — reduces internal fragmentation
  • Huge pages support — reduces TLB pressure

mimalloc (Microsoft):

  • Similar per-thread approach but with tighter memory reuse
  • Better performance for allocation-heavy workloads (30-50% faster than jemalloc in some benchmarks)
# Use jemalloc for Node.js / any Linux process
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 node server.js

# Use mimalloc
LD_PRELOAD=/usr/local/lib/libmimalloc.so node server.js

JavaScript: TypedArrays and Cache Behavior

// Float64Array: contiguous 8-byte double values — cache friendly
const N = 10_000_000;
const arr = new Float64Array(N);

// Sequential sum: excellent cache behavior
// Each 64B cache line holds 8 doubles → 1 cache miss per 8 elements
let sum = 0;
for (let i = 0; i < N; i++) sum += arr[i];  // ~50ms

// Random access: terrible cache behavior
const indices = new Int32Array(N).map(() => Math.random() * N | 0);
let sum2 = 0;
for (let i = 0; i < N; i++) sum2 += arr[indices[i]];  // ~800ms (16x slower!)

// Array<number>: each element is a boxed heap object (pointer + double)
// NOT cache-friendly — pointer chasing through the heap
const jsArr = new Array(N).fill(0);
// Sequential access still works but 8x more memory (8B ptr + 8B double + header)

SharedArrayBuffer: The only way to share memory between Worker threads in Node.js. Uses the same physical memory region, accessed via typed arrays.

// Main thread:
const sab = new SharedArrayBuffer(Float64Array.BYTES_PER_ELEMENT * N);
const shared = new Float64Array(sab);

// Worker thread:
const shared2 = new Float64Array(workerData.sab);
// shared and shared2 reference the same physical memory
// Use Atomics for synchronization (covered in concurrency-primitives)

Real Systems: Cache-Aware Design

Redis: Redis stores data in an in-memory hash table. For small hash keys (≤64 bytes, ≤128 fields), it uses a ziplist (contiguous memory block) instead of a proper hash table — all fields and values packed together, fitting in a few cache lines. For small sets, this is faster than a hash table despite O(n) lookup, because cache locality dominates.

RocksDB block cache: RocksDB caches recently-accessed SSTable blocks in a block cache (typically LRU, 8KB blocks). The block cache is clock-based (approximates LRU without per-block locks). Hot keys stay cache-warm; cold keys cause block reads from NVMe SSD (~100µs).

Database buffer pool (PostgreSQL): PostgreSQL maintains shared_buffers (default 128MB, should be 25% of RAM). All reads/writes go through the buffer pool. A sequential scan with enable_seqscan=on uses ring buffering (bounded cache usage) to avoid evicting hot pages.


Interview: Row-Major vs Column-Major Traversal

“Why is iterating a 2D array row-by-row faster than column-by-column?”

The answer combines three concepts:

  1. Memory layout: In C/C++/Java/JavaScript, 2D arrays are stored row-major — matrix[0][0], matrix[0][1], ..., matrix[0][N-1], matrix[1][0], .... The elements of each row are contiguous in memory.

  2. Cache lines: When you access matrix[0][0], the CPU loads a 64-byte cache line containing matrix[0][0] through matrix[0][7] (for 8-byte doubles). Row traversal accesses matrix[0][1] next — it’s already in cache.

  3. Column traversal cache misses: When you access matrix[0][0], then matrix[1][0], then matrix[2][0] — each of these is in a different cache line (separated by N * 8 bytes). For a 1024×1024 matrix, column elements are 8KB apart — each access is a guaranteed cache miss.

Row-major:    [r0c0][r0c1][r0c2]...[r0c7] | [r0c8]...[r0c15]
              ├── one 64B cache line ─────┘
Row traversal: r0c0 → r0c1 → r0c2 → ... all in same line = 1 miss per 8 elements
Column traversal: r0c0 → r1c0 → r2c0 → ... each in different line = 1 miss per element

Performance ratio: 8x for doubles (fits 8 per cache line)
For 1024×1024: row-major ~5ms, column-major ~40ms

NUMA (Non-Uniform Memory Access) adds another dimension: in multi-socket servers, memory banks are local to one socket. Accessing remote NUMA memory is 2-3x slower than local. Thread pinning (taskset, numactl) ensures threads access their local NUMA node’s memory.