Stack vs Heap: The Fundamental Divide
Every program has two primary regions of memory: the stack and the heap. Misunderstanding their mechanics leads to bugs that are hard to reproduce and catastrophic in production.
The Stack
The call stack is a contiguous region of memory (typically 1-8MB per thread on Linux, 1MB on Windows) that grows downward. Each function call pushes a stack frame (activation record) and each return pops it.
High address
┌─────────────────────────┐
│ main() frame │ ← bottom of call stack
├─────────────────────────┤
│ processOrder() frame │
│ - return address │
│ - saved rbp │
│ - local: orderId │
│ - local: total │
├─────────────────────────┤
│ validatePayment() │ ← current frame, rsp points here
│ - return address │
│ - args: amount │
│ - local: result │
└─────────────────────────┘ ← rsp (stack pointer register)
Low address
Stack allocation is O(1): just decrement the stack pointer register (rsp on x86-64). Deallocation is O(1): increment rsp (the frame is “popped” but not zeroed — it’s simply below the current stack pointer). No garbage collector, no allocator, no heap bookkeeping.
Downsides of stack:
- Fixed, small size. Stack overflow = crash (
SIGSEGVfrom guard page) - LIFO only — you can’t free arbitrary stack variables
- Cannot return pointers to stack-allocated data (dangling pointer on return)
- Not shareable between threads (each thread has its own stack)
The Heap
The heap is a region of memory managed by the allocator. Objects persist until explicitly freed (C/C++) or until the GC collects them. The heap can grow (via mmap / sbrk) as large as virtual address space allows.
Stack Frames and the Calling Convention
The System V AMD64 ABI (used on Linux/macOS for x86-64) defines exactly how functions pass arguments and return values:
Argument registers: rdi, rsi, rdx, rcx, r8, r9 (first 6 integer args)
Return value: rax (integer), xmm0 (float/double)
Callee-saved: rbx, rbp, r12-r15 (callee must preserve these)
Caller-saved: rax, rcx, rdx, rdi, rsi, r8-r11 (callee may clobber)
// C function
long add(long a, long b, long c) {
long result = a + b + c;
return result;
}
; Compiled to (approximately):
add:
; rdi=a, rsi=b, rdx=c (argument registers per SysV ABI)
lea rax, [rdi + rsi] ; rax = a + b
add rax, rdx ; rax += c
ret ; return rax
; No stack frame needed! All in registers.
When the call stack does need to spill to memory:
functionWithLocals:
push rbp ; save caller's base pointer
mov rbp, rsp ; set up our base pointer
sub rsp, 32 ; allocate 32 bytes for locals
; [rbp-8] = local var1
; [rbp-16] = local var2
; ...
mov rsp, rbp ; restore stack pointer (free locals)
pop rbp ; restore caller's base pointer
ret
Red zone (System V AMD64): The 128 bytes below rsp are reserved for temporary storage without adjusting rsp. Signal handlers must not use this area. Leaf functions (no calls) can use it for locals without a sub rsp, N instruction.
Heap Allocators: Internals
Bump allocator: Maintain a single pointer into a large pre-allocated region. Allocation = increment pointer. No deallocation (or dealloc all at once). O(1) allocation, zero fragmentation, but no individual frees.
// Bump allocator (arena)
typedef struct { char* ptr; char* end; } Arena;
void* arena_alloc(Arena* a, size_t size) {
size = (size + 7) & ~7; // align to 8 bytes
if (a->ptr + size > a->end) return NULL; // OOM
void* result = a->ptr;
a->ptr += size;
return result;
}
// Reset all allocations at once: a->ptr = a->start;
Free list allocator: Maintains a linked list of freed blocks. malloc searches the free list for a block large enough (first-fit, best-fit, or segregated lists). free inserts the block back.
Free list:
[block 64B] → [block 128B] → [block 32B] → NULL
malloc(80): split block 128B into 80B (used) + 48B (free)
free list: [block 64B] → [block 48B] → [block 32B] → NULL
Slab allocator: For fixed-size objects (common in OS kernels). Pre-allocated pools of same-sized objects. O(1) allocation and deallocation, no fragmentation for the fixed size. Used in Linux kernel’s kmem_cache, and for small objects in jemalloc.
Buddy allocator: Splits memory into power-of-2 sized blocks. Allocations are rounded up to the nearest power of 2. Merging (coalescing) free buddies is efficient. Used in Linux’s page allocator.
Manual Memory Management in C/C++
C gives you complete control and complete responsibility:
#include <stdlib.h>
#include <string.h>
// malloc: allocate uninitialized memory
int* arr = malloc(100 * sizeof(int));
if (arr == NULL) { /* OOM — always check! */ }
// calloc: allocate zero-initialized memory
int* arr2 = calloc(100, sizeof(int));
// realloc: resize allocation (may move to new address)
arr = realloc(arr, 200 * sizeof(int));
// free: return memory to allocator
free(arr);
arr = NULL; // CRITICAL: set to NULL to avoid use-after-free
// Common bugs:
// 1. Use-after-free: access arr after free → undefined behavior (often crash or security hole)
free(arr);
arr[0] = 42; // UNDEFINED BEHAVIOR
// 2. Double-free: free the same pointer twice → heap corruption
free(arr);
free(arr); // UNDEFINED BEHAVIOR — corrupts allocator's free list
// 3. Buffer overflow: write beyond allocated bounds → corrupts adjacent data
int* buf = malloc(10 * sizeof(int));
buf[10] = 42; // off-by-one → UNDEFINED BEHAVIOR
// 4. Memory leak: lose the pointer without freeing
void* p = malloc(1024);
p = malloc(512); // leaked the first 1024 bytes — no reference to free it
AddressSanitizer (ASan): Compiler instrumentation that detects use-after-free, buffer overflow, double-free at runtime with ~2x slowdown. Essential for C/C++ development.
gcc -fsanitize=address -g -O1 program.c && ./a.out
# Detects: heap-buffer-overflow, stack-buffer-overflow, use-after-free, memory leaks
Smart Pointers in C++
C++11 smart pointers eliminate manual memory management via RAII (Resource Acquisition Is Initialization):
#include <memory>
#include <vector>
// unique_ptr: single owner, move-only, zero overhead vs raw pointer
{
auto p = std::make_unique<std::vector<int>>(100);
// p owns the vector. When p goes out of scope: automatic delete.
auto p2 = std::move(p); // transfer ownership — p is now null
} // p2 destructs here → delete called automatically
// shared_ptr: shared ownership via reference counting
// Header: 16 bytes (ptr + pointer to control block)
// Control block: strong count + weak count + deleter + allocator
auto sp1 = std::make_shared<int>(42);
auto sp2 = sp1; // ref count: 2
{
auto sp3 = sp1; // ref count: 3
} // sp3 out of scope → ref count: 2, no delete yet
sp2.reset(); // ref count: 1
// sp1 destructs → ref count: 0 → delete
// weak_ptr: non-owning reference — avoids cycles
struct Node {
std::shared_ptr<Node> next;
std::weak_ptr<Node> prev; // weak_ptr breaks the cycle
};
// BAD: cycle → memory leak
// A→B→A (both shared_ptrs) → ref count never reaches 0
auto a = std::make_shared<Node>();
auto b = std::make_shared<Node>();
a->next = b;
b->next = a; // ref count cycle — LEAK
// GOOD: weak_ptr for back-reference
b->prev = a; // weak — doesn't increment ref count
unique_ptr compiles to identical assembly as raw pointers. Zero overhead — it is a compile-time abstraction only.
Reference Counting
Python, Swift, and Rust’s Arc<T> (in std::sync) use automatic reference counting (ARC): increment on copy, decrement on drop, free when count reaches 0.
Advantages: Deterministic destruction (object frees immediately when last reference drops), composable with non-GC code, no stop-the-world pauses.
Disadvantages: Atomic increments/decrements on every copy are expensive in multithreaded code. Cycle problem: if A holds a reference to B and B holds a reference to A, both counts stay at 1 forever — a memory leak.
Cycle detection in Python: Python uses a secondary generational GC (alongside ARC) that periodically scans for reference cycles. The GC runs when the number of allocated objects exceeds a threshold per generation.
import gc
# Create a cycle
a = {}
b = {'ref': a}
a['ref'] = b
del a, b # ref count doesn't reach 0 — cycle!
gc.collect() # explicitly trigger cycle collection
print(gc.get_count()) # (gen0, gen1, gen2) object counts
Garbage Collection: Mark and Sweep
The canonical GC algorithm:
Phase 1 — Mark: Starting from roots (stack variables, global variables, CPU registers), follow all object references recursively, marking each reachable object.
Phase 2 — Sweep: Scan the entire heap. Reclaim any unmarked (unreachable) object.
Roots: [A, D]
Heap: A → B → C
D → E
F (unreachable)
Mark phase: A marked, B marked (from A), C marked (from B)
D marked, E marked (from D)
F: never reached → not marked
Sweep phase: F is collected (freed)
Tri-color marking (incremental GC): The basic mark-sweep requires stopping the world for the entire mark phase. Tri-color marking makes it incremental:
- White: Not yet visited (initially all objects)
- Grey: Reachable but children not yet scanned
- Black: Reachable and all children scanned
The GC processes grey objects one at a time, moving them to black. Between GC steps, the mutator (application) runs. A write barrier ensures that if a black object gets a new pointer to a white object, the white object is re-greyed.
JVM Garbage Collectors
The JVM’s generational hypothesis: most objects die young. Divide the heap into generations:
JVM Heap (e.g., 4GB):
┌─────────────────────────────────────────────────────────────────┐
│ Young Generation (1GB) │ Old Generation │
│ ┌──────────────┬────────────┬──────────┐ │ (3GB) │
│ │ Eden (800MB)│ Survivor S0│Survivor S1│ │ (long-lived objs) │
│ └──────────────┴────────────┴──────────┘ │ │
└─────────────────────────────────────────────────────────────────┘
Minor GC (Young Gen): New objects allocated in Eden. When Eden fills, minor GC runs: mark live objects, copy them to Survivor S0, clear Eden. Objects that survive N minor GCs are promoted to Old Gen. Minor GC is fast (10-50ms) because young gen is small.
Major/Full GC (Old Gen): When Old Gen fills. Must collect the entire heap. Pause times historically hundreds of milliseconds to seconds.
Modern JVM GC algorithms:
| Algorithm | Stop-The-World | Throughput | Latency | Use case |
|---|---|---|---|---|
| G1 (default JDK 9+) | Short incremental | High | <200ms | General purpose |
| ZGC (JDK 15+) | Root scan only | Medium | <10ms | Low latency |
| Shenandoah (JDK 12+) | Concurrent compaction | Medium | <10ms | Low latency |
| Serial | Full stop | Highest | Seconds | Single-core/small heap |
GC tuning flags:
java -XX:+UseZGC \
-Xms4g -Xmx4g \ # fixed heap prevents resize GCs
-XX:MaxGCPauseMillis=10 \
-XX:+PrintGCDetails \
-Xlog:gc*:file=gc.log \
-jar app.jar
V8 JavaScript Engine Memory Management
V8 (Node.js, Chrome) uses the Orinoco garbage collector — a generational, mostly-concurrent GC:
Young generation (Scavenger): New objects allocated in semi-space (typically 1-8MB). Scavenge GC is a Cheney’s copying collector: live objects copied to the other semi-space; dead objects abandoned. Scavenge runs frequently (~100ms intervals) and takes <5ms.
Old generation (Major GC): Long-lived objects promoted from young gen. Uses tri-color concurrent marking (runs alongside application) with a final stop-the-world atomic finishing step. Compaction moves objects to reduce fragmentation.
Hidden classes (Shapes): V8’s key optimization for JavaScript objects. When you create an object like \{x: 1\}, V8 creates a hidden class C0 with property x. Adding \{x: 1, y: 2\} creates class C1 extending C0. Objects with the same shape share the hidden class, enabling array-like storage and inline cache hits.
// GOOD: consistent object shape → single hidden class → JIT-optimized
function Point(x, y) { this.x = x; this.y = y; }
const points = Array.from({length: 1e6}, (_, i) => new Point(i, i));
// BAD: different shapes → hidden class transitions → deoptimization
const objs = [];
objs.push({x: 1});
objs.push({x: 1, y: 2}); // different shape!
objs.push({x: 1, y: 2, z: 3}); // another shape!
// V8 falls back to dictionary mode for these — 5-10x slower property access
Pointer compression (V8 v8.0+): V8 uses 32-bit compressed pointers within a 4GB heap cage, even on 64-bit systems. This halves pointer memory usage and improves cache density. The full 64-bit address is reconstructed as base + offset.
Go’s Garbage Collector
Go uses a concurrent tri-color mark-and-sweep GC with the goal of <1ms pause times:
GC cycle:
1. STW (stop-the-world): mark roots, enable write barriers (microseconds)
2. Concurrent mark: goroutines run while GC marks concurrently
3. STW: terminate marking, rescan recently-modified grey objects (microseconds)
4. Concurrent sweep: free unmarked objects while program runs
Write barrier: Go uses a hybrid barrier (Dijkstra + Yuasa). When a goroutine modifies a pointer during concurrent marking, the barrier ensures the GC sees the updated reference graph.
GC tuning:
import "runtime"
import "runtime/debug"
// GOGC env var: set GC trigger as % of heap growth
// GOGC=100 (default): trigger GC when heap doubles
// GOGC=off: disable GC (careful — memory grows unbounded)
// GOGC=50: trigger more aggressively (lower latency, higher CPU)
debug.SetGCPercent(200) // Equivalent to GOGC=200
runtime.GC() // Force immediate GC cycle
// GOMEMLIMIT (Go 1.19+): hard memory limit — prevents OOM
debug.SetMemoryLimit(512 * 1024 * 1024) // 512MB limit
Escape analysis: Go determines at compile time whether a variable can live on the stack or must be heap-allocated (“escapes to heap”). Variables that escape are slower (heap allocation + GC pressure).
// Does NOT escape: lives on stack, no GC overhead
func sum(a, b int) int {
result := a + b // result on stack
return result
}
// Escapes to heap: pointer returned to caller outlives the stack frame
func newInt(v int) *int {
return &v // v must escape — its address lives beyond the function
}
// go build -gcflags='-m' shows escape analysis decisions
Rust: Ownership and Zero-Cost Memory Safety
Rust’s ownership system enforces memory safety at compile time with no runtime overhead:
The three rules of ownership:
- Each value has exactly one owner
- When the owner goes out of scope, the value is dropped (memory freed)
- There can be either multiple immutable borrows OR one mutable borrow at a time — never both
fn main() {
// Stack allocation — freed when out of scope
let x: i64 = 42; // 8 bytes on stack
// Heap allocation via Box<T> (like unique_ptr)
let heap_val: Box<i64> = Box::new(42); // 8 bytes on heap
// When heap_val drops: Box calls free() — no GC needed
// Ownership transfer (move semantics)
let s1 = String::from("hello"); // String owns its heap buffer
let s2 = s1; // s1 MOVED to s2 — s1 is invalid!
// println!("{}", s1); // COMPILE ERROR: use of moved value
// Borrowing: immutable reference — no ownership transfer
let len = calculate_length(&s2); // &s2 = borrow
// Mutable borrow: exclusive access
let mut s = String::from("hello");
let r = &mut s;
r.push_str(", world");
// Cannot have any other borrow of s while r is alive
}
fn calculate_length(s: &String) -> usize {
s.len()
// s goes out of scope but doesn't drop: we only borrowed it
}
Lifetimes: Rust’s borrow checker tracks how long references are valid. Lifetime annotations ('a) are required when the compiler cannot infer relationships:
// This function signature says: the returned reference lives as long as
// the shortest-lived input reference
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
if x.len() > y.len() { x } else { y }
}
Rust’s memory model compiles to the same machine code as carefully-written C++ with unique_ptr. Zero runtime overhead. No GC. No dangling pointers. No data races (enforced by Send/Sync traits).
Arena Allocators for Request-Scoped Allocation
For request-scoped work (handling an HTTP request, processing a message), arena allocation is dramatically faster than general-purpose heap:
// Using the 'bumpalo' crate for arena allocation
use bumpalo::Bump;
fn handle_request(request: &Request) -> Response {
let arena = Bump::new(); // allocate arena
// All allocations in this request use the arena
let user: &User = arena.alloc(fetch_user(request.user_id));
let items: &[Item] = arena.alloc_slice_copy(&fetch_cart(request.cart_id));
let response = build_response(&arena, user, items);
response // the response is returned (or serialized to bytes before arena drops)
// arena drops here → all allocations freed in one operation
}
// No individual frees. No GC. The entire arena's memory released at once.
Arena allocation is used in:
- Web servers: one arena per request (used in Nginx, Envoy)
- Compilers: one arena per compilation unit (LLVM uses this extensively)
- Game engines: one arena per game frame (cleared at frame boundary)
- Databases: one arena per query execution
Virtual Memory and mmap
Virtual memory allows processes to use more address space than physical RAM. The OS maintains a page table mapping virtual addresses to physical pages. Pages not in RAM are swapped to disk.
mmap is the system call that maps files or anonymous memory into the virtual address space:
#include <sys/mman.h>
// File-backed mmap: map a file directly into virtual address space
// Reading/writing the memory region reads/writes the file
int fd = open("data.bin", O_RDWR);
size_t len = lseek(fd, 0, SEEK_END);
void* data = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
// Anonymous mmap: allocate memory not backed by a file
// (this is what malloc uses internally for large allocations)
void* buf = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
munmap(buf, 1024*1024); // release mapping
Database buffer pool vs mmap: SQLite uses mmap to let the OS page cache act as a buffer pool. PostgreSQL implements its own buffer pool (shared_buffers) and does not mmap data files, for these reasons:
- PostgreSQL controls eviction policy (clock sweep) — more predictable than OS LRU
- PostgreSQL can write WAL before evicting dirty pages —
mmapmakes this hard mmapfault handling on access is unpredictable for latency-sensitive workloads- PostgreSQL can use
O_DIRECTto bypass OS page cache entirely for predictable behavior
Node.js Memory Management
// Buffer vs ArrayBuffer vs TypedArray
// Buffer: Node.js-specific, wraps ArrayBuffer, adds utility methods
const buf = Buffer.alloc(1024); // heap, zero-filled
const buf2 = Buffer.allocUnsafe(1024); // heap, uninitialized (faster)
const buf3 = Buffer.from([1, 2, 3]);
// ArrayBuffer: raw binary data, not directly accessible
const ab = new ArrayBuffer(16);
// TypedArray: view over ArrayBuffer
const view = new Float64Array(ab); // 2 float64s (16 bytes)
// Detecting memory leaks in Node.js
const used = process.memoryUsage();
console.log({
rss: used.rss, // total process memory (includes code, stack)
heapTotal: used.heapTotal,// total heap allocated by V8
heapUsed: used.heapUsed, // heap actually used by live objects
external: used.external, // C++ objects bound to V8 (Buffers)
arrayBuffers: used.arrayBuffers,
});
// Common memory leak patterns in Node.js:
// 1. Event listeners not removed → closures hold references
// Fix: emitter.removeListener() or emitter.once()
// 2. Growing caches without eviction → Map/Set grows without bound
// Fix: LRU cache with size limit
// 3. Global variables accumulating state
// Fix: scope carefully, use WeakMap for object-keyed caches
// 4. Promise rejection not handled → promise chain holds reference
// Fix: always handle rejections, use async/await with try-catch
Heap snapshots for leak detection:
node --inspect app.js
# In Chrome DevTools → Memory → Take Heap Snapshot
# Compare two snapshots: objects that grew between them are suspects
# Or programmatically:
v8.writeHeapSnapshot(); # writes heapdump-<pid>-<timestamp>.heapsnapshot
--max-old-space-size: Sets the maximum old generation heap size in MB.
node --max-old-space-size=4096 server.js # 4GB heap limit
Interview: How Does a Garbage Collector Work?
The answer that demonstrates real depth:
Start with the core problem: Without automatic memory management, programs must manually free every allocation. Mistakes cause use-after-free (security vulnerabilities), double-free (heap corruption), or leaks. GC automates collection at the cost of runtime overhead.
Explain the reachability model: A GC’s fundamental insight — if an object is unreachable from any root (stack, globals, CPU registers), the program can never access it. It is garbage by definition. GC identifies and reclaims unreachable objects.
Mark-and-sweep: In two phases: mark reachable objects (DFS/BFS from roots), sweep the heap to free unmarked objects. Simple but requires stop-the-world pauses.
Generational hypothesis: Most objects die young. Young gen uses a fast copying collector (Scavenge/Cheney). Old gen uses a slower but more thorough collector. This makes GC fast in the common case.
Concurrent and incremental GCs: Modern GCs (ZGC, Shenandoah, Go’s GC) do most work concurrently while the application runs. Write barriers ensure consistency when the mutator modifies the object graph during GC.
”What is a memory leak in Node.js?”
A memory leak in Node.js (or any GC’d language) is when an object remains reachable but is never used again. The GC cannot collect it because it looks live. Common patterns:
- Unbounded caches:
const cache = new Map()that only grows, never evicts - Event listener accumulation:
emitter.on('data', handler)called repeatedly withoutremoveListener - Closure captures: a long-lived closure captures a large object in its scope
- Timers/intervals not cleared:
setIntervalcallback holds references to outer scope
Diagnosis: Use node --inspect + Chrome DevTools heap snapshots, or clinic.js heap to track allocations over time. Look for objects whose count grows monotonically with traffic.