Core Architecture · Topic 11 of 21

JIT Compilation & VM Runtimes

200 XP

The Fundamental Problem: Fast Execution of High-Level Code

Programming languages must bridge an enormous semantic gap: human-readable abstractions like map, Promise, or a Python list comprehension must ultimately become sequences of load/store/arithmetic instructions that a CPU can execute. How you cross that gap determines your language’s startup latency, peak throughput, memory footprint, and operational complexity.

There are three canonical strategies, each with distinct tradeoffs:

Strategy          | Startup  | Peak Throughput | Portability | Example
------------------|----------|-----------------|-------------|-------------------
Pure Interpreter  | Fast     | Low (5–100×)    | High        | CPython, early Ruby
AOT Compiler      | Slow     | Highest         | Low         | C/C++, Rust, Go
JIT Compiler      | Medium   | Near-native     | High        | V8, HotSpot, PyPy

A pure interpreter reads the source (or an intermediate bytecode) and executes it directly, dispatching each operation through a C switch statement or a computed-goto table. Startup is instant, but the interpreter overhead — fetch, decode, dispatch — repeats for every operation, every loop iteration.

An AOT (Ahead-Of-Time) compiler (like rustc or gcc) does all analysis at build time: type inference, register allocation, loop vectorisation. The resulting binary runs at near-hardware speed with zero runtime overhead. The tradeoff: you pay compile time up front, you lose runtime type feedback, and the binary is architecture-specific.

A JIT (Just-In-Time) compiler is the pragmatic middle ground. It starts interpreting (fast startup), profiles execution to identify hot code (code run frequently), and compiles that hot code to native machine code while the program runs. It can also exploit runtime type information — information unavailable at AOT compile time — to produce code that’s sometimes faster than equivalent C.


Bytecode: The Portable Intermediate Layer

Most modern runtimes do not JIT-compile source code directly. They first compile to bytecode — a compact, architecture-neutral instruction set for a virtual machine. Bytecode serves several critical purposes:

1. Portability. JVM .class files run identically on x86-64, ARM, RISC-V, and s390x. The JIT backend handles the architecture-specific translation.

2. Optimisation surface. Bytecode is easier to analyse than raw source text. Control-flow graphs, def-use chains, and loop detection are straightforward to extract from a well-designed bytecode.

3. Startup caching. Bytecode can be serialised to disk. V8’s code cache stores compiled bytecode so require() calls on warm processes skip re-parsing. The JVM loads .class files directly.

4. Security validation. The JVM bytecode verifier statically checks type safety before execution — no array bounds can be forged, no method can be called with the wrong receiver type — giving a hard security boundary without a trust-free source re-parse on every load.

                  ┌────────────────────────────────────────────────┐
                  │               Source Code (.js / .java / .py)  │
                  └──────────────────────┬─────────────────────────┘
                                         │ Parse + Compile

                  ┌────────────────────────────────────────────────┐
                  │            Bytecode / IR (.class / Ignition BC) │
                  └──────────────────────┬─────────────────────────┘
                                         │ Interpret (fast start)

                  ┌──────────────────────────────────────────────── ┐
                  │         Profiling: count invocations, track types│
                  └──────────────────────┬──────────────────────────┘
                                         │ Hot? → JIT compile

                  ┌────────────────────────────────────────────────┐
                  │          Native Machine Code (x86-64 / ARM64)  │
                  └────────────────────────────────────────────────┘

V8: JavaScript’s Optimising Runtime

V8 (used in Node.js, Chrome, Deno, Bun’s JS layer) has undergone several architectural generations. The current pipeline is a four-tier system.

Tier 1 — Ignition (Bytecode Interpreter)

When V8 parses a JavaScript function, it immediately compiles the AST to Ignition bytecode — a register-based instruction set with ~200 opcodes. Example:

function add(a, b) { return a + b; }

Compiles roughly to:

// Ignition bytecode (conceptual)
Ldar a          // Load accumulator ← register a
Add  b, [0]     // accumulator += register b  (slot [0] = feedback vector entry)
Return          // return accumulator

Every bytecode instruction has a feedback vector slot. As Ignition executes, it records the types of values it sees into these slots. This type feedback drives the upper tiers.

Tier 2 — Sparkplug (Baseline JIT)

Sparkplug is a non-optimising, single-pass compiler that translates Ignition bytecode to native code with essentially no analysis. It exists purely to eliminate interpreter dispatch overhead. Sparkplug runs in microseconds and produces code ~2× faster than Ignition.

The trick: Sparkplug mirrors Ignition’s register layout exactly, so the feedback vector slots it uses are identical. When Sparkplug code is deoptimised back to the interpreter, no state translation is needed.

Tier 3 — Maglev (Mid-Tier Optimising JIT)

Introduced in V8 v11 (Chrome 114, Node 20), Maglev is a sea-of-nodes IR compiler that sits between Sparkplug and TurboFan. It performs type specialisation based on feedback (e.g., treating + as integer add when both operands have always been Smis), eliminates redundant type checks, and inlines small callees — all in far less compilation time than TurboFan.

Maglev targets functions that are hot but not extremely hot, giving 3–5× speedup over Sparkplug without TurboFan’s 50ms+ compile latency.

Tier 4 — TurboFan (Optimising JIT)

TurboFan is V8’s full optimising compiler, triggered for functions with high invocation counts and stable type profiles. It builds a sea-of-nodes graph (pioneered by Cliff Click’s JVM work) where nodes are operations and edges encode data and control dependencies simultaneously. This unified representation allows aggressive global optimisations.

Key TurboFan optimisations:

  • Speculative type specialisation: if feedback says x is always a 31-bit integer (a V8 “Smi”), TurboFan emits a direct integer add with a deopt guard instead of a polymorphic + that handles strings, BigInts, and floats.
  • Inlining: callees are merged into the caller’s graph, enabling constant folding across function boundaries.
  • Escape analysis: short-lived objects that don’t escape a function are allocated on the stack instead of the heap.
  • Loop peeling and unrolling: loop-invariant checks are hoisted out of loops.
                  Function called > ~1000 times with stable types

                  ┌──────────────────────▼──────────────────────────┐
                  │  TurboFan: Build sea-of-nodes graph from BC      │
                  │  → Type specialisation (guard + fast path)       │
                  │  → Inlining, escape analysis, LICM               │
                  │  → Register allocation (linear scan)             │
                  │  → Native code emission                          │
                  └─────────────────────────────────────────────────┘

Hidden Classes and Inline Caches

This is the single most important performance concept for JavaScript developers.

The Problem with Dynamic Property Access

In C++, obj.x compiles to a fixed offset load: mov eax, [rbx + 8]. The compiler knows at compile time where x lives. In JavaScript, obj.x could, in principle, require a hash-table lookup on every access — catastrophically slow.

V8 solves this with hidden classes (also called “maps” internally, or “shapes” in SpiderMonkey).

Hidden Classes

When you write const p = { x: 1, y: 2 }, V8 doesn’t create a hash map. Instead, it creates a hidden class (HC0) that describes the object’s layout:

HC0: { x → offset 0, y → offset 4 }
     Object slots: [1, 2]

If you then add p.z = 3, V8 transitions to a new hidden class:

HC1: { x → offset 0, y → offset 4, z → offset 8 }
     Object slots: [1, 2, 3]

Critically, all objects created with { x, y } in the same code path share HC0. The hidden class is a pointer stored in the object header — a single 8-byte field that encodes the entire property layout.

Monomorphic, Polymorphic, Megamorphic ICs

An Inline Cache (IC) is a small patch of machine code at each property-access site that caches the hidden class → offset mapping:

// First call: x is an instance of HC0
// IC is "uninitialized" → generic lookup → patch IC to: "if HC == HC0, load offset 0"

// Subsequent calls with HC0: direct offset load — 1 instruction
// A new object with HC1 arrives: IC becomes "polymorphic" — checks HC0 or HC1
// >4 different hidden classes: IC goes "megamorphic" — falls back to hash lookup
IC State      | # Shapes Seen | Cost         | Triggered By
--------------|---------------|--------------|----------------------------------
Uninitialized | 0             | Full lookup  | First execution
Monomorphic   | 1             | ~1 ns        | All calls see same shape
Polymorphic   | 2–4           | ~5–10 ns     | Multiple compatible shapes
Megamorphic   | >4            | ~50–100 ns   | Wildly inconsistent shapes

What Destroys Hidden Class Optimisations

// ❌ WRONG: different property insertion orders → different hidden classes
function makePoint(flip) {
  if (flip) return { y: 0, x: 1 };   // HC_A: {y, x}
  else      return { x: 1, y: 0 };   // HC_B: {x, y}
}
// Any IC accessing these goes polymorphic.

// ❌ WRONG: delete causes hidden class transition to a "slow" mode
const p = { x: 1, y: 2 };
delete p.x;  // p now has a "dictionary mode" object — hash table, no IC benefit

// ✅ RIGHT: always add properties in the same order
class Point {
  constructor(x, y) { this.x = x; this.y = y; }
}
// Every Point instance shares the same hidden class.

// ✅ RIGHT: initialise all properties in the constructor
// even if the value is null/undefined — establishes the shape upfront

Deoptimisation: When the JIT Bets Wrong

TurboFan compiles speculative code based on observed types. If the speculation is violated, V8 must deoptimise — discard the compiled code and fall back to Ignition bytecode.

Deoptimisation Triggers

Type changes in hot functions:

function sum(arr) {
  let total = 0;
  for (let i = 0; i < arr.length; i++) total += arr[i];
  return total;
}
sum([1, 2, 3]);        // TurboFan compiles assuming integer array
sum([1, 2, "three"]);  // DEOPT: string encountered — falls back to bytecode

The arguments object: Using arguments prevents inlining and disables several optimisations. Use rest parameters (...args) instead.

try-catch in hot loops: Prior to V8 v6, a try-catch block prevented TurboFan from compiling the enclosing function. This limitation has been mostly lifted, but restructuring hot paths to avoid exceptions is still best practice.

eval: eval can introduce new variables into any enclosing scope, making scope analysis impossible. Functions containing eval are never optimised by TurboFan.

Detecting Deoptimisations

# Run with deopt logging
node --trace-deopt server.js

# V8 profiler: generate CPU profile
node --prof server.js
node --prof-process isolate-*.log > profile.txt

# Or use clinic.js / 0x for flame graphs
npx 0x server.js

A deoptimisation loop — where TurboFan recompiles, hits the same deopt, and falls back repeatedly — is called a deopt storm and can crater performance by 10–100×.


HotSpot JVM: Tiered Compilation

The JVM’s JIT story predates V8’s by a decade and remains the gold standard for long-running server workloads.

Compilation Tiers

Tier 0: Interpreter (template interpreter — partly JIT-compiled itself)
Tier 1: C1 at level 1  (no profiling)
Tier 2: C1 at level 2  (limited profiling: invocation + backedge counts)
Tier 3: C1 at level 3  (full profiling: type profiles, branch profiles)
Tier 4: C2             (fully optimising — uses tier-3 profile data)

The compilation threshold for C2 is ~10,000 method invocations + backedge iterations (configurable with -XX:CompileThreshold). Before reaching tier 4, the method runs in tier 3 (C1 with profiling) to accumulate a rich type profile, analogous to V8’s feedback vectors.

C1 (Client Compiler): Fast compilation (~1ms), produces decent code. Focuses on inlining, null-check elimination, basic dead-code removal.

C2 (Server Compiler): Slow compilation (10–100ms), produces near-C++ quality code. Uses the sea-of-nodes IR (the same approach TurboFan later adopted). Key C2 optimisations:

  • Inlining: C2 aggressively inlines methods up to 8 levels deep (-XX:MaxInlineLevel=8). This is how stream().filter().map().collect() achieves near-loop performance.
  • Escape Analysis: If a new object doesn’t escape the method, C2 stack-allocates it or eliminates it entirely (scalar replacement).
  • Devirtualisation: Interface calls are expensive (vtable + type check). If profiling shows only one concrete type at a call site, C2 emits a direct call with a guard — the same speculative trick V8 uses.
  • Loop Unrolling + Vectorisation: C2 can auto-vectorise simple loops using SSE/AVX instructions.
  • Lock Elision + Coarsening: If a lock object doesn’t escape the thread, synchronisation is eliminated entirely.
// C2 will devirtualise this if only one Comparator impl is ever passed:
list.sort(comparator);
// → if (comparator instanceof MyComparator) { direct call } else { vtable }

JVM Diagnostic Tools

# Print JIT compilation decisions
java -XX:+PrintCompilation -jar app.jar

# Print inlining decisions
java -XX:+PrintInlining -jar app.jar

# JDK Flight Recorder — production-safe, <2% overhead
java -XX:StartFlightRecording=duration=60s,filename=recording.jfr -jar app.jar
jfr print --events jdk.JITCompilation recording.jfr

# async-profiler: allocation and CPU flame graphs without safepoint bias
./profiler.sh -d 30 -f flamegraph.html <pid>

GraalVM: Polyglot and Native Image

GraalVM is Oracle’s polyglot runtime that replaces C2 with a JIT compiler written entirely in Java.

Truffle: Language-Agnostic AST Interpreter

The Truffle framework lets language implementors write an AST interpreter in Java. Truffle then applies partial evaluation — it specialises the interpreter against a specific program’s AST, producing optimised machine code via the Graal JIT. Languages built on Truffle include GraalPy, FastR, TruffleRuby, and GraalJS.

The key insight: a Truffle interpreter + Graal JIT achieves performance competitive with language-specific JITs, with far less engineering effort.

Native Image: AOT for the JVM

native-image performs closed-world analysis at build time: it traces every reachable class, method, and field from the entry point, then compiles everything to a self-contained native binary using the Graal compiler.

native-image -jar myapp.jar myapp
./myapp  # starts in ~50ms with zero JVM warmup, <50MB RSS

Tradeoffs:

  • ✅ Instant startup (~50ms vs ~500ms for JVM)
  • ✅ Low memory footprint (no JIT metadata, no bytecode)
  • ❌ No dynamic class loading, reflection must be declared up front
  • ❌ Peak throughput slightly lower than warmed-up HotSpot C2
  • ❌ Build time is slow (minutes for large apps)

Native image is the foundation of Quarkus and Micronaut’s “serverless-native” story.


CPython vs PyPy: Meta-Tracing JIT

CPython (the reference Python implementation) is a pure interpreter of Python bytecode. Every attribute access is a dictionary lookup (__dict__), every + operator dispatches through __add__, and every function call crosses the C/Python boundary. Typical overhead vs optimised C: 50–200×.

PyPy takes a radically different approach: rather than writing a JIT compiler for Python, PyPy’s engineers wrote CPython’s interpreter in RPython (a restricted, statically-typed subset of Python), then applied a meta-tracing JIT to the interpreter itself.

Meta-tracing insight:
  A JIT that traces the *interpreter* interpreting *user code*
  produces specialised machine code for that specific user code path.
  You get a JIT "for free" from any interpreter written in RPython.

PyPy’s JIT:

  1. Identifies hot interpreter loops (loops in the user’s Python code).
  2. Records (traces) the sequence of interpreter operations for one iteration.
  3. Compiles the trace to native code with type guards.
  4. On a type mismatch, falls back to the interpreter and re-traces.

Typical PyPy performance: 5–10× faster than CPython on CPU-bound workloads, with occasional peaks of 50×. NumPy-heavy scientific code is the exception — CPython + C extensions remain faster because PyPy’s C extension compatibility layer adds overhead.


LLVM: The Universal Compiler Backend

LLVM is not a compiler — it’s a compiler infrastructure: a set of reusable libraries for building compilers. The LLVM IR (Intermediate Representation) is a typed, SSA-form assembly language that serves as the common target for dozens of language frontends.

Language Frontend    → LLVM IR → Optimisation Passes → Backend → Machine Code
─────────────────────────────────────────────────────────────────────────────
Clang (C/C++)         ↘
Rustc                  → llvm-ir  → mem2reg, instcombine, → x86-64 / ARM64
Swift                  →             licm, gvn, loop-     → RISC-V / WASM
Kotlin/Native          ↗             vectorize, …

LLVM IR Example

; A simple function in LLVM IR (SSA form)
define i32 @add(i32 %a, i32 %b) {
entry:
  %result = add nsw i32 %a, %b   ; nsw = no signed wrap (enables optimisations)
  ret i32 %result
}

LLVM’s pass pipeline (opt) applies hundreds of transformations to this IR:

  • mem2reg: promotes stack allocations to SSA registers
  • instcombine: algebraic simplification (x * 2x << 1)
  • licm: loop-invariant code motion
  • gvn: global value numbering (eliminates redundant computations)
  • loop-vectorize: converts scalar loops to SIMD

Because LLVM handles all of this, language authors only need to write a frontend — the backend quality is world-class for free.


WebAssembly: Safe, Fast, Portable Bytecode

WebAssembly (WASM) is a binary instruction format designed as a compilation target for languages like C, Rust, Go, and C#. It runs in browsers, on servers (Wasmtime, WasmEdge), and is emerging as a container alternative.

Binary Format and Validation

A .wasm file is a typed, structured binary with sections: types, imports, functions, tables, memory, exports, and code. Before execution, the runtime performs a single-pass type check — guaranteed to complete in O(n) — ensuring type safety without a full analysis pass.

V8’s WASM Pipeline: Liftoff + TurboFan

.wasm download begins


  Liftoff (baseline compiler): streaming, single-pass, parallel per function
  → fast compilation (~10ms for 1MB WASM), moderate code quality


  TurboFan (optimising compiler): runs in background thread
  → high-quality native code replaces Liftoff code function-by-function

Liftoff enables WASM to start executing in under 50ms while TurboFan tiers up in the background — the same multi-tier strategy as JavaScript.

Wasmtime (Standalone Runtime)

Wasmtime (Bytecode Alliance) uses Cranelift as its JIT backend — a code generator written in Rust, designed for fast compilation and correctness rather than maximum optimisation. For server-side WASM, Wasmtime provides:

  • Capability-based security: WASM modules can only access host resources explicitly granted (WASI’s permission model).
  • Sub-millisecond cold starts: no JVM warmup, no OS process spawn.
  • Language-agnostic sandboxing: run untrusted plugins in-process safely.

AOT vs JIT: When to Use Which

Dimension              | AOT (Rust, Go, C)     | JIT (JVM, V8)        | Native Image
-----------------------|-----------------------|----------------------|------------------
Cold start             | <1ms                  | 500ms–2s             | ~50ms
Peak throughput        | Highest (static info) | Near-AOT (with warmup)| Slightly below JIT
Runtime optimisation   | None                  | PGO from live data   | None
Binary portability     | No (per-arch)         | Yes (bytecode)       | No (per-arch)
Memory (idle)          | Minimal               | JVM: ~100MB+         | Minimal
Latency predictability | High                  | GC/JIT pauses        | High
Best for               | Systems, CLIs, uFaaS  | Long-running servers | Serverless, CLIs

For serverless / FaaS workloads (AWS Lambda, Cloudflare Workers), cold start dominates — this is why AWS Lambda runs Node.js (V8’s fast cold start) and Go/Rust (AOT), and why Firecracker + Native Image is the preferred pattern for Java.


Profiling: Finding What to Optimise

V8 Profiling

# Built-in sampling profiler
node --prof server.js
node --prof-process isolate-0x*.log | head -50

# clinic.js suite
npm install -g clinic
clinic flame -- node server.js   # flame graph
clinic doctor -- node server.js  # anomaly detection

# Chrome DevTools: --inspect + Performance tab
node --inspect server.js

JVM Profiling

# JDK Flight Recorder (JFR) — <2% overhead, safe in production
java -XX:StartFlightRecording=duration=60s,filename=app.jfr MyApp
jfr print --events jdk.ExecutionSample app.jfr

# async-profiler — wall-clock + allocation + lock profiling
# avoids safepoint bias that afflicts JVMTI-based profilers
./asprof -d 30 -f out.html <pid>

# PrintCompilation to understand JIT decisions
java -XX:+PrintCompilation -XX:+PrintInlining MyApp 2>&1 | grep -v "made not entrant"

Reading Flame Graphs

Wide bars   → function spends a lot of CPU time (direct or via callees)
Tall stacks → deep call chains (often framework overhead)
Flat tops   → leaf functions — the actual CPU consumers
Plateaus    → same stack frame repeated — a hot loop

TypeScript: Compilation Without Runtime Types

TypeScript (tsc) compiles .ts.js by performing:

  1. Parsing the TypeScript AST (superset of JavaScript AST).
  2. Type checking: building a symbol table, resolving types, emitting type errors.
  3. Emit: stripping all type annotations and emitting JavaScript.

Types are completely erased at runtime. The generated JavaScript is semantically identical to what you’d write without types — V8 never sees a TypeScript type. This has important implications:

// TypeScript source
function add(a: number, b: number): number { return a + b; }

// Emitted JavaScript — identical to untyped version
function add(a, b) { return a + b; }

TypeScript’s type system provides zero runtime overhead but also zero runtime guarantees — if data arrives from an API with a different shape than declared, the runtime will happily proceed with undefined properties.

For runtime type checking, use libraries like Zod or io-ts, which generate both TypeScript types and runtime validators from a single schema definition.


Interview Deep Dive: Mixed-Type Array Performance

Q: “Why is a for loop over an array with mixed types slower in JavaScript?”

The complete answer:

A JavaScript array backed by elements of a single type (all integers, or all doubles, or all object pointers) is stored by V8 as a packed typed array in memory — a contiguous block of 64-bit values, identical to a C array. The element kind is encoded in the array’s hidden class:

PACKED_SMI_ELEMENTS     → all elements are V8 Smis (31-bit integers)
PACKED_DOUBLE_ELEMENTS  → all elements are doubles (boxed or unboxed)
PACKED_ELEMENTS         → mixed / object references (pointer array)
HOLEY_*                 → array has holes (missing indices)

Transitions between element kinds are one-way and monotonic: a Smi array that receives a double transitions to double elements — forever. A double array that receives a string transitions to pointer elements — forever. V8 never transitions “back up” to a more specialised kind.

const arr = [1, 2, 3];           // PACKED_SMI_ELEMENTS
arr.push(3.14);                   // → PACKED_DOUBLE_ELEMENTS (downgrade)
arr.push("hello");                // → PACKED_ELEMENTS (downgrade)
// arr stays PACKED_ELEMENTS even if you remove the string

for (let i = 0; i < arr.length; i++) {
  total += arr[i];  // IC goes megamorphic; TurboFan can't specialise the add
}

With PACKED_SMI_ELEMENTS, TurboFan emits a tight loop: unbox the Smi, add integers, no type checks. With PACKED_ELEMENTS, every element load requires a pointer dereference plus a type check plus a potential call to valueOf. This is typically 3–5× slower for arithmetic workloads.

The performance lesson: homogeneous arrays are not just a style preference — they are a JIT contract. In performance-critical paths, maintain type discipline or use Int32Array / Float64Array to make the element kind explicit and immutable.