The Compiler’s Job: Closing the Semantic Gap
A compiler transforms a program written in a high-level language — designed for human understanding — into a sequence of machine instructions — designed for CPU execution. This transformation is not a single step but a pipeline of phases, each with a precise mathematical definition. Understanding this pipeline is essential for writing performant code, debugging arcane build errors, and reasoning about why tools like SWC, esbuild, and tsc behave as they do.
The canonical pipeline:
Source Text
│
▼ Lexical Analysis (Lexer / Tokeniser)
Token Stream
│
▼ Syntactic Analysis (Parser)
Abstract Syntax Tree (AST)
│
▼ Semantic Analysis (Type Checker, Name Resolver)
Decorated AST / Symbol Table
│
▼ IR Generation (HIR / MIR / LLVM IR)
Intermediate Representation
│
▼ Optimisation Passes (middle-end)
Optimised IR
│
▼ Code Generation (Backend)
Target Code (x86-64 / ARM64 / WASM / JS)
│
▼ Linking
Executable / Library
Each phase has a single responsibility and a well-defined input/output contract. This separation is not bureaucracy — it enables substitutable backends (the same C frontend can emit x86, ARM, or WASM), pluggable optimisation passes, and incremental compilation.
Phase 1: Lexical Analysis (Lexing / Tokenisation)
The lexer reads a stream of characters and emits a stream of tokens — the vocabulary of the language. Each token has a type (keyword, identifier, integer literal, operator) and a value (the matched text).
Lexers are implemented as Deterministic Finite Automata (DFAs), often generated from regular expression specifications using tools like flex or hand-written for performance.
Token Types
INTEGER_LIT 42
FLOAT_LIT 3.14
STRING_LIT "hello"
IDENTIFIER myVar
KEYWORD if | while | return | const
OPERATOR + | - | * | / | === | =>
PUNCTUATION ( | ) | { | } | , | ;
EOF <end of input>
Whitespace and Comments
Most languages treat whitespace as a token separator — it is consumed but not emitted. Comments are similarly dropped. The exception: Python uses indentation as structural syntax, so its lexer emits INDENT and DEDENT tokens. This is why Python’s lexer is more complex than most.
Practical TypeScript Tokeniser
Let’s build a minimal tokeniser for arithmetic expressions in TypeScript:
type TokenType =
| "NUMBER" | "PLUS" | "MINUS" | "STAR" | "SLASH"
| "LPAREN" | "RPAREN" | "EOF";
interface Token {
type: TokenType;
value: string;
pos: number;
}
function tokenize(source: string): Token[] {
const tokens: Token[] = [];
let i = 0;
while (i < source.length) {
const ch = source[i];
// Skip whitespace
if (/\s/.test(ch)) { i++; continue; }
// Number literal (integer or float)
if (/[0-9]/.test(ch)) {
let start = i;
while (i < source.length && /[0-9.]/.test(source[i])) i++;
tokens.push({ type: "NUMBER", value: source.slice(start, i), pos: start });
continue;
}
// Single-character tokens
const single: Record<string, TokenType> = {
"+": "PLUS", "-": "MINUS", "*": "STAR",
"/": "SLASH", "(": "LPAREN", ")": "RPAREN",
};
if (ch in single) {
tokens.push({ type: single[ch], value: ch, pos: i });
i++;
continue;
}
throw new Error(`Unexpected character '${ch}' at position ${i}`);
}
tokens.push({ type: "EOF", value: "", pos: i });
return tokens;
}
Phase 2: Parsing
The parser takes the token stream and builds an Abstract Syntax Tree (AST) — a tree that captures the grammatical structure of the program, discarding syntactically irrelevant details (parentheses that encode precedence, trailing commas).
Grammar Classes
Languages are defined by formal grammars. The most relevant classes for programming languages:
Grammar Class | Parser Type | Examples
------------------|----------------------|-----------------------------------
Regular | DFA / regex | Tokens (lexer level)
LL(k) | Recursive descent | Most hand-written parsers
LR(k) / LALR(1) | Table-driven bottom-up| yacc/bison grammars (C, Java spec)
PEG | Packrat parser | Pest (Rust), PEG.js
LL(k) grammars are parsed top-down — the parser predicts which production rule to apply by looking at the next k tokens. Most hand-written parsers for production compilers (Clang, TypeScript’s tsc) use recursive descent, which implements each grammar rule as a mutually recursive function.
LR(k) grammars are parsed bottom-up — the parser reduces token sequences to grammar symbols, driven by a state machine and a parse table. They handle a larger class of grammars but are impractical to write by hand, hence tools like yacc.
Pratt Parsers: Elegant Expression Parsing
Pratt parsing (Top-Down Operator Precedence) is the standard technique for parsing expressions with operator precedence and associativity. Each token type has a binding power (an integer priority). The parser consumes tokens from left to right, collecting operands until it encounters a token with lower binding power.
// Pratt parser for arithmetic: handles precedence without grammar hacking
interface Expr { }
interface Num extends Expr { type: "Num"; value: number; }
interface BinOp extends Expr {
type: "BinOp"; op: string; left: Expr; right: Expr;
}
const BINDING_POWER: Record<string, [number, number]> = {
"+": [10, 11], // left bp = 10, right bp = 11 (left-associative)
"-": [10, 11],
"*": [20, 21],
"/": [20, 21],
};
class Parser {
private pos = 0;
constructor(private tokens: Token[]) {}
private peek(): Token { return this.tokens[this.pos]; }
private consume(): Token { return this.tokens[this.pos++]; }
parseExpr(minBp = 0): Expr {
let tok = this.consume();
let left: Expr;
if (tok.type === "NUMBER") {
left = { type: "Num", value: parseFloat(tok.value) };
} else if (tok.type === "LPAREN") {
left = this.parseExpr(0);
this.consume(); // RPAREN
} else {
throw new Error(`Unexpected token: ${tok.type}`);
}
while (true) {
const op = this.peek();
if (op.type === "EOF" || op.type === "RPAREN") break;
const [lbp, rbp] = BINDING_POWER[op.value] ?? [0, 0];
if (lbp <= minBp) break;
this.consume();
const right = this.parseExpr(rbp);
left = { type: "BinOp", op: op.value, left, right };
}
return left;
}
}
// "1 + 2 * 3" → BinOp(+, Num(1), BinOp(*, Num(2), Num(3)))
// Multiplication binds tighter because its binding power (20) > addition's (10)
Phase 3: Semantic Analysis
The parser verifies syntactic correctness — is the program grammatically valid? Semantic analysis verifies meaning — is the program logically valid?
Symbol Tables and Scope Analysis
A symbol table maps identifiers to their declarations (type, storage class, scope). The compiler builds one as it walks the AST, with nested scopes forming a stack:
Global scope: { console, Math, fetch }
└── Function scope (main): { x: number, y: number }
└── Block scope (if body): { result: number }
Name resolution walks this scope stack outward from the use site to find each identifier’s declaration. An unresolved name is a compile-time error. Shadowing occurs when a nested scope declares a name that exists in an outer scope.
Type Checking
Type checking traverses the AST bottom-up, propagating types from leaves (literals, identifiers) to internal nodes (expressions). For a binary + node:
- Infer the type of the left child.
- Infer the type of the right child.
- Look up the valid overloads of
+for those types. - If no overload matches → type error.
- Annotate the node with the result type.
TypeScript’s type checker (checker.ts, ~50,000 lines) implements this process for a Turing-complete type language with generics, conditional types, mapped types, and template literal types.
Intermediate Representation (IR)
After semantic analysis, the compiler lowers the AST to an IR — a representation designed not for human readability but for machine transformation and optimisation.
Why Not Go Directly to Machine Code?
Direct code generation from an AST produces slow, unoptimised code and ties the compiler to one architecture. IR decouples the middle-end (optimisations, architecture-independent) from the backend (architecture-specific code generation).
LLVM IR: A Concrete Example
LLVM IR is a typed, infinite-register assembly language in Static Single Assignment (SSA) form:
; Source: int factorial(int n) { return n <= 1 ? 1 : n * factorial(n-1); }
define i32 @factorial(i32 %n) {
entry:
%cond = icmp sle i32 %n, 1 ; n <= 1?
br i1 %cond, label %base, label %rec
base:
ret i32 1
rec:
%n_minus_1 = sub i32 %n, 1 ; n - 1
%rec_result = call i32 @factorial(i32 %n_minus_1)
%result = mul i32 %n, %rec_result ; n * factorial(n-1)
ret i32 %result
}
Every value is defined exactly once (SSA), making data-flow analysis trivial: to find all uses of a definition, follow its use-def chain.
Static Single Assignment (SSA) Form
SSA is the key insight that makes modern optimisers possible. In SSA:
- Every variable is assigned exactly once.
- φ (phi) nodes at control-flow join points select between multiple incoming values.
// Non-SSA
x = 1
if cond:
x = 2
use(x) // Which x? Depends on control flow
// SSA
x₁ = 1
if cond:
x₂ = 2
x₃ = φ(x₁, x₂) // φ selects based on which branch executed
use(x₃)
With SSA, use-def chains are explicit in the graph structure — no alias analysis needed for local variables. Constant propagation becomes: if a variable’s single definition is a constant, substitute that constant everywhere the variable is used.
Optimisation Passes
Optimisation passes transform IR into semantically equivalent but faster IR. They run repeatedly, in sequence, until a fixed point (no more changes) or a pass budget is exhausted.
Constant Folding
Evaluate constant expressions at compile time:
; Before
%a = mul i32 6, 7
%b = add i32 %a, 0
; After constant folding + algebraic simplification
; %b = 42 (the entire expression is constant)
Dead Code Elimination (DCE)
Remove code whose results are never used. With SSA, this is a simple graph reachability problem: if a definition has no uses, delete it. Apply recursively.
// Before DCE
function compute(x) {
const unused = expensiveCalc(); // no uses → eliminated
return x * 2;
}
Common Subexpression Elimination (CSE)
Replace duplicate computations with a single computation + reuse:
// Before
int a = x * y + z;
int b = x * y + w;
// After CSE
int tmp = x * y;
int a = tmp + z;
int b = tmp + w;
Loop-Invariant Code Motion (LICM)
Move computations whose inputs don’t change across iterations outside the loop:
// Before
for (int i = 0; i < n; i++) {
arr[i] = arr[i] * (base * base); // base*base computed every iteration
}
// After LICM
int bb = base * base; // hoisted
for (int i = 0; i < n; i++) {
arr[i] = arr[i] * bb;
}
Function Inlining
Replace a function call with the callee’s body, substituting arguments. Inlining eliminates call overhead (frame setup, argument passing, return) and, crucially, enables further optimisations — the inlined body may expose constant values, dead branches, or common subexpressions that weren’t visible across the call boundary.
Inlining is controlled by heuristics: function size (don’t inline large functions — code bloat hurts instruction cache), call site hotness (only inline frequently executed calls), and recursion (never inline recursive calls naively).
Code Generation: From IR to Machine Code
Instruction Selection
Each IR operation must be mapped to one or more machine instructions. This is non-trivial because modern ISAs have complex addressing modes, fused instructions (fused multiply-add: fma rd, rs1, rs2, rs3 replaces a multiply + add), and SIMD lanes.
Pattern matching on IR trees is the standard approach: define patterns that match IR subtrees and the machine instructions they translate to. LLVM’s SelectionDAG and GlobalISel implement this.
Register Allocation
The IR assumes an infinite set of virtual registers. The machine has ~16 general-purpose registers (x86-64) or ~31 (ARM64). Register allocation assigns virtual registers to physical registers, spilling to the stack when there aren’t enough.
Graph Coloring Allocation: Model the problem as graph coloring. Nodes = virtual registers, edges = simultaneously-live registers (they cannot share a physical register). A k-coloring of the graph assigns physical registers (k = number of available registers). NP-complete in general, but heuristics (Chaitin-Briggs) work well in practice.
Linear Scan Allocation: Simpler algorithm used by JITs (including LLVM in fast mode, Graal): compute live intervals (the range of IR positions where each virtual register is live), then assign physical registers greedily scanning left to right. O(n log n), much faster than graph coloring, slightly worse code quality.
Instruction Scheduling
Modern CPUs execute instructions out-of-order, but the compiler can help by reordering independent instructions to hide memory latency and fill execution unit pipelines.
Linking: Assembling the Final Binary
Static Linking
The linker (ld, lld) takes object files (.o) and archives (.a) produced by the compiler and:
- Symbol resolution: every undefined symbol (
printf,malloc) is matched to a definition in some object file or archive. - Relocation: instructions that reference addresses not yet known (because they depend on where the code will be loaded) are patched with final addresses.
- Section merging: all
.textsections are concatenated, all.datasections concatenated, etc.
Dynamic Linking
Dynamic libraries (.so on Linux, .dylib on macOS, .dll on Windows) are not incorporated into the binary — they are loaded at runtime by the dynamic linker (ld.so / dyld).
PLT (Procedure Linkage Table) and GOT (Global Offset Table): The first call to a dynamically-linked function goes through the PLT, which invokes the dynamic linker to resolve the function address and patches the GOT entry. Subsequent calls use the cached GOT entry — one extra indirection (load + jump) per call.
call printf ; → jmp *GOT[printf_idx]
; first call: GOT[printf_idx] = lazy resolver
; after first call: GOT[printf_idx] = printf's real address
Section Layout
ELF Section | Contents | Permissions
---------------|---------------------------------|------------------
.text | Executable machine code | r-x (read, exec)
.rodata | Read-only data (string literals)| r-- (read only)
.data | Initialised global variables | rw- (read, write)
.bss | Uninitialised globals (zero-fill)| rw- (no file space)
.plt | Procedure Linkage Table | r-x
.got | Global Offset Table | rw-
.debug_info | DWARF debug information | not loaded at runtime
Link-Time Optimisation (LTO)
With LTO, the compiler emits IR (not object code) into .o files. The linker invokes the compiler again with all compilation units visible simultaneously, enabling inlining across module boundaries — the most impactful cross-file optimisation.
# Enable LTO in Rust
[profile.release]
lto = "fat" # "thin" for faster builds with most of the benefit
Runtime Initialisation
Before main() is called, the C runtime initialises the process:
OS loads ELF binary
│
▼
ld.so (dynamic linker) maps shared libraries, resolves symbols
│
▼
_start (crt0.o) — provided by libc
│
▼ Sets up: stack pointer, argc/argv, environment
│ Calls: .init_array (C++ global constructors, __attribute__((constructor)))
│
▼
main()
│
▼ Returns exit code
.fini_array (destructors, atexit handlers)
│
▼
exit() → kernel
C++ global constructors run before main(). This is why global objects with non-trivial constructors can cause subtle initialisation-order bugs across translation units (the “static initialisation order fiasco”).
Debugging Information: DWARF
DWARF (Debug With Arbitrary Record Formats) is the standard debug info format on ELF systems. It maps machine code addresses back to source locations and describes variable types and locations (which register or stack slot holds x at a given instruction).
# View DWARF info
objdump --dwarf=info myapp | head -100
readelf --debug-dump=frames myapp # call frame info for unwinding
# gdb uses DWARF to:
# - map breakpoints (source line → address)
# - display variable values (DWARF location expressions → register/memory)
# - unwind the call stack (DWARF CFI tables)
-g flag in compilers emits DWARF. -O2 -g emits both optimised code and debug info — but some variables may be optimised away or live in registers, causing “value unavailable” in gdb.
Build Systems: Dependency Graphs and Incremental Builds
A build system models the build as a Directed Acyclic Graph (DAG): nodes are files or targets, edges are dependencies. A node is rebuilt if its inputs are newer than its outputs.
Make
The original build system. Correct for simple projects, fragile at scale:
# Makefile
CC = gcc
CFLAGS = -O2 -g
main: main.o utils.o
$(CC) -o main main.o utils.o
main.o: main.c utils.h
$(CC) $(CFLAGS) -c main.c
utils.o: utils.c utils.h
$(CC) $(CFLAGS) -c utils.c
Make re-runs a rule if any input has a newer mtime than the output. It does not track header changes transitively unless you use -MMD to generate dependency files.
Bazel (Used at Google)
Bazel uses content-addressable storage (file hashes, not timestamps) and hermetic sandboxed builds. Every action runs in a sandbox with only declared inputs — undeclared inputs cause build failures, preventing “works on my machine” bugs. Remote caching shares build artifacts across a team.
# BUILD file (Starlark)
cc_binary(
name = "server",
srcs = ["server.cc"],
deps = [":utils", "@abseil//absl/strings"],
)
cc_library(
name = "utils",
srcs = ["utils.cc"],
hdrs = ["utils.h"],
)
Cargo (Rust)
Cargo fingerprints source files, Cargo.toml, and environment variables. It tracks inter-crate dependencies and rebuilds only affected crates. cargo check skips code generation, giving fast type-check feedback.
TypeScript Compiler (tsc) Internals
tsc is a multi-pass compiler implemented entirely in TypeScript (~400,000 lines):
1. Configuration loading tsconfig.json → CompilerOptions
2. Program creation Collect all .ts files via "include", "references"
3. Parsing Per-file: SourceFile AST (reusable across type checks)
4. Binder Build symbol table, assign node IDs, detect scope violations
5. Type checker On-demand: type-check each file, produce diagnostics
6. Emit Walk type-checked AST → write .js + .d.ts + .js.map
The language service (used by VS Code) reuses the same pipeline but adds incremental update support: when a file changes, only that file’s AST is re-parsed, type checking re-runs lazily for changed files, and diagnostics are invalidated only for affected files.
Project References (references in tsconfig.json) allow multi-package monorepos to declare explicit build dependencies. tsc -b builds packages in topological order, caching .d.ts declaration files to avoid re-type-checking unchanged packages.
Babel, SWC, and esbuild: Modern JS Tooling Pipelines
Babel: Transform-Based Transpilation
Babel is a source-to-source transformer (transpiler): it parses JavaScript/TypeScript to an AST, applies a chain of plugins (each transforming one language feature), and regenerates source text. It performs no type checking.
Performance profile: Babel is slow (~1000 files/sec) because it is written in JavaScript (GC overhead, single-threaded, no parallelism) and re-parses every file on each build.
SWC: Rust-Powered Parser + Transformer
SWC replicates Babel’s plugin model but is implemented in Rust. It achieves 70× throughput improvement over Babel by:
- Zero garbage collection pauses (Rust’s ownership model)
- Parallel file processing via Rayon (data-parallel work-stealing)
- SIMD-accelerated lexer using
simdutffor UTF-8 decoding
SWC is used as the JS/TS transform layer in Next.js 12+ (replacing Babel) and Deno.
esbuild: Bundler Architecture
esbuild (Evan Wallace, 2020) is a full JavaScript bundler written in Go, designed from first principles for speed:
Parallel parsing → each file parsed on its own goroutine (Go scheduler)
Single-pass linking → module graph resolved + tree-shaking in one pass
No plugin overhead → built-in support for JS/TS/JSX, no loader chain
esbuild processes 1,000,000 lines/second on a single core — ~100× faster than webpack. The architectural lesson: algorithmic improvements (single-pass, better data structures) outweigh micro-optimisations. esbuild avoids allocating an IR for the entire bundle; instead, it links by byte-range copying with fix-ups, minimising allocations.
Interview Deep Dive: node server.js
Q: “What happens when you run node server.js?”
The complete, layered answer:
1. Shell fork+exec: the OS creates a new process. The ELF binary 'node' is loaded.
ld.so resolves shared library dependencies (libv8.so, libuv.so, libssl.so).
_start initialises the C++ runtime, then calls node::Start().
2. Node.js initialisation:
- Initialise the V8 Isolate (heap, JIT compiler state, GC).
- Initialise libuv's event loop (wraps epoll on Linux, kqueue on macOS).
- Load built-in modules (fs, path, net, http) from embedded bytecode snapshots.
The snapshot is pre-compiled Ignition bytecode for Node's internal JS,
loaded directly — no re-parsing. This is why Node starts in ~50ms.
3. Load server.js:
- Node reads the file from disk (libuv async I/O → readFileSync for the entry point).
- V8 parses the source text → Ignition bytecode (a few ms for typical files).
- The module wrapper is applied:
(function(exports, require, module, __filename, __dirname) { /* server.js */ })
- V8 executes the bytecode in the Ignition interpreter.
4. Require chain:
- Each require('express') triggers module resolution:
Check Module._cache (already loaded? return cached exports).
Locate the file (node_modules resolution algorithm).
Read + parse + execute the module file.
Cache the exports object.
5. HTTP server setup:
require('http').createServer(handler).listen(3000) →
libuv calls bind(2) + listen(2) on a TCP socket →
registers a file-descriptor event with epoll (Linux) / kqueue (macOS).
6. Event loop runs:
The process doesn't exit. Node enters the libuv event loop:
[timers] → [I/O callbacks] → [idle] → [poll] → [check (setImmediate)] → [close]
The poll phase blocks on epoll_wait() until a connection or timer fires.
7. On a request:
- epoll_wait returns → libuv calls the TCP accept handler.
- The HTTP parser (llhttp) processes request bytes.
- The user's handler runs (JS Ignition bytecode).
- If the handler is called frequently, Sparkplug → Maglev → TurboFan tiers it up.
- response.end() schedules a write on the socket's libuv stream.
This mental model — V8 + libuv + the module system — explains every performance characteristic of Node.js: startup time (V8 init + module loading), throughput (JIT tier-up), and I/O scalability (libuv’s event loop).