Why Virtualisation? The Problem Statement
Before virtualisation, a physical server ran one operating system. If you needed five services, you provisioned five servers — each running at 5–15% CPU utilisation. Capital expenditure was wasted on idle hardware, deployment was slow (weeks to provision physical machines), and software environments were brittle (“works on my machine”).
Virtualisation solves all three problems simultaneously:
- Consolidation: Run 20–100 VMs per physical host, raising utilisation to 60–80%.
- Isolation: A bug or security compromise in one VM cannot directly affect another — each has its own kernel, memory space, and virtual hardware.
- Reproducibility: A VM image is an exact, versioned snapshot of an entire OS + software stack. Deploy the same image anywhere.
The fundamental challenge: how do you run an operating system — which expects exclusive control of hardware — as a user-space application on another operating system?
Full Virtualisation: Hypervisor Types
A hypervisor (Virtual Machine Monitor, VMM) is the software layer that multiplexes physical hardware across multiple virtual machines.
Type 1 Hypervisor (Bare-Metal)
A Type 1 hypervisor runs directly on the hardware. There is no host OS; the hypervisor is the OS for the purpose of resource management.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ VM 1 │ │ VM 2 │ │ VM 3 │
│ (Guest OS + │ │ (Guest OS + │ │ (Guest OS + │
│ Workload) │ │ Workload) │ │ Workload) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────┴────────────────┴────────────────┴─────────┐
│ Type 1 Hypervisor │
│ (VMware ESXi / KVM / Hyper-V) │
└───────────────────────────────────────────────────┘
│ Physical Hardware │
└───────────────────────────────────────────────────┘
VMware ESXi: Proprietary, widely used in enterprise data centres. Mature live migration (vMotion), storage virtualisation, and management tools.
KVM (Kernel-based Virtual Machine): A Linux kernel module that turns Linux itself into a Type 1 hypervisor. KVM exposes /dev/kvm — a file descriptor interface for creating and managing VMs. KVM handles CPU and memory virtualisation via Intel VT-x / AMD-V; QEMU provides I/O emulation. AWS EC2, Google Compute Engine, and OpenStack all run on KVM.
Hyper-V: Microsoft’s hypervisor, integral to Windows Server. Used for Azure VMs.
Type 2 Hypervisor (Hosted)
A Type 2 hypervisor runs as a process inside a host OS. The host OS manages hardware; the hypervisor multiplexes CPU time within that process.
┌──────────────┐ ┌──────────────┐
│ VM 1 │ │ VM 2 │
│ (Guest OS) │ │ (Guest OS) │
└──────┬───────┘ └──────┬───────┘
│ │
┌──────┴────────────────┴────────┐
│ Type 2 Hypervisor Process │
│ (VirtualBox / Parallels) │
└────────────────┬───────────────┘
┌───────────────────────────────┐
│ Host OS │
│ (macOS / Windows) │
└───────────────────────────────┘
│ Physical Hardware │
└───────────────────────────────┘
Type 2 hypervisors have higher overhead (host OS scheduling adds jitter) but are convenient for developer workstations. Parallels Desktop and VMware Fusion on macOS are Type 2; they leverage Apple Hypervisor.framework for hardware-assisted execution.
CPU Virtualisation: How Privileged Instructions Are Handled
A guest OS assumes it has ring 0 (kernel mode) privilege. In reality, the hypervisor runs in ring 0 (or VMX root mode) and the guest runs in ring 3 (or VMX non-root mode). When the guest attempts a privileged operation, one of three techniques handles it:
Trap-and-Emulate (Classic)
Privileged instructions executed in ring 3 trigger a General Protection Fault — a CPU exception caught by the hypervisor. The hypervisor inspects the instruction, emulates its effect on the virtual machine’s state, and resumes the guest. Clean but requires the ISA to be “classically virtualizable” (all privileged instructions trap in user mode). x86 is notoriously not classically virtualizable — 17 instructions behave differently in ring 3 without trapping.
Binary Translation (Old VMware)
VMware’s original solution: before executing guest code, scan basic blocks and rewrite the 17 problematic x86 instructions into equivalent safe sequences that trap correctly. The translated code is cached. This is expensive (translation overhead) and complex but works on any hardware.
Hardware-Assisted Virtualisation: Intel VT-x / AMD-V
Modern CPUs add a new privilege level: VMX root mode (host) and VMX non-root mode (guest). The guest OS runs in ring 0 of non-root mode — it thinks it’s in ring 0, but the CPU automatically intercepts sensitive operations and triggers a VM exit to the hypervisor in root mode. No binary translation needed.
Key VT-x concepts:
- VMCS (VM Control Structure): A hardware data structure storing guest/host register state.
VMLAUNCH/VMRESUMEinstructions context-switch into the guest. - VM exits: Triggered by I/O port access,
CPUID,HLT, page faults, etc. The hypervisor handles the exit and callsVMRESUME. - VPID (Virtual Processor ID): Tags TLB entries with a VM identifier so TLBs don’t need full flushes on VM context switches.
Guest executes privileged instruction
│
▼ (VM exit — CPU hardware)
Hypervisor VM exit handler runs in VMX root mode
│
▼
Emulate the instruction / handle I/O / update VMCS
│
▼ VMRESUME
Guest continues in VMX non-root mode
Memory Virtualisation: Shadow Pages and Nested Paging
A guest OS manages a guest physical address space (GPA) — it believes these are physical addresses. The hypervisor must translate GPA → host physical address (HPA).
Shadow Page Tables
The hypervisor maintains shadow page tables that map guest virtual addresses (GVA) directly to host physical addresses (HPA), bypassing the guest’s page tables. When the guest modifies its page tables, a VM exit is triggered, and the hypervisor updates the shadow tables. This is correct but expensive — every page-table modification is a VM exit.
Extended Page Tables (EPT) / Nested Page Tables (NPT)
Intel EPT (part of VT-x) and AMD NPT add a second level of hardware page-table walking:
Guest virtual address (GVA)
│
▼ Guest page tables (walk in hardware)
Guest physical address (GPA)
│
▼ EPT tables (second hardware walk)
Host physical address (HPA)
The CPU hardware performs the two-level walk without hypervisor intervention. A TLB miss on a GVA requires up to 24 memory accesses (4-level guest × 4-level EPT + 1 leaf) but the result is cached in the TLB tagged with VPID. EPT/NPT reduces memory virtualisation overhead by 80% vs shadow page tables.
I/O Virtualisation
Emulated Devices
The hypervisor emulates hardware devices (e.g., an Intel e1000 NIC, an IDE disk controller). The guest uses standard kernel drivers unchanged. Every I/O operation causes a VM exit → QEMU emulation. High compatibility, mediocre performance.
Paravirtual Drivers (virtio)
virtio is an open standard for paravirtual I/O. Instead of emulating real hardware, the hypervisor exposes a virtio-net / virtio-blk device. The guest uses a virtio driver that communicates via shared-memory virtqueue ring buffers — no VM exit per packet, just a doorbell kick when the ring is ready.
Result: virtio-net achieves 10 Gbps+ throughput; emulated e1000 caps at ~1 Gbps. KVM/QEMU defaults to virtio for all I/O.
SR-IOV (Single Root I/O Virtualisation)
SR-IOV is a PCIe hardware standard that allows a single physical NIC to present multiple Virtual Functions (VFs) — each VF appears as an independent PCIe device assignable to a VM. DMA from VM to NIC happens without hypervisor involvement. This is how AWS EC2 achieves 100 Gbps network throughput per instance.
KVM + QEMU: The Linux VM Stack
Understanding KVM + QEMU is understanding how most cloud VMs work.
KVM (kvm.ko kernel module) handles:
- CPU virtualisation (VMX/SVM, VMCS management)
- Memory virtualisation (EPT/NPT)
- Exposes
/dev/kvmioctl API
QEMU (userspace) handles:
- I/O device emulation (virtio, USB, display, BIOS/UEFI)
- VM lifecycle management
- Block device backends (qcow2 images, raw files, NBD)
QEMU Process (userspace)
├── vCPU threads: one per virtual CPU
│ Each thread calls ioctl(KVM_RUN) → enters VMX non-root mode
│ On VM exit → QEMU handles I/O, returns to KVM_RUN
├── I/O threads: virtio backend processing
└── Main thread: VM lifecycle, monitor, QMP API
Kernel (kvm.ko)
├── VMCS management
├── EPT page tables
└── KVM_RUN ioctl implementation
When a vCPU thread calls ioctl(KVM_RUN), the kernel executes VMRESUME and the guest runs until a VM exit. The exit reason (I/O port, EPT violation, external interrupt, etc.) is stored in the VMCS; QEMU or KVM handles it depending on the type.
Containers vs VMs: The Key Architectural Difference
┌──────────────────┐ ┌──────────────────┐
│ VM │ │ Container │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Guest OS │ │ │ │ App Process │ │
│ │ (full kernel)│ │ │ └──────────────┘ │
│ └──────────────┘ │ │ Namespaces + │
│ Hypervisor │ │ cgroups │
└──────────────────┘ └──────────────────┘
│ Host OS / KVM │ │ Host OS kernel │
└──────────────────┘ └──────────────────┘
VMs virtualise the hardware — each VM runs a complete OS kernel. Strong isolation (separate kernel, separate syscall table). Cold start: 1–30 seconds. Memory overhead: 100MB+ per VM (OS footprint).
Containers virtualise the OS — each container is a process with restricted views of the system via namespaces and resource limits via cgroups. They share the host kernel. Weak isolation (a kernel exploit affects all containers). Cold start: <100ms. Memory overhead: a few MB per container (no OS duplication).
The isolation boundary is the fundamental difference. A VM kernel exploit stays in the VM; a container kernel exploit is a host kernel exploit.
Linux Namespaces: What Each Isolates
Linux namespaces are the kernel mechanism behind container isolation. A process can be placed in a new namespace, giving it a private view of a system resource:
Namespace | Flag | Isolates
------------|--------------|----------------------------------------------------
pid | CLONE_NEWPID | Process ID numbering (PID 1 inside = init)
net | CLONE_NEWNET | Network interfaces, routing tables, iptables rules
mnt | CLONE_NEWNS | Filesystem mount points
uts | CLONE_NEWUTS | Hostname and NIS domain name
ipc | CLONE_NEWIPC | SysV IPC, POSIX message queues
user | CLONE_NEWUSER| UID/GID mapping (root inside ≠ root outside)
cgroup | CLONE_NEWCGROUP| cgroup root (container sees own cgroup hierarchy)
time | CLONE_NEWTIME| System clock offsets (Linux 5.6+)
The unshare(1) command demonstrates this:
# Create a new PID namespace and run bash as PID 1
unshare --pid --fork --mount-proc bash
# Inside: ps aux shows only our bash process as PID 1
# Create a new network namespace (no interfaces except lo)
unshare --net bash
ip addr # only shows lo — completely isolated network stack
cgroups v2: Resource Accounting and Limits
Control groups (cgroups) limit, account for, and isolate the resource usage (CPU, memory, disk I/O, network) of a collection of processes.
cgroups v2 (Linux 4.5+, the default since Ubuntu 22.04) unifies the hierarchy: all controllers (cpu, memory, io) hang off a single unified tree at /sys/fs/cgroup/.
Memory Limits
# What `docker run --memory=512m` does under the hood:
mkdir /sys/fs/cgroup/mycontainer
echo 536870912 > /sys/fs/cgroup/mycontainer/memory.max # 512 MiB
echo <container_pid> > /sys/fs/cgroup/mycontainer/cgroup.procs
# When the container exceeds the limit:
# OOM killer invokes → kills the process with the highest memory usage
# (configurable: memory.oom.group = 1 kills the entire cgroup)
cat /sys/fs/cgroup/mycontainer/memory.events
# oom 2 ← 2 OOM events occurred
CPU Shares and Throttling
# Give container 50% of 2 CPUs (100ms period, 100ms quota of 200ms available)
echo "100000" > /sys/fs/cgroup/mycontainer/cpu.max # quota (100ms)
echo "100000" > /sys/fs/cgroup/mycontainer/cpu.period # period (100ms)
# This = 1 full CPU equivalent
# docker run --cpus=1.5 sets quota=150000, period=100000
I/O Throttling
# Throttle writes to /dev/sda to 10 MB/s
echo "8:0 wbps=10485760" > /sys/fs/cgroup/mycontainer/io.max
Union Filesystems: Docker Image Layers
Docker images are built from layers — each RUN, COPY, or ADD instruction in a Dockerfile creates a new layer. Layers are read-only content-addressable blobs stored in /var/lib/docker/overlay2/.
OverlayFS (Linux 3.18+) merges these layers into a single unified filesystem view using three directories:
lower (read-only image layers, stacked)
upper (read-write container layer — all writes go here)
work (OverlayFS internal scratch space)
merged (the unified view presented to the container)
# Inspect an actual overlay mount:
mount | grep overlay
# overlay on / type overlay (rw,lowerdir=/var/lib/docker/overlay2/abc.../diff:
# /var/lib/docker/overlay2/def.../diff,
# upperdir=/var/lib/docker/overlay2/xyz.../diff,
# workdir=/var/lib/docker/overlay2/xyz.../work)
Copy-on-write: When a container writes to a file that exists in a lower layer, OverlayFS copies the file to upper/ first (copy-up), then writes to the copy. Subsequent reads return the upper/ version. This means the first write to a large file in a lower layer incurs a full copy — a latency cliff to be aware of in I/O-heavy containers.
Layer sharing: Multiple containers running the same image share all lower layers. A 1 GB base image shared by 50 containers occupies 1 GB on disk, not 50 GB.
Container Networking: veth Pairs and iptables
By default, Docker creates a bridge network (docker0, typically 172.17.0.0/16). Each container gets:
-
A veth pair: two virtual Ethernet interfaces linked back-to-back. One end (
eth0) lives in the container’s network namespace; the other lives in the host’s default namespace, attached todocker0. -
An IP address within the bridge subnet, assigned by Docker’s embedded IPAM.
Container ns: Host ns:
eth0 ──────────── veth3a2b ─── docker0 (172.17.0.1)
172.17.0.2/16 │
iptables NAT
MASQUERADE → eth0 (host public IP)
Port mapping (-p 8080:80): Docker adds an iptables DNAT rule in the PREROUTING chain:
# What docker -p 8080:80 creates:
iptables -t nat -A DOCKER -p tcp --dport 8080 \
-j DNAT --to-destination 172.17.0.2:80
Incoming packets on host port 8080 are rewritten to 172.17.0.2:80 before routing, then forwarded through the docker0 bridge into the container’s veth.
Docker Internals: containerd → runc → OCI
The Docker stack is layered:
docker CLI
│ REST API (Unix socket)
▼
dockerd (Docker daemon)
│ gRPC
▼
containerd (container lifecycle manager — CNCF project)
│ gRPC (containerd shim protocol)
▼
containerd-shim-runc-v2 (per-container process)
│ exec()
▼
runc (OCI runtime — creates namespaces + cgroups, exec's the container)
│
▼
Container process (PID 1 in new namespaces)
OCI (Open Container Initiative): defines two standards:
- Image spec: how image layers and manifests are structured (the format stored in registries).
- Runtime spec: the
config.jsonthatruncreads to create a container (namespaces, mounts, capabilities, seccomp filters).
runc is the minimal reference implementation: reads config.json, calls clone(2) with the appropriate namespace flags, sets up cgroups, applies seccomp/apparmor, then exec(2)s the container’s entrypoint.
# Manually create and run an OCI container without Docker:
mkdir -p /tmp/mycontainer/rootfs
# (populate rootfs with a minimal filesystem)
runc spec # generates config.json
runc run mycontainer
Kubernetes: Pods and the Container Runtime Interface
Kubernetes orchestrates containers at scale. The core primitive is the Pod — a group of 1+ containers sharing a network namespace (same IP, same localhost) and optionally storage volumes.
Pod Network Namespace
All containers in a pod communicate via localhost. This is implemented by a pause container (infra container): an empty container whose sole job is to hold the network namespace open. Application containers join this existing namespace when they start.
Pod:
pause container (holds net namespace, IP: 10.0.0.5)
├── app container A (joins pause's net ns → localhost:8080)
└── app container B (joins pause's net ns → localhost:3306)
kubelet and CRI
kubelet (the per-node Kubernetes agent) uses the Container Runtime Interface (CRI) — a gRPC API — to talk to any compliant container runtime. The two production runtimes are:
- containerd (with CRI plugin): the standard, used in GKE, EKS, AKS.
- CRI-O: Red Hat’s minimal runtime, used in OpenShift.
kubelet → (CRI gRPC) → containerd → runc → container
eBPF: Replacing iptables in Container Networking
eBPF (extended Berkeley Packet Filter) allows safely running sandboxed programs inside the Linux kernel — in network, tracing, and security contexts — without modifying kernel source or loading kernel modules.
Cilium uses eBPF to replace kube-proxy’s iptables-based service load balancing:
iptables (kube-proxy):
Every packet to a ClusterIP → iptables NAT chain traversal
O(n) rules, no connection tracking per-service, hard to debug
eBPF (Cilium):
eBPF programs attached to network interfaces + socket layer
XDP (eXpress Data Path): packet processing before kernel stack — O(1) lookup
BPF maps (hash tables) store service endpoint state
Load balancing at socket layer: no DNAT, no conntrack overhead
Cilium achieves 3–5× better network throughput than kube-proxy at scale, and eliminates iptables entirely on nodes.
gVisor: User-Space Kernel for Stronger Isolation
Containers share the host kernel — a container breakout exploit (e.g., a kernel CVE) compromises the entire host. Google’s gVisor interposes a user-space kernel between the container and the host kernel.
Container process
│ syscall
▼
gVisor (Sentry — user-space kernel written in Go)
├── Implements ~200 Linux syscalls in Go
├── Enforces container isolation in user space
└── Only ~50 host syscalls reach the actual kernel (via Gofer)
│
▼
Host Linux kernel (tiny, well-audited syscall surface)
Runsc is gVisor’s OCI-compatible runtime — drop-in replacement for runc in any Kubernetes cluster:
# Use gVisor for a specific Kubernetes pod:
spec:
runtimeClassName: gvisor
Tradeoffs:
- ✅ Kernel-level isolation: a container exploit hits the gVisor Sentry, not the host kernel
- ❌ Performance: syscall interception adds ~10–30% overhead for I/O-heavy workloads
- ❌ Compatibility: unusual syscalls or
/procfeatures may not be implemented
Google Cloud Run uses gVisor for all containers.
Firecracker: MicroVMs for Serverless
AWS Lambda needed the isolation of VMs with the startup speed of containers. The result is Firecracker — a microVM hypervisor written in Rust, open-sourced in 2018.
Traditional VM (QEMU/KVM):
- Full BIOS/UEFI boot sequence
- Dozens of emulated devices (USB, PCI bus, AC97 audio...)
- Cold start: 1–30 seconds
Firecracker microVM:
- Stripped to essentials: virtio-net, virtio-block, serial console
- No legacy devices, no BIOS POST
- Custom microVM kernel (optimised for fast boot)
- Cold start: ~125ms (VM creation) + ~25ms (kernel boot) = ~150ms total
Firecracker uses Linux’s KVM for hardware-assisted CPU and memory virtualisation, but replaces QEMU entirely with its own minimal device model. Each Lambda invocation gets its own Firecracker microVM — VM-level isolation with near-container startup time.
Jailer: Firecracker runs inside its own seccomp sandbox (only 24 syscalls allowed), ensuring the microVM manager itself has minimal attack surface.
WebAssembly as a Container Alternative
WebAssembly (WASM) + WASI is emerging as a third isolation model:
Model | Isolation Unit | Startup | Syscall Interface | Portability
-------------|-----------------|----------|---------------------|------------
Process/VM | OS process/VM | 100ms+ | Full POSIX | None (arch-specific)
Container | Linux namespace | 10–100ms | Full Linux syscalls | Linux only
WASM/WASI | WASM sandbox | <1ms | WASI (capability) | Any OS, any arch
WASI (WebAssembly System Interface) is a capability-based POSIX-like API: a WASM module must be granted access to specific filesystem paths, network sockets, and environment variables at runtime — by default it has access to nothing.
WASM Component Model defines a standard binary interface for composing WASM modules. Tools like wasmtime, Spin (Fermyon), and Fastly Compute@Edge run WASM modules at the edge with sub-millisecond cold starts.
The Docker team’s statement (Jan 2023): “If WASM and WASI existed in 2008, we wouldn’t have needed to create Docker.” — Solomon Hykes.
Interview Deep Dives
”What’s the difference between a container and a VM?”
The one-sentence answer: A VM virtualises hardware (each VM runs its own kernel); a container virtualises the OS (containers share the host kernel via namespaces and cgroups).
The complete answer:
| Dimension | Virtual Machine | Container |
|---|---|---|
| Isolation unit | Hardware (CPU, RAM, disk, NIC) | OS (namespaces + cgroups) |
| Kernel | Each VM has its own full kernel | All containers share host kernel |
| Isolation strength | Strong (VM breakout ≠ host kernel) | Weaker (kernel CVE = all containers) |
| Cold start | 1–60 seconds (OS boot) | <100ms (process start) |
| Memory overhead | 100MB–1GB per VM (OS footprint) | <10MB per container |
| Filesystem | Full virtual disk image | OverlayFS image layers |
| Portability | Hypervisor-specific image formats | OCI images — runtime-agnostic |
| Best for | Strong multi-tenant isolation | Microservices, CI, packaging |
”How does Docker networking work?”
The complete answer:
When Docker starts a container without custom networking:
-
docker0 bridge (
172.17.0.0/16) is created on the host at Docker startup (if absent). It’s a Linux software bridge — like a virtual Ethernet switch. -
veth pair is created:
vethXXXXon the host side (attached to docker0),eth0on the container side (inside the container’s network namespace). -
IP assignment: Docker’s IPAM allocates an IP (e.g.,
172.17.0.5) from the bridge subnet, assigns it toeth0inside the container. -
Default route:
0.0.0.0/0via172.17.0.1(docker0’s IP) — all external traffic goes through the bridge. -
NAT (MASQUERADE): iptables
POSTROUTINGchain rewrites the source IP of packets leavingdocker0to the host’s public IP. Return traffic is de-NATed by conntrack. -
Port mapping (
-p 8080:80): An iptablesDNATrule inPREROUTINGrewrites packets arriving on host port 8080 to172.17.0.5:80.
# Verify this yourself:
docker run -d -p 8080:80 nginx
iptables -t nat -L DOCKER -n --line-numbers
# → DNAT rule: tcp dpt:8080 → 172.17.0.X:80
ip link show docker0
bridge link show docker0 # shows the veth peers attached
This iptables-based approach is what Cilium replaces with eBPF for better performance at scale.