Core Architecture · Topic 21 of 21

Virtualization & Containers

300 XP

Why Virtualisation? The Problem Statement

Before virtualisation, a physical server ran one operating system. If you needed five services, you provisioned five servers — each running at 5–15% CPU utilisation. Capital expenditure was wasted on idle hardware, deployment was slow (weeks to provision physical machines), and software environments were brittle (“works on my machine”).

Virtualisation solves all three problems simultaneously:

  • Consolidation: Run 20–100 VMs per physical host, raising utilisation to 60–80%.
  • Isolation: A bug or security compromise in one VM cannot directly affect another — each has its own kernel, memory space, and virtual hardware.
  • Reproducibility: A VM image is an exact, versioned snapshot of an entire OS + software stack. Deploy the same image anywhere.

The fundamental challenge: how do you run an operating system — which expects exclusive control of hardware — as a user-space application on another operating system?


Full Virtualisation: Hypervisor Types

A hypervisor (Virtual Machine Monitor, VMM) is the software layer that multiplexes physical hardware across multiple virtual machines.

Type 1 Hypervisor (Bare-Metal)

A Type 1 hypervisor runs directly on the hardware. There is no host OS; the hypervisor is the OS for the purpose of resource management.

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   VM 1       │ │   VM 2       │ │   VM 3       │
│ (Guest OS +  │ │ (Guest OS +  │ │ (Guest OS +  │
│  Workload)   │ │  Workload)   │ │  Workload)   │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
       │                │                │
┌──────┴────────────────┴────────────────┴─────────┐
│               Type 1 Hypervisor                   │
│          (VMware ESXi / KVM / Hyper-V)            │
└───────────────────────────────────────────────────┘
│                   Physical Hardware               │
└───────────────────────────────────────────────────┘

VMware ESXi: Proprietary, widely used in enterprise data centres. Mature live migration (vMotion), storage virtualisation, and management tools.

KVM (Kernel-based Virtual Machine): A Linux kernel module that turns Linux itself into a Type 1 hypervisor. KVM exposes /dev/kvm — a file descriptor interface for creating and managing VMs. KVM handles CPU and memory virtualisation via Intel VT-x / AMD-V; QEMU provides I/O emulation. AWS EC2, Google Compute Engine, and OpenStack all run on KVM.

Hyper-V: Microsoft’s hypervisor, integral to Windows Server. Used for Azure VMs.

Type 2 Hypervisor (Hosted)

A Type 2 hypervisor runs as a process inside a host OS. The host OS manages hardware; the hypervisor multiplexes CPU time within that process.

┌──────────────┐ ┌──────────────┐
│   VM 1       │ │   VM 2       │
│ (Guest OS)   │ │ (Guest OS)   │
└──────┬───────┘ └──────┬───────┘
       │                │
┌──────┴────────────────┴────────┐
│  Type 2 Hypervisor Process     │
│  (VirtualBox / Parallels)      │
└────────────────┬───────────────┘
┌───────────────────────────────┐
│         Host OS               │
│     (macOS / Windows)         │
└───────────────────────────────┘
│        Physical Hardware       │
└───────────────────────────────┘

Type 2 hypervisors have higher overhead (host OS scheduling adds jitter) but are convenient for developer workstations. Parallels Desktop and VMware Fusion on macOS are Type 2; they leverage Apple Hypervisor.framework for hardware-assisted execution.


CPU Virtualisation: How Privileged Instructions Are Handled

A guest OS assumes it has ring 0 (kernel mode) privilege. In reality, the hypervisor runs in ring 0 (or VMX root mode) and the guest runs in ring 3 (or VMX non-root mode). When the guest attempts a privileged operation, one of three techniques handles it:

Trap-and-Emulate (Classic)

Privileged instructions executed in ring 3 trigger a General Protection Fault — a CPU exception caught by the hypervisor. The hypervisor inspects the instruction, emulates its effect on the virtual machine’s state, and resumes the guest. Clean but requires the ISA to be “classically virtualizable” (all privileged instructions trap in user mode). x86 is notoriously not classically virtualizable — 17 instructions behave differently in ring 3 without trapping.

Binary Translation (Old VMware)

VMware’s original solution: before executing guest code, scan basic blocks and rewrite the 17 problematic x86 instructions into equivalent safe sequences that trap correctly. The translated code is cached. This is expensive (translation overhead) and complex but works on any hardware.

Hardware-Assisted Virtualisation: Intel VT-x / AMD-V

Modern CPUs add a new privilege level: VMX root mode (host) and VMX non-root mode (guest). The guest OS runs in ring 0 of non-root mode — it thinks it’s in ring 0, but the CPU automatically intercepts sensitive operations and triggers a VM exit to the hypervisor in root mode. No binary translation needed.

Key VT-x concepts:

  • VMCS (VM Control Structure): A hardware data structure storing guest/host register state. VMLAUNCH / VMRESUME instructions context-switch into the guest.
  • VM exits: Triggered by I/O port access, CPUID, HLT, page faults, etc. The hypervisor handles the exit and calls VMRESUME.
  • VPID (Virtual Processor ID): Tags TLB entries with a VM identifier so TLBs don’t need full flushes on VM context switches.
Guest executes privileged instruction

          ▼ (VM exit — CPU hardware)
Hypervisor VM exit handler runs in VMX root mode


Emulate the instruction / handle I/O / update VMCS

          ▼ VMRESUME
Guest continues in VMX non-root mode

Memory Virtualisation: Shadow Pages and Nested Paging

A guest OS manages a guest physical address space (GPA) — it believes these are physical addresses. The hypervisor must translate GPA → host physical address (HPA).

Shadow Page Tables

The hypervisor maintains shadow page tables that map guest virtual addresses (GVA) directly to host physical addresses (HPA), bypassing the guest’s page tables. When the guest modifies its page tables, a VM exit is triggered, and the hypervisor updates the shadow tables. This is correct but expensive — every page-table modification is a VM exit.

Extended Page Tables (EPT) / Nested Page Tables (NPT)

Intel EPT (part of VT-x) and AMD NPT add a second level of hardware page-table walking:

Guest virtual address (GVA)

      ▼ Guest page tables (walk in hardware)
Guest physical address (GPA)

      ▼ EPT tables (second hardware walk)
Host physical address (HPA)

The CPU hardware performs the two-level walk without hypervisor intervention. A TLB miss on a GVA requires up to 24 memory accesses (4-level guest × 4-level EPT + 1 leaf) but the result is cached in the TLB tagged with VPID. EPT/NPT reduces memory virtualisation overhead by 80% vs shadow page tables.


I/O Virtualisation

Emulated Devices

The hypervisor emulates hardware devices (e.g., an Intel e1000 NIC, an IDE disk controller). The guest uses standard kernel drivers unchanged. Every I/O operation causes a VM exit → QEMU emulation. High compatibility, mediocre performance.

Paravirtual Drivers (virtio)

virtio is an open standard for paravirtual I/O. Instead of emulating real hardware, the hypervisor exposes a virtio-net / virtio-blk device. The guest uses a virtio driver that communicates via shared-memory virtqueue ring buffers — no VM exit per packet, just a doorbell kick when the ring is ready.

Result: virtio-net achieves 10 Gbps+ throughput; emulated e1000 caps at ~1 Gbps. KVM/QEMU defaults to virtio for all I/O.

SR-IOV (Single Root I/O Virtualisation)

SR-IOV is a PCIe hardware standard that allows a single physical NIC to present multiple Virtual Functions (VFs) — each VF appears as an independent PCIe device assignable to a VM. DMA from VM to NIC happens without hypervisor involvement. This is how AWS EC2 achieves 100 Gbps network throughput per instance.


KVM + QEMU: The Linux VM Stack

Understanding KVM + QEMU is understanding how most cloud VMs work.

KVM (kvm.ko kernel module) handles:

  • CPU virtualisation (VMX/SVM, VMCS management)
  • Memory virtualisation (EPT/NPT)
  • Exposes /dev/kvm ioctl API

QEMU (userspace) handles:

  • I/O device emulation (virtio, USB, display, BIOS/UEFI)
  • VM lifecycle management
  • Block device backends (qcow2 images, raw files, NBD)
QEMU Process (userspace)
  ├── vCPU threads: one per virtual CPU
  │     Each thread calls ioctl(KVM_RUN) → enters VMX non-root mode
  │     On VM exit → QEMU handles I/O, returns to KVM_RUN
  ├── I/O threads: virtio backend processing
  └── Main thread: VM lifecycle, monitor, QMP API

Kernel (kvm.ko)
  ├── VMCS management
  ├── EPT page tables
  └── KVM_RUN ioctl implementation

When a vCPU thread calls ioctl(KVM_RUN), the kernel executes VMRESUME and the guest runs until a VM exit. The exit reason (I/O port, EPT violation, external interrupt, etc.) is stored in the VMCS; QEMU or KVM handles it depending on the type.


Containers vs VMs: The Key Architectural Difference

┌──────────────────┐   ┌──────────────────┐
│  VM              │   │  Container       │
│ ┌──────────────┐ │   │ ┌──────────────┐ │
│ │ Guest OS     │ │   │ │ App Process  │ │
│ │ (full kernel)│ │   │ └──────────────┘ │
│ └──────────────┘ │   │ Namespaces +     │
│ Hypervisor       │   │ cgroups          │
└──────────────────┘   └──────────────────┘
│  Host OS / KVM   │   │  Host OS kernel  │
└──────────────────┘   └──────────────────┘

VMs virtualise the hardware — each VM runs a complete OS kernel. Strong isolation (separate kernel, separate syscall table). Cold start: 1–30 seconds. Memory overhead: 100MB+ per VM (OS footprint).

Containers virtualise the OS — each container is a process with restricted views of the system via namespaces and resource limits via cgroups. They share the host kernel. Weak isolation (a kernel exploit affects all containers). Cold start: <100ms. Memory overhead: a few MB per container (no OS duplication).

The isolation boundary is the fundamental difference. A VM kernel exploit stays in the VM; a container kernel exploit is a host kernel exploit.


Linux Namespaces: What Each Isolates

Linux namespaces are the kernel mechanism behind container isolation. A process can be placed in a new namespace, giving it a private view of a system resource:

Namespace   | Flag         | Isolates
------------|--------------|----------------------------------------------------
pid         | CLONE_NEWPID | Process ID numbering (PID 1 inside = init)
net         | CLONE_NEWNET | Network interfaces, routing tables, iptables rules
mnt         | CLONE_NEWNS  | Filesystem mount points
uts         | CLONE_NEWUTS | Hostname and NIS domain name
ipc         | CLONE_NEWIPC | SysV IPC, POSIX message queues
user        | CLONE_NEWUSER| UID/GID mapping (root inside ≠ root outside)
cgroup      | CLONE_NEWCGROUP| cgroup root (container sees own cgroup hierarchy)
time        | CLONE_NEWTIME| System clock offsets (Linux 5.6+)

The unshare(1) command demonstrates this:

# Create a new PID namespace and run bash as PID 1
unshare --pid --fork --mount-proc bash
# Inside: ps aux shows only our bash process as PID 1

# Create a new network namespace (no interfaces except lo)
unshare --net bash
ip addr    # only shows lo — completely isolated network stack

cgroups v2: Resource Accounting and Limits

Control groups (cgroups) limit, account for, and isolate the resource usage (CPU, memory, disk I/O, network) of a collection of processes.

cgroups v2 (Linux 4.5+, the default since Ubuntu 22.04) unifies the hierarchy: all controllers (cpu, memory, io) hang off a single unified tree at /sys/fs/cgroup/.

Memory Limits

# What `docker run --memory=512m` does under the hood:
mkdir /sys/fs/cgroup/mycontainer
echo 536870912 > /sys/fs/cgroup/mycontainer/memory.max   # 512 MiB
echo <container_pid> > /sys/fs/cgroup/mycontainer/cgroup.procs

# When the container exceeds the limit:
# OOM killer invokes → kills the process with the highest memory usage
# (configurable: memory.oom.group = 1 kills the entire cgroup)
cat /sys/fs/cgroup/mycontainer/memory.events
# oom 2   ← 2 OOM events occurred

CPU Shares and Throttling

# Give container 50% of 2 CPUs (100ms period, 100ms quota of 200ms available)
echo "100000" > /sys/fs/cgroup/mycontainer/cpu.max     # quota (100ms)
echo "100000" > /sys/fs/cgroup/mycontainer/cpu.period  # period (100ms)
# This = 1 full CPU equivalent

# docker run --cpus=1.5 sets quota=150000, period=100000

I/O Throttling

# Throttle writes to /dev/sda to 10 MB/s
echo "8:0 wbps=10485760" > /sys/fs/cgroup/mycontainer/io.max

Union Filesystems: Docker Image Layers

Docker images are built from layers — each RUN, COPY, or ADD instruction in a Dockerfile creates a new layer. Layers are read-only content-addressable blobs stored in /var/lib/docker/overlay2/.

OverlayFS (Linux 3.18+) merges these layers into a single unified filesystem view using three directories:

lower (read-only image layers, stacked)
upper (read-write container layer — all writes go here)
work  (OverlayFS internal scratch space)
merged (the unified view presented to the container)
# Inspect an actual overlay mount:
mount | grep overlay
# overlay on / type overlay (rw,lowerdir=/var/lib/docker/overlay2/abc.../diff:
#   /var/lib/docker/overlay2/def.../diff,
#   upperdir=/var/lib/docker/overlay2/xyz.../diff,
#   workdir=/var/lib/docker/overlay2/xyz.../work)

Copy-on-write: When a container writes to a file that exists in a lower layer, OverlayFS copies the file to upper/ first (copy-up), then writes to the copy. Subsequent reads return the upper/ version. This means the first write to a large file in a lower layer incurs a full copy — a latency cliff to be aware of in I/O-heavy containers.

Layer sharing: Multiple containers running the same image share all lower layers. A 1 GB base image shared by 50 containers occupies 1 GB on disk, not 50 GB.


Container Networking: veth Pairs and iptables

By default, Docker creates a bridge network (docker0, typically 172.17.0.0/16). Each container gets:

  1. A veth pair: two virtual Ethernet interfaces linked back-to-back. One end (eth0) lives in the container’s network namespace; the other lives in the host’s default namespace, attached to docker0.

  2. An IP address within the bridge subnet, assigned by Docker’s embedded IPAM.

Container ns:          Host ns:
  eth0 ──────────── veth3a2b ─── docker0 (172.17.0.1)
 172.17.0.2/16                     │
                                 iptables NAT
                                 MASQUERADE → eth0 (host public IP)

Port mapping (-p 8080:80): Docker adds an iptables DNAT rule in the PREROUTING chain:

# What docker -p 8080:80 creates:
iptables -t nat -A DOCKER -p tcp --dport 8080 \
  -j DNAT --to-destination 172.17.0.2:80

Incoming packets on host port 8080 are rewritten to 172.17.0.2:80 before routing, then forwarded through the docker0 bridge into the container’s veth.


Docker Internals: containerd → runc → OCI

The Docker stack is layered:

docker CLI
    │ REST API (Unix socket)

dockerd (Docker daemon)
    │ gRPC

containerd (container lifecycle manager — CNCF project)
    │ gRPC (containerd shim protocol)

containerd-shim-runc-v2 (per-container process)
    │ exec()

runc (OCI runtime — creates namespaces + cgroups, exec's the container)


Container process (PID 1 in new namespaces)

OCI (Open Container Initiative): defines two standards:

  • Image spec: how image layers and manifests are structured (the format stored in registries).
  • Runtime spec: the config.json that runc reads to create a container (namespaces, mounts, capabilities, seccomp filters).

runc is the minimal reference implementation: reads config.json, calls clone(2) with the appropriate namespace flags, sets up cgroups, applies seccomp/apparmor, then exec(2)s the container’s entrypoint.

# Manually create and run an OCI container without Docker:
mkdir -p /tmp/mycontainer/rootfs
# (populate rootfs with a minimal filesystem)
runc spec   # generates config.json
runc run mycontainer

Kubernetes: Pods and the Container Runtime Interface

Kubernetes orchestrates containers at scale. The core primitive is the Pod — a group of 1+ containers sharing a network namespace (same IP, same localhost) and optionally storage volumes.

Pod Network Namespace

All containers in a pod communicate via localhost. This is implemented by a pause container (infra container): an empty container whose sole job is to hold the network namespace open. Application containers join this existing namespace when they start.

Pod:
  pause container (holds net namespace, IP: 10.0.0.5)
  ├── app container A (joins pause's net ns → localhost:8080)
  └── app container B (joins pause's net ns → localhost:3306)

kubelet and CRI

kubelet (the per-node Kubernetes agent) uses the Container Runtime Interface (CRI) — a gRPC API — to talk to any compliant container runtime. The two production runtimes are:

  • containerd (with CRI plugin): the standard, used in GKE, EKS, AKS.
  • CRI-O: Red Hat’s minimal runtime, used in OpenShift.
kubelet → (CRI gRPC) → containerd → runc → container

eBPF: Replacing iptables in Container Networking

eBPF (extended Berkeley Packet Filter) allows safely running sandboxed programs inside the Linux kernel — in network, tracing, and security contexts — without modifying kernel source or loading kernel modules.

Cilium uses eBPF to replace kube-proxy’s iptables-based service load balancing:

iptables (kube-proxy):
  Every packet to a ClusterIP → iptables NAT chain traversal
  O(n) rules, no connection tracking per-service, hard to debug

eBPF (Cilium):
  eBPF programs attached to network interfaces + socket layer
  XDP (eXpress Data Path): packet processing before kernel stack — O(1) lookup
  BPF maps (hash tables) store service endpoint state
  Load balancing at socket layer: no DNAT, no conntrack overhead

Cilium achieves 3–5× better network throughput than kube-proxy at scale, and eliminates iptables entirely on nodes.


gVisor: User-Space Kernel for Stronger Isolation

Containers share the host kernel — a container breakout exploit (e.g., a kernel CVE) compromises the entire host. Google’s gVisor interposes a user-space kernel between the container and the host kernel.

Container process
      │ syscall

gVisor (Sentry — user-space kernel written in Go)
  ├── Implements ~200 Linux syscalls in Go
  ├── Enforces container isolation in user space
  └── Only ~50 host syscalls reach the actual kernel (via Gofer)


Host Linux kernel (tiny, well-audited syscall surface)

Runsc is gVisor’s OCI-compatible runtime — drop-in replacement for runc in any Kubernetes cluster:

# Use gVisor for a specific Kubernetes pod:
spec:
  runtimeClassName: gvisor

Tradeoffs:

  • ✅ Kernel-level isolation: a container exploit hits the gVisor Sentry, not the host kernel
  • ❌ Performance: syscall interception adds ~10–30% overhead for I/O-heavy workloads
  • ❌ Compatibility: unusual syscalls or /proc features may not be implemented

Google Cloud Run uses gVisor for all containers.


Firecracker: MicroVMs for Serverless

AWS Lambda needed the isolation of VMs with the startup speed of containers. The result is Firecracker — a microVM hypervisor written in Rust, open-sourced in 2018.

Traditional VM (QEMU/KVM):
  - Full BIOS/UEFI boot sequence
  - Dozens of emulated devices (USB, PCI bus, AC97 audio...)
  - Cold start: 1–30 seconds

Firecracker microVM:
  - Stripped to essentials: virtio-net, virtio-block, serial console
  - No legacy devices, no BIOS POST
  - Custom microVM kernel (optimised for fast boot)
  - Cold start: ~125ms (VM creation) + ~25ms (kernel boot) = ~150ms total

Firecracker uses Linux’s KVM for hardware-assisted CPU and memory virtualisation, but replaces QEMU entirely with its own minimal device model. Each Lambda invocation gets its own Firecracker microVM — VM-level isolation with near-container startup time.

Jailer: Firecracker runs inside its own seccomp sandbox (only 24 syscalls allowed), ensuring the microVM manager itself has minimal attack surface.


WebAssembly as a Container Alternative

WebAssembly (WASM) + WASI is emerging as a third isolation model:

Model        | Isolation Unit  | Startup  | Syscall Interface   | Portability
-------------|-----------------|----------|---------------------|------------
Process/VM   | OS process/VM   | 100ms+   | Full POSIX          | None (arch-specific)
Container    | Linux namespace | 10–100ms | Full Linux syscalls  | Linux only
WASM/WASI    | WASM sandbox    | &lt;1ms   | WASI (capability)   | Any OS, any arch

WASI (WebAssembly System Interface) is a capability-based POSIX-like API: a WASM module must be granted access to specific filesystem paths, network sockets, and environment variables at runtime — by default it has access to nothing.

WASM Component Model defines a standard binary interface for composing WASM modules. Tools like wasmtime, Spin (Fermyon), and Fastly Compute@Edge run WASM modules at the edge with sub-millisecond cold starts.

The Docker team’s statement (Jan 2023): “If WASM and WASI existed in 2008, we wouldn’t have needed to create Docker.” — Solomon Hykes.


Interview Deep Dives

”What’s the difference between a container and a VM?”

The one-sentence answer: A VM virtualises hardware (each VM runs its own kernel); a container virtualises the OS (containers share the host kernel via namespaces and cgroups).

The complete answer:

DimensionVirtual MachineContainer
Isolation unitHardware (CPU, RAM, disk, NIC)OS (namespaces + cgroups)
KernelEach VM has its own full kernelAll containers share host kernel
Isolation strengthStrong (VM breakout ≠ host kernel)Weaker (kernel CVE = all containers)
Cold start1–60 seconds (OS boot)<100ms (process start)
Memory overhead100MB–1GB per VM (OS footprint)<10MB per container
FilesystemFull virtual disk imageOverlayFS image layers
PortabilityHypervisor-specific image formatsOCI images — runtime-agnostic
Best forStrong multi-tenant isolationMicroservices, CI, packaging

”How does Docker networking work?”

The complete answer:

When Docker starts a container without custom networking:

  1. docker0 bridge (172.17.0.0/16) is created on the host at Docker startup (if absent). It’s a Linux software bridge — like a virtual Ethernet switch.

  2. veth pair is created: vethXXXX on the host side (attached to docker0), eth0 on the container side (inside the container’s network namespace).

  3. IP assignment: Docker’s IPAM allocates an IP (e.g., 172.17.0.5) from the bridge subnet, assigns it to eth0 inside the container.

  4. Default route: 0.0.0.0/0 via 172.17.0.1 (docker0’s IP) — all external traffic goes through the bridge.

  5. NAT (MASQUERADE): iptables POSTROUTING chain rewrites the source IP of packets leaving docker0 to the host’s public IP. Return traffic is de-NATed by conntrack.

  6. Port mapping (-p 8080:80): An iptables DNAT rule in PREROUTING rewrites packets arriving on host port 8080 to 172.17.0.5:80.

# Verify this yourself:
docker run -d -p 8080:80 nginx
iptables -t nat -L DOCKER -n --line-numbers
# → DNAT rule: tcp dpt:8080 → 172.17.0.X:80
ip link show docker0
bridge link show docker0   # shows the veth peers attached

This iptables-based approach is what Cilium replaces with eBPF for better performance at scale.