User Space vs Kernel Space
The CPU operates at different privilege levels, called rings (x86) or exception levels (ARM). User programs run at ring 3 (least privileged). The kernel runs at ring 0.
Ring 3 code cannot directly access hardware, modify page tables, or call privileged instructions. If it tries, the CPU raises a fault. This is how the OS enforces isolation — a user process simply cannot corrupt kernel memory or other processes’ memory.
Ring 0 (kernel): unrestricted hardware access, manages all resources
Ring 3 (user): runs your application — sandboxed, no direct hardware
Rings 1 and 2 exist but are unused by modern OSes. Hypervisors (VMware, KVM) use a ring below 0 (ring -1 or VMX root mode on Intel).
How a System Call Works
A system call is the controlled gate between ring 3 and ring 0. On x86-64 Linux:
- The program loads a syscall number into
rax(e.g.read= 0,write= 1,open= 2). - Arguments go into
rdi,rsi,rdx,r10,r8,r9. - The
syscallinstruction triggers a mode switch: saves user registers, switches to the kernel stack, jumps to the syscall handler. - The kernel validates arguments, performs the operation, writes the return value into
rax. sysretreturns to user space.
This mode switch costs ~100–300ns on modern hardware — not free, but not catastrophic. The real cost is often what the syscall does (wait for disk, allocate a page, etc.).
The C Library Wrapper
You almost never call syscall() directly. The C standard library (glibc, musl) provides thin wrappers: read(), write(), open(). These set up arguments and issue the syscall instruction. Go’s runtime makes syscalls directly without libc.
vDSO — Fast Syscalls Without Mode Switch
Certain syscalls that only read kernel state (like gettimeofday) are so frequent that a full mode switch is wasteful. Linux maps a small kernel-owned page into every process’s address space — the vDSO (virtual dynamic shared object). The process can call clock_gettime at near-function-call speed because it reads a kernel-maintained shared memory region without entering ring 0.
Seccomp — Filtering Syscalls
seccomp (secure computing mode) lets a process install a BPF filter that restricts which syscalls it can make. Docker containers and Chrome’s renderer processes use seccomp to limit the attack surface: if the process is compromised, it cannot call ptrace, fork, execve, etc.
Kernel vs User Space for Performance
The classic advice “avoid syscalls in hot loops” is true but nuanced:
read()/write()per byte: terrible. Buffered I/O (libc’sfread/fwriteor Go’sbufio) batches into large kernel calls.mmap(): eliminatesread()syscalls for file access — address translation happens in hardware.io_uring: submits many I/O operations in one syscall via ring buffers.- eBPF: runs user-provided programs inside the kernel (at ring 0), eliminating the mode switch entirely for tracing/filtering workloads.
What to Know for Interviews
- Every file I/O, network operation, memory allocation (
mmap/brk), and process management operation crosses the syscall boundary. straceon Linux traces every syscall a process makes — invaluable for debugging “what is this program actually doing?”- The OS scheduler can preempt a thread mid-syscall if a higher-priority thread becomes runnable.