Core Architecture · Topic 4 of 8

System Calls & Kernel Space

150 XP

User Space vs Kernel Space

The CPU operates at different privilege levels, called rings (x86) or exception levels (ARM). User programs run at ring 3 (least privileged). The kernel runs at ring 0.

Ring 3 code cannot directly access hardware, modify page tables, or call privileged instructions. If it tries, the CPU raises a fault. This is how the OS enforces isolation — a user process simply cannot corrupt kernel memory or other processes’ memory.

Ring 0 (kernel): unrestricted hardware access, manages all resources
Ring 3 (user):   runs your application — sandboxed, no direct hardware

Rings 1 and 2 exist but are unused by modern OSes. Hypervisors (VMware, KVM) use a ring below 0 (ring -1 or VMX root mode on Intel).

How a System Call Works

A system call is the controlled gate between ring 3 and ring 0. On x86-64 Linux:

  1. The program loads a syscall number into rax (e.g. read = 0, write = 1, open = 2).
  2. Arguments go into rdi, rsi, rdx, r10, r8, r9.
  3. The syscall instruction triggers a mode switch: saves user registers, switches to the kernel stack, jumps to the syscall handler.
  4. The kernel validates arguments, performs the operation, writes the return value into rax.
  5. sysret returns to user space.

This mode switch costs ~100–300ns on modern hardware — not free, but not catastrophic. The real cost is often what the syscall does (wait for disk, allocate a page, etc.).

The C Library Wrapper

You almost never call syscall() directly. The C standard library (glibc, musl) provides thin wrappers: read(), write(), open(). These set up arguments and issue the syscall instruction. Go’s runtime makes syscalls directly without libc.

vDSO — Fast Syscalls Without Mode Switch

Certain syscalls that only read kernel state (like gettimeofday) are so frequent that a full mode switch is wasteful. Linux maps a small kernel-owned page into every process’s address space — the vDSO (virtual dynamic shared object). The process can call clock_gettime at near-function-call speed because it reads a kernel-maintained shared memory region without entering ring 0.

Seccomp — Filtering Syscalls

seccomp (secure computing mode) lets a process install a BPF filter that restricts which syscalls it can make. Docker containers and Chrome’s renderer processes use seccomp to limit the attack surface: if the process is compromised, it cannot call ptrace, fork, execve, etc.

Kernel vs User Space for Performance

The classic advice “avoid syscalls in hot loops” is true but nuanced:

  • read()/write() per byte: terrible. Buffered I/O (libc’s fread/fwrite or Go’s bufio) batches into large kernel calls.
  • mmap(): eliminates read() syscalls for file access — address translation happens in hardware.
  • io_uring: submits many I/O operations in one syscall via ring buffers.
  • eBPF: runs user-provided programs inside the kernel (at ring 0), eliminating the mode switch entirely for tracing/filtering workloads.

What to Know for Interviews

  • Every file I/O, network operation, memory allocation (mmap/brk), and process management operation crosses the syscall boundary.
  • strace on Linux traces every syscall a process makes — invaluable for debugging “what is this program actually doing?”
  • The OS scheduler can preempt a thread mid-syscall if a higher-priority thread becomes runnable.