Core Architecture · Topic 3 of 8

File Systems & I/O

150 XP

The VFS Layer

Linux’s Virtual File System (VFS) is a kernel abstraction that presents a uniform interface (open, read, write, stat) regardless of whether the underlying filesystem is ext4, XFS, NTFS, tmpfs, or a network filesystem like NFS. Every filesystem registers its own implementations of these operations with the VFS.

This is why you can mount a USB drive (vfat), a network share (CIFS), and a RAM disk (tmpfs) in the same directory tree and interact with all of them identically.

Inodes

An inode (index node) stores file metadata: permissions, owner, timestamps, size, and — most importantly — pointers to the data blocks on disk. Critically, the inode does not store the file name. Names live in directory entries, which map name → inode number.

Directory entry:  "report.pdf" → inode 14237
Inode 14237:
  size:        2,456,832 bytes
  permissions: 0644
  created:     2026-03-15T09:12:00Z
  blocks:      [4821, 4822, 4823, ..., indirect block at 9001]

This is why hard links are cheap — two directory entries pointing to the same inode. No data is copied.

Block Groups and Locality

ext4 divides the disk into block groups (~128MB each). Each block group has its own inode table and block bitmap. The filesystem tries to allocate inodes and their data blocks within the same group — keeping related data physically close, reducing seek time on spinning disks.

Buffered vs Direct I/O

By default, read() and write() go through the page cache. The kernel caches disk pages in RAM; repeated reads of the same file don’t hit the disk. Writes are batched and flushed asynchronously (write-back caching).

Direct I/O (O_DIRECT flag) bypasses the page cache entirely. The data goes straight between user space buffers and the disk. Databases use this to manage their own caching (buffer pool) without double-buffering.

Buffered I/O:  userspace → page cache → disk
Direct I/O:    userspace → disk (aligned to block size, no cache)

fsync() forces dirty pages to disk before returning. Without it, a “successful” write() may still be in the page cache when a crash occurs. Databases call fsync after committing a transaction.

Journaling

Updating a file requires multiple disk writes (update inode, update directory, write data). A crash mid-sequence can corrupt the filesystem. Journaling (ext4, XFS, NTFS) writes the intended changes to a circular log first. On crash recovery, the journal is replayed. Only committed journal entries are applied.

io_uring

Linux 5.1+ introduced io_uring — a pair of ring buffers shared between userspace and kernel. Instead of issuing syscalls for each I/O, you submit batches to the submission queue and poll the completion queue. Round-trips through the kernel are eliminated for the hot path. Databases and high-performance servers (PostgreSQL 16+, Nginx) are adopting it.

What to Know for Interviews

  • open() returns a file descriptor — an index into the process’s file descriptor table. The table entry points to a kernel file object, which points to the inode.
  • mmap() on a file is often faster than read() for random access — the page cache is shared, and you avoid a copy into user space.
  • Inode count limits matter — a filesystem can run out of inodes before running out of disk space if you store millions of small files (common with node_modules).