The VFS Layer
Linux’s Virtual File System (VFS) is a kernel abstraction that presents a uniform interface (open, read, write, stat) regardless of whether the underlying filesystem is ext4, XFS, NTFS, tmpfs, or a network filesystem like NFS. Every filesystem registers its own implementations of these operations with the VFS.
This is why you can mount a USB drive (vfat), a network share (CIFS), and a RAM disk (tmpfs) in the same directory tree and interact with all of them identically.
Inodes
An inode (index node) stores file metadata: permissions, owner, timestamps, size, and — most importantly — pointers to the data blocks on disk. Critically, the inode does not store the file name. Names live in directory entries, which map name → inode number.
Directory entry: "report.pdf" → inode 14237
Inode 14237:
size: 2,456,832 bytes
permissions: 0644
created: 2026-03-15T09:12:00Z
blocks: [4821, 4822, 4823, ..., indirect block at 9001]
This is why hard links are cheap — two directory entries pointing to the same inode. No data is copied.
Block Groups and Locality
ext4 divides the disk into block groups (~128MB each). Each block group has its own inode table and block bitmap. The filesystem tries to allocate inodes and their data blocks within the same group — keeping related data physically close, reducing seek time on spinning disks.
Buffered vs Direct I/O
By default, read() and write() go through the page cache. The kernel caches disk pages in RAM; repeated reads of the same file don’t hit the disk. Writes are batched and flushed asynchronously (write-back caching).
Direct I/O (O_DIRECT flag) bypasses the page cache entirely. The data goes straight between user space buffers and the disk. Databases use this to manage their own caching (buffer pool) without double-buffering.
Buffered I/O: userspace → page cache → disk
Direct I/O: userspace → disk (aligned to block size, no cache)
fsync() forces dirty pages to disk before returning. Without it, a “successful” write() may still be in the page cache when a crash occurs. Databases call fsync after committing a transaction.
Journaling
Updating a file requires multiple disk writes (update inode, update directory, write data). A crash mid-sequence can corrupt the filesystem. Journaling (ext4, XFS, NTFS) writes the intended changes to a circular log first. On crash recovery, the journal is replayed. Only committed journal entries are applied.
io_uring
Linux 5.1+ introduced io_uring — a pair of ring buffers shared between userspace and kernel. Instead of issuing syscalls for each I/O, you submit batches to the submission queue and poll the completion queue. Round-trips through the kernel are eliminated for the hot path. Databases and high-performance servers (PostgreSQL 16+, Nginx) are adopting it.
What to Know for Interviews
open()returns a file descriptor — an index into the process’s file descriptor table. The table entry points to a kernel file object, which points to the inode.mmap()on a file is often faster thanread()for random access — the page cache is shared, and you avoid a copy into user space.- Inode count limits matter — a filesystem can run out of inodes before running out of disk space if you store millions of small files (common with
node_modules).