
How File Systems Work — Organizing Data on Disk
A disk is an array of blocks — typically 4 KB each. Without a file system, you'd address storage by block number: "write these bytes to blocks 17,342 through 17,345." File systems provide the abstraction we actually want: named files organized in directories, with metadata like permissions, timestamps, and ownership.
The kernel implements file systems. Every open(), read(), write(), and close() goes through the kernel's VFS (Virtual File System) layer, which dispatches to the appropriate file system driver.
What Is an Inode?
Every file on a Unix file system has an inode (index node) — a data structure that stores the file's metadata and the location of its data blocks.
An inode contains:
| Field | What it stores |
|---|---|
| File type | Regular file, directory, symlink, device |
| Size | In bytes |
| Permissions | Owner/group/other read/write/execute |
| Ownership | User ID, group ID |
| Timestamps | Created, modified, accessed |
| Link count | Number of directory entries pointing to this inode |
| Block pointers | Where the actual data lives on disk |
The inode does not contain the file name. Names live in directories. A directory is itself a file — its data is a list of (name, inode number) pairs. This is why a file can have multiple names (hard links): two directory entries pointing to the same inode.
When you run ls -li, the first column is the inode number. stat filename shows all the inode metadata.
How Does the Kernel Find a File?
Opening /home/user/project/README.md triggers a path traversal:
- Start at root — the root directory
/has a known inode (typically inode 2). - Read root directory — find the entry
home→ inode 131073. - Read
homedirectory — finduser→ inode 262145. - Read
userdirectory — findproject→ inode 393217. - Read
projectdirectory — findREADME.md→ inode 524289. - Read inode 524289 — get the file's metadata and block pointers.
Each step is a disk read (unless cached). A deep path like /a/b/c/d/e/f/file requires 7 directory lookups. The kernel caches recently accessed directory entries in the dentry cache and inode metadata in the inode cache, so repeated accesses are fast.
How Is Data Stored on Disk?
Small files fit in a few contiguous blocks. Large files are scattered across the disk. The inode needs to track where every block is.
Direct block pointers — the inode stores pointers to the first 12 data blocks directly. For 4 KB blocks, this handles files up to 48 KB.
Indirect blocks — for larger files:
- Single indirect — one block of pointers to data blocks. With 4-byte pointers and 4 KB blocks, that's 1,024 more blocks (4 MB).
- Double indirect — a block of pointers to indirect blocks. 1,024 × 1,024 = 1 million more blocks (4 GB).
- Triple indirect — for truly enormous files. 1,024³ blocks.
Modern file systems like ext4 use extents instead of individual block pointers. An extent says "this file uses blocks 10,000 through 10,500" — one record instead of 500 pointers. This is more compact and faster for sequential reads.
What Is Journaling?
Power fails. The system crashes. If the file system was in the middle of writing — updating the inode, adding blocks, modifying the directory — the on-disk state may be inconsistent. A file might point to blocks that belong to another file. A directory might reference an inode that was freed.
Journaling solves this by writing a log (the journal) before modifying the file system:
- Write intent to journal — "I'm going to update inode X and blocks Y-Z."
- Write data to journal — the actual new content.
- Commit — mark the journal entry as complete.
- Write to disk — apply the changes to the real file system structures.
- Delete journal entry — the transaction is done.
If the system crashes at any point:
- Before commit → discard the journal entry. Nothing happened.
- After commit, before disk write → replay the journal. Apply the changes.
This guarantees atomic file system operations — either all changes apply or none do. ext4 journals by default. NTFS has journaling. APFS uses a different mechanism (copy-on-write, discussed below).
What Is Copy-on-Write?
Traditional file systems modify data in place: overwriting block 10,000 replaces the old data. If the system crashes mid-write, the block may be half-old, half-new — corrupt.
Copy-on-write (CoW) file systems never overwrite existing data. Instead:
- Write new data to a new block.
- Update the pointer to reference the new block.
- Free the old block.
Because the old data is never touched, there's no corruption risk. If the system crashes before the pointer is updated, the old data is still intact.
CoW enables powerful features:
- Snapshots — a snapshot is just a saved set of block pointers. Creating a snapshot is instant because no data is copied — it just preserves the current pointers.
- Clones — a copy of a file that shares all blocks with the original. Only blocks that are modified get duplicated.
ZFS, Btrfs, and APFS are copy-on-write file systems. They trade slightly more complex writes for crash safety, snapshots, and efficient cloning.
Why Does fsync Matter?
When you call write(), the data goes to a kernel buffer — not directly to disk. The kernel writes buffers to disk later (typically within 5-30 seconds). This is write caching, and it makes writes fast.
But if the system crashes before the buffer is flushed, the data is lost. fsync(fd) forces the kernel to write all buffered data for that file to disk and wait until the disk confirms it's written.
This matters critically for databases. A database that writes a transaction to the WAL (Write-Ahead Log) and considers it committed must call fsync — otherwise a crash could lose committed transactions. PostgreSQL calls fsync after every WAL write. SQLite calls fsync in its default journal mode.
fsync is expensive — it forces a disk write and waits for it. On spinning disks, that's 5-10ms. On SSDs, it's 0.1-1ms. Database performance is often limited by fsync latency.
How Do File Systems Compare?
| File system | OS | Design | Key feature |
|---|---|---|---|
| ext4 | Linux | Journaling, extents | Reliable, well-tested, default on most Linux |
| XFS | Linux | Journaling, extents | High performance for large files |
| Btrfs | Linux | Copy-on-write | Snapshots, checksums, built-in RAID |
| ZFS | Linux/FreeBSD | Copy-on-write | Enterprise: snapshots, compression, dedup |
| APFS | macOS/iOS | Copy-on-write | Encryption, snapshots, space sharing |
| NTFS | Windows | Journaling | ACLs, compression, Windows default |
| overlayfs | Linux | Layered | Union mount — used by containers |
The choice of file system affects performance, reliability, and available features. ext4 is the safe default for Linux. APFS is the only option on macOS. ZFS and Btrfs offer advanced features at the cost of complexity.
Next Steps
File systems organize data. Containers build on file systems to create isolated environments:
- How Containers Work — namespaces, cgroups, and overlayfs.
- How the Kernel Works — the VFS layer that dispatches to file system drivers.
- How Memory Works — the page cache that makes file I/O fast.