How Write-Ahead Logging Works — Crash Recovery and Durability

2026-03-23

A database modifies data in memory for speed. But memory is volatile — a crash, a power failure, a kernel panic erases everything. The data on disk is the only durable copy. So how does a database guarantee that committed transactions survive crashes?

Write-ahead logging (WAL): before modifying any data page, write a description of the change to a sequential log on disk. If the database crashes before the data pages are written, the log is replayed on startup to reconstruct the changes.

The rule is simple and absolute: the log record must be on disk before the data page is modified on disk. Log first, data second. This is the "write-ahead" guarantee.

Why Not Just Write Data Pages Directly?

Data pages are scattered across the disk. Updating a row means writing to the page containing that row — a random write. Random writes are slow (the disk head must seek to the right position, or the SSD must erase and rewrite a block).

Worse: a data page is typically 8 KB. If the system crashes during a page write, the page may be partially written — the first 4 KB is the new version, the last 4 KB is the old version. The page is now corrupt and unrecoverable.

WAL solves both problems:

Sequential writes — log records are appended to the end of the log file. Sequential writes are the fastest possible disk access pattern — 10-100x faster than random writes.

Crash safety — log records are small (typically 50-200 bytes) and aligned to disk sectors. A crash during a log write either completes the record (safe) or doesn't (the incomplete record is discarded on recovery). No partial states.

How Does WAL Work?

During normal operation:

Transaction modifies a row. The change is applied to the data page in memory (in the buffer pool).
A WAL record describing the change is written to the WAL buffer.
On COMMIT, the WAL buffer is flushed to disk (fsync). The transaction is now durable.
The modified data page stays in memory. It will be written to disk later — minutes or hours later — by a background process (the checkpoint).

The key insight: the data pages don't need to be on disk at commit time. Only the log needs to be on disk. This is why commits are fast — one sequential fsync of the log, not random writes to dozens of data pages.

During recovery (after a crash):

Read the WAL from the last checkpoint forward.
For each log record, check if the corresponding data page on disk reflects the change.
If not (the page was only modified in memory and lost in the crash), redo the change by applying the log record to the page.
For uncommitted transactions (started but not committed before the crash), undo their changes.

After recovery, the database is in a consistent state: all committed transactions are present, all uncommitted transactions are rolled back. This is how atomicity and durability are implemented.

What Is a Checkpoint?

The WAL grows continuously. Without bounds, it would consume unlimited disk space, and recovery would take hours (replaying days of log records).

A checkpoint solves this:

Write all dirty pages (modified in memory but not yet on disk) to their data files.
Record the checkpoint position in the WAL.
WAL records before the checkpoint are no longer needed for recovery — they can be recycled.

Checkpoints happen periodically (every few minutes or after a certain amount of WAL data). The tradeoff: frequent checkpoints keep the WAL small and recovery fast, but each checkpoint causes a burst of random I/O (writing dirty pages). PostgreSQL's checkpoint_timeout and max_wal_size control this.

What Is fsync and Why Does It Matter?

When you call write(), the data goes to a kernel buffer, not to disk. The kernel writes buffers to disk later. fsync() forces the kernel to flush the buffer to disk and wait for the disk to confirm the write.

WAL commits MUST use fsync. Without it, a committed transaction's log record might be in a kernel buffer when the system crashes — the record is lost, and the transaction's durability guarantee is broken.

fsync costs 0.1-1ms on SSDs, 5-10ms on spinning disks. This is the primary bottleneck for transaction throughput. Techniques to amortize the cost:

Group commit — multiple transactions' log records are batched into a single fsync. If 100 transactions commit within 1ms, one fsync flushes all 100. PostgreSQL does this automatically.

WAL compression — compress log records before writing. Less data to fsync.

Async commit — trade durability for speed. The transaction returns before fsync completes. If the system crashes in the window between commit and fsync, the transaction is lost. PostgreSQL offers this via synchronous_commit = off.

WAL Beyond Traditional Databases

The WAL concept appears throughout systems:

SQLite — uses WAL mode as its default journal mode. Readers don't block writers because they read from the main database while writers append to the WAL.

Kafka — is essentially a distributed WAL. Producers append records. Consumers read sequentially. The log IS the database.

Event sourcing — store every state change as an immutable event (the log). The current state is derived by replaying events. The event log IS the WAL.

File system journaling — ext4's journal is a WAL for file system metadata. Same principle: write intent to the journal, then modify the actual data.

Next Steps

How Query Execution Works — how SQL becomes a plan that reads indexes and scans tables.
How File Systems Work — fsync, journaling, and the disk layer beneath databases.

Prerequisites

How Query Execution Works