How Storage Engines Work — Organizing Data on Disk

How Storage Engines Work — Organizing Data on Disk

2026-03-23

A database stores data on disk and retrieves it on demand. The storage engine is the component that decides how data is physically organized — how it's written, how it's read, how it's updated, and how it survives crashes. Everything else in the database (queries, transactions, replication) sits on top of the storage engine.

The choice of storage engine determines whether your database is fast at reads or writes, whether it handles analytical queries or transactional workloads, and how much disk space it uses.

What Is a Page?

Databases don't read individual rows from disk. Disk I/O reads in blocks — typically 4 KB or 8 KB. A database page is the unit of I/O: one page = one disk read.

A page contains multiple rows (in a row store) or multiple column values (in a column store). When you read one row, the entire page containing that row is loaded into memory. If the next row you need is on the same page, it's already in memory — no disk I/O needed.

This is why sequential access is fast (consecutive rows are on the same pages) and random access is slow (each row might be on a different page). The goal of storage engine design is to put data that's accessed together on the same pages.

How Does a Heap Store Work?

The simplest storage layout: rows are appended to the end of the file in insertion order. A heap is an unordered collection of pages.

Insert: fast — append to the last page. O(1). Read by ID: requires an index — without one, you'd scan every page. O(n). Update: find the row (via index), modify it in place. If the row grows larger than the available space on its page, it's moved and a forwarding pointer is left. Delete: mark the row as deleted. The space is reused later (vacuuming).

PostgreSQL uses a heap store. Every table is a heap. Indexes point to row locations within the heap. This is the most common layout for row-oriented databases.

B-Tree Storage Engines

Some databases store data directly in a B-tree — the table itself is the index, sorted by primary key. This is called a clustered index or index-organized table.

Insert: find the correct position in the sorted tree, insert. O(log n). May require page splits if the target page is full. Read by primary key: traverse the B-tree. O(log n). 3-4 disk reads for billions of rows. Range scan: find the start key, then walk the leaf pages sequentially. Extremely fast because data is sorted. Update: find the row, modify in place.

MySQL InnoDB uses this approach — the primary key IS the B-tree that stores the data. Secondary indexes point to the primary key, which is then looked up in the clustered index.

The tradeoff: inserts are slower than a heap (must maintain sort order), but range queries on the primary key are faster (data is already sorted).

LSM Trees — Optimized for Writes

Log-Structured Merge trees (LSM trees) take a different approach: optimize for write throughput at the cost of read latency.

Write path:

  1. Write to an in-memory buffer (the memtable — typically a sorted tree).
  2. When the memtable is full, flush it to disk as a sorted, immutable file (an SSTable — Sorted String Table).
  3. Periodically merge and compact SSTables to remove duplicates and deleted entries.

Read path:

  1. Check the memtable (in memory — fast).
  2. Check each SSTable on disk, newest first. Use Bloom filters to skip SSTables that definitely don't contain the key.
  3. Return the most recent version found.

Writes are sequential (append-only), which is the fastest possible disk access pattern. Reads may need to check multiple SSTables, which is slower than a single B-tree lookup.

B-Tree engineLSM-Tree engine
Write patternRandom (update in place)Sequential (append, merge later)
Write speedModerateFast
Read speedFast (one tree traversal)Moderate (multiple SSTables)
Space amplificationLowHigher (multiple versions until compaction)
Used byPostgreSQL, MySQL InnoDB, SQLiteRocksDB, Cassandra, LevelDB, ScyllaDB

Row Store vs Column Store

Row stores keep all columns of a row together on the same page. Reading SELECT * FROM users WHERE id = 5 loads one page containing the entire row. Good for transactional workloads that read and write complete rows.

Column stores keep all values of a single column together. Reading SELECT AVG(age) FROM users loads only the age column — no need to read names, emails, or other columns. Good for analytical queries that aggregate over many rows but few columns.

Row storeColumn store
LayoutAll columns per row on one pageAll values per column together
Good forOLTP (transactions, lookups)OLAP (analytics, aggregations)
Read one rowFast (one page)Slow (one page per column)
Aggregate one columnSlow (reads all columns)Fast (reads only that column)
CompressionModerateExcellent (similar values compress well)
ExamplesPostgreSQL, MySQL, SQLiteClickHouse, DuckDB, Parquet, BigQuery

Column stores compress dramatically — a column of country codes might have only 200 distinct values across millions of rows. Dictionary encoding, run-length encoding, and bit-packing reduce storage by 10-100x.

What Is the Buffer Pool?

Reading from disk on every query would be impossibly slow. The buffer pool (or page cache) keeps frequently accessed pages in memory.

When a query needs a page:

  1. Check the buffer pool. If the page is there (a cache hit), return it immediately — no disk I/O.
  2. If not (a cache miss), read the page from disk into the buffer pool, evicting the least-recently-used page if the pool is full.

A well-tuned database with enough memory serves most queries entirely from the buffer pool. PostgreSQL's shared_buffers, MySQL's innodb_buffer_pool_size, and SQLite's page cache all serve this purpose.

The buffer pool is why "add more RAM" is often the most effective database optimization — more pages fit in memory, fewer disk reads, faster queries.

Next Steps