
B-Tree vs LSM Tree — When to Use Which
Every database must decide: how should data be organized on disk? The two dominant approaches — B-trees and LSM trees — make fundamentally different tradeoffs. B-trees update data in place, optimizing for reads. LSM trees append data sequentially, optimizing for writes.
This isn't a theoretical distinction. It determines whether your database handles 10,000 or 100,000 writes per second, whether reads take 1 disk seek or 5, and whether your storage costs are 1x or 3x.
The B-Tree Approach
A B-tree stores data in sorted order across a tree of pages. Each page is a disk block (typically 8-16 KB). To update a row, the database finds the page containing it, modifies the page in memory, and writes it back to the same location on disk.
Write path: find the page → modify in place → write to WAL → eventually write the page back.
This is a random write — the disk head must seek to the right position. Random writes are slow on spinning disks (5-10ms per seek) and moderately fast on SSDs (0.1ms per operation, but write amplification from the SSD's internal garbage collection).
Read path: traverse the tree from root to leaf. 3-4 page reads for any key in a table of billions. Each page read may be a cache hit (buffer pool) or a disk read.
PostgreSQL, MySQL InnoDB, and SQLite all use B-trees. The read performance is excellent: O(log n) with a branching factor of 100-500 means 3-4 levels for any practical table size.
The LSM Tree Approach
An LSM tree never modifies data in place. Instead:
- Write to memtable — an in-memory sorted structure (usually a red-black tree or skip list).
- Flush to SSTable — when the memtable is full (~64 MB), write it to disk as a sorted, immutable file.
- Compaction — periodically merge SSTables to remove duplicates and reclaim space.
Write path: append to memtable (memory) → flush sequentially to disk. Sequential writes are the fastest possible disk access pattern — 10-100x faster than random writes.
Read path: check memtable → check each SSTable level (newest first) → use Bloom filters to skip irrelevant SSTables. Reads may need to check multiple files, making them slower than a single B-tree traversal.
RocksDB, LevelDB, Cassandra, ScyllaDB, and CockroachDB all use LSM trees.
The Tradeoffs
| B-Tree | LSM Tree | |
|---|---|---|
| Write pattern | Random (update in place) | Sequential (append only) |
| Write throughput | Moderate | High (2-10x faster) |
| Read latency | Low (one tree traversal) | Higher (multiple SSTable checks) |
| Space amplification | Low (one copy of each row) | Higher (multiple versions until compaction) |
| Write amplification | Moderate (WAL + page write) | Higher (compaction rewrites data) |
| CPU during compaction | None | Significant (background merging) |
| Predictable latency | Yes | No (compaction can cause spikes) |
Write amplification is how many times data is written to disk for each logical write. A B-tree writes the WAL entry and the data page = ~2x. An LSM tree writes the WAL, the memtable flush, and multiple compaction passes = 10-30x. More disk writes mean more SSD wear and more I/O bandwidth consumed.
Space amplification is how much disk space is used relative to the actual data size. B-trees store one copy of each row (~1x plus some internal fragmentation). LSM trees may have multiple versions of the same row across SSTable levels until compaction merges them (~1.1-2x).
Read amplification is how many disk reads a lookup requires. B-trees: 1 (the right page, usually cached). LSM trees: potentially many (one per SSTable level), though Bloom filters eliminate most checks.
When to Use B-Trees
OLTP with mixed read/write — web applications, APIs, SaaS. Most queries are point lookups and short range scans. Write volume is moderate. PostgreSQL and MySQL handle this workload perfectly.
Predictable latency — B-tree reads are consistently fast. No compaction spikes. Critical for latency-sensitive applications (payment processing, real-time APIs).
Small to medium datasets — if the dataset fits in the buffer pool, B-tree reads are memory-only. No disk I/O at all.
When to Use LSM Trees
Write-heavy workloads — logging, metrics, time-series data, event ingestion. When you're writing millions of rows per second, LSM trees' sequential writes dominate.
Large datasets on SSDs — LSM trees' sequential write pattern is friendlier to SSDs (less write amplification at the SSD level, despite more at the database level). For multi-terabyte datasets, this matters for SSD lifespan.
Append-mostly data — if rows are rarely updated or deleted (logs, audit trails, event stores), LSM trees' weakness (compaction to handle updates) doesn't apply.
The Hybrid Reality
Modern databases blur the line:
CockroachDB uses RocksDB (LSM) as its storage engine but provides SQL and ACID transactions on top. The LSM tree handles the write throughput of a distributed database where every write is replicated to multiple nodes.
TiDB uses TiKV (RocksDB-based) for the storage layer. Same pattern — LSM for write throughput, SQL for the query interface.
SQLite uses B-trees but adds a WAL mode that gives it some LSM-like write characteristics (sequential WAL writes, deferred page updates).
PostgreSQL is exploring a columnar storage engine (cstore_fdw, Hydra) for analytical queries while keeping B-trees for transactional workloads.
The trend: B-trees for the primary data structure, LSM trees for high-write auxiliary stores (write-ahead logs, replication streams, caching layers).
The Decision Framework
- What's your read/write ratio? Mostly reads → B-tree. Mostly writes → LSM.
- Do you need predictable latency? Yes → B-tree. Tolerant of occasional spikes → LSM is fine.
- How large is the dataset? Fits in memory → B-tree (reads are free). Exceeds memory significantly → LSM's sequential I/O helps.
- Are rows updated frequently? Yes → B-tree (in-place update). No (append-only) → LSM excels.
For most applications — web apps, APIs, SaaS — PostgreSQL (B-tree) is the right choice. For write-heavy infrastructure — logging, metrics, event streaming — RocksDB or Cassandra (LSM) is the right choice.
The storage engine is the most consequential architectural decision in a database. Understanding the tradeoff is understanding why your database performs the way it does.