B-Tree vs LSM Tree — When to Use Which

B-Tree vs LSM Tree — When to Use Which

2026-03-23

Every database must decide: how should data be organized on disk? The two dominant approaches — B-trees and LSM trees — make fundamentally different tradeoffs. B-trees update data in place, optimizing for reads. LSM trees append data sequentially, optimizing for writes.

This isn't a theoretical distinction. It determines whether your database handles 10,000 or 100,000 writes per second, whether reads take 1 disk seek or 5, and whether your storage costs are 1x or 3x.

The B-Tree Approach

A B-tree stores data in sorted order across a tree of pages. Each page is a disk block (typically 8-16 KB). To update a row, the database finds the page containing it, modifies the page in memory, and writes it back to the same location on disk.

Write path: find the page → modify in place → write to WAL → eventually write the page back.

This is a random write — the disk head must seek to the right position. Random writes are slow on spinning disks (5-10ms per seek) and moderately fast on SSDs (0.1ms per operation, but write amplification from the SSD's internal garbage collection).

Read path: traverse the tree from root to leaf. 3-4 page reads for any key in a table of billions. Each page read may be a cache hit (buffer pool) or a disk read.

PostgreSQL, MySQL InnoDB, and SQLite all use B-trees. The read performance is excellent: O(log n) with a branching factor of 100-500 means 3-4 levels for any practical table size.

The LSM Tree Approach

An LSM tree never modifies data in place. Instead:

  1. Write to memtable — an in-memory sorted structure (usually a red-black tree or skip list).
  2. Flush to SSTable — when the memtable is full (~64 MB), write it to disk as a sorted, immutable file.
  3. Compaction — periodically merge SSTables to remove duplicates and reclaim space.

Write path: append to memtable (memory) → flush sequentially to disk. Sequential writes are the fastest possible disk access pattern — 10-100x faster than random writes.

Read path: check memtable → check each SSTable level (newest first) → use Bloom filters to skip irrelevant SSTables. Reads may need to check multiple files, making them slower than a single B-tree traversal.

RocksDB, LevelDB, Cassandra, ScyllaDB, and CockroachDB all use LSM trees.

The Tradeoffs

B-TreeLSM Tree
Write patternRandom (update in place)Sequential (append only)
Write throughputModerateHigh (2-10x faster)
Read latencyLow (one tree traversal)Higher (multiple SSTable checks)
Space amplificationLow (one copy of each row)Higher (multiple versions until compaction)
Write amplificationModerate (WAL + page write)Higher (compaction rewrites data)
CPU during compactionNoneSignificant (background merging)
Predictable latencyYesNo (compaction can cause spikes)

Write amplification is how many times data is written to disk for each logical write. A B-tree writes the WAL entry and the data page = ~2x. An LSM tree writes the WAL, the memtable flush, and multiple compaction passes = 10-30x. More disk writes mean more SSD wear and more I/O bandwidth consumed.

Space amplification is how much disk space is used relative to the actual data size. B-trees store one copy of each row (~1x plus some internal fragmentation). LSM trees may have multiple versions of the same row across SSTable levels until compaction merges them (~1.1-2x).

Read amplification is how many disk reads a lookup requires. B-trees: 1 (the right page, usually cached). LSM trees: potentially many (one per SSTable level), though Bloom filters eliminate most checks.

When to Use B-Trees

OLTP with mixed read/write — web applications, APIs, SaaS. Most queries are point lookups and short range scans. Write volume is moderate. PostgreSQL and MySQL handle this workload perfectly.

Predictable latency — B-tree reads are consistently fast. No compaction spikes. Critical for latency-sensitive applications (payment processing, real-time APIs).

Small to medium datasets — if the dataset fits in the buffer pool, B-tree reads are memory-only. No disk I/O at all.

When to Use LSM Trees

Write-heavy workloads — logging, metrics, time-series data, event ingestion. When you're writing millions of rows per second, LSM trees' sequential writes dominate.

Large datasets on SSDs — LSM trees' sequential write pattern is friendlier to SSDs (less write amplification at the SSD level, despite more at the database level). For multi-terabyte datasets, this matters for SSD lifespan.

Append-mostly data — if rows are rarely updated or deleted (logs, audit trails, event stores), LSM trees' weakness (compaction to handle updates) doesn't apply.

The Hybrid Reality

Modern databases blur the line:

CockroachDB uses RocksDB (LSM) as its storage engine but provides SQL and ACID transactions on top. The LSM tree handles the write throughput of a distributed database where every write is replicated to multiple nodes.

TiDB uses TiKV (RocksDB-based) for the storage layer. Same pattern — LSM for write throughput, SQL for the query interface.

SQLite uses B-trees but adds a WAL mode that gives it some LSM-like write characteristics (sequential WAL writes, deferred page updates).

PostgreSQL is exploring a columnar storage engine (cstore_fdw, Hydra) for analytical queries while keeping B-trees for transactional workloads.

The trend: B-trees for the primary data structure, LSM trees for high-write auxiliary stores (write-ahead logs, replication streams, caching layers).

The Decision Framework

  1. What's your read/write ratio? Mostly reads → B-tree. Mostly writes → LSM.
  2. Do you need predictable latency? Yes → B-tree. Tolerant of occasional spikes → LSM is fine.
  3. How large is the dataset? Fits in memory → B-tree (reads are free). Exceeds memory significantly → LSM's sequential I/O helps.
  4. Are rows updated frequently? Yes → B-tree (in-place update). No (append-only) → LSM excels.

For most applications — web apps, APIs, SaaS — PostgreSQL (B-tree) is the right choice. For write-heavy infrastructure — logging, metrics, event streaming — RocksDB or Cassandra (LSM) is the right choice.

The storage engine is the most consequential architectural decision in a database. Understanding the tradeoff is understanding why your database performs the way it does.