B-Tree — Professional Level¶
Audience: Engineers building storage engines, query optimizers, or high-throughput indexed systems. Prerequisite:
senior.md.
This document is concerned with what happens after you have already implemented a correct B-tree and now you need it to be fast: cache-conscious node layout, vectorized scanning inside a node, the buffer-managed vs mmap'd trade-off, the LSM-tree alternative, adaptive radix trees as hot-page replacements, write-ahead logging interaction, index-organized tables, and notes on modern storage engines.
1. Cache-Conscious Node Layout¶
The B-tree's headline win comes from packing many keys per page. Inside that page, the secondary win comes from packing many keys per CPU cache line.
The numbers¶
Modern x86-64 CPUs use 64-byte cache lines. A 4 KB page holds 64 cache lines. If a node has 256 8-byte keys, that's 2 KB of key data — 32 cache lines just for keys, plus another 32 for child pointers if you store them adjacently. A linear scan touches each cache line once; a binary search touches log₂(256) = 8, but at unpredictable offsets — each comparison is a cache miss until the page is fully warm.
Structure-of-arrays vs array-of-structs¶
// AoS (naive):
struct Entry { uint64_t key; uint64_t child_ptr; };
struct Node { Entry entries[2 * T - 1]; uint16_t n; bool leaf; };
// SoA (cache-friendly):
struct Node {
uint64_t keys[2 * T - 1];
uint64_t children[2 * T];
uint16_t n;
bool leaf;
};
Binary-searching the SoA version touches only the keys array — half the cache lines. The children are not pulled in until you have decided which subtree to enter. 2× speedup on a hot in-memory B-tree is typical for this change alone.
"Eytzinger" layout for in-RAM search trees¶
For purely in-memory B-trees that never spill to disk, Khuong & Morin (2017, "Array Layouts for Comparison-Based Searching") showed that an Eytzinger / level-order layout — store the implicit binary tree of a sorted array breadth-first — gives 2-4× faster lookups than std::lower_bound due to perfect cache prefetcher prediction. The idea generalizes to B-trees: instead of storing children by pointer, store the whole subtree's keys in a contiguous level-order block. This is what Algolia's instantsearch uses for autocompletion indexes; what Roaring Bitmaps uses internally; and what some modern OLAP engines use for their dictionary indexes.
Slotted page layout (the database convention)¶
Real database pages are not homogeneous arrays. They use a slotted page layout:
+--------------------------------------------------+
| header (page LSN, flags, checksum) |
+----+----+----+--------------------------+--------+
| s0 | s1 | s2 | ... free space | k2|k1|k0|
+----+----+----+--------------------------+--------+
^ ^ ^ ^ ^ ^
| | +-- offset of key 2 (variable) +--+--+-- keys grow right-to-left
| +-- offset of key 1 (variable size)
+-- offset of key 0
Slot pointers grow from the start of the page; variable-length records grow from the end. Inserting and deleting only rewrites the slot directory and the affected record; no global shift. Used by PostgreSQL, InnoDB, SQL Server, Oracle, SQLite, BerkeleyDB — every production B-tree.
2. SIMD Scanning Within a Node¶
For nodes with hundreds of keys, binary search has 8 cache misses but ~8 comparisons. Linear search with SIMD has 0 cache misses (sequential prefetcher loves it) and ~256/16 = 16 SIMD comparisons that retire 4 per cycle. On modern hardware, linear SIMD search beats binary search up to ~512 keys per node.
Polychroniou, Raghavan & Ross (2015, "Rethinking SIMD Vectorization for In-Memory Databases") showed AVX2 / AVX-512 scans of sorted 32-bit keys at 30+ GB/s. The same technique is used in:
- DuckDB's
lower_boundfor sorted columns. - ClickHouse's skip indexes and dictionary lookups.
- VictoriaMetrics's index scan.
- Apache Arrow's vectorized kernel set.
Pseudocode (AVX2, 8x 32-bit keys at a time):
__m256i vtarget = _mm256_set1_epi32(target);
for (int i = 0; i < n; i += 8) {
__m256i v = _mm256_loadu_si256((__m256i*)(keys + i));
__m256i cmp = _mm256_cmpgt_epi32(v, vtarget);
uint32_t mask = _mm256_movemask_epi8(cmp);
if (mask) {
// first set bit's byte index / 4 is the relative key offset
return i + (__builtin_ctz(mask) >> 2);
}
}
return n;
For variable-length keys (strings), you typically use a fixed-prefix array in the node header and only fall through to the full key on prefix tie.
3. Buffer-Managed vs mmap'd B-Trees¶
There are two schools for relating B-tree pages to disk pages:
Buffer-managed (PostgreSQL, InnoDB, SQL Server, Oracle, BerkeleyDB)¶
The engine maintains its own buffer pool of fixed-size frames. Page reads go through a hash table from page-id to frame; cache replacement is LRU-2 or clock-with-frequency. The engine fully controls eviction and write-back.
Pros: - Predictable eviction policy tuned to database access patterns. - Can be larger than RAM via spilling (rarely done). - Direct I/O (O_DIRECT) bypasses double-caching. - Easy to instrument: cache hit rates, dirty-page counts, etc.
Cons: - ~5,000 lines of complex C++ buffer manager code. - Doubles memory if buffered I/O is used (OS page cache + buffer pool).
mmap'd (LMDB, bbolt, GreenplumDB's append-only tables historically)¶
The whole file is mmap'd. Page reads are pointer dereferences; the kernel handles paging via the OS page cache. No buffer pool code.
Pros: - Tiny code (LMDB is 6,000 lines total, including everything). - Zero-copy reads — the index page in the OS cache is the buffer. - No double caching.
Cons: - No control over eviction policy. - Page faults can stall threads unpredictably; tail latencies can spike. - TLB pressure on huge databases (mitigated by huge pages). - Andy Pavlo's "Are You Sure You Want to Use MMAP in Your Database Management System?" (CIDR 2022) is a famous broadside arguing mmap is incompatible with serious crash safety and concurrency for general-purpose DBMSes. LMDB / bbolt sidestep most of his arguments via single-writer CoW design.
When to pick which¶
- Single-writer, mostly-read, low operational complexity: mmap. LMDB and bbolt have proven this at scale (etcd, Monero).
- Many concurrent writers, online schema changes, write-ahead logging, partial recovery: buffer pool. PostgreSQL and InnoDB are the canonical examples.
4. LSM Trees vs B-Trees¶
The LSM (Log-Structured Merge) tree, popularized by Google's Bigtable (2006) and incarnated in LevelDB / RocksDB / Cassandra / HBase / ScyllaDB / Apache Pinot, is the B-tree's main rival for write-heavy workloads. The contrast:
| B-tree | LSM tree | |
|---|---|---|
| Write path | random in-place updates to leaf pages | append-only sorted run; flush MemTable to immutable SSTable |
| Read path | one traversal, ~1 leaf read | check MemTable, then check each level's SSTable (until found) → read amplification |
| Space | ~70% leaf utilization | many copies of the same key across levels until compaction |
| Compaction | none required | continuous background merge across levels — high write amplification |
| Range scan | sequential leaf walk | merge-iterate across MemTable + each level |
| Best for | mixed read/write, smaller datasets, OLTP | write-heavy, append-mostly, time-series, log ingestion |
The compaction trade-off¶
LSM trees buffer writes in memory and write whole sorted SSTables in batches. Background compaction merges levels, removing duplicates. The cost: each key may be rewritten 10-20 times on its journey from level 0 to the bottom — write amplification of 10x is common. The benefit: writes are sequential I/O, ~30x faster on HDD and 5-10x faster on SSD vs random B-tree writes.
Read amplification trade-off¶
A read in an LSM may need to consult every level until it finds the key (or a tombstone). Bloom filters per SSTable cut this to ~1 random read in expectation. Still slower than a B-tree's single traversal.
The hybrid: WiredTiger¶
MongoDB's default engine offers both modes per collection. B-tree by default; LSM available for ingest-heavy collections. Most users stick with B-tree.
Compaction strategies¶
- Tiered (Cassandra default): group N similar-sized SSTables into one larger SSTable. Lower write amp, higher space amp.
- Leveled (RocksDB/LevelDB default): each level is N× larger than the previous; compaction merges from one level to the next. Higher write amp, lower space amp.
- Universal (RocksDB): combines both.
- Tiered + Leveled (FoundationDB, ScyllaDB): different policies at different levels.
The classic LSM analysis paper is Sears & Ramakrishnan (2012), "bLSM: A General Purpose Log Structured Merge Tree". Modern surveys: Luo & Carey (2020), "LSM-based Storage Techniques: A Survey".
5. WiredTiger's Leaf Split Policy¶
WiredTiger (MongoDB's default engine) makes a non-obvious design choice: leaves split at insert time, internal nodes do not split synchronously. Internal nodes accumulate "pending" inserts in their in-memory image; at the next eviction, they reconcile into multiple smaller on-disk pages.
This decouples the synchronous insert path (only touches a leaf and updates its in-memory image) from the asynchronous reconciliation (rebuilds parent representations during eviction). Throughput goes up; tail latency on inserts goes down. The trade-off is more memory used for in-memory page images and more CPU during reconciliation.
The full design is documented at https://source.wiredtiger.com/develop/arch-index.html.
6. Adaptive Radix Trees as Hot-Page Replacements¶
For pages or sub-indexes that are entirely in-memory and constantly hit, a B-tree may not be the right shape. The Adaptive Radix Tree (ART) of Leis, Kemper & Neumann (ICDE 2013, "The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases") provides:
- O(k) lookup where k is key length in bytes (not log n).
- 4-way / 16-way / 48-way / 256-way nodes that switch type as they fill up — keeping memory low for sparse subtrees.
- Prefix compression so common key prefixes are stored once.
Performance: ART beats a tuned in-memory B-tree by ~2x on uniform integer keys, ~5x on string keys with shared prefixes. Used by HyPer, DuckDB, CedarDB, Umbra.
For systems that need persistence, persistent ART variants exist but are less common than B+ trees because ART's lookups are not as cache-friendly when the tree spills to disk (each node is just one cache line, not a full page).
7. Write-Ahead Logging Interaction¶
A B-tree modification cannot be "atomic on disk" — splitting a node requires writing two pages, updating the parent, and possibly cascading upward. If a crash occurs mid-split, the on-disk tree is left inconsistent.
The standard fix is write-ahead logging (WAL). Before any page modification:
- Construct a log record describing the change (typically: redo info = new page content; undo info = old page content).
- Force the log record to durable storage (fsync).
- Then modify the page in the buffer pool. The dirty page is flushed lazily.
After a crash, recovery: - Redo: Replay all log records since the last checkpoint to bring pages forward. - Undo: Roll back any uncommitted transactions.
The reference recovery algorithm is ARIES (Mohan et al., 1992, "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging"). InnoDB, SQL Server, Oracle, and DB2 all implement ARIES or variants. PostgreSQL uses a simpler scheme based on full-page writes after each checkpoint.
Log record per B-tree operation¶
A B-tree insert generates 1–4 log records: - INSERT-KEY-IN-LEAF (1 record). - SPLIT-LEAF (1 record describing both the new sibling and the parent update). - ...possibly cascading up.
Compaction tip: many engines log only the difference (the new key and the split point) rather than the full new page content, saving WAL bandwidth.
CoW trees: no log¶
Copy-on-write B-trees skip WAL entirely. The atomic root pointer swap is the commit. This is why bbolt and LMDB are dramatically simpler than InnoDB. Trade-off: every commit costs writing the whole modified path (~log n pages).
8. Index-Organized Tables (Clustered Indexes)¶
When the B+ tree leaf contains the full row data, not a row pointer, the table is called an Index-Organized Table (Oracle) or clustered index (Microsoft/MySQL).
Benefits¶
- One I/O for the entire row on a primary-key lookup. A heap-based table needs index lookup + heap fetch.
- Range scans on the primary key are sequential on disk because rows are physically ordered by PK.
- Secondary index entries are smaller because they don't carry full TIDs — they carry the PK (which for InnoDB doubles as the row's "location").
Costs¶
- Secondary index lookup is two B+ tree traversals: one for PK, one for row.
- PK update is expensive: the row physically moves to its new location in the clustered index, and every secondary index pointing at that PK must be invalidated and rewritten.
- Large variable-length rows distort fanout: a clustered B+ tree page that contains a 4 KB BLOB inline holds far fewer rows. InnoDB pushes such columns to overflow pages and stores only a pointer in the leaf.
Who uses what¶
| Engine | Default |
|---|---|
| InnoDB | Always clustered on PK |
| Oracle | Heap by default; IOT optional |
| SQL Server | Heap by default; clustered if explicit |
| PostgreSQL | Always heap; no clustered indexes |
| SQLite | Always clustered on ROWID |
PostgreSQL's CLUSTER command is a one-time physical reordering, not a maintained clustered index — order degrades immediately on subsequent inserts.
9. Modern Storage Engines (a Selective Survey)¶
FoundationDB¶
Distributed key-value store; Apple acquired & open-sourced (2018). Storage engine is RocksDB-backed (LSM), but its transaction layer is a B-tree-like Merkle tree over keys for verification and conflict detection.
Aerospike¶
Real-time KV store. Two-layer architecture: in-memory primary index is a red-black tree (= 2-3-4 B-tree), per-node sharded; on-disk records sit in append-only "stripes" with the in-memory index pointing into them.
ClickHouse¶
Column-store OLAP. Each table is sorted by its primary key, but the "index" is a sparse skip index — one entry per data granule (typically 8192 rows). Not a true B-tree. Each granule is found via binary search of this sparse array.
CockroachDB / TiDB / YugabyteDB¶
Distributed SQL. Each range is stored as RocksDB (LSM) on each node. The "B-tree" character lives at the SQL-engine layer in the form of secondary indexes, also stored in RocksDB.
Umbra / HyPer¶
Academic in-memory databases. ART for primary indexes. B-tree for some secondary indexes when range scans dominate.
BumbleDB¶
Specialized B-tree variants for hybrid memory/storage hierarchies — research projects out of CMU's database group focusing on persistent memory (PMEM / Optane).
Velox / DuckDB¶
Embedded OLAP. DuckDB uses ART for hash joins and a B+ tree per table for primary key indexes; column data stored in compressed segments separate from the index.
10. Tail-Latency and Operational Tips¶
- Checkpoint pacing. A naive engine forces all dirty pages at checkpoint time, spiking write latency. Modern engines (PostgreSQL
checkpoint_completion_target, InnoDBinnodb_io_capacity) spread the writes across the checkpoint interval. - Buffer pool warm-up. Cold cache lookups can be 1000x slower than warm. PostgreSQL has
pg_prewarm; InnoDB serializes the buffer pool to disk for fast restart. - Read-ahead. Sequential scans should trigger prefetch of the next several leaf pages; almost all engines do this via the OS readahead mechanism + their own scan-detection heuristic.
- Auto-vacuum tuning. PostgreSQL's autovacuum doesn't reorganize B-tree pages; rebuild periodically with
REINDEX CONCURRENTLYon hot tables. - Index bloat measurement. Use
pgstattuple(PostgreSQL) orSHOW TABLE STATUS(InnoDB) to see the fill ratio. Below 50% means a rebuild will help. - Compression. InnoDB row-format COMPRESSED applies per-page zlib; halves disk space but doubles CPU. PostgreSQL TOAST compresses individual oversized columns.
Reading list¶
- Petrov, A. Database Internals, Part I (storage), 2019. The current best single source.
- Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. The reference architecture.
- Mohan, C., et al. (1992). ARIES: A Transaction Recovery Method... The WAL design paper.
- Leis, V., Kemper, A., & Neumann, T. (2013). The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases. ICDE.
- Polychroniou, O., Raghavan, A., & Ross, K. A. (2015). Rethinking SIMD Vectorization for In-Memory Databases. SIGMOD.
- Khuong, P.-V., & Morin, P. (2017). Array Layouts for Comparison-Based Searching. JEA.
- Pavlo, A., et al. (2022). Are You Sure You Want to Use MMAP in Your Database Management System? CIDR. The mmap critique.
- Sears, R., & Ramakrishnan, R. (2012). bLSM: A General Purpose Log Structured Merge Tree. SIGMOD.
- Luo, C., & Carey, M. J. (2020). LSM-based Storage Techniques: A Survey. VLDB Journal.
- Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases 3(4). 200 pages of optimizations from 1970 to now.