B-Tree — Senior Level¶
Audience: Engineers building or operating production systems where the B-tree is the substrate. Prerequisite:
middle.md.
This document is a field guide to B-trees in real software — which database, filesystem, or embedded engine uses which variant, with what page size, what locking, and what crash-recovery scheme. The point is to give you the exact vocabulary you need when reading source code, tuning a workload, or designing a new storage component.
1. Relational Databases¶
1.1 PostgreSQL — nbtree (Lehman-Yao B+ tree)¶
- Index access method:
btreeis the default and has been since release 6.4 (1998). Source undersrc/backend/access/nbtree/. - Page size: 8 KB by default (
BLCKSZcompile-time constant; configurable). - Variant: B+ tree with Lehman-Yao right-links for concurrency. Internal nodes hold separator keys + child pointers; leaves hold full index tuples with a pointer (TID) into the heap.
- Locking: No latches on the descent path — search reads pages with only a buffer pin, no exclusive lock. Inserts use page-level latches acquired in descent order, released as soon as the next is taken. The Lehman-Yao right-link rescues searches that race with concurrent splits.
- MVCC: Index tuples are not versioned; the heap is. Index lookups must visit the heap to check visibility. This is why PostgreSQL has higher index-lookup cost than InnoDB (which embeds versioning in the index).
- Crash recovery: WAL records every page-level modification (insert, split, vacuum). Recovery replays WAL against the on-disk pages.
FILLFACTOR: Per-index tunable (10–100, default 90 for B-tree). Controls how full pages start; leaves room for in-place updates and slows split rate.- HOT updates keep some updates in the heap without rewriting the index, avoiding B-tree dirty I/O when index keys don't change.
1.2 MySQL InnoDB — clustered B+ tree¶
- Source:
storage/innobase/btr/(the B-tree code is several thousand lines of C++). - Page size: 16 KB by default (configurable per tablespace to 4 / 8 / 32 / 64 KB).
- Variant: Clustered B+ tree on the primary key. The leaves contain the row data itself — not pointers to a separate heap. This is called an index-organized table (Oracle terminology) or clustered index (Microsoft / MySQL terminology).
- Secondary indexes are themselves B+ trees, but their leaf entries store the primary key value, not a row pointer. A secondary lookup is therefore two B+ tree traversals: first to find the PK, then to find the row in the clustered index.
- Locking: Page-latches via
mtr(mini-transaction) framework. Index tree latch can be S or X; combined with page latches in a release-as-you-descend pattern. - MVCC: Row versions stored in rollback segments (undo log). Each row carries a hidden transaction ID and a rollback pointer; visibility is computed against the reader's snapshot.
- Adaptive Hash Index (AHI): InnoDB watches B-tree access patterns and, for hot leaf pages, builds an in-memory hash index pointing directly to records — short-circuiting the tree walk. Tunable via
innodb_adaptive_hash_index.
1.3 SQL Server — clustered & non-clustered B+ trees¶
- Page size: 8 KB, fixed (never configurable). Eight pages form a 64 KB extent.
- Variant: B+ tree. Clustered indexes contain the data; non-clustered indexes store the row's clustering key (or RID if heap-organized).
- Locking: Page locks (
PAGELATCH_*) and key-range locks for SERIALIZABLE. - MVCC: Optional via "Read Committed Snapshot Isolation" — uses tempdb to hold row versions. Tree itself is not versioned.
- Page split direction tuning: SQL Server supports right-pad inserts (sequential workload tuning) via the GUID NEWSEQUENTIALID() function, since insert order matters for split frequency.
1.4 Oracle — B-tree indexes¶
- Default index type: B-tree (since Oracle 7). Source is proprietary.
- Page size ("block size"): 8 KB default; tablespace can use 2/4/8/16/32 KB.
- Variant: B+ tree. Internal nodes are "branch blocks"; leaves are "leaf blocks". Leaves are linked into a doubly-linked list for ascending and descending range scans.
- Index-Organized Tables (IOT): Oracle's name for clustered tables. The full row sits in the B+ tree leaf.
- Reverse-key indexes: A B-tree variant where keys are byte-reversed before indexing. Used to spread monotonically-increasing keys (timestamps, sequence values) across leaf pages to reduce hot-spot contention in RAC clusters.
- Bitmap indexes: A different index type, useful for low-cardinality columns; not B-tree. Mentioned only because Oracle docs lump them together.
1.5 SQLite — embedded B-tree¶
- Source:
src/btree.candsrc/btreeInt.h— about 10,000 lines of C, beautifully commented. - Page size: 4 KB default; configurable 512 B to 64 KB at database creation.
- Variant: SQLite stores two kinds of B-trees per database:
- Table B-tree: every table is a B+ tree keyed by a 64-bit ROWID, leaves hold full row payload.
- Index B-tree: every secondary index is a B-tree (not B+; intermediate nodes also store the key) of (index-key, ROWID) pairs.
- The file format spec (
https://www.sqlite.org/fileformat.html, section 1.3 ff.) is the canonical reference for the on-disk layout. Worth reading in full as a model of clarity. - Locking: Database-level (no concurrent writes). Concurrent readers OK during the WAL journaling mode.
- Crash recovery: Two modes: rollback journal (older) or WAL (default since 3.7.0). WAL mode also gives reader-writer concurrency.
2. Embedded Key-Value Stores¶
2.1 BoltDB / bbolt — CoW B+ tree¶
- License & origin: BoltDB (Ben Johnson, 2014; archived 2018), forked as bbolt by etcd-io (2018-).
- Page size: 4 KB.
- Variant: Copy-on-write B+ tree, single-writer / many-reader, fully mmap'd. Based on LMDB's design (Howard Chu's design influence is explicit in the README).
- Used by: etcd (so transitively by Kubernetes), Consul (via bbolt as Raft log store), InfluxDB v1.
- Crash recovery: None needed — CoW writes are atomic at the meta-page swap.
- Concurrency: One writer at a time; readers see consistent snapshots via the old meta page.
2.2 LMDB — Lightning Memory-Mapped Database¶
- Source:
mdb.cin OpenLDAP's repo; ~6,000 lines of C. - Page size: 4 KB (matches OS page size on x86_64).
- Variant: mmap'd copy-on-write B+ tree, single-writer.
- Used by: OpenLDAP slapd, Monero, libmdbx (a hardened fork), some Ethereum clients, Cuckoo Cycle miners.
- Recovery: Crash-safe by the same atomic meta-page swap; ~5x faster than file-based engines for read-heavy workloads precisely because there is no buffer pool — the OS page cache is the buffer pool.
- Limitations: Single writer (every transaction takes the global write lock); cannot expand beyond mmap'd size without remapping.
2.3 BerkeleyDB — B-tree, hash, queue, recno¶
- Originated: University of Berkeley CSRG, then Sleepycat → Oracle (acquired 2006).
- Page size: 512 B to 64 KB; default 4 or 8 KB.
- Variant: Standard B+ tree with lock manager for fine-grained concurrency, full WAL, and ARIES-style recovery.
- Used by: Subversion, Bogofilter, Postfix (queue), older OpenLDAP back-end (replaced by LMDB), Citus's hstore predecessor.
2.4 WiredTiger — MongoDB default since 3.2¶
- Source:
https://github.com/wiredtiger/wiredtiger. - Variant: B-tree (configurable page size, default 32 KB internal / 32 KB leaf) with copy-on-write writes batched into checkpoints. Also supports an LSM mode (rarely used since 3.2).
- Concurrency: Hazard pointers for lock-free reads; per-page exclusive latches for writes.
- MVCC: Each row carries timestamps; readers see a consistent snapshot at their read timestamp.
3. Filesystems¶
3.1 HFS+ — macOS catalog B-tree¶
- HFS+ stores the directory tree, allocation map, and file extents as B*-trees (the high-fill Knuth variant, see
middle.md). Catalog node size 8 KB; allocation block typically 4 KB. - The "Catalog B*-Tree" maps
(parentDirID, name)→ file metadata. Lookup, listing, and creation all go through this tree. - HFS+ has no journaling for catalog updates by default — relies on transactions in the journal file. Replaced by APFS in macOS 10.13 (2017).
3.2 APFS — B-tree of B-trees¶
- APFS keeps multiple B+ trees per volume: an object map, an extent reference tree, a snapshot metadata tree, a fsroot per volume.
- CoW everywhere; updates write new nodes upward to a new root that becomes visible atomically.
- Snapshots are O(1) — they just retain an old root pointer. Same model as Btrfs and LMDB.
3.3 NTFS — Master File Table B+ tree¶
- Each NTFS volume's MFT (Master File Table) is a sequence of 1 KB records, but directory entries are indexed by B+ trees stored as
$INDEX_ROOTand$INDEX_ALLOCATIONattributes on the directory's MFT record. - Page (index buffer) size: 4 KB on modern volumes (used to be 1 KB).
- Resilient: every B+ tree change is fenced by the NTFS log file's checkpoint structure.
3.4 XFS — B+ trees everywhere¶
- XFS is famously "all B+ trees". Per allocation group, separate B+ trees for:
- Free space by block number (location).
- Free space by extent size (size).
- Inode allocation (bitmap-based B+ tree, "iallocag").
- Reverse mapping (which file owns which extent — added with rmap-enabled XFS).
- Page (block) size: 4 KB by default. Originally designed for SGI IRIX in 1993 — one of the earliest filesystems built around B+ trees end-to-end.
3.5 Btrfs — B-tree of B-trees, CoW throughout¶
- Btrfs is literally "B-tree filesystem"; every metadata structure is a CoW B-tree.
- Separate B-trees for: subvolume root, file extents, checksums, free space, device tree, quota tree, UUID tree.
- 4 KB nodes (default); CoW writes new pages, atomically swaps a superblock pointer at commit.
- Snapshots, send/receive, and clones all leverage the CoW tree structure for cheap shared subtrees.
3.6 Ext4 — HTree directory index¶
- Ext4 directories use an "htree" — a bounded-depth B-tree (max 2 or 3 levels) keyed by the hash of the filename. The htree was added in ext3 to make large directories (>10K entries) tolerable.
- Not a full B-tree: hashed keys defeat range queries (
readdirreturns hash-order garbage, not alphabetical), but constant-bounded depth is enough for ext4's use case.
3.7 ZFS — Merkle-tree DMU¶
- ZFS's Data Management Unit (DMU) uses CoW indirect blocks arranged in a Merkle B-tree: every block carries a 256-bit checksum stored in its parent, so any corruption is detected when reading. Combined with redundancy (mirrors / raidz), corruption is automatically repaired.
- Block sizes are variable, up to 1 MB (
recordsizeproperty); a typical filesystem record is 128 KB.
4. Linux Kernel¶
drivers/md/dm-btree*— Device Mapper's persistent B-tree, used bydm-thin(thin provisioning) anddm-cache.fs/btrfs/ctree.c— Btrfs's CoW B-tree implementation, ~5,000 lines of C.lib/btree.c— A simple in-memory B+ tree library used by a handful of subsystems (e.g., DRM memory managers).- Kernel
radix_tree— superseded byxarray— is a radix tree, not a B-tree, but it solves similar in-RAM-indexing problems where a B-tree would also fit.
5. Operational Considerations¶
5.1 Page/buffer pool alignment¶
A B-tree node should be exactly one disk page. If the engine uses direct I/O (O_DIRECT), the buffer must be aligned to the device's logical block size (usually 512 B or 4 KB). If it uses buffered I/O, alignment is less critical but still helps with mmap and partial-page reads.
5.2 The page cache interaction¶
Most B-tree engines have their own buffer pool that competes with the OS page cache. PostgreSQL's shared_buffers (default 128 MB, tune to 25% of RAM) is the most famous example. The conventional wisdom for double caching: keep the database's buffer pool small enough that OS page cache can still hold "secondary" pages, or use mmap and let the OS do all the caching (LMDB / bbolt approach). InnoDB's innodb_buffer_pool_size should typically be 50-70% of RAM on dedicated database servers — InnoDB does not rely on OS page cache.
5.3 Sequential-insert hot spots¶
Monotonically-increasing primary keys (timestamp, UUID v1, auto-increment) cause every new key to land in the rightmost leaf. That leaf becomes a hot spot: every writer queues for it, every split affects only the right edge of the tree. Mitigations:
- Hash-keyed indexes (Oracle reverse-key indexes).
- UUID v4 instead of v1 (random distribution).
- Page split at 100% instead of 50% so right-edge inserts don't waste left-side capacity (
FILLFACTOR 100for append-only tables in PostgreSQL). - Partitioning the table so the hot-spot is split across N partitions.
5.4 Index bloat and rebuild¶
Repeated insert/delete cycles leave nodes underfull. PostgreSQL's VACUUM reclaims dead tuples but does not compact the tree itself; REINDEX rebuilds. InnoDB compacts on the fly during page splits but can still get bloated under random-delete patterns. Periodically rebuilding indexes on long-lived OLTP tables is normal hygiene.
5.5 Online schema changes¶
CREATE INDEX CONCURRENTLY in PostgreSQL builds a B-tree in two passes while accepting writes. InnoDB similarly supports online ALTER TABLE ... ADD INDEX since 5.6. Both use a side-log of writes that arrive during the build, then replay them at the end with appropriate locks.
6. Quick reference table¶
| System | Variant | Page size | Concurrency | CoW | Where keys live |
|---|---|---|---|---|---|
| PostgreSQL btree | B+ Lehman-Yao | 8 KB | page latches + right-link | no | leaves only |
| InnoDB | B+ clustered | 16 KB | mtr latches | no | leaves; clustered = data |
| SQL Server | B+ | 8 KB | page latches | no | leaves only |
| Oracle | B+ | 8 KB | latches + KCBR | no | leaves only |
| SQLite | Table B+ / Index B | 4 KB | DB-level | no | leaves (table) / both (index) |
| BoltDB / bbolt | B+ CoW | 4 KB | 1 writer | yes | leaves only |
| LMDB | B+ CoW mmap | 4 KB | 1 writer | yes | leaves only |
| BerkeleyDB | B+ | 4-8 KB | lock mgr | no | leaves only |
| WiredTiger | B-tree | 32 KB | hazard ptr | yes | both |
| HFS+ catalog | B* | 4-8 KB | volume lock | no | both |
| Btrfs | B-tree CoW | 4 KB+ | rwsem per tree | yes | both |
| XFS | B+ | 4 KB | per-AG latches | no | leaves only |
| NTFS index | B+ | 4 KB | reserved range | no | leaves only |
| Ext4 htree | Hashed B-tree | 4 KB | dir mutex | no | leaves only |
7. Reading list¶
- Stonebraker, M., & Hellerstein, J. M. (2007). Architecture of a Database System. Foundations and Trends in Databases. Chapter 4 covers buffer management and indexes.
- Ramakrishnan, R., & Gehrke, J. Database Management Systems, 3rd ed., Chapters 10-12. The textbook on indexes and storage.
- Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. The buffer pool chapter is essential.
- Petrov, A. Database Internals, Chapters 2-3 (B-tree basics), 6 (variants), 13 (anti-entropy).
- PostgreSQL source:
src/backend/access/nbtree/README— the maintainers' own explanation of the Lehman-Yao implementation. Required reading. - SQLite file format spec:
https://www.sqlite.org/fileformat.html. The clearest statement of B-tree on-disk layout you will find anywhere. - MySQL InnoDB internals: Jeremy Cole's blog series at
https://blog.jcole.us/innodb/. - LMDB design paper by Howard Chu (2011), available from the OpenLDAP site.
- Bender, M. A., et al. Don't Thrash: How to Cache Your Hash on Flash. VLDB 2012. SSDs vs B-trees vs LSM.