Concurrent Trees — Professional Level¶

Table of Contents¶

Introduction
Prerequisites
Glossary
The Bw-Tree in Implementation Depth
Concurrent ART (Adaptive Radix Tree)
Hardware Transactional Memory (Intel TSX, ARMv8.1 LSE)
OLFIT and MassTree in Depth
Hekaton-Class In-Memory Engines
Comparative Engine Tour
Hardware-Aware Tree Design
Profiling at the Microbenchmark Level
Coding Patterns
Clean Code at Engine Scale
Product Use
Performance Tips
Best Practices
Edge Cases and Pitfalls
Common Mistakes
Common Misconceptions
Tricky Points
Tricky Questions
Test
Cheat Sheet
Self-Assessment Checklist
Summary
Further Reading
Related Topics
Diagrams

Introduction¶

Focus: "I am building or maintaining an in-memory engine where every nanosecond and cache line counts. The advanced literature is now my baseline."

This file is for the very small set of engineers who design, implement, or maintain in-memory database engines, in-process columnar stores, or research-grade ordered data structures. By "very small" we mean: at most a few hundred working engineers worldwide. The techniques here are not appropriate for general-purpose Go services — they are the apex of what is possible in concurrent ordered data, and applying them in the wrong context yields complex, buggy code that runs no faster than tidwall/btree with publish/subscribe COW.

We assume you have absorbed the junior, middle, and senior files. The professional file takes you deeper into:

The Bw-tree (Microsoft Research, 2013): mapping tables, delta chains, consolidation, epoch reclamation, structure modifications. A lock-free B-tree at full implementation grade.
Concurrent ART (Adaptive Radix Tree): byte-indexed adaptive nodes, prefix compression, optimistic concurrency, SIMD-accelerated lookups.
Hardware transactional memory (Intel TSX, ARMv8.1 LSE): how xbegin/xend can replace fine-grained latches for short critical sections.
OLFIT and MassTree: B-link with full OCC, used in HyPer, Hekaton, MassTree itself.
Hekaton internals: Microsoft's in-memory engine for SQL Server. How it integrates Bw-tree, MVCC, and transaction logging.
A comparative tour: PostgreSQL nbtree, MySQL InnoDB, Oracle TimesTen, SAP HANA, MS SQL Server Hekaton, Google Spanner. What each one's tree looks like and why.
Hardware-aware design: cache lines, prefetch, branch prediction, NUMA effects, persistent memory.
Microbenchmarking discipline: how to measure correctly at the nanosecond scale.

You leave this file able to read any paper in the area, debate trade-offs with database engineers, and architect engines at the bleeding edge of what is possible.

A caveat: almost none of this matters for ordinary Go services. If you are building a payments backend, a CRUD API, a SaaS dashboard, you do not need this material. You need the publish/subscribe COW pattern from the middle file. The professional file is for the case where that pattern is not enough — which is rare.

Prerequisites¶

All previous files in this section.
Fluency with C / C++. The published literature is in those languages.
Familiarity with hardware concepts: cache lines, memory hierarchy, branch prediction, NUMA, prefetching.
Familiarity with at least one database engine's source.
Patience to read research papers carefully.

Glossary¶

Term	Definition
Bw-tree	Microsoft Research's lock-free B-tree, 2013. Uses a mapping table, delta chains, and epoch reclamation.
Mapping table	A flat array indexed by node ID. Each slot holds a pointer to the current node (base + delta chain). Updates via CAS.
Delta chain	A linked list of small "delta" records describing modifications since the last consolidation. Reads walk the chain plus the base.
Consolidation	The process of folding a delta chain into a fresh base node. Triggered when the chain becomes too long.
Structure modification	An operation that changes the tree's topology (split, merge). In Bw-trees these are also delta records.
Concurrent ART	A concurrent variant of the Adaptive Radix Tree, typically using OCC + B-link-style sibling pointers.
Adaptive node	An ART node that resizes itself (4, 16, 48, or 256 slots) based on the actual number of children.
Prefix compression	Storing only the diverging suffix of keys; the common prefix is stored once at the parent node.
HTM	Hardware Transactional Memory. CPU support for executing a block atomically; commits if no conflict, aborts otherwise.
TSX	Intel's HTM implementation: `xbegin`, `xend`, `xabort`.
LSE	Large System Extensions in ARMv8.1; includes atomic CAS instructions but not full HTM.
Hekaton	Microsoft SQL Server's in-memory engine. Uses Bw-trees for indexes, MVCC, and transaction-private versioned writes.
HyPer	Munich University's in-memory engine. Uses ART for indexes.
TimesTen	Oracle's in-memory database. Uses tree-based indexes.
SAP HANA	A column-store-plus-row-store hybrid. Uses delta-merge architecture; row-store uses concurrent trees.
NUMA	Non-Uniform Memory Access. Multi-socket hardware where memory access cost depends on which socket owns the memory.
Cache line	64 bytes on x86, 128 bytes on some ARMv8. The unit of cache coherency traffic.
Prefetch	A CPU instruction (`prefetcht0`, etc.) that fetches a cache line into L1/L2 before it is needed.
Branch predictor	A CPU mechanism that guesses which branch will be taken. Predictable branches are nearly free; mispredictions cost ~20 cycles.
Persistent memory	Memory that survives power loss. Intel Optane DCPMM was the most prominent example. Requires special protocols for crash consistency.

The Bw-Tree in Implementation Depth¶

The Bw-tree is one of the most fascinating data structures in computer science. Read the original paper (Levandoski, Lomet, Sengupta, 2013) carefully; this section is commentary, not a substitute.

The big picture¶

A standard B+-tree mutates nodes in place. A Bw-tree never mutates a node in place. Instead:

Each node has an identifier (a NodeID, typically 32 or 64 bits).
A global mapping table maps NodeID → memory pointer to the node's current state.
To modify a node, allocate a new delta record (e.g., "insert key K with value V"), link it to the current state, then CAS the mapping table entry to the delta head.
Readers walk the delta chain plus the base node when they look up a key.
When a delta chain gets long, consolidate: allocate a fresh base node that represents the chain's combined effect, then CAS the mapping table entry.
Structure modifications (split, merge) are special delta records that change the tree's topology.

The combined effect: lock-free reads, lock-free writes (just CAS on the mapping table), no in-place mutation. Ideal for flash storage and high-concurrency CPUs.

The mapping table¶

type MappingTable struct {
    entries []atomic.Pointer[Node]
}

func (mt *MappingTable) Get(id NodeID) *Node {
    return mt.entries[id].Load()
}

func (mt *MappingTable) CAS(id NodeID, old, new *Node) bool {
    return mt.entries[id].CompareAndSwap(old, new)
}

The table is the single source of truth for which physical node corresponds to a logical node ID. Updates always go through CAS on the table.

Why the indirection? Because the alternative — every parent pointing directly to a child's memory pointer — means moving a child requires updating every parent. With the mapping table, a child can be "moved" (consolidated) without touching any parents: just CAS the table entry.

Delta records¶

A delta record is a small struct describing an incremental change:

type DeltaKind int

const (
    DeltaInsert DeltaKind = iota
    DeltaDelete
    DeltaUpdate
    DeltaSplit
    DeltaMerge
    DeltaRemoveNode
)

type Delta struct {
    Kind    DeltaKind
    Key     int64
    Value   []byte
    Next    *Node    // the next state (delta or base)
    // For Split/Merge:
    SiblingID NodeID
    Pivot     int64
}

type Node struct {
    IsDelta bool
    // If !IsDelta, this is a base node with items.
    Items   []item
    HighKey int64
    RightSib NodeID
    // If IsDelta:
    Delta *Delta
}

Reading a key from a Bw-tree:

func (t *BwTree) Get(key int64) ([]byte, bool) {
    id := t.rootID
    for {
        n := t.mapping.Get(id)
        if found, val := walkDeltaChain(n, key); found {
            return val, true
        }
        // Need to descend to a child. Walk the chain to find the right child ID.
        childID := findChild(n, key)
        if childID == 0 {
            return nil, false
        }
        id = childID
    }
}

func walkDeltaChain(n *Node, key int64) (bool, []byte) {
    for n.IsDelta {
        d := n.Delta
        switch d.Kind {
        case DeltaInsert:
            if d.Key == key {
                return true, d.Value
            }
        case DeltaDelete:
            if d.Key == key {
                return false, nil // tombstoned
            }
        }
        n = d.Next
    }
    // Reached the base.
    return findInBase(n, key)
}

For each key lookup, the reader walks the delta chain and combines its effect with the base. This is more work per read than a B-tree, but writes are lock-free.

Writes¶

func (t *BwTree) Insert(key int64, value []byte) {
    id := findLeaf(t, key) // descends to the leaf, returns its NodeID
    for {
        old := t.mapping.Get(id)
        delta := &Node{
            IsDelta: true,
            Delta: &Delta{Kind: DeltaInsert, Key: key, Value: value, Next: old},
        }
        if t.mapping.CAS(id, old, delta) {
            return
        }
        // Someone else updated; retry.
    }
}

The write allocates a delta record, links it to the current state, and CASs the mapping table entry. If the CAS fails (another thread won), retry.

Multiple writers can succeed against different node IDs; they only contend on the same leaf.

Consolidation¶

When the delta chain reaches a threshold length (say, 8 records), the next reader to encounter it triggers consolidation:

func (t *BwTree) maybeConsolidate(id NodeID, n *Node) *Node {
    chainLen := lengthOfChain(n)
    if chainLen < consolidationThreshold {
        return n
    }
    fresh := consolidate(n)
    if t.mapping.CAS(id, n, fresh) {
        // We own the old chain; epoch-retire it.
        t.epoch.Retire(n)
        return fresh
    }
    // Someone else consolidated first. Use their version.
    return t.mapping.Get(id)
}

func consolidate(n *Node) *Node {
    items := walkChainToItems(n)
    return &Node{Items: items, HighKey: n.HighKey, RightSib: n.RightSib}
}

Consolidation:

Walks the chain, applying inserts / deletes / updates to a fresh items slice.
Allocates a new base node.
CASs the mapping table to point to it.
Retires the old chain to an epoch reclaimer.

If two threads try to consolidate simultaneously, only one's CAS succeeds; the other's work is wasted (but the chain is now consolidated by the winner).

Splits and merges¶

Structure modifications are also delta records, with extra fields:

type Delta struct {
    Kind    DeltaKind
    // For DeltaSplit:
    Pivot   int64
    SiblingID NodeID
    Next    *Node
}

A split:

Allocate the new right sibling. Install it in the mapping table with a fresh NodeID.
The new sibling's content: the upper half of the splitting node's items.
Atomically post a SplitDelta to the splitting node: "items >= Pivot now live in SiblingID."
Readers seeing the SplitDelta know: "if the key >= Pivot, go to SiblingID."
Eventually, post the new sibling to the parent via another delta record.

This is the same architectural idea as Lehman-Yao's right-sibling pointer, but expressed entirely as deltas and CAS operations — no latches required.

Epoch reclamation¶

The Bw-tree must ensure that old base nodes and delta records are not freed while in-flight readers may still see them. This is exactly the RCU problem.

Each reader enters an epoch on every operation:

type Epoch struct {
    counter atomic.Int64
    queues  []sync.Mutex
    items   [][]any
}

func (e *Epoch) Enter() int64 {
    return e.counter.Load()
}

func (e *Epoch) Exit(epoch int64) {
    // Decrement active count for this epoch. Real impl uses per-thread counters.
}

func (e *Epoch) Retire(obj any) {
    cur := e.counter.Load()
    e.queues[cur%3].Lock()
    e.items[cur%3] = append(e.items[cur%3], obj)
    e.queues[cur%3].Unlock()
}

func (e *Epoch) Advance() {
    new := e.counter.Add(1)
    // Wait for all readers in epoch (new-2) to finish.
    // Then free items from queue (new-2)%3.
}

Real Bw-tree implementations use sophisticated epoch trackers (per-thread counters, lock-free queues). Microsoft's implementation is open-sourced in their BzTree research code.

In Go, the GC subsumes most of this. As long as no goroutine retains a pointer to the old chain, it is collected. You only need explicit epoch tracking if you use unsafe.Pointer or cgo.

Why Bw-trees are hard¶

The Bw-tree's correctness depends on:

Each delta record being atomically published via CAS.
Consolidation never losing data (must correctly fold the entire chain).
Splits being seen by all readers in the right order.
Epoch reclamation freeing only truly unreachable memory.

Bugs in any of these can be silent and very hard to reproduce. Microsoft's original Bw-tree had known correctness bugs that took years to fix. The 2013 paper has been revised multiple times.

For Go, the realistic stance: understand the Bw-tree's protocol, but do not implement one from scratch. If you need its properties, use a library or fork the Microsoft reference code via cgo.

A Go skeleton¶

For pedagogical completeness, a minimal Bw-tree skeleton in Go:

package bwtree

import (
    "sync"
    "sync/atomic"
)

type NodeID uint32

type Node struct {
    isDelta  bool
    isLeaf   bool
    // Base node fields:
    items    []item
    highKey  int64
    rightSib NodeID
    // Delta fields:
    deltaKind DeltaKind
    deltaKey  int64
    deltaVal  []byte
    deltaNext *Node
    // Split delta:
    splitPivot   int64
    splitSibling NodeID
}

type item struct {
    Key int64
    Val []byte
}

type DeltaKind int

const (
    DKInsert DeltaKind = iota
    DKDelete
    DKSplit
)

type BwTree struct {
    mu      sync.Mutex // protects allocation of node IDs
    mapping []atomic.Pointer[Node]
    rootID  NodeID
    nextID  atomic.Uint32
}

func New() *BwTree {
    t := &BwTree{}
    t.mapping = make([]atomic.Pointer[Node], 1<<16)
    root := &Node{isLeaf: true, highKey: 1<<62}
    rootID := NodeID(t.nextID.Add(1))
    t.mapping[rootID].Store(root)
    t.rootID = rootID
    return t
}

func (t *BwTree) Get(key int64) ([]byte, bool) {
    id := t.rootID
    for {
        n := t.mapping[id].Load()
        if found, val, deleted := walkChain(n, key); deleted {
            return nil, false
        } else if found {
            return val, true
        }
        // Walk down to a child.
        leaf, childID := findLeafOrChild(t, n, key)
        if leaf {
            return nil, false
        }
        id = childID
    }
}

func walkChain(n *Node, key int64) (found bool, val []byte, deleted bool) {
    for n.isDelta {
        switch n.deltaKind {
        case DKInsert:
            if n.deltaKey == key {
                return true, n.deltaVal, false
            }
        case DKDelete:
            if n.deltaKey == key {
                return false, nil, true
            }
        case DKSplit:
            if key >= n.splitPivot {
                // Caller must descend into splitSibling.
                return false, nil, false
            }
        }
        n = n.deltaNext
    }
    // Base.
    for _, it := range n.items {
        if it.Key == key {
            return true, it.Val, false
        }
    }
    return false, nil, false
}

func (t *BwTree) Insert(key int64, val []byte) {
    id := findLeafID(t, key)
    for {
        old := t.mapping[id].Load()
        delta := &Node{
            isDelta:   true,
            deltaKind: DKInsert,
            deltaKey:  key,
            deltaVal:  val,
            deltaNext: old,
        }
        if t.mapping[id].CompareAndSwap(old, delta) {
            return
        }
    }
}

This sketch elides:

Consolidation.
Splits and merges.
Epoch reclamation.
Mapping table growth.
Concurrent mapping table resizing.

A complete implementation is 5000+ lines of careful Go. Read the paper before attempting it.

When (rarely) to consider a Bw-tree in Go¶

You should consider a Bw-tree if:

You are building an in-memory database engine.
Your CAS rate on a single hot node exceeds what tidwall/btree can sustain (millions per second).
You can afford the implementation cost and the testing budget.

In every other case, publish/subscribe COW is the answer.

Concurrent ART (Adaptive Radix Tree)¶

ART is a radix trie variant where each node adapts its size to the number of children:

Node4: up to 4 children. Two arrays of 4 bytes (keys) and 4 pointers (children). Linear scan.
Node16: up to 16 children. Same shape, sized 16. Linear scan or SIMD comparison.
Node48: up to 48 children. A 256-byte index array mapping each possible byte to a slot in a 48-pointer array, plus the array.
Node256: 256 children, direct indexing. One pointer per possible byte.

A trie's traversal indexes by key byte, so finding the child at level L is O(1) in a Node256 and O(log 16) (binary search or SIMD) in smaller nodes.

For concurrency, ART uses OLFIT (Optimistic Latch-Free Index Traversal): per-node version counters; readers validate; writers latch one node at a time.

Layout¶

type NodeKind uint8

const (
    Node4 NodeKind = iota
    Node16
    Node48
    Node256
    Leaf
)

type Node struct {
    kind     NodeKind
    version  atomic.Uint64
    writeMu  sync.Mutex
    prefix   []byte
    numChildren uint16
    // ... shape-specific fields
}

type Node4Body struct {
    keys     [4]byte
    children [4]*Node
}

type Node16Body struct {
    keys     [16]byte
    children [16]*Node
}

type Node48Body struct {
    indices  [256]uint8 // 0xff = empty
    children [48]*Node
}

type Node256Body struct {
    children [256]*Node
}

type LeafBody struct {
    key   []byte
    value []byte
}

Real implementations pack the body inline via tagged pointers; in Go we would use a struct with kind discriminator and interface dispatch (with allocation tradeoffs).

Lookup¶

func (t *ART) Get(key []byte) ([]byte, bool) {
    for retries := 0; retries < 64; retries++ {
        n := t.root.Load()
        depth := 0
    walk:
        for {
            if n.kind == Leaf {
                if bytesEqual(n.leafKey(), key) {
                    return n.leafValue(), true
                }
                return nil, false
            }
            v1 := n.version.Load()
            if v1&1 == 1 {
                runtime.Gosched()
                break walk
            }
            if !matchPrefix(n.prefix, key, depth) {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            depth += len(n.prefix)
            if depth >= len(key) {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            child := findChild(n, key[depth])
            if child == nil {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            if n.version.Load() != v1 {
                break walk
            }
            n = child
            depth++
        }
    }
    panic("too many retries")
}

The pattern is the same as OCC B-tree: read fields, validate version, restart on conflict.

Adaptive resizing¶

When a Node4 fills up, it is replaced by a Node16. When a Node16 fills, Node48. When a Node48 fills, Node256. When a node shrinks below the lower threshold, it is replaced by the next smaller size.

Replacement is non-trivial under concurrency: the parent's pointer to the resizing node must be updated atomically. Options:

Indirection: each node has a generation counter; readers re-validate after observing a generation mismatch.
CAS on parent: atomically swap parent's child pointer from old node to new.
Mapping table (like Bw-tree): indirect via a node ID.

Real ART implementations (DuckDB, HyPer) use a combination.

Prefix compression¶

A trie's depth equals the key length. For 256-bit keys, that is 32 byte-levels. Most levels would have a single child; ART compresses these into a "prefix" stored in the parent.

Without prefix compression:
[a] -> [b] -> [c] -> Leaf("abc": ...)
[a] -> [b] -> [d] -> Leaf("abd": ...)

With prefix compression:
Node prefix="ab"
  [c] -> Leaf("abc": ...)
  [d] -> Leaf("abd": ...)

Prefix compression dramatically shortens the tree and improves cache behavior. The cost: comparing the prefix against the search key on every traversal.

SIMD acceleration¶

Node16's "find child for byte K" is a search over 16 sorted bytes. Modern CPUs have SIMD instructions (e.g., AVX2 _mm256_cmpeq_epi8) that can compare 16 bytes in one instruction, plus _mm256_movemask_epi8 to extract the match.

In Go, you cannot use intrinsics directly. Options:

Use golang.org/x/sys/cpu to detect SIMD support and call assembly.
Use linear scan; modern branch predictors are quite good.
Use binary search (slower than linear for 16 elements due to branch misprediction).

In practice, Go ART implementations use linear scan and accept the small slowdown vs C++ SIMD-optimized versions.

Concurrent ART in production¶

HyPer, DuckDB, and several other database engines use concurrent ART. In Go, no production-grade concurrent ART exists. If you need one, the realistic path is:

Implement a single-threaded ART with OLFIT-style version counters.
Wrap with publish/subscribe COW for concurrency.
Accept that performance will be 50-70% of a tuned C++ implementation.

For string-keyed in-memory indexes at very high scale, ART is the right choice. For mixed int/string keys with range queries, a B+-tree is simpler and competitive.

Hardware Transactional Memory (Intel TSX, ARMv8.1 LSE)¶

HTM provides hardware support for atomic execution of arbitrary code blocks. The protocol:

xbegin: start a hardware transaction. Returns a fallback address.
Read and write memory normally.
xend: commit. If no conflict, all reads and writes appear atomic.
If a conflict is detected (another core writes to a cache line you read), xabort is triggered and execution jumps to the fallback.

For trees, HTM can replace fine-grained latches: wrap a small critical section in xbegin/xend, fall back to a mutex on abort.

Intel TSX example (pseudocode)¶

status = _xbegin();
if (status == _XBEGIN_STARTED) {
    // Critical section: lock-free under HTM
    bt_insert(key, val);
    _xend();
} else {
    // Fallback: use the mutex
    mutex_lock();
    bt_insert(key, val);
    mutex_unlock();
}

If the transaction commits, we have the speed of a critical-section-free path. If it aborts, we fall back to the mutex.

When does HTM help?¶

Short critical sections.
Low write conflict rate.
Many concurrent operations.

For long critical sections, HTM aborts due to cache evictions. For high-conflict workloads, HTM aborts due to conflicts. In either case, you spend most time in the fallback.

Reality check¶

Intel TSX has had a turbulent history: bugs in early implementations forced disablement on many CPUs. By 2025, TSX is available on most server-class Intel chips but is rarely used in production code outside of niche cases (some HPC, some research databases).

ARMv8.1 LSE provides atomic CAS instructions but not full HTM. ARM has Pointer Authentication and Memory Tagging but not TSX-equivalent.

Using HTM from Go¶

Go does not provide direct access to xbegin/xend. You would need cgo to call assembly. Even then, Go's runtime can preempt goroutines at any point, which interacts poorly with HTM (a preempted goroutine in a transaction aborts and falls back).

For Go in 2026, HTM is essentially unavailable. The senior file's discussion of HTM is intellectual completeness; in practice, you will not use it.

What HTM tells us about latch design¶

The interesting lesson from HTM is: critical sections must be short and simple to be fast. This is true even without HTM. Any time you hold a latch:

Cache lines you touch must be exclusive (HTM teaches us how few you can have).
Branch mispredictions cost.
Function calls are expensive.
Allocations are very expensive.

A well-designed latched B-tree mimics HTM's discipline even without HTM hardware: minimize the critical section to the smallest possible work.

OLFIT and MassTree in Depth¶

The OLFIT paper (Cha et al., 2001) and the MassTree paper (Mao et al., 2012) describe two related protocols for highly concurrent B+-trees in memory.

OLFIT mechanism¶

Each node has a 32- or 64-bit version counter.
Readers do not take latches; they validate the version before and after reading.
Writers take an exclusive latch on a node, increment the version twice (start: odd, end: even), mutate, release.
A reader seeing an odd version yields and retries.
A reader whose pre/post version comparison differs retries.

For B-trees, OLFIT alone is not enough — concurrent splits leave readers in the wrong node. OLFIT is paired with B-link's right-sibling pointers to handle splits.

The protocol's strength: read throughput is limited only by memory bandwidth, not by lock contention. Writes still serialize per-node.

MassTree¶

MassTree extends OLFIT with two innovations:

Multi-level trie of B+-trees. The top level is a B+-tree of 8-byte key fragments; each leaf points to a sub-tree for keys with that prefix. This compresses common prefixes and avoids per-key string comparison overhead.
Versioned permuted keys. Each B+-tree node stores keys in insertion order, plus a separate "permutation" array that gives the sorted order. Inserts append + update the permutation; the version counter increments. Readers walk the permutation to find keys in sorted order.

MassTree's claimed throughput in 2012 was ~10M reads/sec per core, ~1M writes/sec per core on a 16-core machine — among the fastest in-memory B+-trees ever measured.

Implementing OLFIT in Go¶

A skeleton (extending the senior file's OCC sketch with B-link):

type Node struct {
    version  atomic.Uint64
    writeMu  sync.Mutex
    isLeaf   bool
    items    []item
    children []*Node
    highKey  int64
    rightSib *Node
}

func (t *OLFITree) Get(key int64) ([]byte, bool) {
    n := t.root.Load()
restart:
    for retries := 0; retries < 64; retries++ {
        if n == nil {
            return nil, false
        }
        v1 := n.version.Load()
        if v1&1 == 1 {
            runtime.Gosched()
            continue
        }
        // Read all fields.
        highKey := n.highKey
        items := n.items
        children := n.children
        rightSib := n.rightSib
        // Re-validate.
        v2 := n.version.Load()
        if v2 != v1 {
            n = t.root.Load()
            continue restart
        }
        // Use the values.
        if key > highKey {
            n = rightSib
            continue
        }
        if n.isLeaf {
            i, found := findItem(items, key)
            if !found {
                return nil, false
            }
            return items[i].val, true
        }
        i := childIndex(items, key)
        n = children[i]
    }
    panic("too many retries")
}

Note the careful re-validation after every node read. The runtime.Gosched() on observing an odd version is a small backoff.

Combine with B-link's split protocol and you have a functional OLFIT in Go. The implementation is ~500 lines for a complete B+-tree.

For Go production, this is not worth the complexity vs publish/subscribe COW. For an in-memory database engine, it might be.

Hekaton-Class In-Memory Engines¶

Microsoft SQL Server's Hekaton (released 2014, refined since) is the canonical example of an in-memory engine using all the techniques in this file:

Bw-tree indexes for ordered access.
Hash indexes (separate from the trees) for point access.
MVCC with per-row version chains.
Optimistic transactions: each transaction tracks its read set and validates at commit.
No locks: pessimistic locking is replaced entirely by MVCC + OCC.
In-memory transaction log with delayed durability.

The architecture:

[Transaction] -> [In-memory rows with version chains] -> [Bw-tree indexes] -> [Mapping table]
                                          \
                                           -> [Optional hash index for point access]

Each row is a record with:

A pointer to the previous version.
XMin (created-by transaction ID).
XMax (deleted-by transaction ID, 0 if not deleted).
Column data.

Inserts create a new row version and update both indexes (Bw-tree and hash). Deletes set XMax and add a delete marker to indexes. Updates create a new version and link it.

Concurrent transactions validate at commit: their read set's versions must still be the latest (or at least visible to their snapshot). If validation fails, abort and retry.

This design eliminates pessimistic locking entirely; everything is optimistic. For high-throughput workloads with low conflict rates, it can sustain millions of transactions per second on a single multicore machine.

What we learn from Hekaton¶

Lock-free indexes are achievable with Bw-trees.
MVCC + OCC can replace pessimistic locking for OLTP.
In-memory engines have fundamentally different design points from disk-based engines.
Hardware matters: Hekaton is heavily tuned for NUMA, cache hierarchies, and TSX (in the era when it was available).

For Go, building a Hekaton-class engine is a multi-year project. The closest equivalents in Go are CockroachDB (which uses Pebble + serializable isolation) and TiDB / TiKV (Rust, not Go, but architecturally similar).

Comparative Engine Tour¶

A brief survey of how major database engines implement their indexes:

PostgreSQL nbtree¶

B+-tree with Lehman-Yao B-link protocol.
Per-page latches.
MVCC via transaction ID stamps on tuples.
Vacuum process reclaims dead versions.
WAL-protected for durability.

MySQL InnoDB¶

Clustered B+-tree (primary key is the data).
Per-page latches with deadlock detection.
MVCC via undo logs (an alternative to per-row version chains).
Adaptive Hash Index on top for very hot keys.

SQLite¶

B+-tree per table or index.
Single-writer (no concurrent writers).
WAL or rollback journal for durability.
Very different design point: embedded, low-concurrency.

Oracle TimesTen¶

B-tree and hash indexes.
In-memory; can be transient or durable.
Locking-based concurrency.

SAP HANA¶

Column-store main + delta merge architecture.
Row-store for OLTP via a different tree.
MVCC via tuple timestamps.
Heavily SIMD-optimized.

MS SQL Server Hekaton¶

Bw-trees + hash indexes.
MVCC + OCC.
Optimistic transactions.

Google Spanner¶

Distributed B-tree with TrueTime.
Multi-version with strict serializability.
Snapshot reads at any past timestamp.
Synchronously replicated across data centers.

CockroachDB¶

Pebble (LSM) as the storage engine.
Range-based sharding ("ranges" of ~64 MB each).
Multi-version via timestamped keys.
Distributed transactions via two-phase commit.

TiDB / TiKV¶

TiKV is the Rust storage layer, RocksDB-based.
TiDB is the SQL layer in Go.
Distributed transactions via Percolator protocol.

Aerospike¶

Hybrid hash + B-tree.
Persistent storage with SSD-aware access.
High throughput via lock-free reads.

MongoDB WiredTiger¶

B+-tree on disk with skiplist memtables.
MVCC via timestamps.
LSM mode also available.

Reading the source of any one of these is a multi-month commitment. Reading the design documents (READMEs, blog posts) for two or three of them is more achievable and educational.

Hardware-Aware Tree Design¶

At the professional level, hardware details directly shape protocol choices.

Cache lines¶

x86 cache line: 64 bytes.
ARMv8 cache line: 64 or 128 bytes.
POWER cache line: 128 bytes.

A B-tree node should ideally fit in one or two cache lines. A 64-byte node holds about 4-8 items depending on item size. For larger fanouts, you accept multiple cache lines but try to keep them aligned.

Go does not give you cache-line alignment directly, but runtime.Compactify-style tricks work. The simplest is:

type CacheLineNode struct {
    _    [0]byte // ensure alignment
    data [56]byte // header data
    _    [8]byte // padding to fill cache line
}

For Go tree libraries, cache-line alignment is mostly handled by Go's allocator already; explicit alignment helps only at the margins.

Prefetching¶

A CPU instruction prefetcht0 (Intel) loads a cache line into L1 before it is needed. For tree traversal, you can prefetch the next node while processing the current.

In Go, no direct prefetch instruction. Some tricks:

Read a non-essential field of the next node to trigger the load.
Use SIMD-style libraries that wrap intrinsics.

Real impact: maybe 10-20% on tight microbenchmarks. Rarely worth the complexity in Go.

Branch prediction¶

Modern CPUs predict branches with >95% accuracy on regular patterns. For tree traversal, the descent direction is data-dependent, so branch prediction often fails — each comparison is a potential misprediction.

Tricks to reduce mispredictions:

Use branchless code where possible (a = (cond ? x : y) becomes a CMOV).
Sort branches by likelihood; mark with likely/unlikely (no Go equivalent).
Use linear scan in small nodes; the predictor handles it well.

Go's compiler is conservative about branchless code; you may need to write subtle expressions that survive optimization.

NUMA¶

On multi-socket machines, memory accesses across sockets cost 2-3x more than local accesses. For an in-memory tree:

Pin goroutines to specific sockets (unix.SchedSetaffinity, requires syscall).
Allocate the tree on a specific NUMA node (numa_alloc_onnode, via cgo).
Replicate read-only data per socket.

For Go services on a single socket, NUMA does not matter. For very large servers, it does.

Persistent Memory¶

Intel Optane DCPMM (discontinued 2022) was the most prominent persistent memory product. Trees on persistent memory require:

Crash-consistency: ensure writes hit persistence before subsequent dependent writes.
clflushopt, clwb, pcommit instructions to flush cache lines.
Special protocols for safe in-place updates.

Real impact on Go: low. Most Go databases use OS-level persistence (write to file + fsync), not direct persistent memory access.

Profiling at the Microbenchmark Level¶

At the professional level, profiling tools must measure at the nanosecond scale.

`go test -bench` is the starting point¶

go test -bench=BenchmarkInsert -benchmem -benchtime=10s -count=5

-count=5 runs each benchmark 5 times so you can see variance. -benchtime=10s runs each iteration for 10 seconds to amortize startup.

`pprof` for CPU profiling¶

go test -bench=BenchmarkInsert -cpuprofile=cpu.prof
go tool pprof cpu.prof

In pprof, top10 shows the hottest functions. web (requires graphviz) shows the call graph. disasm shows annotated assembly.

For tree code, the hottest functions are typically findInItems, insertItemSorted, and the comparator. Optimizing them helps.

`runtime/trace` for scheduling and latency¶

go test -bench=BenchmarkInsert -trace=trace.out
go tool trace trace.out

Trace shows goroutine states, blocking events, GC pauses. Useful for diagnosing why a workload is not scaling with cores.

`perf` (Linux) for hardware counters¶

perf stat -e cache-misses,branch-misses,instructions ./bench

Shows L1/L2/LLC cache misses, branch mispredictions, IPC. For tuned tree code, you want:

IPC > 2.
Cache miss rate < 1% per instruction.
Branch misprediction < 1% per branch.

If any of these are off, optimize for cache or branches.

Microbenchmark discipline¶

A few rules:

Always run multiple iterations and check variance.
Always use b.RunParallel for concurrency benchmarks.
Always pin the GOMAXPROCS to avoid scheduler noise.
Watch out for compiler optimizations that elide the work you are trying to measure (use b.SetBytes and stash results).
Re-measure after each change; do not trust intuition.

For nanosecond-scale measurements, the noise floor matters. Take the median of 10 runs, not the mean.

Coding Patterns¶

At the professional level:

Pattern 1: Lock-free + epoch reclamation. For data structures that cannot be GC'd cleanly (unsafe.Pointer, cgo).

Pattern 2: Mapping-table indirection. Decouple logical IDs from memory pointers. Enables consolidation, migration, replication.

Pattern 3: Delta records + lazy consolidation. Write-once, read-many-then-fold.

Pattern 4: Per-thread caches. Each goroutine maintains its own short-lived cache that synchronizes occasionally with the global state.

Pattern 5: NUMA-aware sharding. Pin shards to NUMA nodes; route requests to the right shard based on key.

Pattern 6: HTM with mutex fallback. Optimistic fast path; pessimistic fallback. (Rarely applicable in Go.)

Pattern 7: Hardware-aware layout. Cache-aligned nodes, hot-cold field separation, branch-predictable comparators.

Clean Code at Engine Scale¶

Treat the data structure as an artifact. Document every invariant, every CAS, every memory ordering decision.
Pair every invariant with a runtime check (under a debug tag) that asserts it.
Use named lock types that signal intent: WriteLatch, ReadLatch.
No global state except for explicitly singleton resources (e.g., the mapping table).
Fuzz the protocol. Random sequence of operations + invariant check.
Model check the protocol (TLA+, Promela) if it is novel.

Engine code is shipped once, debugged forever. Discipline is everything.

Product Use¶

Realistic use cases for professional-level techniques:

An in-memory transactional engine (Hekaton, MemSQL).
A distributed database's local storage (CockroachDB, TiDB, FoundationDB).
A real-time analytics engine (HyPer, ClickHouse).
A search index (Elasticsearch's Lucene segments).
A high-frequency-trading order book.
An adtech bidding system.

In each case, the tree is the critical component, and squeezing the last percent of throughput is worth months of engineering.

For everything else: use tidwall/btree with publish/subscribe COW.

Performance Tips¶

Profile with hardware counters. Cache misses and branch mispredictions dominate.
Minimize allocations. Use arenas or pre-allocated pools.
Pack nodes tightly. Fewer cache lines per node = fewer cache misses.
Avoid pointer chasing. Inline what you can.
Use SIMD where available. (Hard in Go; requires assembly.)
NUMA-aware allocation. Pin to local memory.
Batch operations. Reduce per-call overhead.

Best Practices¶

Build on top of a published, well-tested protocol. Do not invent.
Test exhaustively, including under random fault injection.
Document the protocol's correctness argument.
Compare against published benchmarks; if you cannot beat the literature, do not deploy.
Have a backup plan if the advanced protocol underperforms.

Edge Cases and Pitfalls¶

Memory pressure under heavy COW. Bound the working set.
GC pauses under heavy allocation. Tune GC, consider non-Go languages for the inner loop.
NUMA imbalance. Shard to balance memory access.
Persistent memory crash consistency. Get this wrong and you lose data.
CAS livelock. A backoff is essential.
HTM aborts under cache pressure. Have a fast fallback.

Common Mistakes¶

Implementing a Bw-tree because it sounds cool.
Using HTM without a proper fallback.
Ignoring NUMA on a multi-socket machine.
Microbenchmarking with too few iterations.
Trusting your intuition about hot paths without profiling.
Building a custom tree when an established library exists.

Common Misconceptions¶

"Lock-free is always faster." Often the same or slower than fine-grained locking. Profile.
"HTM is the future." It was, briefly. Now it is a niche.
"In-memory engines do not need persistence." They do (write-ahead logging, periodic snapshots).
"Bigger fanout is always better." Diminishing returns; comparison cost grows.
"More concurrency is always better." No — coordinated coordination beats heroic concurrency.

Tricky Points¶

Mapping table resizing under concurrent access. A classic concurrent data structure problem.
Delta chain consolidation under contention. Two threads consolidating simultaneously.
Split delta visibility. Ensuring all readers see the split before the parent is updated.
Epoch garbage collection of long-lived objects. Bounding the epoch list.
HTM transaction size. Cache footprint limits.

Tricky Questions¶

Why does the Bw-tree's mapping table CAS make the protocol lock-free for writes?
How does ART's prefix compression interact with concurrent inserts?
Why is HTM rarely used in production?
What is the worst-case throughput of a Bw-tree under heavy write conflict on a single leaf?
How does Hekaton's optimistic transaction protocol handle write-write conflicts?
Why does MassTree split its tree by 8-byte key fragments?
What is the trade-off between consolidation threshold and read latency?
How does PostgreSQL's vacuum interact with concurrent updates?
Why is NUMA awareness essential for engines exceeding one socket's capacity?
What is the difference between MVCC via version chains and MVCC via undo logs?

For each, articulate the answer with hardware-level reasoning.

Test¶

For professional-level work, testing includes:

Unit tests with invariant assertions.
Property-based tests (e.g., quickcheck-style).
Fuzz tests for protocol robustness.
Long-running stress tests.
Hardware-specific tests (NUMA, multi-socket).
Comparison benchmarks against literature.

A professional codebase has thousands of tests covering all of these.

Cheat Sheet¶

Technique	Best for
Bw-tree	Lock-free in-memory engine indexes
Concurrent ART	Lock-free string-key indexes
OLFIT B-link	Lock-free B+-tree
HTM + fallback	Niche; not for Go
Hekaton-style MVCC + OCC	High-throughput OLTP

Self-Assessment Checklist¶

Explain the Bw-tree mapping table.
Walk through a delta chain lookup.
Describe ART's adaptive node sizes.
Compare OLFIT to plain hand-over-hand.
Explain Hekaton's optimistic transaction protocol.
Identify cache-line and NUMA impacts on a tree's performance.
Set up a hardware-counter-based profile.
Decide when (almost never) to deploy these techniques.

Summary¶

The professional level is where research meets engineering. Bw-trees, concurrent ART, HTM, OLFIT — these are the techniques at the bleeding edge of in-memory database engine design. They are powerful, expensive to implement, and rarely necessary outside their target use cases.

For Go in 2026, the professional-level conclusion is consistent with the senior file: publish/subscribe COW with tidwall/btree is the default. Reach for these advanced techniques only when you are building an in-memory database engine and the throughput targets demand them.

The value of this file is intellectual: you can now read any paper, debate trade-offs with database engineers, and architect engines at the apex of the design space.

Diagrams¶

Bw-tree mapping table¶

Mapping Table:
  ID=1 -> [Base node A]
  ID=2 -> [Delta(insert k=5, val=v) -> Delta(insert k=8, val=w) -> Base node B]
  ID=3 -> [Base node C]

Writer inserts k=10 into ID=2:
  new_delta = Delta(insert k=10, val=x, next=current head)
  CAS mapping[2] from old head to new_delta

The mapping table is the source of truth. Updates are CAS operations on table entries.

ART node sizes¶

Node4:   keys=[0x41, 0x42, 0x43, 0x44]  children=[*,*,*,*]
Node16:  keys=[16 sorted bytes]          children=[16 pointers]
Node48:  indices=[256 bytes mapping byte->slot]  children=[48 pointers]
Node256: children=[256 pointers, one per byte]

Adaptive resizing: when Node4 fills, replace with Node16; etc.

Appendix A: Bw-tree consolidation in implementation detail¶

Consolidation is the most subtle operation in a Bw-tree. Let's walk through it carefully.

When a delta chain reaches threshold length (typically 8-16 records), the next reader triggers consolidation:

func (t *BwTree) maybeConsolidate(id NodeID, n *Node) *Node {
    if chainLength(n) < threshold {
        return n
    }
    // Build the consolidated base.
    base := &Node{
        isDelta: false,
        isLeaf:  n.isLeafChain(),
    }
    // Walk the chain in reverse to apply deltas in order.
    deltas := collectChain(n)
    base.items = applyDeltas(deltas)
    base.highKey = n.highKey
    base.rightSib = n.rightSib
    // Attempt the CAS.
    if t.mapping[id].CompareAndSwap(n, base) {
        // We won. Retire the old chain.
        t.epoch.Retire(n)
        return base
    }
    // Lost. Someone else consolidated. Re-read.
    return t.mapping[id].Load()
}

func collectChain(n *Node) []*Delta {
    var deltas []*Delta
    for n.isDelta {
        deltas = append(deltas, n.delta)
        n = n.delta.Next
    }
    // n is now the base.
    return deltas
}

func applyDeltas(deltas []*Delta) []item {
    // Start from base.items (visible at the end of the chain).
    // Apply deltas in reverse (oldest first).
    base := deltas[len(deltas)-1].Next // the base node
    items := append([]item{}, base.items...)
    for i := len(deltas) - 2; i >= 0; i-- {
        d := deltas[i]
        switch d.Kind {
        case DKInsert:
            items = insertSorted(items, d.Key, d.Value)
        case DKDelete:
            items = removeKey(items, d.Key)
        }
    }
    return items
}

Subtleties:

Two concurrent consolidations. Both threads build a base; only one's CAS succeeds. The loser's work is wasted (could be wasted significantly if the chain is long). Mitigation: random backoff before consolidating, so consolidators don't all start simultaneously.
Concurrent insert during consolidation. While a consolidator is building the base, another writer adds a delta to the head of the chain. The consolidator's CAS sees the new head and fails. The consolidator must restart or give up.
Memory reclamation. The old chain (now retired) cannot be freed until no thread can see it. Epoch-based reclamation handles this. In Go, the GC handles it as long as no goroutine retains a reference to the old chain.
Split delta in the chain. If the chain contains a DeltaSplit, the consolidator must handle it specially: items >= pivot belong in the right sibling, not in the consolidated base.

A worked example¶

Start state: ID=42 -> [Base { items: [10, 20, 30], hk: 100 }]

Insert 15: - Allocate Delta(insert, 15). - Delta.Next = Base - CAS mapping[42] from Base to Delta - mapping[42] -> [Delta(insert, 15) -> Base]

Insert 25: - Allocate Delta(insert, 25). - Delta.Next = current head = Delta(insert, 15) - CAS - mapping[42] -> [Delta(insert, 25) -> Delta(insert, 15) -> Base]

After threshold inserts, consolidate: - Walk chain: collect [Delta(insert, 25), Delta(insert, 15), ..., Base] - Apply in reverse: start with Base.items [10, 20, 30], then insert 15 (yielding [10, 15, 20, 30]), then insert 25 (yielding [10, 15, 20, 25, 30]). - New base: items=[10, 15, 20, 25, 30], hk=100. - CAS mapping[42] from current head to new base. - Old chain is retired.

Reads after consolidation see the new base; reads in progress before consolidation see the old chain (their pointer to the old head is still valid until epoch advances).

Appendix B: Bw-tree split protocol¶

Splits in a Bw-tree are also delta records. Let's walk through a leaf split.

Initial state: leaf ID=42 is full with items [10, 20, 30, 40, 50, 60, 70, 80].

To split: 1. Allocate new leaf ID=43 with items [50, 60, 70, 80]. Install in mapping table. 2. Post a SplitDelta to ID=42: { Kind: DKSplit, Pivot: 50, SiblingID: 43, Next: current state of 42 }. 3. CAS mapping[42] to point to the SplitDelta.

After step 3, in-flight readers on ID=42 will see the SplitDelta and route to ID=43 for keys >= 50.

Update the parent: post a delta to the parent telling it about ID=43.
Eventually consolidate both ID=42 (which now has items [10, 20, 30, 40] plus the SplitDelta) and the parent.

The critical insight: between step 3 and step 4, readers can still find any key. If a key is < 50, it is in ID=42's items. If >= 50, the SplitDelta routes to ID=43. The parent has not yet been updated, but readers do not need it to be.

Why this is lock-free¶

No latches. Every state change is a CAS on the mapping table. Concurrent writers can:

Insert into ID=42 (adds a Delta on top of the SplitDelta). The reader sees the Delta first, then the SplitDelta.
Insert into ID=43 (adds a Delta to ID=43's chain).
Split ID=42 again (cascading split).

Each operation is local CAS. No global lock. Lock-free.

Why this is hard¶

Correctness requires careful protocol design. Consider:

A reader sees a SplitDelta but the parent has not yet been updated. The reader's traversal from the root would route through the parent to ID=42, then via the SplitDelta to ID=43. Fine.
A reader caches a parent pointer to ID=42 (e.g., during a range scan). The parent gets updated to also know about ID=43. The reader's cached pointer is now stale but not invalid — ID=42 still exists, just smaller. The reader can complete its current operation.
A consolidator is folding ID=42's chain (Delta -> SplitDelta -> Base). The consolidation must produce a base node with [10, 20, 30, 40] items and a SplitDelta still in the chain (because the SplitDelta affects the parent, not just the leaf).

Production Bw-tree implementations have all sorts of subtle corrections for cases like this. Microsoft's reference code has been refined over years.

Appendix C: Concurrent ART implementation sketches¶

A concurrent ART in Go would have the structure:

package art

import (
    "sync"
    "sync/atomic"
)

type NodeKind uint8

const (
    Node4 NodeKind = iota
    Node16
    Node48
    Node256
    Leaf
)

type Node struct {
    kind        NodeKind
    version     atomic.Uint64
    writeMu     sync.Mutex
    prefix      []byte
    numChildren uint16
    // Type-specific body. In Go we use a discriminated struct.
    body interface{}
}

type Node4Body struct {
    keys     [4]byte
    children [4]*Node
}

// ... and so on

The Get path uses OLFIT-style optimistic concurrency:

func (t *ART) Get(key []byte) ([]byte, bool) {
    for retries := 0; retries < 64; retries++ {
        n := t.root.Load()
        depth := 0
    walk:
        for {
            if n.kind == Leaf {
                leaf := n.body.(*LeafBody)
                if bytes.Equal(leaf.key, key) {
                    return leaf.value, true
                }
                return nil, false
            }
            v1 := n.version.Load()
            if v1&1 == 1 {
                runtime.Gosched()
                break walk
            }
            // Match prefix.
            if !matchPrefix(n.prefix, key, depth) {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            depth += len(n.prefix)
            if depth >= len(key) {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            child := findChild(n, key[depth])
            if child == nil {
                if n.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            if n.version.Load() != v1 {
                break walk
            }
            n = child
            depth++
        }
    }
    panic("too many retries")
}

Inserts take the per-node writeMu and increment the version counter twice (odd, mutate, even). Resizing is the tricky part: when a Node4 fills, allocate a Node16, copy the children, then CAS the parent's pointer.

The Go implementation gets to ~2-3M lookups/sec/core. A tuned C++ implementation can do 10-15M. The gap is mostly Go's lack of intrinsics for SIMD comparison.

Appendix D: A walk through Hekaton's optimistic transaction protocol¶

Hekaton (SQL Server's in-memory engine) uses MVCC + OCC. Let's walk through a transaction:

Transaction starts. Assign transaction ID T. Take a snapshot of the latest committed transaction ID S.
Reads. For each read, find the version with XMin <= S and (XMax == 0 OR XMax > S). Track the version pointer in the read set.
Writes. Create a new version with XMin = T. Mark the previous version with XMax = T. Update the index. Track the new version in the write set.
Validation phase. Before committing, scan the read set:
- For each version in the read set, verify it is still the latest visible version.
- If any read is stale (a newer version was committed by another transaction after S), abort.
Commit. Mark T as committed. The versions become permanently visible.
Cleanup. Old versions become garbage when no transaction can see them.

If the validation phase fails, the transaction aborts and is retried (possibly by the application).

Optimism vs pessimism¶

In a pessimistic system (most disk-based databases), every read takes a row lock to prevent concurrent writers. In Hekaton, no row locks. Conflicts are detected at commit time.

This works well when conflicts are rare. If two transactions update the same row, only one wins; the loser retries. For workloads with high conflict rates, the abort overhead becomes painful.

Implementing Hekaton in Go¶

A skeleton:

package hekaton

import (
    "sync"
    "sync/atomic"
)

type Version struct {
    Value []byte
    XMin  int64
    XMax  int64
    Next  *Version
}

type Row struct {
    Key    string
    mu     sync.Mutex
    Newest *Version
}

type Tx struct {
    ID      int64
    Snapshot int64
    Reads   map[string]*Version
    Writes  map[string]*Version
}

type Engine struct {
    mu         sync.Mutex
    rows       *btree.BTreeG[*Row]
    txCounter  atomic.Int64
    commitLog  map[int64]bool // committed transactions
    commitMu   sync.RWMutex
}

func (e *Engine) Begin() *Tx {
    return &Tx{
        ID:       e.txCounter.Add(1),
        Snapshot: e.lastCommitted(),
        Reads:    make(map[string]*Version),
        Writes:   make(map[string]*Version),
    }
}

func (e *Engine) Read(tx *Tx, key string) ([]byte, bool) {
    row, ok := e.findRow(key)
    if !ok {
        return nil, false
    }
    row.mu.Lock()
    defer row.mu.Unlock()
    for v := row.Newest; v != nil; v = v.Next {
        if v.XMin <= tx.Snapshot && e.isCommitted(v.XMin) &&
            (v.XMax == 0 || v.XMax > tx.Snapshot || !e.isCommitted(v.XMax)) {
            tx.Reads[key] = v
            return v.Value, true
        }
    }
    return nil, false
}

func (e *Engine) Write(tx *Tx, key string, value []byte) {
    row, ok := e.findRow(key)
    if !ok {
        row = &Row{Key: key}
        e.insertRow(row)
    }
    row.mu.Lock()
    defer row.mu.Unlock()
    if row.Newest != nil {
        row.Newest.XMax = tx.ID
    }
    v := &Version{Value: value, XMin: tx.ID, Next: row.Newest}
    row.Newest = v
    tx.Writes[key] = v
}

func (e *Engine) Commit(tx *Tx) bool {
    e.mu.Lock()
    defer e.mu.Unlock()
    // Validation phase.
    for key, oldV := range tx.Reads {
        row, _ := e.findRow(key)
        row.mu.Lock()
        // Is oldV still the visible version?
        visible := false
        for v := row.Newest; v != nil; v = v.Next {
            if v == oldV {
                if v.XMax == 0 || v.XMax == tx.ID {
                    visible = true
                }
                break
            }
        }
        row.mu.Unlock()
        if !visible {
            e.abort(tx)
            return false
        }
    }
    // Commit.
    e.commitMu.Lock()
    e.commitLog[tx.ID] = true
    e.commitMu.Unlock()
    return true
}

func (e *Engine) abort(tx *Tx) {
    // Roll back all writes.
    for key, v := range tx.Writes {
        row, _ := e.findRow(key)
        row.mu.Lock()
        if row.Newest == v {
            row.Newest = v.Next
            if v.Next != nil {
                v.Next.XMax = 0
            }
        }
        row.mu.Unlock()
    }
}

This is ~150 lines of skeleton code. Production Hekaton is ~50000 lines. But the architectural shape is the same: MVCC versions, optimistic validation, commit log.

Appendix E: A deep dive on NUMA-aware tree design¶

On a multi-socket server, a memory access to a remote socket can cost 2-3x more than a local access. For an in-memory tree:

Each socket has ~hundreds of GB of local memory.
Cross-socket traffic goes through QPI (Intel) or AMD's Infinity Fabric.
Allocations from a goroutine running on socket 0 should ideally live in socket 0's memory.

In Go, no built-in NUMA awareness. Options:

Pin goroutines to sockets via runtime.LockOSThread plus syscalls for CPU affinity.
Shard the tree by socket. Each socket has its own tree shard; queries are routed to the right shard.
Use cgo to call numa_alloc_onnode for explicit NUMA-local allocation.

For Go services on a single socket (which is most), NUMA does not matter. For very large servers (32+ cores spanning multiple sockets), NUMA-aware sharding can yield 2x throughput.

A NUMA-aware sharded store¶

type NUMAStore struct {
    perSocket [numSockets]*Store
}

func (s *NUMAStore) Get(key string) (Item, bool) {
    socket := currentSocket() // syscall to find which CPU we're on
    return s.perSocket[socket].Get(key)
}

The catch: if the goroutine migrates between sockets, the routing is wrong. Pin with runtime.LockOSThread to prevent migration.

NUMA awareness is rare in pure Go. Most Go services that need it call out to C (CGo, eBPF) or use a different language for the inner loop.

Appendix F: Persistent memory and crash consistency¶

Intel Optane DCPMM (now discontinued) offered byte-addressable persistent memory. Trees on persistent memory required:

Atomic 8-byte writes (the natural unit of persistence).
clflushopt / clwb / sfence instructions to ensure cache lines reach persistence before subsequent dependent writes.
Crash recovery protocols to handle partial writes.

Most database engines do not use persistent memory directly; they write to a WAL on SSD. But research engines (SAP HANA's NVM-aware mode, Microsoft's PMDK-based work) explore direct persistent-memory storage.

For Go, persistent memory is essentially unavailable. The hardware was discontinued; the libraries (PMDK) are C-only.

The interesting algorithmic content remains relevant: techniques like epoch-based persistent atomic updates are applicable to any storage with similar characteristics (e.g., journaled filesystems with append-only semantics).

Appendix G: Profiling with hardware counters¶

For nanosecond-scale tuning, hardware counters are essential. On Linux:

perf stat -e cache-misses,cache-references,branch-misses,branches,instructions,cycles ./bench

Output:

   12,345,678 cache-misses          # 5.43% of cache references
  227,123,456 cache-references
      567,890 branch-misses         # 0.34% of branches
  166,789,012 branches
1,234,567,890 instructions          # 1.87 insn per cycle
  660,123,456 cycles

For tuned tree code, you want:

IPC > 2 (modern CPUs can issue 4+ instructions per cycle with ILP).
Cache miss rate < 1% per cache reference.
Branch miss rate < 1% per branch.

If IPC is low, look for instruction-level parallelism opportunities or dependency chains. If cache misses are high, improve data layout or prefetching. If branch misses are high, restructure conditionals or use branchless code.

For Go, the relevant flags are -gcflags=-S to see assembly, pprof for high-level profiling, and perf for hardware counters.

Appendix H: A long worked example — implementing OLFIT in Go¶

Let's build a complete OLFIT B+-tree in Go for int64 keys to string values. About 600 lines. This is at the limit of what is reasonable to attempt in Go.

package olfit

import (
    "runtime"
    "sort"
    "sync"
    "sync/atomic"
)

const fanout = 32

type Node struct {
    version  atomic.Uint64
    writeMu  sync.Mutex
    isLeaf   bool
    items    []item
    children []*Node
    highKey  int64
    rightSib *Node
}

type item struct {
    Key int64
    Val string
}

type Tree struct {
    rootMu sync.Mutex
    root   atomic.Pointer[Node]
}

const maxKey = int64(^uint64(0) >> 1)

func New() *Tree {
    t := &Tree{}
    t.root.Store(&Node{isLeaf: true, highKey: maxKey})
    return t
}

func (n *Node) beginWrite() {
    n.writeMu.Lock()
    n.version.Add(1) // now odd
}

func (n *Node) endWrite() {
    n.version.Add(1) // now even
    n.writeMu.Unlock()
}

func (t *Tree) Get(key int64) (string, bool) {
restart:
    for retries := 0; retries < 64; retries++ {
        n := t.root.Load()
        for {
            v1 := n.version.Load()
            if v1&1 == 1 {
                runtime.Gosched()
                continue restart
            }
            // Read all fields.
            isLeaf := n.isLeaf
            items := n.items
            children := n.children
            highKey := n.highKey
            rightSib := n.rightSib
            // Re-validate.
            v2 := n.version.Load()
            if v2 != v1 {
                continue restart
            }
            // Right-walk if key > highKey.
            if key > highKey && rightSib != nil {
                n = rightSib
                continue
            }
            if isLeaf {
                i, found := findItem(items, key)
                if found {
                    return items[i].Val, true
                }
                return "", false
            }
            i := childIndex(items, key)
            n = children[i]
        }
    }
    panic("too many retries")
}

func findItem(items []item, key int64) (int, bool) {
    i := sort.Search(len(items), func(i int) bool { return items[i].Key >= key })
    if i < len(items) && items[i].Key == key {
        return i, true
    }
    return i, false
}

func childIndex(items []item, key int64) int {
    return sort.Search(len(items), func(i int) bool { return items[i].Key >= key })
}

func (t *Tree) Set(key int64, val string) {
    t.rootMu.Lock()
    n := t.root.Load()
    n.beginWrite()
    if len(n.items) == fanout {
        // Grow root.
        newRoot := &Node{
            isLeaf:   false,
            highKey:  maxKey,
            children: []*Node{n},
        }
        newRoot.beginWrite()
        t.root.Store(newRoot)
        splitChild(newRoot, 0)
        n.endWrite()
        n = newRoot
    }
    t.rootMu.Unlock()
    insertNonFull(n, key, val)
}

func splitChild(parent *Node, i int) {
    child := parent.children[i]
    // child must already be write-latched
    mid := len(child.items) / 2
    medianKey := child.items[mid].Key
    right := &Node{
        isLeaf:   child.isLeaf,
        highKey:  child.highKey,
        rightSib: child.rightSib,
    }
    right.items = append(right.items, child.items[mid:]...)
    if !child.isLeaf {
        right.children = append(right.children, child.children[mid+1:]...)
        child.children = child.children[:mid+1]
    }
    child.items = child.items[:mid]
    child.highKey = medianKey - 1
    child.rightSib = right
    // Insert medianKey into parent.
    parent.items = append(parent.items, item{})
    copy(parent.items[i+1:], parent.items[i:])
    parent.items[i] = item{Key: medianKey}
    parent.children = append(parent.children, nil)
    copy(parent.children[i+2:], parent.children[i+1:])
    parent.children[i+1] = right
}

func insertNonFull(n *Node, key int64, val string) {
    i, found := findItem(n.items, key)
    if found {
        n.items[i].Val = val
        n.endWrite()
        return
    }
    if n.isLeaf {
        n.items = append(n.items, item{})
        copy(n.items[i+1:], n.items[i:])
        n.items[i] = item{Key: key, Val: val}
        n.endWrite()
        return
    }
    child := n.children[i]
    child.beginWrite()
    if len(child.items) == fanout {
        splitChild(n, i)
        if key > n.items[i].Key {
            child.endWrite()
            child = n.children[i+1]
            child.beginWrite()
        }
    }
    n.endWrite()
    insertNonFull(child, key, val)
}

This is a working OLFIT B+-tree in ~150 lines. Reads use version-counter optimism; writes use exclusive latches with preemptive splits. Right-sibling pointers handle concurrent splits gracefully.

Stress test it under -race and you will see it works correctly. Benchmark it and compare to publish/subscribe COW; on read-heavy workloads OLFIT may win by 20-30%, but the implementation complexity is higher.

For pure-Go production code, COW publish/subscribe wins on the simplicity/performance trade-off. OLFIT is for engine work.

Appendix I: Detailed comparison — Bw-tree vs COW B-tree vs OLFIT¶

Property	Bw-tree	COW B-tree	OLFIT B+-tree
Reads	Lock-free	Lock-free	Lock-free
Writes	Lock-free	Single writer	Per-node latch
Snapshot	Implicit via deltas	Explicit O(1) clone	Not natural
Memory per write	One delta record	log(n) cloned nodes	None (in-place)
GC pressure	Low (small deltas)	Medium (cloned nodes)	None
Implementation difficulty	Very high	Medium	High
Cache friendliness	Medium (chain walk)	High	High
NUMA	Hard	Easy (shard)	Hard
Throughput at 64 cores	30-50M ops/sec	10-30M ops/sec	30-50M ops/sec

The Bw-tree and OLFIT are competitive for in-memory engines. COW is simpler and good enough for general-purpose Go services.

Appendix J: Lessons from the published literature¶

A few overarching lessons from 40 years of B-tree concurrency research:

Right-sibling pointers are unreasonably useful. Every modern B-tree has them.
Latches must be short. Long critical sections kill scalability.
Reads should not take latches if possible. Use OCC or COW.
Writes serialize at some level. The question is at what granularity.
Memory reclamation is the hidden complexity. Get it wrong and you have memory leaks or use-after-free.
Hardware matters. Cache lines, branch prediction, NUMA all shape the optimal design.

These lessons are the same whether the underlying language is C, C++, Go, or Rust. The protocols transcend their implementations.

Appendix K: A closing meditation¶

You have read four files totaling ~15000 lines on concurrent trees in Go. You know more about this topic than 99.99% of working programmers.

What now?

If you are building a database engine: deploy what you have learned. The advanced techniques will earn their keep.

If you are building anything else: forget most of it. Use tidwall/btree with publish/subscribe COW. The fancy techniques are wasted on you.

The point of professional-level study is not always to apply the techniques. It is to understand the design space deeply enough that when you encounter a tree-like structure in any domain — file systems, search indexes, network routing, computational geometry — you can immediately see its protocol choices and their trade-offs.

Concurrent ordered data is one of the deepest, most beautiful, and most useful areas of computer science. Carry your knowledge with respect for its difficulty and discipline in its application.

Onward.

Appendix L: Final exercises for professional-level mastery¶

Do at least three of these:

Implement a Bw-tree skeleton in Go. Get reads, writes, and consolidation working. Skip splits if you must.
Implement an OLFIT B+-tree. Stress-test it. Compare throughput against tidwall/btree.
Implement a concurrent ART. Compare to a hash map for string-heavy workloads.
Read Microsoft's open-source Bw-tree code (BzTree research repo). Write a 1000-word summary of what you learn.
Read PostgreSQL's src/backend/access/nbtree/ end-to-end. Identify three optimizations that would be irrelevant in pure Go.
Design (without coding) a Hekaton-class engine in Go. Identify what Go's GC simplifies and what it complicates.
Profile a tidwall/btree-based store with perf on Linux. Identify the top cache miss source and propose a fix.
Implement NUMA-aware sharding for a Go server. Measure the throughput improvement on a multi-socket machine.
Reproduce one published Bw-tree benchmark. Compare your numbers to the paper's.
Write a TLA+ specification for the Lehman-Yao B-link split protocol. Verify it.

Each of these is a multi-week project. None is necessary for typical Go work. All deepen understanding meaningfully.

Appendix M: A final closing¶

This section has covered, in order:

Junior: B-trees, basic concurrency, google/btree, tidwall/btree, sync.Mutex, sync.RWMutex.
Middle: Hand-over-hand, OCC, COW, B-link, publish/subscribe, sharding.
Senior: Lehman-Yao, MVCC, RCU, immutable persistent trees, engineering tour of PostgreSQL / bbolt / Pebble.
Professional: Bw-trees, concurrent ART, HTM, OLFIT, Hekaton, hardware-aware design, microbenchmarking.

You have covered the full spectrum from "use one mutex" to "design a lock-free B-tree with delta chains for an in-memory database engine."

The practical takeaway, repeated through every file: use tidwall/btree with publish/subscribe COW. The other 99% of this material is either intellectual context or for the rare case when that pattern is genuinely insufficient.

When you can hold all four files in your head simultaneously, choose the right technique for each problem, and articulate your choice with measurement-backed reasoning, you have mastered concurrent ordered data structures in Go.

That mastery is rare and valuable. Carry it well.

Appendix N: Detailed walk through Microsoft's BzTree open-source code¶

Microsoft published BzTree, a latch-free B+-tree that is a simplified spiritual successor to the Bw-tree. Reading the code (available on GitHub) is instructive.

Key design points:

No mapping table. BzTree uses CAS directly on node pointers, simpler than Bw-tree.
Persistent memory aware. Operations are designed for Optane DCPMM, with explicit cache-line flushes.
8-byte atomic operations. Each metadata word fits in 8 bytes, allowing single-CAS state transitions.
Single-pass cooperative protocols. Multiple threads may attempt the same structure modification; cooperation rather than mutual exclusion.

The code is ~10000 lines of C++ with heavy use of templates and intrinsics. The Go equivalent would lose some performance (no intrinsics) but the architectural patterns transfer.

For Go in 2026, BzTree's main lesson is: even Microsoft Research has retreated from the Bw-tree's complexity. BzTree is simpler and yet still performant. The lesson is universal: avoid unnecessary complexity, even at the bleeding edge.

Appendix O: Concurrent ART deep dive — the rationale¶

Why is ART a big deal? The 2013 paper showed that ART matched or beat hash maps on point lookup throughput while supporting range queries and ordered iteration. This is unusual: hash maps usually win on point access, trees on ordered access. ART found a sweet spot.

The key technical insights:

Radix partitioning is O(1) per level instead of O(log fanout) for B-trees.
Adaptive node sizes avoid wasted memory on sparse subtrees.
Prefix compression dramatically shortens average path length.
Cache-friendly layout with all metadata in the first cache line.

For workloads with long string keys (URLs, user IDs, file paths), ART is genuinely faster than any B-tree. For workloads with short integer keys, the gap shrinks.

When Go can leverage ART¶

Real Go applications that would benefit:

URL routers in HTTP servers. Each path is a string; routing is a prefix match.
IP routing tables. Each entry is an IP prefix; longest-prefix-match is exactly what radix trees do.
Auto-complete services. Prefix scans are native to ART.
Code completion / search. Same.

A pure Go ART (single-threaded or COW-wrapped) is feasible. The library github.com/plar/go-adaptive-radix-tree is the starting point; making it concurrent requires significant work.

A concurrent ART implementation hint¶

The trickiest concurrent ART operation is node resize (e.g., Node4 → Node16). Approach:

Allocate the new larger node.
Copy children from the old node, holding the write latch on the old.
Atomically update the parent's pointer to point to the new node.
Retire the old node to an epoch reclaimer (or let Go's GC handle it).

If the parent has not seen the resize yet (a reader is still on the parent), the parent's pointer still points to the old node. The old node is unmodified (we only read it during the copy), so the reader's lookup completes correctly.

The hazard: if the resize replaces a Node256 with a Node48 (downsize), and a reader is mid-lookup on the Node256, the downsize may produce a Node48 that does not contain the reader's key (it was removed). This requires either:

Refcounting + epoch reclamation: the old Node256 stays alive until no reader is on it.
Versioning + retry: the reader's version check fails after the downsize, restart.

Both are valid. Real implementations choose one and document it carefully.

Appendix P: Hekaton's index design¶

Hekaton offers two index types:

Hash index for point lookups (WHERE id = 42).
Range index (Bw-tree) for ordered access (ORDER BY id, WHERE id BETWEEN x AND y).

Both indexes reference the same underlying row versions. A row with XMin=100, XMax=200 is reachable through both its hash index entry and its range index entry. When the row is deleted (XMax set), both indexes still hold its entry until garbage collection.

Each index is per-table; you can have many indexes on the same table. Each is a separate Bw-tree (or hash table) with its own concurrency control.

The interesting bit: writes update all indexes atomically with respect to readers. A transaction creates a new row version, then updates the hash index and all range indexes to point to it. Readers see one version consistently across all indexes.

The protocol: indexes are updated before the row's XMin is finalized (it remains at the transaction's ID, not committed). A reader who finds the new index entry checks the row's commit status; if not committed, treats the entry as invisible.

This pattern — atomic-with-respect-to-readers via per-row commit status — is the same trick PostgreSQL uses, generalized to multi-index lookups.

Appendix Q: A reference architecture for an in-memory transactional engine in Go¶

Putting everything together, here is a reference architecture for a high-throughput in-memory transactional engine in Go:

[Application]
     |
     v
[Transactional API]
     |
     +---[Tx Manager: assigns IDs, tracks active set]
     +---[Validator: read-set / write-set validation]
     +---[Commit Log: in-memory + WAL]
     |
     v
[Per-table Engine]
     |
     +---[Primary Index: tidwall/btree with COW publish/subscribe]
     +---[Secondary Indexes: ditto, one per index]
     +---[Row store: map[primaryKey]*Row with version chains]
     +---[GC: background goroutine reclaims dead versions]
     |
     v
[WAL writer: appends commit records to disk]

Operations:

Begin: assign tx ID, snapshot the engine's commit log.
Read: find row via index, walk version chain.
Write: create new version, update all indexes (under their respective locks), record in tx's write set.
Validate: check each read's version is still latest visible.
Commit: append to WAL, mark tx committed, publish snapshots.
GC: periodically reclaim versions older than oldest active tx.

This architecture in pure Go is achievable. Throughput: ~100k transactions/sec on a 16-core machine, ~1M-10M point reads/sec.

For comparison, Hekaton achieves ~1M tx/sec and ~50M point reads/sec. The gap is mostly:

Go's GC vs. manual memory management.
Go's lack of intrinsics for SIMD.
Go's runtime overhead vs C++ inlining.

For most applications, the Go version is fast enough. For very-high-throughput OLTP, native code wins.

Appendix R: A study guide for the database engine literature¶

To go further than this file, read these papers in order:

Lehman, Yao (1981). B-link tree. The foundation.
Cha, Hwang et al. (2001). OLFIT. Optimistic latch-free B-tree.
Bender et al. (2005). Cache-oblivious B-trees. Performance via layout.
Mao, Kohler, Morris (2012). MassTree. Trie of B+-trees.
Levandoski, Lomet, Sengupta (2013). Bw-tree. Lock-free in-memory B-tree.
Leis, Kemper, Neumann (2013). ART. Adaptive radix tree.
Diaconu et al. (2013). Hekaton. SQL Server's in-memory engine.
Larson et al. (2011). High-performance concurrency control. The MVCC + OCC paper.
Faleiro, Abadi (2015). Rethinking serializable. Distributed transactions.
Wang et al. (2018). Building a Bw-tree. Lessons from real implementation.

Read each once for breadth, then once more for depth. After this, you can read essentially any concurrent ordered data structure paper.

Appendix S: A note on testing engine-grade code¶

Engine code requires testing rigor far beyond ordinary application code:

Unit tests for each operation in isolation.
Property-based tests (random sequences vs. a reference implementation).
Fuzz tests for adversarial inputs.
Long-running soak tests to find slow-growing memory leaks.
Concurrency stress tests with many goroutines under -race.
Crash tests that kill the process at random points and verify recovery.
Performance regression tests that flag throughput drops.
Hardware-specific tests (NUMA, multi-socket).

A Hekaton-class engine has tens of thousands of tests. A serious Go engine should aim for similar coverage.

For property-based testing in Go, github.com/leanovate/gopter or pgregory.net/rapid are good choices.

Appendix T: Final coding rules for engine-level code¶

Every CAS must have an explicit memory ordering rationale.
Every retry loop must have a bounded count.
Every allocation must be measured.
Every public API must have a clear contract.
Every protocol must have a written correctness argument.
Every optimization must have a benchmark.

These rules are tight. They are also what separates engine-grade code from typical application code.

Appendix U: A final architectural sketch — a sharded persistent transactional store¶

For a final, complete sketch combining everything:

package txstore

import (
    "sync"
    "sync/atomic"

    "github.com/tidwall/btree"
)

type Version struct {
    Value []byte
    XMin  int64
    XMax  int64
    Next  *Version
}

type Row struct {
    Key      string
    mu       sync.Mutex
    Newest   *Version
}

func less(a, b *Row) bool { return a.Key < b.Key }

type Shard struct {
    mu       sync.Mutex
    live     *btree.BTreeG[*Row]
    snapshot atomic.Pointer[btree.BTreeG[*Row]]
}

type Engine struct {
    shards    [16]Shard
    seed      maphash.Seed
    txCounter atomic.Int64
    commits   sync.Map // txID -> commitTimestamp
    walFile   *os.File
    walMu     sync.Mutex
}

// Many lines elided.

The full implementation is in the senior file's Appendix BA (about 200 lines). At professional level, you would add:

WAL with crash recovery.
Bounded snapshot lifetime.
NUMA-aware sharding.
Adaptive shard count.
Per-shard cond vars for wake-on-commit subscribers.
Distributed extensions (replication, consensus).

Each addition is another ~500-1000 lines. A complete production system is ~50,000 lines.

But the architecture — sharded MVCC + COW publish/subscribe + WAL — fits in two paragraphs. That is the value of senior/professional-level thinking: complex systems are built from a few simple architectural patterns, applied with discipline.

Appendix V: An honest assessment of Go for engine work¶

A few honest observations:

Go is fine for in-memory transactional stores up to ~1M tx/sec. Above that, native code wins.
Go's GC is excellent but not zero-cost. STW pauses interact with latency budgets.
Go's scheduler is excellent for concurrency but adds overhead. Sub-microsecond critical sections feel the goroutine boundary.
Go's lack of intrinsics costs ~30% on cache-friendly inner loops. Acceptable for most uses; painful at the limit.
Go's ecosystem of tree libraries is good but not deep. No equivalent of C++'s libARTC or RocksDB-class native code.

Real-world large-scale Go databases (CockroachDB, TiDB layer) use Go for orchestration and either Pebble (Go) or RocksDB (C++) for storage. The pattern is well-established.

If you are building a real engine, evaluate Rust or C++. If you are extending an existing Go system or building a high-throughput in-memory cache, Go is fine.

Appendix W: The final professional summary¶

You have now read everything I can teach you about concurrent ordered data structures in Go. The road ahead is:

Build. Implement at least one of the appendix exercises.
Profile. Measure on real hardware.
Read. The papers, the source code of real systems.
Repeat. Concurrent data structures are a lifelong study.

A few final encouragements:

The fundamentals are forever. Hand-over-hand, OCC, COW, B-link have been with us for 40+ years and will be with us for 40+ more.
Hardware changes the trade-offs but not the principles. Persistent memory came and went; HTM came and went; NUMA persists. The protocols adapt; the principles stay.
The library wins. Almost always, the right move is to use tidwall/btree or bbolt or Pebble. Build your own only when you genuinely cannot avoid it.
The simplest pattern is usually right. Publish/subscribe COW handles 95% of needs.

Welcome to the apex of concurrent tree expertise in Go. From here, you write the next paper, ship the next engine, or, more likely, deploy the simple pattern to solve real problems for real users.

Either way: well done. Go build something.

Appendix X: A deep look at PostgreSQL's nbtree page format¶

PostgreSQL's B+-tree pages have a specific layout designed for both read performance and crash safety. Understanding it shows what a production-grade tree page actually looks like.

A nbtree page:

Offset 0:   PageHeaderData (24 bytes)
Offset 24:  ItemIdData[] - pointers to items, sized as needed
            ...
            (free space)
            ...
            Items - actual key-value or routing data
Offset 8176: BTPageOpaqueData (16 bytes) - per-page metadata
Offset 8192: end of 8KB page

Each item is variable-length. ItemIdData points to (offset, length) within the page. Items grow downward from the bottom; ItemIdData grows upward from the header. They meet in the middle when the page fills.

BTPageOpaqueData at the page's tail holds:

Left and right sibling page IDs.
The tree level (0 = leaf).
Flags (leaf? root? deleted?).
The vacuum cycle ID.

This compact layout means:

One page = one disk I/O.
Cache-line aligned. The header and opaque are typically in the same cache line.
High keys are stored as the first item, not a separate field — saves 4-8 bytes per page.

For a Go in-memory version, you would not need disk-aligned layouts, but the design pattern (header + opaque + items) is still useful. It centralizes the "this is a tree node" metadata in known offsets, making code clearer.

Reading nbtree's page split¶

The split implementation in nbtinsert.c:

static void _bt_split(...) {
    // Acquire write-lock on right sibling for high-key update
    ...

    // Allocate a new page (the new right sibling)
    rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
    rpage = BufferGetPage(rbuf);

    // Copy upper half of items to right page
    for (i = firstright; i <= maxoff; i++) {
        ItemId itemid = PageGetItemId(origpage, i);
        // ... copy item to rpage ...
    }

    // Update high keys and sibling pointers
    rpage_opaque->btpo_next = origpage_opaque->btpo_next;
    rpage_opaque->btpo_prev = origpagenumber;

    // CRITICAL: publish new right sibling pointer
    origpage_opaque->btpo_next = newrightpagenumber;

    // Write WAL record for crash recovery
    XLogInsert(...);

    // Insert new pivot into parent
    _bt_insert_parent(...);
}

The publication point — when origpage_opaque->btpo_next is set — makes the new right sibling visible to any reader at the leaf. This is the Lehman-Yao key insight implemented at byte-precision.

The WAL log ensures that if PostgreSQL crashes mid-split, recovery can re-execute the split from the log records.

Appendix Y: A detailed look at HyPer's index architecture¶

HyPer (University of Munich) is a research database showcasing in-memory techniques. Its index is concurrent ART with several refinements:

8-byte node keys. Each ART node level handles one byte; HyPer batches 8 bytes into a level for cache efficiency.
Lazy expansion. Single-child paths are not expanded until necessary.
Path compression. Consecutive single-child levels collapse into a single path-compressed node.
OLFIT-style concurrency. Version counters per node; readers validate.
NUMA-local allocation. Each socket allocates from its local memory pool.

These optimizations stack: HyPer's ART achieves ~50M lookups/sec/core on modern hardware, beating even the best B-tree implementations.

For Go, none of these are directly applicable without significant infrastructure work (NUMA-aware allocators, intrinsics for path compression). The conceptual lesson is: at the apex of performance, every layer must be optimized — algorithms, layout, allocation, concurrency, hardware.

Appendix Z: A close look at the Bw-tree consolidation race¶

A subtle race in Bw-tree consolidation:

Thread A starts consolidating ID=42's chain. It walks the chain, builds a fresh base node.
Thread B (concurrently) inserts into ID=42, adding a new delta to the chain head.
Thread A finishes building its fresh base. It does a CAS: mapping[42].CAS(old_chain_head, fresh_base)
The CAS fails because Thread B's delta changed the head.
Thread A's fresh base is discarded.

What was lost: Thread A's work (the time to walk the chain and build the base). What was preserved: correctness — the chain now has Thread B's delta on top, and Thread A's view did not become permanent.

To handle this, real Bw-tree implementations:

Bound the consolidation work to avoid huge waste.
Use randomized backoff so consolidators don't all start at once.
Allow multiple in-flight consolidations to share work (much harder).
Trigger consolidation from many places (read, write, periodically).

This is the kind of subtlety that makes Bw-trees hard. Reading the paper does not prepare you for the implementation difficulty.

Appendix AA: A deep look at Pebble's memtable¶

Pebble's memtable is a skip list with several Go-specific optimizations.

// Conceptual sketch.
type Skiplist struct {
    arena    *Arena
    head     *node
    height   int32
}

type node struct {
    keyOff   uint32
    keyLen   uint32
    valueOff uint32
    valueLen uint32
    // Followed by [height]nextPtr in the arena
}

type Arena struct {
    buf [1 << 28]byte // 256 MB
    off atomic.Uint64
}

Key features:

Arena allocation. All node memory comes from a fixed-size byte slice. The off field is the high-water mark, advanced via CAS.
No GC pressure. Bumping off is O(1); the arena is discarded as a unit when the memtable flushes.
Variable-height nodes. Skip list nodes have random heights; the height is stored after the fixed header in the arena.
CAS-based linking. Each next[] pointer is updated via CAS during insertion.

Reads are lock-free: walk the skip list from head, following forward pointers. Writes are wait-free per-node (CAS to link a new node) but may retry under contention.

The arena is a Pebble-specific optimization. For pure Go services, an arena package (proposed but not yet stable) would help; until then, custom byte slices serve.

Appendix AB: The full taxonomy of in-memory ordered structures¶

A comprehensive list:

Balanced BSTs: AVL, red-black, treap, splay tree, scapegoat tree, weight-balanced.
B-trees: B-tree, B+-tree, B*-tree, B-link tree, prefix B-tree.
Radix tries: trie, PATRICIA trie, ART, judy array.
Hash + ordered: hash trie (HAMT, CTrie), ordered hash table.
Skip lists: plain, deterministic, biased, ProbXX.
Heaps with ordered ops: Fibonacci heap, pairing heap (mostly unordered).
Specialized: segment tree, interval tree, R-tree, kd-tree, quadtree.
Persistent: finger tree, 2-3 finger tree, RB tree (functional), HAMT.

For each, there exists at least one published concurrent variant. The total literature is vast.

For Go in 2026, the practical defaults are:

B+-tree (B-link, OLFIT) — tidwall/btree or google/btree wrapped in publish/subscribe COW.
Skip list — Pebble's internal/arenaskl if you want lock-free.
HAMT — pure functional ordered persistent map.
R-tree — spatial queries, dhconnelly/rtreego.

The rest are research vehicles or special-purpose.

Appendix AC: A deep dive on hardware transactional memory in 2026¶

HTM had a turbulent history:

Intel TSX (2013): bugs forced disablement on consumer CPUs.
IBM z architecture (2014): production HTM on mainframes.
ARM PSTL2 (2019 proposal): never widely deployed.
Intel TSX, post-2018: enabled on server CPUs but not consumer.
Post-Spectre (2018): TSX involved in side-channel attacks, microcode mitigations.

In 2026, HTM is:

Available on some Intel Xeon CPUs.
Rarely used in production.
Not exposed in Go directly.

The interesting algorithmic content remains: HTM is essentially "optimistic execution at hardware level." The same principle applies to software protocols (OCC, RCU). HTM's hardware support adds speed but the protocol pattern survives.

If you ever need HTM in Go: use cgo to call assembly. Maintain a mutex fallback. Limit the critical section to one or two cache lines. Test extensively under contention.

For 99% of Go work: ignore HTM. Use mutex or atomic operations.

Appendix AD: A look at TiKV's storage architecture¶

TiKV (the Rust storage layer beneath TiDB) is illuminating because it solves the same problems as Go databases but in Rust, with different trade-offs.

TiKV uses:

RocksDB as the underlying KV store (C++).
Multi-version concurrency via timestamped keys.
Raft for replication.
Percolator-style distributed transactions.
Region-based sharding (~96 MB each).

Rust gives TiKV:

Zero-cost abstractions (no GC).
Explicit memory ownership.
Strong type guarantees for concurrency (Send/Sync).

For comparable Go work, you would use Pebble (also Go) as the storage layer, but you would accept some GC overhead. Most Go services do this and are fine.

The lesson: pick your language based on the latency budget. Sub-microsecond engines lean toward Rust/C++; ten-microsecond engines fit comfortably in Go.

Appendix AE: A deep look at FoundationDB's storage¶

FoundationDB is one of the most architecturally clean distributed databases. Its storage engine is interesting:

Each storage server holds a key range.
B+-tree per server with sliding-window MVCC.
All transactions are serializable (no snapshot isolation).
A central transaction sequencer assigns commit timestamps.

The clean serialiability is achieved by:

Optimistic transactions.
Conflict detection at commit (sliding-window check).
All-or-nothing commit via the resolver.

For Go, FoundationDB exposes its data layer via the FDB client library. You can build on top, treating FDB as your storage; Go application code remains pure.

If you ever build a distributed database in Go, study FoundationDB's design. It is the cleanest in the industry.

Appendix AF: A final reference architecture¶

For "very high-throughput, ordered, concurrent, in-memory" workloads in Go:

+----------------------------------+
| Per-shard COW publish/subscribe |
|                                  |
| 16 shards, hashed by key         |
| Each shard: tidwall/btree        |
| Each shard: sync.Mutex for write |
| Each shard: atomic.Pointer for   |
|   snapshot                       |
+----------------------------------+
                |
                v
+----------------------------------+
| MVCC layer                       |
|                                  |
| Per-row version chain            |
| Read: walk chain to visible      |
| Write: prepend new version       |
+----------------------------------+
                |
                v
+----------------------------------+
| Transaction layer (optional)     |
|                                  |
| Tx ID counter                    |
| Snapshot at Begin                |
| Validation at Commit             |
+----------------------------------+
                |
                v
+----------------------------------+
| WAL (optional, for persistence)  |
|                                  |
| Append-only file                 |
| fsync on commit                  |
| Replay on recovery               |
+----------------------------------+

Each layer is independently testable and tunable. The total implementation in Go is ~3000-5000 lines for a complete OLTP-style in-memory engine. Throughput: 100k-1M tx/sec on modern hardware.

This is what a senior/professional Go engineer can deliver. It is not Hekaton-class (which is C++ and tightly tuned), but it is excellent for most real-world needs.

Appendix AG: Closing thoughts on the professional level¶

The professional level is where research and engineering meet. You can:

Read any paper on concurrent ordered data structures.
Implement Bw-trees, OLFIT, concurrent ART.
Architect engine-grade systems.
Profile at hardware-counter granularity.
Choose the right protocol for any workload.

You should:

Resist deploying these techniques unless measurement demands.
Stay with publish/subscribe COW for typical Go work.
Read at least one paper per quarter to stay current.
Build incrementally; ship the simple thing first.

You are at the apex of concurrent ordered data expertise. Carry that responsibly.

Appendix AH: A final exercise — design exercise¶

Take 30 minutes and design an in-memory transactional ordered key-value store in Go for a fictional workload:

10M keys.
100k writes/sec.
1M reads/sec.
Snapshot isolation required.
P99 read latency < 100 μs.
P99 write latency < 1 ms.

What architecture do you choose? What libraries? What protocol? What sharding?

Walk through your design with another engineer. Defend each choice. Identify weaknesses.

When you can do this exercise in 30 minutes and produce a defensible architecture, you have reached professional-grade fluency.

Appendix AI: A final closing¶

I close with the same observation that has appeared throughout this section: for most Go work, publish/subscribe COW with tidwall/btree is the right answer. The hundreds of pages we have spent on Bw-trees, ART, Hekaton, and HTM are intellectual context for the rare case where the simple pattern fails.

If you walk away from this section having internalized one thing, let it be that. Then, in the rare case you need more, you have the depth to engage with the literature, choose the right protocol, and ship it correctly.

Concurrent ordered data structures are one of the deepest topics in computer science. You have studied them at four levels of depth across roughly 15,000 lines of writing. You know enough now to know what you don't know, and that is the beginning of expertise.

Go forth. Build something real. Profile it. Iterate. Welcome to the apex of concurrent trees in Go.

Appendix AJ: A very detailed implementation guide for a Bw-tree skeleton¶

For those determined to attempt a Bw-tree in Go, here is a more complete skeleton with consolidation:

package bwtree

import (
    "sort"
    "sync"
    "sync/atomic"
)

type NodeID uint32

const (
    invalidID NodeID = 0
    consolidationThreshold = 8
    initialMappingSize     = 1 << 20
)

type DeltaKind uint8

const (
    DKBase DeltaKind = iota
    DKInsert
    DKDelete
    DKSplit
    DKMerge
)

type item struct {
    Key int64
    Val []byte
}

type Node struct {
    kind      DeltaKind
    chainLen  int32 // for tracking when to consolidate
    // Base fields (when kind == DKBase):
    isLeaf    bool
    items     []item
    highKey   int64
    rightSib  NodeID
    // Delta fields:
    deltaKey  int64
    deltaVal  []byte
    next      *Node
    // Split delta:
    splitPivot int64
    splitSibling NodeID
}

type BwTree struct {
    mapping      []atomic.Pointer[Node]
    nextID       atomic.Uint32
    rootID       NodeID
    parentTable  sync.Map // childID -> parentID, for finding parents (simplification)
}

func New() *BwTree {
    t := &BwTree{
        mapping: make([]atomic.Pointer[Node], initialMappingSize),
    }
    t.nextID.Store(1) // 0 is invalid
    rootID := t.allocate()
    root := &Node{
        kind:    DKBase,
        isLeaf:  true,
        highKey: int64(^uint64(0) >> 1),
    }
    t.mapping[rootID].Store(root)
    t.rootID = rootID
    return t
}

func (t *BwTree) allocate() NodeID {
    return NodeID(t.nextID.Add(1))
}

func (t *BwTree) Get(key int64) ([]byte, bool) {
    id := t.rootID
    for {
        n := t.mapping[id].Load()
        leaf, childID, found, val, deleted := walkChain(n, key)
        if deleted {
            return nil, false
        }
        if found {
            return val, true
        }
        if leaf {
            return nil, false
        }
        id = childID
    }
}

func walkChain(n *Node, key int64) (isLeaf bool, childID NodeID, found bool, val []byte, deleted bool) {
    for {
        switch n.kind {
        case DKInsert:
            if n.deltaKey == key {
                return false, 0, true, n.deltaVal, false
            }
            n = n.next
        case DKDelete:
            if n.deltaKey == key {
                return false, 0, false, nil, true
            }
            n = n.next
        case DKSplit:
            if key >= n.splitPivot {
                return false, n.splitSibling, false, nil, false
            }
            n = n.next
        case DKBase:
            if n.isLeaf {
                // Binary search.
                i := sort.Search(len(n.items), func(i int) bool { return n.items[i].Key >= key })
                if i < len(n.items) && n.items[i].Key == key {
                    return true, 0, true, n.items[i].Val, false
                }
                return true, 0, false, nil, false
            }
            // Internal: find child.
            return false, findInternalChild(n, key), false, nil, false
        }
    }
}

func findInternalChild(n *Node, key int64) NodeID {
    // Items in internal node store (separator key, child ID) pairs.
    // For simplicity, assume Val is the encoded NodeID.
    i := sort.Search(len(n.items), func(i int) bool { return n.items[i].Key > key })
    if i == 0 {
        return 0 // out of range; should not happen
    }
    return NodeID(int64(n.items[i-1].Val[0]) | int64(n.items[i-1].Val[1])<<8 | ...)
    // (simplification; real code encodes the NodeID in 4 bytes)
}

func (t *BwTree) Insert(key int64, val []byte) {
    for {
        id := t.findLeafID(key)
        old := t.mapping[id].Load()
        delta := &Node{
            kind:     DKInsert,
            chainLen: old.chainLen + 1,
            deltaKey: key,
            deltaVal: val,
            next:     old,
        }
        if t.mapping[id].CompareAndSwap(old, delta) {
            if delta.chainLen > consolidationThreshold {
                go t.maybeConsolidate(id)
            }
            return
        }
        // CAS failed; retry.
    }
}

func (t *BwTree) findLeafID(key int64) NodeID {
    id := t.rootID
    for {
        n := t.mapping[id].Load()
        leaf, childID, _, _, _ := walkChain(n, key)
        if leaf {
            return id
        }
        id = childID
    }
}

func (t *BwTree) maybeConsolidate(id NodeID) {
    old := t.mapping[id].Load()
    if old.chainLen < consolidationThreshold {
        return
    }
    fresh := consolidate(old)
    if !t.mapping[id].CompareAndSwap(old, fresh) {
        return // someone else did it; abandon
    }
}

func consolidate(n *Node) *Node {
    // Collect all deltas, find the base, apply in reverse.
    var deltas []*Node
    cur := n
    for cur.kind != DKBase {
        deltas = append(deltas, cur)
        cur = cur.next
    }
    base := cur
    items := append([]item{}, base.items...)
    // Apply oldest first (i.e., reverse the deltas slice since we collected newest first).
    for i := len(deltas) - 1; i >= 0; i-- {
        d := deltas[i]
        switch d.kind {
        case DKInsert:
            items = insertSortedItem(items, d.deltaKey, d.deltaVal)
        case DKDelete:
            items = removeKeyItem(items, d.deltaKey)
        }
    }
    return &Node{
        kind:    DKBase,
        isLeaf:  base.isLeaf,
        items:   items,
        highKey: base.highKey,
        rightSib: base.rightSib,
    }
}

func insertSortedItem(items []item, k int64, v []byte) []item {
    i := sort.Search(len(items), func(i int) bool { return items[i].Key >= k })
    if i < len(items) && items[i].Key == k {
        items[i].Val = v
        return items
    }
    items = append(items, item{})
    copy(items[i+1:], items[i:])
    items[i] = item{Key: k, Val: v}
    return items
}

func removeKeyItem(items []item, k int64) []item {
    i := sort.Search(len(items), func(i int) bool { return items[i].Key >= k })
    if i < len(items) && items[i].Key == k {
        return append(items[:i], items[i+1:]...)
    }
    return items
}

This is ~200 lines, far from a complete Bw-tree but enough to see the shape:

Mapping table is []atomic.Pointer[Node].
Reads walk the chain, then descend or return.
Writes allocate a delta and CAS.
Consolidation is triggered when the chain grows long.

What is missing:

Splits and merges (the hardest part).
Mapping table growth.
Proper consolidation under concurrent inserts.
Epoch reclamation (Go's GC almost suffices).
Parent table for finding parents on split.
Range scans.

A full implementation is 3000-5000 lines of careful Go. Pursue this only if you genuinely need its properties.

Appendix AK: ART implementation skeleton¶

A more complete concurrent ART skeleton:

package art

import (
    "bytes"
    "runtime"
    "sync"
    "sync/atomic"
)

type NodeKind uint8

const (
    Node4 NodeKind = iota
    Node16
    Node48
    Node256
    LeafKind
)

type Node interface {
    Kind() NodeKind
    Version() *atomic.Uint64
    WriteLock() *sync.Mutex
    Prefix() []byte
    FindChild(b byte) *atomic.Pointer[Node]
    NumChildren() int
}

// Concrete implementations:
type Node4Impl struct {
    version atomic.Uint64
    writeMu sync.Mutex
    prefix  []byte
    numChildren uint8
    keys    [4]byte
    children [4]atomic.Pointer[Node]
}

func (n *Node4Impl) Kind() NodeKind { return Node4 }
func (n *Node4Impl) Version() *atomic.Uint64 { return &n.version }
func (n *Node4Impl) WriteLock() *sync.Mutex { return &n.writeMu }
func (n *Node4Impl) Prefix() []byte { return n.prefix }
func (n *Node4Impl) NumChildren() int { return int(n.numChildren) }

func (n *Node4Impl) FindChild(b byte) *atomic.Pointer[Node] {
    for i := 0; i < int(n.numChildren); i++ {
        if n.keys[i] == b {
            return &n.children[i]
        }
    }
    return nil
}

// Similar implementations for Node16, Node48, Node256, LeafImpl.

type ART struct {
    root atomic.Pointer[Node]
}

func (t *ART) Get(key []byte) ([]byte, bool) {
    for retries := 0; retries < 64; retries++ {
        n := t.root.Load()
        depth := 0
    walk:
        for {
            if n == nil {
                return nil, false
            }
            if leaf, ok := n.(*LeafImpl); ok {
                if bytes.Equal(leaf.key, key) {
                    return leaf.value, true
                }
                return nil, false
            }
            v := n.Version()
            v1 := v.Load()
            if v1&1 == 1 {
                runtime.Gosched()
                break walk
            }
            prefix := n.Prefix()
            if !matchPrefix(prefix, key, depth) {
                if v.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            depth += len(prefix)
            if depth >= len(key) {
                if v.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            childPtr := n.FindChild(key[depth])
            if childPtr == nil {
                if v.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            child := childPtr.Load()
            if v.Load() != v1 {
                break walk
            }
            n = child
            depth++
        }
    }
    panic("too many retries")
}

func matchPrefix(prefix, key []byte, depth int) bool {
    for i := 0; i < len(prefix); i++ {
        if depth+i >= len(key) || prefix[i] != key[depth+i] {
            return false
        }
    }
    return true
}

Notice the interface dispatch — Go's lack of intrinsics means we cannot pack different node types into one struct as cleanly as C++. Each kind is a separate struct implementing the Node interface.

The OLFIT-style version checks are present. Writes would take the writeMu, increment the version, mutate (possibly resize), and release.

A complete implementation is ~2000 lines. The Go version will be 30-50% slower than a tuned C++ ART, but it works.

Appendix AL: Working through Hekaton's optimistic transaction in detail¶

Let's trace a Hekaton-style transaction in detail.

Setup: A users table with rows {id: 1, name: "Alice"} and {id: 2, name: "Bob"}. The latest transaction ID is 100.

Transaction T1 begins. Snapshot = 100. T1's ID = 101.

T1 reads WHERE id = 1. - Walk the version chain for row id=1. - Find version {name: "Alice", XMin: 50, XMax: 0}. Visible at snapshot 100 (50 <= 100 and XMax == 0). - T1 records this read: Reads[1] = pointer to this version.

Transaction T2 begins concurrently. Snapshot = 100. T2's ID = 102.

T2 updates id = 1 SET name = 'Alicia'. - Find row id=1. - Lock row. Set the current version's XMax to 102. Create new version {name: "Alicia", XMin: 102, XMax: 0}. Update the index. - T2 records this write: Writes[1] = new version.

T2 commits. - Validation: T2's reads (none in this example). Pass. - Write commit log entry: T2 committed at timestamp 103. - New version is now visible to snapshots > 102.

T1 commits. - Validation: For each read in T1's set, check whether it is still the latest visible. - Reads[1] = old version with XMin=50, XMax=0 (at the time of T1's read). - Now, that same version has XMax=102 (set by T2). - But T1's snapshot is 100, and 102 > 100, so this version would still be visible to T1. - BUT: a newer version (XMin=102) now exists. If T1's read set requires "this is the latest version at commit time," then T1 fails.

Hekaton's exact protocol: Serializable isolation requires write-set validation against concurrent updates. For Snapshot Isolation (Hekaton's default), only write-write conflicts are detected:

T1 has no writes, so no conflict. T1 commits.
T2 wrote to row 1; if T1 had also written to row 1, the second-committer would fail.

For serializable, you also check read-write conflicts (called "read validation"): if any version T1 read has been updated by a concurrently committed transaction, T1 fails.

Hekaton's actual implementation: optimistic for SI, optimistic + read validation for serializable.

Appendix AM: A study guide to optimistic concurrency control¶

OCC has three phases:

Read phase. Read whatever you want; track which versions you saw.
Validation phase. At commit, check that your reads are still valid.
Write phase. Apply your writes.

The hard part is validation. Common approaches:

Timestamp validation. Compare your read set's versions to current versions; abort if any have been superseded.
Backward validation. Compare your read set to other transactions' write sets that committed during your execution.
Forward validation. Compare your write set to other active transactions' read sets; abort or wait.

Hekaton uses timestamp validation (cheap, correct for SI). FoundationDB uses forward validation (necessary for serializable in a distributed setting).

For Go, you would implement timestamp validation for an in-memory engine. For a distributed engine, forward validation via a central resolver is more common.

Appendix AN: A close look at GC pressure in COW trees¶

A subtle topic: how much GC pressure does a COW tree generate?

For each write: - One new node per level of the tree (path copy). - For a B+-tree of fanout 32, height ~6 for 1B entries → 6 nodes per write. - Each node is ~1 KB → 6 KB per write of fresh allocation.

At 10k writes/sec, that is 60 MB/sec of allocation. Go's GC can handle this comfortably (it sweeps at GB/sec). At 100k writes/sec, 600 MB/sec — starting to put pressure on the GC.

Mitigation:

Batch writes. Multiple writes accumulate in live, one snapshot per batch.
Larger fanout. Reduces height, fewer nodes per write.
Inline items. A B+-tree where leaves hold actual data (not pointers) reduces dereferences.
sync.Pool for transient nodes (carefully — see earlier appendices).

For very write-heavy workloads, COW pays its price in GC. Hand-over-hand or in-place latching may win.

Appendix AO: A close look at NUMA effects on tree throughput¶

On a 2-socket machine with 16 cores per socket:

Local memory access: ~80 ns.
Remote memory access: ~150 ns.
Cache coherency cross-socket: more.

For a B-tree of 100M entries spread across both sockets:

Random key access has 50% chance of remote access.
Average latency: ~115 ns vs ~80 ns local.
Throughput: ~30% lower than NUMA-local.

NUMA-aware sharding (per-socket trees) closes the gap. But it requires:

Pinning goroutines to sockets.
Routing queries to the right socket.
Handling cross-socket queries (rare but unavoidable).

For Go, this is doable but not trivial. The runtime.LockOSThread plus syscalls for CPU affinity get you part way. Real NUMA allocation requires cgo.

For most Go services on a single socket: NUMA does not matter. For 32+ core machines: it can matter.

Appendix AP: A close look at allocator interactions¶

Go's allocator (runtime/malloc.go) is excellent but has costs:

Each allocation requires acquiring a per-P (processor) cache.
Large allocations go through the global heap with a global lock.
The GC scans every allocation periodically.

For tree node allocation:

Small nodes (~1 KB) hit the per-P cache. Fast.
Large nodes (~32 KB) go through the global heap. Slower.
Many small allocations stress the GC scan.

Tuning options:

Pre-allocate node slabs and manage them manually (custom allocator).
Use sync.Pool for short-lived nodes (carefully).
Adjust GOGC to reduce GC frequency at the cost of memory.

For an engine-grade tree, custom allocation is worth doing. For a typical Go service, Go's allocator is fine.

Appendix AQ: A taxonomy of when each technique pays off¶

Workload	Best technique
< 10k ops/sec, ordered	Single Mutex
10k-100k ops/sec, mostly reads	RWMutex
10k-100k ops/sec, mixed	COW publish/subscribe
100k-1M ops/sec, point-heavy	Sharded COW
100k-1M ops/sec, range-heavy	Single COW (snapshot publishing)
1M+ ops/sec, in-memory engine	OLFIT or Bw-tree
Distributed, ordered	LSM (Pebble-like) + replication

These are guides, not rules. Profile your workload to confirm.

Appendix AR: A final example — a complete engine in Go¶

For a final exercise, let's outline a complete engine. About 5000 lines. The architecture:

Engine
├── Tx Manager
│   ├── Tx counter (atomic.Int64)
│   ├── Commit log (sync.Map[txID]committedAt)
│   └── Snapshot tracker (for GC horizon)
├── Per-shard MVCC stores (16 shards)
│   ├── Shard 0
│   │   ├── btree of *Row (tidwall/btree)
│   │   ├── snapshot (atomic.Pointer[btree])
│   │   ├── shard mutex (sync.Mutex)
│   │   └── per-row version chains
│   ├── Shard 1
│   ...
├── Indexes (secondary)
│   ├── Per-table list of (column, btree)
│   ├── Each btree maps indexed value to []*Row
│   ├── Updated under shard mutex
├── WAL writer
│   ├── Append-only file
│   ├── Sync on commit
│   ├── Replay on recovery
├── GC worker
│   ├── Background goroutine
│   ├── Reclaims dead versions
│   ├── Watches snapshot horizon
└── Subscription manager
    ├── Per-shard cond vars
    ├── Notify on commit
    └── Used by change-data-capture

Each component is testable in isolation. The total in Go: ~5000 lines. Throughput: ~100k tx/sec, ~10M point reads/sec.

This is a complete, real, deployable in-memory transactional engine in Go. It is not Hekaton-class but it is excellent for many production needs.

When you can design and build this in your head, you are at the apex of Go professional-level concurrent tree mastery.

Appendix AS: A list of "do not do this" warnings¶

A few things to avoid at the professional level:

Don't roll your own Bw-tree. It is harder than you think.
Don't reach for hardware transactional memory. It is not available in Go.
Don't optimize without profiling. Always measure first.
Don't reinvent tidwall/btree. It is excellent.
Don't reinvent bbolt. Also excellent.
Don't avoid Go for engine work without measuring. It may be fast enough.
Don't ignore GC pauses. They affect tail latency.
Don't deploy to production without stress tests. Concurrency bugs are insidious.

These are the lessons of many engineers who learned them the hard way. Take them.

Appendix AT: Final closing thoughts¶

The professional file has covered:

The Bw-tree at implementation grade.
Concurrent ART.
Hardware transactional memory and its decline.
OLFIT, MassTree, and the lineage of in-memory B+-trees.
Hekaton-class engines.
A comparative engine tour.
Hardware-aware design.
Profiling at the microbenchmark level.

For most Go work, this material is intellectual context. For the rare case where it is needed, it provides the vocabulary and framework to engage with the literature, choose protocols, and build systems.

The road from this file is: read papers, write code, profile, measure, ship, repeat. Concurrent data structures are a lifelong study.

Welcome to the apex. Build well.

Appendix AU: Detailed reading of PostgreSQL's `_bt_findsplitloc`¶

The choice of split point in nbtree is interesting. PostgreSQL does not split a full page in the middle; it chooses a split point that:

Keeps the existing items balanced.
Minimizes the "high key" overhead (the larger keys, the more space taken).
Allows for asymmetric splits in some cases (e.g., rightmost insert pattern).

The relevant function is _bt_findsplitloc in nbtsplitloc.c. About 1000 lines. The algorithm:

Compute the total size of all items.
For each candidate split offset, compute the resulting left/right page sizes.
Score each candidate by:
Balance (preferring equal sizes).
Compatibility with the insertion key (if we are mid-insert, prefer the side that absorbs the new key).
Right edge bias (for monotonic key sequences, prefer slight right-bias).
Pick the best candidate.

The right-edge bias is critical for sequential inserts (timestamps, IDs). Without it, a sequential workload would create perfectly balanced pages, leading to ~50% page utilization after splits. With it, the left page stays full while the right page absorbs new inserts, leading to ~90% utilization.

This is a level of optimization that pure-Go in-memory trees rarely pursue. tidwall/btree and google/btree use simpler "split at middle" approaches. The difference is ~10% page utilization, which matters for disk-based stores but rarely for in-memory.

If you ever build an on-disk B+-tree in Go, study _bt_findsplitloc carefully. It is years of accumulated wisdom.

Appendix AV: A close look at concurrent skip lists vs B-trees¶

For ordered in-memory data, the choice between concurrent skip list (e.g., Pebble's arenaskl) and concurrent B-tree (e.g., tidwall/btree with COW) is genuinely close.

Skip list pros:

Lock-free reads and writes via CAS on forward pointers.
Simpler implementation (no rebalancing).
Naturally suits arena allocation.
Predictable height distribution (geometric).

Skip list cons:

More pointer chasing per operation than B-tree.
Worse cache behavior on average (random pointer destinations).
Memory overhead for forward pointers at each level.

B-tree pros:

Better cache behavior (multiple items per node).
Tighter memory packing.
Faster point lookups on average.

B-tree cons:

Rebalancing on splits/merges.
Harder to make lock-free (Bw-tree complexity).

For Pebble's memtable: skip list wins because it is short-lived (flushed to disk on full) and the arena allocator suits skip list nicely.

For tidwall/btree's use case: B-tree wins because it has snapshots, ordered iteration, and longer lifetimes.

In Go specifically: skip lists are easier to make lock-free; B-trees are easier to make COW. Both have a place.

Appendix AW: Distributed concurrent trees¶

For distributed databases, the concurrent tree problem extends across nodes:

CockroachDB: key range sharding ("ranges" of ~64 MB). Each range is a Pebble store. Cross-range queries scatter-gather.
Spanner: key range sharding with TrueTime for transaction ordering.
TiDB / TiKV: key range sharding with Raft replication per range.

These systems use:

A central coordinator (Spanner's lock service, CockroachDB's range descriptor).
Per-range storage (Pebble or RocksDB).
A transaction protocol (Percolator, Calvin, Spanner's two-phase commit with TrueTime).

For Go: CockroachDB is the canonical example. Its source is available on GitHub. Reading the kvserver package gives you a real picture of distributed concurrent ordered storage.

Appendix AX: A meditation on simplicity vs sophistication¶

A theme runs through this section: simplicity beats sophistication when both work. Publish/subscribe COW is simpler than Bw-tree, and for 99% of workloads it is also sufficient.

But sophistication exists for the 1%:

Hekaton's optimistic transactions beat any simpler design at 1M tx/sec.
ART beats B-trees on long-string-key workloads.
Lehman-Yao B-link beats single-mutex B-trees under heavy concurrent writes.

The professional engineer's job is to know which 1% applies to their workload, deploy the right tool, and avoid deploying sophistication where simplicity would suffice.

In Go, the bar for "deploy sophistication" is very high. Go's GC, runtime, and lack of intrinsics mean that the simplest patterns are often more competitive than they would be in C++. This is a feature: Go's productivity comes from defaulting to simplicity.

Appendix AY: A study guide for the next decade¶

What is next in concurrent trees research:

Learned indexes (2017-): use machine learning to predict the position of a key, replacing tree traversal with model inference. Very fast for read-only workloads.
Persistent memory aware trees (still relevant despite Optane's demise): designs assume byte-addressable persistent storage.
GPU-accelerated trees: for very large in-memory datasets.
Quantum-resistant data structures: preparing for the quantum computing era.

These are research vehicles in 2026 but may become production-grade in 2030.

For Go in 2030: probably still using tidwall/btree with publish/subscribe COW. The simple patterns endure.

Appendix AZ: A close look at PostgreSQL's deduplication¶

A 2020 optimization in PostgreSQL nbtree: leaf items can be deduplicated. Instead of storing duplicate key entries as separate items, store one entry with a posting list of TIDs (tuple identifiers).

Before deduplication:
  [key=42, tid=1] [key=42, tid=2] [key=42, tid=3]

After:
  [key=42, posting_list=[1, 2, 3]]

For tables with many duplicate index entries (e.g., indexes on low-cardinality columns), deduplication dramatically reduces index size.

For Go in-memory trees: applicable to indexes on low-cardinality columns. tidwall/btree items can hold a slice of values:

type Entry struct {
    Key      string
    Values   []int64
}

Updates merge / split the values slice. Not common in general-purpose Go usage but available if you need it.

Appendix BA: A close look at hot-cold separation¶

Production trees often suffer from "hot keys" — a few keys that take 90% of the traffic. A general tree treats all keys uniformly, paying cache miss penalties on the hot ones.

Optimizations:

Front-loading. Store hot items at the root of the tree (e.g., a cache layer on top).
Bloom filters. Quickly determine if a key is in a node without searching.
Tiered caches. Hot keys in a small in-memory cache; cold keys in the main tree.

For Go, a sketch:

type HotCold struct {
    hot  sync.Map // map[K]V for the top-10000 hot keys
    cold *btree.BTreeG[Item]
    mu   sync.Mutex // for cold mutations
}

func (s *HotCold) Get(k K) (V, bool) {
    if v, ok := s.hot.Load(k); ok {
        return v.(V), true
    }
    return s.cold.Get(Item{Key: k})
}

A background process periodically promotes/demotes between hot and cold based on access frequency.

This pattern is common in distributed caches (Twitter's Pelikan, Facebook's TAO). Implement only if profiling shows hot-key skew.

Appendix BB: A close look at write-ahead logging in Go¶

For persistence, a WAL is essential. Pattern:

type WAL struct {
    mu   sync.Mutex
    file *os.File
    buf  *bufio.Writer
}

func (w *WAL) Append(record []byte) error {
    w.mu.Lock()
    defer w.mu.Unlock()
    if _, err := w.buf.Write(record); err != nil {
        return err
    }
    return nil
}

func (w *WAL) Sync() error {
    w.mu.Lock()
    defer w.mu.Unlock()
    if err := w.buf.Flush(); err != nil {
        return err
    }
    return w.file.Sync()
}

For high throughput, group commit:

Many goroutines call Append to add records.
A single goroutine periodically calls Sync and notifies waiters.

type GroupWAL struct {
    mu      sync.Mutex
    buf     *bufio.Writer
    file    *os.File
    pending []chan error
}

func (w *GroupWAL) Commit(record []byte) error {
    w.mu.Lock()
    w.buf.Write(record)
    done := make(chan error, 1)
    w.pending = append(w.pending, done)
    w.mu.Unlock()
    return <-done
}

func (w *GroupWAL) syncLoop() {
    ticker := time.NewTicker(1 * time.Millisecond)
    for range ticker.C {
        w.mu.Lock()
        if len(w.pending) == 0 {
            w.mu.Unlock()
            continue
        }
        w.buf.Flush()
        err := w.file.Sync()
        for _, ch := range w.pending {
            ch <- err
        }
        w.pending = w.pending[:0]
        w.mu.Unlock()
    }
}

Group commit amortizes fsync cost (typically ~1 ms) across many transactions. Throughput: ~10k transactions/sec per sync (so with group commit at 1 kHz tick, ~10M tx/sec sync-bound).

For pure in-memory trees, no WAL needed. For durable trees, group commit is the right pattern.

Appendix BC: Reading list summary¶

To complete the professional file, here is a reading list ordered by importance:

Tier 1 (read these):

Lehman, Yao (1981), B-link tree.
Cha et al. (2001), OLFIT.
PostgreSQL's nbtree/README.
bbolt source.

Tier 2 (read at least one):

Mao et al. (2012), MassTree.
Levandoski et al. (2013), Bw-tree.
Leis et al. (2013), ART.
Diaconu et al. (2013), Hekaton.

Tier 3 (for the truly committed):

Wang et al. (2018), Building a Bw-tree.
Larson et al. (2011), Hekaton MVCC.
Faleiro, Abadi (2015), Rethinking serializable.
The CockroachDB source.

After Tier 1, you can read Tier 2 or 3 papers easily. After Tier 2, you can write Tier 3 papers.

Appendix BD: A senior+ designer's checklist¶

When you architect a concurrent ordered store in Go:

For each answer, the choice of protocol and library follows. A 1k QPS, 1ms latency, 100 MB memory workload: single Mutex. A 10M QPS, 1 μs latency, 100 GB memory workload: hardware-aware engine.

The skill is mapping answers to architecture.

Appendix BE: A note on Go's evolving ecosystem¶

Go's concurrency primitives have evolved:

2009: sync.Mutex, sync.RWMutex.
2014: sync.Pool.
2017: sync.Map.
2019: sync/atomic types (atomic.Value).
2021: atomic.Pointer[T], atomic.Int64, etc.

By 2026, the ecosystem is mature. The standard library is sufficient for almost all concurrent tree needs.

Future possibilities (speculative):

arena package for region-based allocation (proposed but not stable).
Better intrinsics support.
Improved GC for very large heaps.
NUMA awareness in the runtime.

If any of these land, concurrent tree libraries will likely take advantage. Stay current.

Appendix BF: A final epilogue¶

Sixteen thousand lines of writing across four files. Concurrent trees in Go from "use one mutex" to "design a lock-free Bw-tree."

What has it given you?

Vocabulary: hand-over-hand, OLFIT, B-link, COW, MVCC, RCU, Bw-tree, ART.
Patterns: publish/subscribe, sharded, snapshot publishing, MVCC version chains.
Libraries: tidwall/btree, google/btree, bbolt, Pebble, BadgerDB.
Architectures: in-memory transactional engines, sharded stores, LSM-backed databases.
Skills: profiling, benchmarking, race detection, invariant checking.
Judgment: when to use which technique, when to use none.

The single most important takeaway: start simple. Almost every Go service that needs ordered concurrent data is best served by publish/subscribe COW with tidwall/btree. The rest of this material is for the rare cases where that pattern is genuinely insufficient.

When you reach those cases, you have the depth to engage with them. When you do not reach them — which is most of the time — you have the wisdom to deploy simplicity.

Welcome to the apex.

Appendix BG: One more closing¶

To the engineer reading this in 2030, 2035, 2040: the principles in this section are likely still relevant. Hardware changes, languages change, ecosystems change, but the fundamental challenges of concurrent ordered data remain:

How do many threads read and write the same ordered structure without corruption?
How do we make readers and writers not block each other?
How do we handle concurrent structural modifications?
How do we reclaim memory safely?

The answers are eternal: latching, optimism, copy-on-write, version counters, sibling pointers, epoch reclamation. The implementations vary; the principles do not.

If you are reading this in the future, know that you are the latest in a long line of engineers grappling with these problems. The literature is rich; the libraries are mature; the patterns are well understood. Use them wisely.

Build well. Profile honestly. Ship reliably. Welcome to concurrent trees in Go.

Appendix BH: A long worked benchmark — comparing seven protocols¶

Let's set up a benchmark that compares seven protocols on the same workload:

Single sync.Mutex.
sync.RWMutex.
Sharded sync.Mutex (16 shards).
Sharded sync.RWMutex (16 shards).
COW publish/subscribe.
Sharded COW publish/subscribe.
OLFIT (Appendix H from senior).

Workload: 10M-entry tree of int64 keys; mixed 95% reads / 5% writes; 64 concurrent goroutines.

package compare_test

import (
    "math/rand"
    "sync"
    "sync/atomic"
    "testing"

    "github.com/tidwall/btree"
)

// Each protocol exposes Set(k, v int64) and Get(k int64) (int64, bool).

// 1. Single Mutex
type mutexStore struct {
    mu sync.Mutex
    bt *btree.BTreeG[item]
}

func newMutexStore() *mutexStore {
    return &mutexStore{bt: btree.NewBTreeG[item](less)}
}

func (s *mutexStore) Set(k, v int64) {
    s.mu.Lock()
    s.bt.Set(item{k, v})
    s.mu.Unlock()
}

func (s *mutexStore) Get(k int64) (int64, bool) {
    s.mu.Lock()
    defer s.mu.Unlock()
    it, ok := s.bt.Get(item{Key: k})
    return it.Val, ok
}

// 2. RWMutex
type rwStore struct {
    mu sync.RWMutex
    bt *btree.BTreeG[item]
}

func newRWStore() *rwStore {
    return &rwStore{bt: btree.NewBTreeG[item](less)}
}

func (s *rwStore) Set(k, v int64) {
    s.mu.Lock()
    s.bt.Set(item{k, v})
    s.mu.Unlock()
}

func (s *rwStore) Get(k int64) (int64, bool) {
    s.mu.RLock()
    defer s.mu.RUnlock()
    it, ok := s.bt.Get(item{Key: k})
    return it.Val, ok
}

// 3. Sharded Mutex
type shardedMutex struct {
    shards [16]struct {
        mu sync.Mutex
        bt *btree.BTreeG[item]
    }
}

func newShardedMutex() *shardedMutex {
    var s shardedMutex
    for i := range s.shards {
        s.shards[i].bt = btree.NewBTreeG[item](less)
    }
    return &s
}

func (s *shardedMutex) shardFor(k int64) *struct {
    mu sync.Mutex
    bt *btree.BTreeG[item]
} {
    return &s.shards[uint(k)%16]
}

func (s *shardedMutex) Set(k, v int64) {
    sh := s.shardFor(k)
    sh.mu.Lock()
    sh.bt.Set(item{k, v})
    sh.mu.Unlock()
}

func (s *shardedMutex) Get(k int64) (int64, bool) {
    sh := s.shardFor(k)
    sh.mu.Lock()
    defer sh.mu.Unlock()
    it, ok := sh.bt.Get(item{Key: k})
    return it.Val, ok
}

// 4. Sharded RWMutex
type shardedRW struct {
    shards [16]struct {
        mu sync.RWMutex
        bt *btree.BTreeG[item]
    }
}

func newShardedRW() *shardedRW {
    var s shardedRW
    for i := range s.shards {
        s.shards[i].bt = btree.NewBTreeG[item](less)
    }
    return &s
}

func (s *shardedRW) Set(k, v int64) {
    sh := &s.shards[uint(k)%16]
    sh.mu.Lock()
    sh.bt.Set(item{k, v})
    sh.mu.Unlock()
}

func (s *shardedRW) Get(k int64) (int64, bool) {
    sh := &s.shards[uint(k)%16]
    sh.mu.RLock()
    defer sh.mu.RUnlock()
    it, ok := sh.bt.Get(item{Key: k})
    return it.Val, ok
}

// 5. COW publish/subscribe
type cowStore struct {
    mu       sync.Mutex
    live     *btree.BTreeG[item]
    snapshot atomic.Pointer[btree.BTreeG[item]]
}

func newCOWStore() *cowStore {
    s := &cowStore{live: btree.NewBTreeG[item](less)}
    s.snapshot.Store(s.live.Copy())
    return s
}

func (s *cowStore) Set(k, v int64) {
    s.mu.Lock()
    s.live.Set(item{k, v})
    s.snapshot.Store(s.live.Copy())
    s.mu.Unlock()
}

func (s *cowStore) Get(k int64) (int64, bool) {
    it, ok := s.snapshot.Load().Get(item{Key: k})
    return it.Val, ok
}

// 6. Sharded COW
type shardedCOW struct {
    shards [16]struct {
        mu       sync.Mutex
        live     *btree.BTreeG[item]
        snapshot atomic.Pointer[btree.BTreeG[item]]
    }
}

func newShardedCOW() *shardedCOW {
    var s shardedCOW
    for i := range s.shards {
        s.shards[i].live = btree.NewBTreeG[item](less)
        s.shards[i].snapshot.Store(s.shards[i].live.Copy())
    }
    return &s
}

func (s *shardedCOW) Set(k, v int64) {
    sh := &s.shards[uint(k)%16]
    sh.mu.Lock()
    sh.live.Set(item{k, v})
    sh.snapshot.Store(sh.live.Copy())
    sh.mu.Unlock()
}

func (s *shardedCOW) Get(k int64) (int64, bool) {
    sh := &s.shards[uint(k)%16]
    it, ok := sh.snapshot.Load().Get(item{Key: k})
    return it.Val, ok
}

// Shared types
type item struct {
    Key int64
    Val int64
}

func less(a, b item) bool { return a.Key < b.Key }

// Benchmark driver
func benchmarkStore(b *testing.B, set func(k, v int64), get func(k int64)) {
    // Pre-populate.
    for i := int64(0); i < 10_000_000; i++ {
        set(i, i*2)
    }
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        r := rand.New(rand.NewSource(rand.Int63()))
        for pb.Next() {
            if r.Intn(20) == 0 {
                k := r.Int63n(10_000_000)
                set(k, k*2)
            } else {
                k := r.Int63n(10_000_000)
                get(k)
            }
        }
    })
}

func BenchmarkMutex(b *testing.B) {
    s := newMutexStore()
    benchmarkStore(b, s.Set, func(k int64) { s.Get(k) })
}

// ... and similar for the other 6 protocols ...

Run with go test -bench=. -cpu=1,4,16,64 -benchtime=5s. On a 2026 64-core server:

BenchmarkMutex-1        500k ops/sec
BenchmarkMutex-64       500k ops/sec    (no scaling; serial bottleneck)

BenchmarkRW-1           500k ops/sec
BenchmarkRW-64           5M ops/sec     (scaling for reads only)

BenchmarkShardedMutex-1   500k ops/sec
BenchmarkShardedMutex-64   10M ops/sec   (linear scaling)

BenchmarkShardedRW-1    500k ops/sec
BenchmarkShardedRW-64    15M ops/sec    (better but writers still serialize per shard)

BenchmarkCOW-1            1M ops/sec
BenchmarkCOW-64          20M ops/sec    (writes limit; reads scale)

BenchmarkShardedCOW-1     1M ops/sec
BenchmarkShardedCOW-64    50M ops/sec   (both reads and writes scale)

Numbers are illustrative; actual results vary by hardware and workload.

Conclusion: sharded COW is the winner for high-throughput mixed workloads. Plain COW is excellent for read-heavy. RWMutex is the fallback when sharding is not feasible (e.g., heavy range queries).

Appendix BI: A study of CockroachDB's MVCC implementation¶

CockroachDB uses Pebble (LSM-tree) for storage, with MVCC layered on top. The key encoding:

mvcc_key = user_key | walltime (12 bytes) | logical (4 bytes)

The MVCC suffix is sorted descending (newer first), so the largest entry for a given user_key is the latest version. Reads scan downward from (user_key, snapshot_timestamp).

This is elegant because:

The tree's natural ordering provides MVCC visibility.
No separate version chain.
Range scans within a timestamp are natural.
GC removes old MVCC entries by scanning the tree (no separate version tracking).

For Go in-memory MVCC, you could emulate this:

type mvccKey struct {
    UserKey   string
    Timestamp int64
}

func less(a, b mvccKey) bool {
    if a.UserKey != b.UserKey {
        return a.UserKey < b.UserKey
    }
    return a.Timestamp > b.Timestamp // descending
}

type Engine struct {
    bt *btree.BTreeG[Entry]
}

type Entry struct {
    Key   mvccKey
    Value []byte
}

Read(userKey, snapshot) does an Ascend(mvccKey{userKey, snapshot}, ...) and returns the first matching entry. Write(userKey, value, txid) does a Set(Entry{mvccKey{userKey, txid}, value}).

GC removes entries with timestamps older than the oldest active reader.

This pattern is cleaner than per-row version chains and is what CockroachDB uses in production. For pure Go in-memory MVCC, consider it.

Appendix BJ: A close look at FoundationDB's resolver¶

FoundationDB uses a centralized resolver for transaction validation:

Each transaction sends its read set and write set to the resolver.
The resolver checks for conflicts against committed transactions in a sliding window.
If no conflict, the transaction commits.
If conflict, the transaction aborts.

This central resolver is the bottleneck (~100k tx/sec on a single node), but it provides serializable isolation across a globally distributed database.

For comparison, Spanner uses two-phase commit + TrueTime; CockroachDB uses range-based MVCC + intent resolution; TiDB uses Percolator-style locks.

For Go applications using FoundationDB: you do not implement the resolver. You use the client library and trust the resolver to handle correctness.

For Go applications building their own distributed transactional store: study FoundationDB's resolver. It is the cleanest published design.

Appendix BK: A meditation on tree fanout choices¶

Fanout choices are often unexamined. Let's think through them:

Fanout 2 (binary tree).

Height: log2(n). For 1B entries, ~30 levels.
Memory: 2 child pointers per node + the key. ~24 bytes per entry.
Cache: each level is 1 cache line; chasing 30 pointers = potential 30 cache misses.
Use: rare; only when you need rotations.

Fanout 16.

Height: log16(n). For 1B entries, ~8 levels.
Memory: 16 child pointers + 15 keys = 248 bytes per node.
Cache: each node is 4 cache lines.
Use: small/medium trees where every comparison matters.

Fanout 32.

Height: log32(n). For 1B entries, ~6 levels.
Memory: 32 child pointers + 31 keys = 504 bytes per node.
Cache: each node is 8 cache lines.
Use: in-memory B-trees in Go. google/btree and tidwall/btree default to ~32.

Fanout 64.

Height: log64(n). For 1B entries, ~5 levels.
Memory: 64 child pointers + 63 keys = 1 KB per node.
Cache: each node is 16 cache lines.
Use: very large in-memory trees; some on-disk trees.

Fanout 128-256.

Height: log128(n). For 1B entries, ~4 levels.
Memory: 2-4 KB per node.
Use: on-disk trees with 4 KB pages; database B+-trees.

Fanout 512+.

Memory: 8 KB+ per node.
Use: on-disk B+-trees with 8 KB pages.

For Go in-memory trees, fanout 32-64 is the sweet spot. tidwall/btree uses 32 internally; you can override via options.

Appendix BL: A meditation on item size¶

A tree's performance depends heavily on item size:

Small items (< 32 bytes): inline in the node; excellent cache behavior.
Medium items (32-256 bytes): mix of inline and indirection; OK cache behavior.
Large items (> 256 bytes): must indirect; bad cache behavior.

For Go, the design choice:

type SmallItem struct {
    Key int64
    Val int64
} // 16 bytes; inline in the node

type MediumItem struct {
    Key int64
    Val string // 16 bytes header + indirected
} // 24 bytes header + indirected

type LargeItem struct {
    Key int64
    Val []byte // 24 bytes header + indirected
} // 32 bytes header + indirected

For tree throughput, prefer items < 64 bytes that fit in a cache line. If your values are large, store a pointer or an ID and look up the actual data separately.

Real production Go trees store small items (e.g., (key, valueID, version)) inline and look up large values in a separate map.

Appendix BM: Final summary and a note¶

We have covered:

The Bw-tree at full implementation detail.
Concurrent ART.
OLFIT, MassTree, Lehman-Yao.
Hekaton-class engines.
A comparative tour of production databases.
Hardware-aware design (cache, NUMA, persistent memory, HTM).
Microbenchmarking discipline.
A complete reference architecture for an in-memory transactional engine.

For Go in 2026, the message remains: the simple patterns win. Publish/subscribe COW with tidwall/btree is the right default. Sharding extends it for high throughput. MVCC layers on top when transactions are needed.

The exotic techniques are for the rare case where the simple patterns are genuinely insufficient — which, in pure-Go production code, is almost never.

You have completed the journey. Now build something real.

Appendix BN: Truly final closing¶

This file is over 5000 lines. The entire section is over 16000 lines. You have completed one of the deepest treatments of concurrent ordered data structures in Go available anywhere.

The road ahead:

Build the reference architecture from Appendix AR. Profile it. Iterate.
Read PostgreSQL nbtree's source end-to-end. It is enlightening.
Read at least three of the Tier 1 papers from Appendix BC.
Talk to other engineers about concurrent data structures. Defend your design choices.
Contribute back. Open-source a Go concurrent tree library or contribute to an existing one.

The field is alive. New papers appear yearly. New libraries emerge. Stay curious.

And whenever you find yourself reaching for an exotic technique, ask: could tidwall/btree with publish/subscribe COW handle this? If yes, use that. If no — and only if no — reach for the more sophisticated tool.

Welcome to mastery. Carry it forward. Build well.

Appendix BO: Implementation deep dive — concurrent ART with full code¶

For completeness, a more thorough concurrent ART implementation in Go:

package art

import (
    "bytes"
    "runtime"
    "sync"
    "sync/atomic"
)

const (
    Node4Max   = 4
    Node16Max  = 16
    Node48Max  = 48
    Node256Max = 256
    MaxPrefix  = 8
)

type nodeBase struct {
    version     atomic.Uint64
    writeMu     sync.Mutex
    prefix      [MaxPrefix]byte
    prefixLen   uint8
    numChildren uint16
}

type artNode interface {
    base() *nodeBase
    isLeaf() bool
    findChild(b byte) *artNode
    addChild(b byte, child artNode) artNode // returns possibly resized node
    removeChild(b byte) artNode
}

type node4 struct {
    nodeBase
    keys     [Node4Max]byte
    children [Node4Max]artNode
}

func (n *node4) base() *nodeBase { return &n.nodeBase }
func (n *node4) isLeaf() bool    { return false }

func (n *node4) findChild(b byte) *artNode {
    for i := uint16(0); i < n.numChildren; i++ {
        if n.keys[i] == b {
            return &n.children[i]
        }
    }
    return nil
}

func (n *node4) addChild(b byte, child artNode) artNode {
    if n.numChildren < Node4Max {
        i := uint16(0)
        for ; i < n.numChildren; i++ {
            if n.keys[i] > b {
                break
            }
        }
        // Shift right
        copy(n.keys[i+1:], n.keys[i:n.numChildren])
        copy(n.children[i+1:], n.children[i:n.numChildren])
        n.keys[i] = b
        n.children[i] = child
        n.numChildren++
        return n
    }
    // Resize to Node16
    bigger := &node16{}
    bigger.nodeBase = n.nodeBase
    for i := uint16(0); i < n.numChildren; i++ {
        bigger.keys[i] = n.keys[i]
        bigger.children[i] = n.children[i]
    }
    return bigger.addChild(b, child)
}

type node16 struct {
    nodeBase
    keys     [Node16Max]byte
    children [Node16Max]artNode
}

func (n *node16) base() *nodeBase { return &n.nodeBase }
func (n *node16) isLeaf() bool    { return false }

func (n *node16) findChild(b byte) *artNode {
    // Linear scan; could use SIMD in C++.
    for i := uint16(0); i < n.numChildren; i++ {
        if n.keys[i] == b {
            return &n.children[i]
        }
    }
    return nil
}

func (n *node16) addChild(b byte, child artNode) artNode {
    if n.numChildren < Node16Max {
        i := uint16(0)
        for ; i < n.numChildren; i++ {
            if n.keys[i] > b {
                break
            }
        }
        copy(n.keys[i+1:], n.keys[i:n.numChildren])
        copy(n.children[i+1:], n.children[i:n.numChildren])
        n.keys[i] = b
        n.children[i] = child
        n.numChildren++
        return n
    }
    // Resize to Node48
    bigger := &node48{}
    bigger.nodeBase = n.nodeBase
    for i := uint16(0); i < n.numChildren; i++ {
        bigger.indices[n.keys[i]] = uint8(i + 1) // 0 = empty
        bigger.children[i] = n.children[i]
    }
    return bigger.addChild(b, child)
}

type node48 struct {
    nodeBase
    indices  [256]uint8 // 0 = empty; else 1-based index into children
    children [Node48Max]artNode
}

func (n *node48) base() *nodeBase { return &n.nodeBase }
func (n *node48) isLeaf() bool    { return false }

func (n *node48) findChild(b byte) *artNode {
    idx := n.indices[b]
    if idx == 0 {
        return nil
    }
    return &n.children[idx-1]
}

func (n *node48) addChild(b byte, child artNode) artNode {
    if n.numChildren < Node48Max {
        // Find first empty slot.
        for i := uint16(0); i < Node48Max; i++ {
            if n.children[i] == nil {
                n.children[i] = child
                n.indices[b] = uint8(i + 1)
                n.numChildren++
                return n
            }
        }
    }
    // Resize to Node256
    bigger := &node256{}
    bigger.nodeBase = n.nodeBase
    for b := 0; b < 256; b++ {
        if n.indices[b] != 0 {
            bigger.children[b] = n.children[n.indices[b]-1]
        }
    }
    return bigger.addChild(b, child)
}

type node256 struct {
    nodeBase
    children [256]artNode
}

func (n *node256) base() *nodeBase { return &n.nodeBase }
func (n *node256) isLeaf() bool    { return false }

func (n *node256) findChild(b byte) *artNode {
    if n.children[b] != nil {
        return &n.children[b]
    }
    return nil
}

func (n *node256) addChild(b byte, child artNode) artNode {
    n.children[b] = child
    n.numChildren++
    return n
}

type leafNode struct {
    nodeBase
    key   []byte
    value []byte
}

func (l *leafNode) base() *nodeBase { return &l.nodeBase }
func (l *leafNode) isLeaf() bool    { return true }
func (l *leafNode) findChild(byte) *artNode {
    panic("findChild on leaf")
}
func (l *leafNode) addChild(byte, artNode) artNode {
    panic("addChild on leaf")
}

// Tree wrapper.
type Tree struct {
    rootMu sync.Mutex
    root   atomic.Value // artNode
}

func (t *Tree) Get(key []byte) ([]byte, bool) {
    for retries := 0; retries < 64; retries++ {
        n, _ := t.root.Load().(artNode)
        if n == nil {
            return nil, false
        }
        depth := 0
    walk:
        for {
            if n.isLeaf() {
                leaf := n.(*leafNode)
                if bytes.Equal(leaf.key, key) {
                    return leaf.value, true
                }
                return nil, false
            }
            b := n.base()
            v1 := b.version.Load()
            if v1&1 == 1 {
                runtime.Gosched()
                break walk
            }
            if !matchPrefix(b.prefix[:b.prefixLen], key, depth) {
                if b.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            depth += int(b.prefixLen)
            if depth >= len(key) {
                if b.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            childPtr := n.findChild(key[depth])
            if childPtr == nil {
                if b.version.Load() != v1 {
                    break walk
                }
                return nil, false
            }
            child := *childPtr
            if b.version.Load() != v1 {
                break walk
            }
            n = child
            depth++
        }
    }
    panic("too many retries")
}

func matchPrefix(prefix, key []byte, depth int) bool {
    for i := 0; i < len(prefix); i++ {
        if depth+i >= len(key) || prefix[i] != key[depth+i] {
            return false
        }
    }
    return true
}

This is ~250 lines. Missing: insertion (which handles splits and prefix updates), deletion (which handles node downsize), and concurrent insertion (which takes write latches and updates version counters).

A complete implementation is ~1500-2000 lines. In Go, expect 30-50% slower than tuned C++ ART. For most workloads this is acceptable; for the apex of throughput, use cgo to a C++ library.

Appendix BP: A walkthrough of FoundationDB's transactional semantics¶

FoundationDB's transactions are:

Serializable: every transaction appears to execute in some serial order.
Optimistic: no locking during reads or writes; conflicts detected at commit.
Time-bounded: a transaction must complete within 5 seconds (a configuration).
Read-versioned: each transaction's reads are at a single version (the version it received at begin).

The transaction lifecycle:

Client calls tr.Begin(). The client receives a read version from a Proxy.
Client reads keys via tr.Get(key). Reads return values at the read version.
Client writes via tr.Set(key, value) (or tr.Clear(range)). Writes are buffered locally.
Client calls tr.Commit(). The buffered writes are sent to the cluster.
The cluster's Resolver checks for conflicts: did any of this transaction's reads change since the read version?
If no conflicts, the cluster assigns a commit version. Writes are made durable.
Client receives the commit version.

The Resolver is a centralized component that checks conflicts in a sliding window. Throughput: ~100k tx/sec on a single Resolver, scaling horizontally with multiple Resolvers (sharded by key range).

For Go programmers using FoundationDB: the client library handles all this. You write straightforward Go code; FoundationDB ensures serializability.

For Go programmers building distributed transactional storage: study FoundationDB's design. It is unusually clean.

Appendix BQ: A meditation on snapshot isolation vs serializable¶

Two common isolation levels:

Snapshot Isolation (SI):

Each transaction reads from a consistent snapshot at its start.
Writes by other transactions during execution are not visible.
Write-write conflicts cause abort.
Read-write conflicts do NOT cause abort (this is the weakness).

Serializable (S):

Every transaction appears to execute in some serial order.
All conflicts (read-write, write-write) prevent inconsistent outcomes.
More expensive: requires either pessimistic locking or optimistic validation against read sets.

The classic anomaly that SI allows but S forbids: write skew.

Suppose two doctors on-call. The constraint: at least one must be on duty.

Transaction T1: doctor A reads (A on duty, B on duty) — invariant holds. Decides to go off duty. Writes (A off duty).
Transaction T2: doctor B reads (A on duty, B on duty) — invariant holds. Decides to go off duty. Writes (B off duty).

Both T1 and T2 commit under SI: no write-write conflict (they wrote different keys). After: both doctors off duty. Invariant violated.

Under S: T2's read of "A on duty" is invalidated by T1's commit, so T2 aborts. The invariant is preserved.

CockroachDB defaults to Serializable; PostgreSQL defaults to Read Committed (weaker than SI); MySQL defaults to Repeatable Read (also weaker than SI). Hekaton defaults to SI.

For Go applications: if you need correctness guarantees stronger than "I trust my application logic to handle anomalies," use Serializable. The performance cost is real but the bug-prevention benefit is real too.

Appendix BR: A reading list synthesis¶

If you read only one paper from this section: Lehman-Yao (1981). It is short, foundational, and changes how you think about concurrent trees.

If you read two: Lehman-Yao + Cha et al. (OLFIT, 2001). These together cover B-link and OCC, which underpin most modern designs.

If you read three: add Levandoski et al. (Bw-tree, 2013). This is the apex of in-memory B-tree design.

If you read four: add Leis et al. (ART, 2013). The competing structure for ordered in-memory data.

If you read more: dive into Hekaton (Diaconu 2013), MassTree (Mao 2012), CockroachDB design docs, and the McKenney RCU bible.

After reading these papers, you understand the modern in-memory concurrent ordered data structure landscape. You will not be surprised by any production design.

Appendix BS: Final reflections on Go for engine work¶

A long meditation: when is Go the right language for engine-grade work?

Yes:

Throughput target ≤ 1M ops/sec.
Latency target ≥ 10 μs.
Operational simplicity matters.
Team is more Go-fluent than C++/Rust.
GC pauses are acceptable.

No:

Throughput > 10M ops/sec.
Latency < 1 μs.
Memory layout precision matters.
Hardware-specific tuning (SIMD, NUMA-local allocation, persistent memory) is essential.

For most product work, Go is plenty fast. For database engines, exchanges, real-time systems — evaluate Rust or C++.

Hybrid approach: build the orchestration in Go, the hot path in Rust via FFI. This is what CockroachDB does (Go + Pebble Go), what TiDB does (Go + TiKV Rust), and what many large systems do.

Appendix BT: A vision of the future¶

In 2030 and beyond, expect:

AI-accelerated indexes (learned indexes maturing).
Disaggregated memory (CXL, large-scale shared memory).
Quantum-resistant cryptography affecting transaction security.
More languages targeting WASM, including database engines.
Edge databases with novel consistency models.

The fundamentals — concurrent ordered data structures — will remain relevant. New hardware will shift the trade-offs but not the principles.

Appendix BU: The very last word¶

If you have read all four files in this section, you have invested days of study. That investment compounds: it sharpens your judgment on every concurrency problem you face for the rest of your career.

Concurrent ordered data structures are not a single topic. They are a microcosm of all of systems programming: locking, lock-free, immutability, snapshots, garbage collection, hardware, latency, throughput, complexity, simplicity.

What you have built is not just knowledge of trees but a way of thinking about concurrent systems. Carry it.

When you encounter a new concurrent problem — whether trees, queues, sets, graphs, or something exotic — apply the same principles. Identify the read/write ratio. Look for snapshot opportunities. Consider COW. Profile before optimizing. Default to simplicity.

You will be a better engineer for it.

Welcome to the apex of concurrent trees in Go. The road continues — concurrent skip lists, concurrent hash maps, distributed consensus, persistent stores, beyond — but the fundamentals you have learned here will serve you everywhere.

Build well. Profile honestly. Ship reliably. Share what you learn.

And remember, when in doubt: tidwall/btree + publish/subscribe COW.

HTM transaction¶

xbegin (fallback addr)
  read cache lines A, B, C
  write cache line D
xend
  if no conflict: commit
  if conflict on A/B/C/D: abort, jump to fallback addr

fallback:
  take mutex
  do the same work
  release mutex