D-Ary Heap — Mathematical Foundations and Complexity Theory¶

The d-ary heap generalises the binary heap by allowing each internal node to have up to d children for an integer d >= 2. The structure preserves the implicit-array layout that makes binary heaps cache-friendly, but it changes the height of the tree from log_2 n to log_d n and, consequently, the trade-off between the cost of push (sift-up) and pop (sift-down). This document develops the formal definition, derives the index arithmetic, proves the loop invariant of sift-down, gives a tight O(n) bound for build-heap that exposes the constant (d - 1), derives the workload-optimal d, analyses external-memory and cache behaviour, states the matching comparison-based lower bound, and surveys variants such as the d-ary soft heap.

Table of Contents¶

Formal Definition — d-ary heap as a d-ary complete tree with the order invariant
Generalised Index Arithmetic
Correctness Proof — sift-down loop invariant
Tight O(n) Build-Heap Bound for Branching Factor d
Workload-Optimal Choice of d — minimising alpha * log_d n + beta * d * log_d n
Cache-Aware Analysis — alignment factor
Lower Bound — Omega(log n / log d) per push
Comparison with Binary Heap (constant factors)
d-Ary Soft Heap and Related Variants
Open Problems
Summary

1. Formal Definition¶

A d-ary heap of size n is a pair (T, key) where T is a rooted tree on the vertex set V = {0, 1, ..., n - 1} and key : V -> K is a labelling into a totally ordered key universe (K, <=) such that:

(Shape invariant) T is a complete d-ary tree: every level except possibly the last is fully populated with d children per internal node, and the last level is filled from the left without gaps.
(Order invariant, min-heap) For every non-root v with parent p(v), key(p(v)) <= key(v).

Equivalently, the order invariant states that for every internal node u,

key(u) <= min { key(c) : c is a child of u }.

The completeness condition together with the parent-child arithmetic of Section 2 makes the tree implicit: it is stored as a contiguous array of length n, with no pointers. The labelling key is precisely the array content.

Height. A complete d-ary tree on n nodes has height

h(n) = floor(log_d ((d - 1) * n + 1)) - 1     for d >= 2
     = Theta(log_d n).

For d = 2 this collapses to the familiar floor(log_2 n).

Reference. CLRS (Cormen, Leiserson, Rivest, Stein), Introduction to Algorithms, 3rd ed., Chapter 6, Problem 6-2 introduces d-ary heaps and asks the reader to derive the height and the cost of EXTRACT-MIN and DECREASE-KEY.

2. Generalised Index Arithmetic¶

Place the root at index 0 (zero-based). For an internal node i, its d children occupy the contiguous slots

child_k(i) = d * i + k + 1,    for k = 0, 1, ..., d - 1.

The parent of any non-root node j >= 1 is

parent(j) = floor((j - 1) / d).

Derivation. Number the nodes in breadth-first order. Level L (root at level 0) contains the indices [ (d^L - 1) / (d - 1), (d^(L+1) - 1) / (d - 1) ). A node at level-offset o on level L has index i = (d^L - 1) / (d - 1) + o. Its k-th child sits at level-offset d * o + k on level L + 1, hence at index

(d^(L+1) - 1) / (d - 1) + d * o + k.

Subtracting 1, dividing by d, and using (d^(L+1) - 1)/(d - 1) = d * (d^L - 1)/(d - 1) + 1 gives child_k(i) = d * i + k + 1. The parent formula is obtained by inverting: from j = d * i + k + 1 with 0 <= k <= d - 1, we have i = floor((j - 1) / d).

Sibling range. The full sibling set of j is the contiguous interval

[ d * parent(j) + 1, d * parent(j) + d ] intersect [0, n).

This contiguity is critical: a sift-down step scans this range with one tight loop and one branch predictor hint, accessing at most d consecutive array elements.

One-based variant. Some textbooks (and the Boost d_ary_heap source) place the root at index 1, giving

child_k(i) = d * (i - 1) + k + 1,    parent(j) = floor((j - 2) / d) + 1.

The asymptotics are identical; the zero-based form is preferred in modern C++ and Rust implementations.

3. Correctness Proof — Sift-Down Loop Invariant¶

The sift-down procedure restores the order invariant after the root is replaced (e.g. by the last element during EXTRACT-MIN).

SIFT-DOWN(A, i, n):
    while true:
        m <- i                                     // candidate for minimum
        for k in 1 .. d:
            c <- d * i + k
            if c < n and A[c] < A[m]: m <- c
        if m == i: return
        swap A[i], A[m]
        i <- m

Theorem. If, before the call, the subtree rooted at every proper descendant of i satisfies the order invariant, then after SIFT-DOWN(A, i, n) the subtree rooted at i satisfies it as well.

Proof (loop invariant). Define the predicate P(i):

P(i): every child c of i satisfies (a) c is the root of a heap, and
      (b) if A[i] <= A[c] then the subtree rooted at i is a heap.

We show P(i) holds at the top of every iteration.

Initialisation. By hypothesis, every proper descendant of the initial i roots a heap; in particular, each child of i does so. Hence (a) holds. Statement (b) is the standard "root dominates children implies whole tree is a heap" lemma, proved by induction on subtree size.

Maintenance. Suppose P(i) holds and the loop body finds an index m != i with A[m] = min { A[i], A[child_1(i)], ..., A[child_d(i)] }. After swapping A[i] and A[m], the new value at i is the former minimum, which is less than or equal to the value at every child of i. By (b), it remains to show that the subtree at m (now containing the old A[i]) still satisfies the hypothesis of P(m). The children of m are unchanged, so (a) for m is inherited from the initial assumption. The new A[m] may violate the order relation with its children, but that is precisely what the next iteration will repair. Hence P(m) holds for the next iteration.

Termination. The variable i strictly increases at each iteration (children have larger indices), and is bounded by n - 1. The loop must terminate, and when it does either m == i (the order at i is already satisfied, so by (b) the whole subtree is a heap) or i has no children (a leaf is trivially a heap).

This completes the proof. The same scheme proves SIFT-UP symmetrically with the invariant "every ancestor strictly above i roots a heap of its own left-and-right subtrees".

4. Tight O(n) Build-Heap Bound for Branching Factor d¶

BUILD-HEAP calls SIFT-DOWN on every internal node from the last to the first. The number of internal nodes is at most ceil(n / d).

Step cost. A single SIFT-DOWN call at a node of subtree-height h performs at most h swaps and at most (d - 1) * h comparisons (each level scans d children, of which one is the parent in the next iteration, so d - 1 strictly new comparisons per level).

Node count by height. In a complete d-ary tree on n nodes, the number of nodes whose subtree has height exactly h is at most

n_h <= ceil( n / d^(h + 1) ).

Justification. A node at subtree-height h is the root of a complete d-ary subtree of >= d^h nodes (taking the convention that a leaf has height 0). Distinct such roots have disjoint subtrees, so their count is at most n / d^h. The tighter bound n / d^(h + 1) follows from the observation that together with each height-h node we may charge its d children of height h - 1, giving d^(h+1) "owned" descendants. (See CLRS Lemma 6.3 for the binary case; the generalisation is verbatim with 2 replaced by d.)

Total comparisons.

T(n) <= sum_{h = 0}^{h_max}  n_h * (d - 1) * h
      <= sum_{h = 0}^{infty} ceil(n / d^(h+1)) * (d - 1) * h
      <= n * (d - 1) / d  *  sum_{h = 0}^{infty}  h / d^h.

The series S(d) = sum_{h >= 0} h / d^h = d / (d - 1)^2 for d > 1 (standard derivative-of-geometric-series identity). Substituting,

T(n) <= n * (d - 1) / d  *  d / (d - 1)^2  =  n / (d - 1).

Conclusion. BUILD-HEAP runs in

T(n) <= n / (d - 1)   comparisons     =   O(n)   for any fixed d >= 2.

For d = 2 we recover the classical bound T(n) <= n. For larger d the constant shrinks: d = 4 gives n / 3, and d = 16 gives n / 15. This is one of the strongest reasons to prefer a d-ary heap for one-shot heapification of a large array.

5. Workload-Optimal Choice of d¶

Consider a sequence of operations with alpha * N pushes and beta * N pops (alpha + beta = 1). The amortised cost per operation is

C(d) = alpha * c_push(d) + beta * c_pop(d)
     = alpha * log_d n             // sift-up: one comparison per level
       + beta * (d - 1) * log_d n  // sift-down: d - 1 comparisons per level
     = (alpha + beta * (d - 1)) * log n / log d.

We minimise C(d) over real d > 1. Let f(d) = (alpha + beta * (d - 1)) / log d. Taking the derivative and setting it to zero:

f'(d) = [ beta * log d  -  (alpha + beta * (d - 1)) / d ] / (log d)^2 = 0
   <=> beta * d * log d  =  alpha + beta * (d - 1)
   <=> beta * d * (log d - 1)  =  alpha - beta.

Balanced workload (alpha = beta = 1/2). The right side is 0, so log d = 1, giving

d* = e ~ 2.718.

Since d must be an integer, the practical optimum is d = 2 or d = 3. The binary heap is within a constant factor of optimal for balanced workloads, which is why it remains the textbook default.

Push-heavy workload — Dijkstra and Prim. Dijkstra's algorithm on a graph with V vertices and E edges performs V pops (one per settled vertex) and up to E decrease-key calls (each implemented as a push in lazy-deletion variants). Setting alpha = E / (E + V) and beta = V / (E + V), the analysis of CLRS Chapter 24 yields total cost

O( (V + E) * log_d V  +  V * (d - 1) * log_d V ).

Differentiating with respect to d gives the famous choice

d* = max(2, ceil(E / V)).

For a graph with average degree k, picking d = k makes both terms balanced and gives total cost O((E + V) * log_{E/V} V), asymptotically faster than the binary-heap O((E + V) * log V) whenever E = omega(V).

Tarjan attributes this observation to Johnson (1977) and discusses it in Data Structures and Network Algorithms (CBMS-NSF Regional Conference Series in Applied Mathematics, 1983).

Boost reference. The Boost C++ Libraries boost::heap::d_ary_heap documentation explicitly exposes boost::heap::arity<N> as a configurable template parameter and notes that "for cache-friendly operation, arity in the range 4 to 8 is usually optimal on modern hardware".

6. Cache-Aware Analysis — Alignment Factor¶

In the external-memory model of Aggarwal and Vitter (1988), the machine has internal memory of M words and transfers blocks of B consecutive words at a time. The cost metric is the number of block transfers (I/Os).

Sift path length in blocks. A d-ary heap of n elements has height log_d n. If each level fits within a single block, the entire sift path is log_d n distinct nodes, but consecutive nodes on the path may share blocks. A more precise count is

I/Os per pop = O( log_d n / log_d(M / B) )      when d is small enough that
                                                a parent and all its d children
                                                fit in O(d / B) blocks.

Alignment factor. If d * sizeof(key) <= B, every sift-down step reads exactly one cache line for the d children: a perfectly aligned, sequential access pattern that the hardware prefetcher serves at maximum bandwidth. The constant in the I/O bound improves by a factor of d compared to a binary heap.

Implication. Pick d so that d * sizeof(key) matches the cache-line size B. On x86-64 with B = 64 bytes and 8-byte keys, this gives d = 8; on Apple Silicon (B = 128) it gives d = 16. Both values agree with empirical benchmarks of d_ary_heap implementations.

Caveat — root pollution. The root is accessed on every operation and is therefore always cached. Asymptotically, the I/O complexity is dominated by the leaf-side of the path, which is amortised across many operations and benefits the most from large d.

For an explicit cache-oblivious construction, see Frigo, Leiserson, Prokop, and Ramachandran, "Cache-Oblivious Algorithms", FOCS 1999. They show that the funnel-heap matches the lower bound Theta((1/B) * log_{M/B}(N/B)) per amortised op, which a fixed-d d-ary heap cannot match for all M/B.

7. Lower Bound — Omega(log n / log d) per Push¶

Any comparison-based priority queue must spend Omega(log n) comparisons per ordered output (otherwise sorting n elements via n push + n pop would beat the Omega(n log n) decision-tree bound).

Branching factor and depth. A d-ary heap of n elements has depth log_d n. A push performs at most one comparison per level, hence at most log_d n comparisons. So pushes are cheaper than pops, but only because the work is shifted: each pop performs (d - 1) * log_d n comparisons.

Tightness. Summing across n pushes and n pops,

total comparisons >= n * log_2 n      (sorting lower bound)
total work in d-ary heap = n * log_d n + n * (d - 1) * log_d n
                        = n * d * log_d n
                        = n * d * log n / log d.

For d = 2 this is 2 n log_2 n, twice the lower bound; for d = e it is e * n * log n / 1 = e n log n / log 2 = e n log_2 n ~ 2.71 n log_2 n. The per-pop lower bound is Omega(log n) independent of d, and the per-push lower bound is Omega(log n / log d), achieved by the d-ary heap.

Thus, the d-ary heap is not asymptotically faster than the binary heap when both push and pop work is counted; it merely redistributes that work between the two operations and reduces constants by exploiting cache alignment.

8. Comparison with Binary Heap (Constant Factors)¶

Operation	Binary heap (`d = 2`)	d-ary heap (general `d`)
`push` (sift-up)	`log_2 n` comparisons	`log_d n` comparisons
`pop` (sift-down)	`2 * log_2 n` comparisons	`(d - 1) * log_d n` comparisons
`build-heap`	`<= n` comparisons	`<= n / (d - 1)` comparisons
Height	`log_2 n`	`log_d n`
Cache lines per sift step	`1` (two children fit)	`ceil(d * sizeof(key) / B)`
Suitable `d` for x86-64 (64-byte line, 8-byte key)	`2`	`4` or `8`

Pop cost ratio. For fixed n, the ratio of comparisons per pop is

((d - 1) * log_d n) / (2 * log_2 n)  =  (d - 1) * log 2 / (2 * log d).

For d = 4: ratio is 3 * 0.693 / (2 * 1.386) = 0.75, a 25 percent saving in comparisons. For d = 8: ratio is 7 * 0.693 / (2 * 2.079) = 1.17, a 17 percent penalty in comparisons — but offset by the cache effect.

Push cost ratio. log_d n / log_2 n = log 2 / log d = 1 / log_2 d. For d = 4 this is 0.5; for d = 8, 0.33. Pushes become dramatically cheaper.

The soft heap of Chazelle (J. ACM 47(6), 2000) is a comparison-based priority queue that supports insert, meld, and extract-min in amortised O(1) time at the cost of introducing controlled errors: at any point a parameter epsilon in (0, 1/2] allows up to epsilon * n keys to be reported with a corrupted value greater than their true value.

d-ary variant. Chazelle's original construction uses binomial-style trees, but a d-ary analogue exists (Kaplan and Zwick, "A simpler implementation and analysis of Chazelle's soft heaps", SODA 2009). The d-ary soft heap admits a tunable trade-off:

insert: O(log(1/epsilon))         amortised
delete-min: O(log(1/epsilon))     amortised
total corrupted keys: <= epsilon * I, where I is the total number of inserts.

Applications.

Chazelle's O(m * alpha(m, n)) minimum spanning tree algorithm uses a soft heap to find approximately-minimum edges and then "cleans up" the corruption.
Pettie and Ramachandran's optimal deterministic MST algorithm (2002) builds on the same primitive.
Selection in linear time without partitioning (Chazelle's lecture notes, Princeton, 2003).

A d-ary soft heap variant has been explored for parallel and cache-oblivious settings, but it is not, to date, asymptotically better than the binary construction.

10. Open Problems¶

Cache-oblivious d-ary heap. A funnel heap matches the optimal I/O complexity Theta((1/B) * log_{M/B}(N/B)) per amortised op, but its implementation is complex. Is there a d-ary heap with a fixed structural parameter that matches this I/O bound on every machine without knowing B?
Deterministic O(1) decrease-key for d > 2. Fibonacci heaps achieve O(1) amortised decrease-key, but only for the binary-tree-list structure. Whether a d-ary structure with constant-amortised decrease-key exists and matches the cache properties of the standard d-ary heap is open. The strict Fibonacci heap of Brodal, Lagogiannis, and Tarjan (ICALP 2012) is a partial answer in the worst-case setting but is not array-implicit.
Tight lower bound including I/Os. The Aggarwal-Vitter sort bound gives a per-op I/O lower bound of Omega((1/B) * log_{M/B}(N/B)). The d-ary heap I/O cost includes a multiplicative log_d n / log_d(M/B) factor. Closing the gap for every (M, B, n, d) configuration remains open.
Optimal d under random workloads. If the push/pop mix is drawn from a distribution rather than fixed, the expected-optimal d may differ from the worst-case d* = e. A precise characterisation is not in the literature.
d-ary monotone priority queues. Thorup's O(m + n * log log n) SSSP bound (J. ACM 1999) uses a custom integer priority queue. Whether a d-ary analogue matches both the cache and the integer-key bound is unknown.

11. Summary¶

The d-ary heap is a parameterised generalisation of the binary heap. Setting d larger than 2 trades a slightly more expensive sift-down ((d - 1) * log_d n comparisons instead of 2 * log_2 n) for a strictly cheaper sift-up (log_d n instead of log_2 n) and a strictly cheaper build-heap (n / (d - 1) instead of n). The shallower tree (height log_d n) and the contiguity of sibling slots make the structure markedly cache-friendly: choosing d so that d * sizeof(key) matches the cache-line size produces the largest measurable speed-up over the binary heap on modern hardware.

The workload-optimal d minimises (alpha + beta * (d - 1)) * log n / log d; for balanced workloads the real-valued optimum is e, so d in {2, 3} is practically optimal. For Dijkstra's algorithm and other push-heavy graph algorithms, the optimum is d ~ E / V, and the resulting cost O((E + V) * log_{E/V} V) is asymptotically faster than the binary-heap bound whenever the graph is dense.

The comparison-based lower bound Omega(log n) per ordered output still applies; the d-ary heap merely redistributes that work between push and pop. For asymptotically faster amortised behaviour one must move to Fibonacci-style or soft-heap structures, at the cost of giving up the implicit array representation that makes the d-ary heap so attractive in cache-bound contexts.

Key references.

Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms, 3rd ed., MIT Press, 2009. Chapter 6, Problem 6-2 ("Analysis of d-ary heaps").
Boost C++ Libraries, boost::heap::d_ary_heap documentation, https://www.boost.org/doc/libs/release/doc/html/boost/heap/d_ary_heap.html
Chazelle, B. "The soft heap: an approximate priority queue with optimal error rate." Journal of the ACM 47(6):1012-1027, 2000.
Johnson, D. B. "Efficient algorithms for shortest paths in sparse networks." Journal of the ACM 24(1):1-13, 1977.
Tarjan, R. E. Data Structures and Network Algorithms. SIAM, 1983.
Aggarwal, A.; Vitter, J. S. "The input/output complexity of sorting and related problems." Communications of the ACM 31(9):1116-1127, 1988.
Frigo, M.; Leiserson, C. E.; Prokop, H.; Ramachandran, S. "Cache-oblivious algorithms." FOCS 1999.
Kaplan, H.; Zwick, U. "A simpler implementation and analysis of Chazelle's soft heaps." SODA 2009.