Fork-Join and Work-Stealing — Middle Level¶
Table of Contents¶
- Introduction
- The Fork-Join DAG
- Spawn, Sync, and the Series-Parallel DAG
- Work and Span of a Fork-Join Computation
- Why Fork-Join Is the "Well-Structured" Class
- The Work-Stealing Scheduler
- Per-Worker Deques: LIFO Local, FIFO Steal
- Why Bottom-LIFO and Top-FIFO
- Child-Stealing vs Continuation-Stealing
- The Blumofe–Leiserson Bound
- Statement
- Busy-Leaves and Why Steals Are Rare
- Near-Optimality: Matching Greedy/Brent
- The Space Bound
- Grain Size Analysis
- Spawn Overhead and the Task Count
- The Coarsening Trade-off and the Sweet Spot
- Code: Fork-Join and a Work-Stealing Scheduler
- Go
- Python
- Pitfalls
- Summary
Introduction¶
Focus: turn the junior facts —
fork/join, the work-stealing idea, grain size — into rigorous statements you can derive. By the end you can draw the fork-join DAG of aspawn/syncprogram and read its workT₁and spanT∞off it; explain the work-stealing deque discipline (owner LIFO at the bottom, thief FIFO at the top) and why each end is chosen; distinguish child-stealing from continuation-stealing and why the latter bounds space; state and motivate the Blumofe–Leiserson boundT_P ≤ T₁/P + O(T∞)expected time,O(P·S₁)space,O(P·T∞)expected steals; and pick a grain size that hides spawn overhead without throttling parallelism.
At the junior level you met fork-join parallelism: spawn (or fork) launches a subcomputation that may run in parallel with the code after it, and sync (or join) waits for the spawned work to finish. You met work-stealing: each worker keeps its own queue of ready tasks, runs its own work, and when it runs dry, steals a task from another worker. And you met grain size: the cutoff below which you stop spawning and just run serially, so the cost of forking does not swamp the work being forked.
This file makes all of that rigorous. It builds on the work–span model: work T₁ (total operations, = time on one processor), span T∞ (critical-path length, = time on infinitely many processors), parallelism T₁/T∞, and the greedy/Brent bound T_P ≤ T₁/P + T∞. Work-stealing is how a real runtime approximates that greedy schedule without a central queue — and the central result, Blumofe–Leiserson, says it does so within a constant factor, in expectation, while using little space and making few steals.
This file covers:
- The fork-join DAG. A
spawnadds a node and a fork edge; asyncadds a join node depending on the spawned strands. The result is a series-parallel DAG — the "well-structured" parallel programs. We readT₁andT∞straight off it. - The work-stealing scheduler. Each worker owns a double-ended queue (deque). The owner pushes and pops at the bottom (LIFO, depth-first, cache-friendly); a thief steals from the top (FIFO, the oldest/biggest subtree). We explain why both ends are chosen as they are.
- Child- vs continuation-stealing. Whether
spawnpushes the child (and the worker continues the parent) or pushes the continuation (and the worker runs the child, Cilk-style). The latter bounds stack space. - The Blumofe–Leiserson bound — stated precisely:
T_P ≤ T₁/P + O(T∞)expected,O(P·S₁)space,O(P·T∞)expected steals — with the busy-leaves intuition for why steals are rare and why this is near-optimal against Brent. - Grain size. Spawn overhead
τper fork inflates work byτ·(#tasks); coarsening to graingcuts tasks to≈ T₁/gbut caps parallelism. The sweet spot.
A note on vocabulary used throughout:
| Symbol | Meaning |
|---|---|
P | number of worker processors |
T₁ | work — total operations (= time on 1 processor) |
T∞ | span — critical-path length (= time on ∞ processors) |
T₁/T∞ | parallelism — the average available parallelism |
T_P | time to run on P processors under the scheduler |
S₁ | serial stack space — space used by the 1-processor depth-first execution |
g | grain size — serial cutoff below which we stop spawning |
τ | per-spawn scheduling overhead |
Throughout, "task" / "strand" means a maximal run of serial instructions with no spawn or sync in the middle — exactly the unit nodes of the work–span DAG.
The Fork-Join DAG¶
Spawn, Sync, and the Series-Parallel DAG¶
A fork-join program is built from two control constructs layered on ordinary serial code:
spawn f(args)— executef(args), but allow the instructions after the spawn (the continuation) to run in parallel with it. The spawn does not wait.sync— wait until all functions spawned in the current frame have completed before proceeding. It is the join point.
Map this onto a computation DAG, where each node is a unit-time strand and each edge is a dependency:
- A
spawnsplits the current strand into two outgoing edges: one to the child (the spawnedf), one to the continuation (the code after the spawn). This is a fork. - A
syncis a join node with an incoming edge from the end of every strand spawned since the last sync, plus one from the continuation. It cannot start until all of them finish.
The picture for a function that spawns two children then syncs:
● parent strand (up to the spawns)
/ \
spawn/ \ continuation
/ \
● child1 ● (spawn child2, then reach sync)
\ |
\ ● child2
\ /
\ /
● sync (join): depends on child1, child2, and the continuation
|
● parent continues after sync
Because forks and joins are properly nested — every spawn is matched by a later sync in the same frame, and frames nest like balanced parentheses — the resulting DAG is series-parallel (SP). An SP DAG is built recursively from single edges by two composition rules:
- Series composition
A ; B: runA, thenB(the sink ofAfeeds the source ofB). This is ordinary sequencing. - Parallel composition
A ∥ B: runAandBindependently between a common fork and a common join. This is aspawnof one branch alongside its continuation.
Every fork-join computation is such a nested SP DAG. This nesting is the structural fact behind everything that follows — it is what makes work-stealing's deque discipline correct and what makes the span easy to compute by a simple recurrence.
Work and Span of a Fork-Join Computation¶
Because the DAG is series-parallel, T₁ and T∞ obey clean compositional recurrences. Write T₁(C) and T∞(C) for the work and span of computation C.
Work composes by addition under both rules — every strand runs once on one processor regardless of structure:
Span composes by addition in series and maximum in parallel — parallel branches overlap, so the critical path takes the longer one:
The max in the parallel rule is the entire source of parallelism: two spawned branches contribute their sum to the work but only their maximum to the span. The more (and more balanced) the parallel composition, the larger the gap T₁/T∞.
Worked example — parallel sum of n elements by recursive halving (spawn the left half, recur on the right, sync, add):
sum(lo, hi):
if hi - lo <= grain: return serial_sum(lo, hi)
mid = (lo + hi) / 2
L = spawn sum(lo, mid) // fork the left half
R = sum(mid, hi) // continuation does the right half
sync // join
return L + R
Ignoring grain for the asymptotics, the recursion tree has depth log₂ n and n leaves. Work is T₁ = Θ(n) (every element is touched once, plus n − 1 additions at the internal joins). Span follows the recurrence T∞(n) = T∞(n/2) + Θ(1) — at each level the two halves run in parallel (max), and the join adds Θ(1) — giving T∞ = Θ(log n). Parallelism T₁/T∞ = Θ(n/log n). This is the same Θ(n)-work, Θ(log n)-span profile as the tree reduce — fork-join recursion is the reduction tree, expressed as code.
Worked example — parallel merge sort. Spawn the left recursive sort, run the right as the continuation, sync, then merge:
With a serial merge (span Θ(n)), the span recurrence is T∞(n) = T∞(n/2) + Θ(n) = Θ(n), so parallelism is only Θ(log n) — the serial merge is the bottleneck. With a parallel merge (span Θ(log² n), covered in parallel sorting and merging), the span drops to T∞(n) = T∞(n/2) + Θ(log² n) = Θ(log³ n) and parallelism rises to Θ(n log n / log³ n) = Θ(n/log² n). The lesson the recurrences make precise: parallelizing the recursion is not enough — you must also parallelize the combine step, or the serial combine caps your span.
Why Fork-Join Is the "Well-Structured" Class¶
Not every parallel program is fork-join. A program with arbitrary go-routines signalling each other through channels can form any DAG — including ones with edges that cross fork-join boundaries (a "pipeline" where stage 2 of branch A feeds stage 3 of branch B). Fork-join forbids exactly those crossings: every dependency edge respects the nesting.
This restriction is what makes fork-join the class scheduling theory targets:
- The span recurrence is local. You can compute
T∞by the simple series-+/parallel-maxrecurrence above. General DAGs need a full longest-path computation. - A deque suffices. Because spawned work nests, a single per-worker LIFO/FIFO deque exactly tracks "what this worker could do next" and "what is stealable" — no general dependency bookkeeping is needed. This is why the work-stealing scheduler is so cheap.
- Space is bounded. Nested parallelism means the live frames at any worker form a stack, so the space bound
O(P·S₁)holds. - It is provably efficient. The Blumofe–Leiserson theorem is proved for fully-strict (fork-join) computations; the proof relies on the SP structure.
So "well-structured" is not aesthetic — it is the precise hypothesis under which a deque-based scheduler is both correct (it never deadlocks, because nesting forbids cyclic waits) and efficient (the BL bound holds). Cilk, Intel TBB, Java's ForkJoinPool, OpenMP tasks, and Rust's Rayon all restrict you to (essentially) fork-join precisely to earn these guarantees.
The Work-Stealing Scheduler¶
Per-Worker Deques: LIFO Local, FIFO Steal¶
A work-stealing scheduler runs P workers (one per processor), each owning a private double-ended queue (deque) of ready tasks. A deque has two ends, conventionally bottom and top, and supports three operations:
pushBottom(task) // owner: add a newly spawned task (cheap, local)
popBottom() task // owner: take the most recent task (LIFO)
popTop() task // THIEF: steal the oldest task (FIFO), from another worker's deque
The discipline is asymmetric:
- The owner works at the bottom, LIFO. When a worker spawns, it
pushBottoms the new task; when it needs the next thing to do, itpopBottoms — taking the most recently spawned task. This makes each worker's local execution depth-first, exactly mirroring the order a single-threaded program would run. - A thief steals from the top, FIFO. An idle worker (empty deque) picks a victim at random and
popTops — taking the oldest task in the victim's deque.
worker's deque (grows downward as it spawns):
top ──► [ T0 ] ← oldest task: a thief steals THIS (FIFO, rare)
[ T1 ]
[ T2 ]
bottom ──► [ T3 ] ← newest task: the owner runs THIS (LIFO, common)
Because the owner and thieves touch opposite ends, they rarely contend: the common case (owner push/pop at the bottom) is a near-lock-free local operation, and steals (top) are infrequent. The deque is the scheduler's hot data structure, and a real implementation uses the lock-free Chase–Lev deque so that pushBottom/popBottom are wait-free in the common case and only a steal-vs-pop race on the last element needs a CAS.
Why Bottom-LIFO and Top-FIFO¶
The two ends are not arbitrary — each choice is justified by a separate concern.
Why the owner pops the bottom (LIFO / depth-first). When a worker spawns child then continuation and pushBottoms, popping the bottom next runs the most recently spawned work first. This makes the local schedule depth-first, identical to the serial execution order. That matters for two reasons:
- Cache locality. Depth-first execution touches the same data the spawning frame just touched — the child operates on a slice of the parent's freshly-cached data. A breadth-first order would scatter across the whole input before returning, blowing the cache. LIFO keeps the worker's working set small and hot, just like a serial recursion.
- Space. Depth-first means at most one root-to-leaf path of frames is live per worker at a time, which is what bounds stack space to
O(S₁)per worker (see space bound). A breadth-first local order would materialize an exponential number of live frames.
Why a thief steals the top (FIFO / oldest). The oldest task in a victim's deque is the one spawned highest in the recursion tree — the one representing the largest remaining subtree of work. Stealing it is productive in two ways:
- A steal moves a lot of work. Taking the top hands the thief a big chunk (a whole subtree), so one steal keeps the thief busy for a long time before it must steal again. Stealing the bottom (a tiny near-leaf task) would force constant re-stealing.
- Steals stay near the critical path. The oldest task is closest to the root of the unexplored DAG, so stealing it makes progress on the span — exactly the work that, left alone, would serialize. This is the intuition the BL bound makes precise: each steal "advances the span."
So the deque is a clever data structure: the owner sees a stack (LIFO, depth-first, cache-friendly, like serial code) while thieves see a queue (FIFO, taking the biggest, oldest jobs). One structure, two disciplines, opposite ends — minimal contention.
Child-Stealing vs Continuation-Stealing¶
When a worker executes L = spawn f(); <continuation>, there are two runnable things: the child f() and the continuation. The scheduler must choose which the spawning worker runs now and which it makes available to steal. Two policies:
- Child-stealing (a.k.a. "help-first"). The worker
pushBottoms the child and keeps running the continuation. The child is what's stealable. This is what most task libraries do (TBB, JavaForkJoinPool, OpenMP tasks, Rayon'sjoinin spirit) because it needs no compiler support —spawnis just "enqueue a closure." - Continuation-stealing (a.k.a. "work-first", the Cilk model). The worker
pushBottoms the continuation and immediately runs the child, depth-first. The continuation is what's stealable. This requires the compiler/runtime to be able to suspend and migrate the continuation (a stack frame), which is why Cilk needs compiler support (or, in C++, segmented/cactus stacks).
The difference looks subtle but has a sharp consequence for space:
loop: for i in 0..n: spawn work(i)
child-stealing: the worker races through the loop, pushBottom-ing n children
before running any → up to Θ(n) tasks live at once → space blowup
continuation-stealing: the worker runs work(0) to completion (depth-first), the
continuation (the rest of the loop) sits stealable; at most
ONE child is live per worker → space stays O(P·S₁)
Under continuation-stealing, the spawning worker behaves exactly like the serial program (run the child now, depth-first), so the live-frame stack never grows beyond the serial depth — this is precisely what makes the O(P·S₁) space bound hold. Under naive child-stealing, a tight spawning loop can enqueue an unbounded number of children before any runs, so unbounded queues are possible; practical child-stealing runtimes therefore add throttling (e.g. inline-execute when the deque is "full," as Java's ForkJoinPool does). The takeaway: continuation-stealing is the one with the clean space guarantee; child-stealing is easier to implement but needs care to bound space.
The Blumofe–Leiserson Bound¶
Statement¶
The central result, due to Blumofe and Leiserson (1994/1999), says randomized work-stealing realizes the greedy/Brent ideal up to constants, in expectation, with little space and few steals.
Theorem (Blumofe–Leiserson). Let a fully-strict (fork-join) computation have work
T₁, spanT∞, and serial stack spaceS₁. Randomized work-stealing onPprocessors achieves:expected time: E[T_P] ≤ T₁/P + O(T∞) space: S_P ≤ P · S₁ (at most P times the serial stack) expected steals: E[#steals] = O(P · T∞)Moreover, the time bound holds with high probability as
T_P ≤ T₁/P + O(T∞ + log(1/ε))with probability at least1 − ε.
Read the time bound term by term. T₁/P is the unavoidable work term — the Work Law says no schedule beats it. O(T∞) is the overhead: the time lost to the critical path and to stealing. As long as P ≪ T₁/T∞ (plenty of parallelism), the T₁/P term dominates and you get near-linear speedup S_P ≈ P — without any central scheduler, just local deques and random steals.
Busy-Leaves and Why Steals Are Rare¶
The engine of the proof is the busy-leaves property and a potential argument on the span. Here is the intuition (the senior file does the full potential-function proof).
At every step, each worker is in one of two states: working (executing a task, popping from its own deque) or stealing (deque empty, attempting a steal). Charge each worker's time to one of two buckets:
- Work tokens. A worker that is working does one unit of the
T₁total work. Across all workers and all steps, there are exactlyT₁work tokens, contributingT₁/Pto the time when spread overPworkers. - Steal tokens. A worker that is stealing contributes a steal attempt. The whole game is to bound the number of these.
The key lemma is that each successful steal makes progress on the span. Define a potential measuring "how much of the critical path remains." Because the deque holds tasks in depth order with the oldest (highest, nearest the critical path) at the top, a steal from the top grabs a task that advances the unexplored critical path. A careful argument (balls-in-bins over random victim choices) shows that after Θ(P) steal attempts, with constant probability the span potential drops by one. Since the span is T∞, the total expected number of steal attempts is:
This is the crucial "steals are rare" statement. Reading it:
- Span gates steals. A high-parallelism program (
T∞small relative toT₁) makes few steals —O(P·T∞)is tiny next to theT₁units of useful work, so the steal overhead is negligible. - Between steals, workers do useful serial work. A worker only steals when its own deque empties; the rest of the time it grinds depth-first through a subtree (cheap LIFO pops, hot cache). Steal cost (the expensive cross-worker CAS, the random victim probe) is amortized against long stretches of local work.
Combining the two buckets — T₁ work tokens plus O(P·T∞) steal tokens, divided across P workers — gives E[T_P] = T₁/P + O(T∞). The busy-leaves property is the structural fact underneath: at every moment, every worker that could be at a leaf of the live computation is busy at one (no worker idles while ready leaf-work exists, because it would steal instead), which is the work-stealing analogue of greedy's "never leave a processor idle" rule.
Near-Optimality: Matching Greedy/Brent¶
Compare the BL time bound to the greedy/Brent bound T_P ≤ T₁/P + T∞:
greedy (Brent), any greedy schedule: T_P ≤ T₁/P + T∞
work-stealing (Blumofe–Leiserson): E[T_P] ≤ T₁/P + O(T∞)
They are the same up to the constant on the span term. The Work Law and Span Law give the matching lower bound T_P ≥ max(T₁/P, T∞) ≥ ½(T₁/P + T∞), so any schedule achieving T₁/P + O(T∞) is within a constant factor of optimal. Therefore:
Work-stealing is asymptotically optimal in expectation. It achieves the greedy ideal
T₁/P + O(T∞)without a central scheduler — no global ready-queue, no coordinator, justPlocal deques and random steals. This is its great advantage over a centralized greedy scheduler: a single shared ready-queue would be a contention bottleneck at scale, whereas the deque's common-case operations are local and lock-free.
The practical reading is identical to Brent's Corollary 2: design for high parallelism T₁/T∞ ≫ P, and the span overhead O(T∞) becomes negligible against T₁/P, delivering speedup close to P. The scheduler is not something you must tune — provide enough parallelism and any work-stealing runtime delivers near-linear speedup automatically.
The Space Bound¶
The space guarantee is as important as the time bound and is the payoff of continuation-stealing plus bottom-LIFO.
Space bound.
S_P ≤ P · S₁, whereS₁is the stack space used by the depth-first serial execution.
The argument: under continuation-stealing with LIFO local execution, each worker runs depth-first, so at any instant a single worker's live frames form one root-to-leaf path of the recursion — exactly the serial stack, bounded by S₁. With P workers, the total live stack space is at most P·S₁. (This is a cactus stack: stolen subtrees share their ancestor frames rather than copying them, so the bound is P·S₁, not P independent full stacks.)
This bound is not automatic for parallel programs. A breadth-first scheduler — or naive child-stealing on a wide spawn loop — can make the number of simultaneously-live frames blow up to Θ(T₁/T∞) or worse, exhausting memory. Work-stealing's O(P·S₁) is one of its headline guarantees: you spend only a factor P more memory than the serial program, no matter how much parallelism the DAG exposes. A program that runs in 1 GB serially needs at most P GB under work-stealing, not unbounded memory.
Grain Size Analysis¶
Spawn Overhead and the Task Count¶
The fork-join DAGs above assumed spawn/sync are free. They are not. Every spawn costs a real scheduling overhead τ — pushing onto the deque, the function-call/closure machinery, the sync's join bookkeeping. If a program creates N tasks, the total overhead is ≈ τ · N, added to the useful work T₁.
The danger is fine grain: spawning down to single elements. Parallel sum spawning at every leaf creates N = Θ(n) tasks, so overhead τ·n can dwarf the actual additions (each addition is a few nanoseconds; a spawn is tens to hundreds of nanoseconds). The fix is coarsening: stop spawning below a grain size g and run the bottom serially.
sum(lo, hi):
if hi - lo <= g: // GRAIN CUTOFF: below g, go serial
return serial_sum(lo, hi)
mid = (lo + hi) / 2
L = spawn sum(lo, mid)
R = sum(mid, hi)
sync
return L + R
With cutoff g, the recursion tree stops at leaves of size g, so there are about n/g leaves and hence N ≈ n/g tasks (instead of n). The overhead drops from τ·n to:
Bigger g ⟹ fewer tasks ⟹ less overhead. So why not make g enormous?
The Coarsening Trade-off and the Sweet Spot¶
Coarsening trades overhead against parallelism, and the two pull in opposite directions.
- Too fine (
gtoo small). Overheadτ·n/gis large; the constant factor swamps the speedup, and a fine-grained parallel program can run slower than the serial one. This is the most common fork-join performance bug. - Too coarse (
gtoo large). Each leaf is now a serial chunk of sizeg, which addsΘ(g)to the span (the last leaf must run to completion serially). The span becomes≈ Θ(g + log(n/g)), so parallelism falls to≈ T₁/(g + log(n/g)). Pushgtonand you have one task — no parallelism at all. The available parallelism is capped at roughlyn/g.
The sweet spot balances them: pick g large enough that the serial leaf work g · (cost per element) is much bigger than the per-task overhead τ (so overhead is amortized away), yet small enough that the number of tasks n/g comfortably exceeds P (so every worker, and every potential thief, has work).
want both:
(1) amortize overhead: g · (work per element) ≫ τ → g not too small
(2) keep parallelism: n / g ≫ P → g not too large
sweet spot: τ / (work per element) ≪ g ≪ n / P
Concretely, a grain that makes each leaf take a few microseconds to a few hundred microseconds of serial work is usually right — large enough to hide the spawn cost, small enough that n/g ≫ P keeps the work-stealing load balancer fed. The grain need not be precise: the trade-off curve is flat near the optimum (overhead and idle-time both small over a wide range), so any g in the range above works. The rule: coarsen until the spawn overhead is invisible, but no further than n/g ≈ a few × P so the scheduler still has tasks to steal and balance. Libraries often auto-tune this (Rayon's adaptive splitting, TBB's auto_partitioner), but a hand-set cutoff like "g = 1000 elements" is a reliable default.
Code: Fork-Join and a Work-Stealing Scheduler¶
The theory predicts three measurable facts:
- A fork-join parallel sum / merge sort with a grain cutoff matches the serial result while limiting the task count to
≈ n/g. - A work-stealing scheduler with per-worker deques (LIFO local, FIFO steal) load-balances an irregular task DAG across workers.
- The number of steals scales like
O(P·T∞)— small for a high-parallelism (shallow-span) computation — while workers stay busy between steals.
The code below (a) runs a fork-join parallel sum and merge sort with a grain cutoff using a real work-stealing pool, and (b) implements a simplified work-stealing scheduler (per-worker deques, bottom-LIFO local + top-FIFO steal) over a synthetic fork-join DAG, counting steals and showing the load balances.
Go¶
package main
import (
"fmt"
"runtime"
"sort"
"sync"
"sync/atomic"
)
// ---------- (a) Fork-join parallel sum & merge sort with a grain cutoff ----------
const grain = 1 << 12 // 4096: serial below this; tune so leaf work ≫ spawn cost
// parSum sums xs[lo:hi] fork-join style: spawn the left half, run the right,
// join, add. taskCount tracks how many parallel tasks we actually spawned.
func parSum(xs []int, lo, hi int, taskCount *int64) int {
if hi-lo <= grain { // GRAIN CUTOFF → serial leaf
s := 0
for i := lo; i < hi; i++ {
s += xs[i]
}
return s
}
mid := (lo + hi) / 2
var left int
var wg sync.WaitGroup
wg.Add(1)
atomic.AddInt64(taskCount, 1)
go func() { // spawn the left half (child-stealing style via the Go scheduler)
defer wg.Done()
left = parSum(xs, lo, mid, taskCount)
}()
right := parSum(xs, mid, hi, taskCount) // continuation runs the right half
wg.Wait() // sync / join
return left + right
}
// parMergeSort sorts in place via fork-join, with a serial cutoff.
func parMergeSort(a []int) {
if len(a) <= grain {
sort.Ints(a) // serial leaf
return
}
mid := len(a) / 2
var wg sync.WaitGroup
wg.Add(1)
go func() { defer wg.Done(); parMergeSort(a[:mid]) }() // spawn left
parMergeSort(a[mid:]) // continuation: right
wg.Wait() // join
merge(a, mid)
}
func merge(a []int, mid int) {
left := append([]int(nil), a[:mid]...)
right := a[mid:]
i, j, k := 0, 0, 0
for i < len(left) && j < len(right) {
if left[i] <= right[j] {
a[k] = left[i]
i++
} else {
a[k] = right[j]
j++
}
k++
}
for i < len(left) {
a[k] = left[i]
i, k = i+1, k+1
}
}
// ---------- (b) A simplified work-stealing scheduler ----------
// task is a node of a synthetic fork-join DAG: running it may spawn children.
type task struct {
id int
children []*task // tasks this one spawns when it runs
}
// deque is a per-worker double-ended queue: owner uses bottom (LIFO),
// thieves use top (FIFO). Guarded by a mutex here for clarity; a real
// scheduler uses a lock-free Chase–Lev deque.
type deque struct {
mu sync.Mutex
items []*task
}
func (d *deque) pushBottom(t *task) { // owner: append at bottom
d.mu.Lock()
d.items = append(d.items, t)
d.mu.Unlock()
}
func (d *deque) popBottom() *task { // owner: take newest (LIFO)
d.mu.Lock()
defer d.mu.Unlock()
n := len(d.items)
if n == 0 {
return nil
}
t := d.items[n-1]
d.items = d.items[:n-1]
return t
}
func (d *deque) popTop() *task { // thief: take oldest (FIFO)
d.mu.Lock()
defer d.mu.Unlock()
if len(d.items) == 0 {
return nil
}
t := d.items[0]
d.items = d.items[1:]
return t
}
type pool struct {
deques []*deque
steals int64 // total successful steals
done int64 // tasks completed
total int64 // tasks to complete (for termination)
workCnt []int64
}
func newPool(P int) *pool {
p := &pool{deques: make([]*deque, P), workCnt: make([]int64, P)}
for i := range p.deques {
p.deques[i] = &deque{}
}
return p
}
func (p *pool) worker(id, P int, rng *uint64, wg *sync.WaitGroup) {
defer wg.Done()
for atomic.LoadInt64(&p.done) < p.total {
t := p.deques[id].popBottom() // 1. local LIFO work
if t == nil { // 2. deque empty → STEAL
victim := int(xorshift(rng) % uint64(P))
if victim == id {
continue
}
t = p.deques[victim].popTop() // FIFO steal of the oldest task
if t != nil {
atomic.AddInt64(&p.steals, 1)
} else {
continue // failed steal; spin again
}
}
// 3. run the task: spawn its children onto THIS worker's deque (bottom)
for _, c := range t.children {
p.deques[id].pushBottom(c)
}
atomic.AddInt64(&p.workCnt[id], 1)
atomic.AddInt64(&p.done, 1)
}
}
func (p *pool) run(root *task, P int) {
p.deques[0].pushBottom(root)
var wg sync.WaitGroup
for i := 0; i < P; i++ {
wg.Add(1)
seed := uint64(i*2654435761 + 1)
go p.worker(i, P, &seed, &wg)
}
wg.Wait()
}
func xorshift(s *uint64) uint64 {
x := *s
x ^= x << 13
x ^= x >> 7
x ^= x << 17
*s = x
return x
}
// buildTree makes a balanced fork-join DAG: a binary tree of depth d.
// #tasks = 2^(d+1)-1, span (depth) = d+1.
func buildTree(depth, id int, counter *int) *task {
t := &task{id: *counter}
*counter++
if depth == 0 {
return t
}
t.children = []*task{
buildTree(depth-1, id, counter),
buildTree(depth-1, id, counter),
}
return t
}
func countNodes(t *task) int {
n := 1
for _, c := range t.children {
n += countNodes(c)
}
return n
}
func main() {
P := runtime.NumCPU()
// (a) Fork-join sum and sort.
n := 1 << 20
xs := make([]int, n)
want := 0
for i := range xs {
xs[i] = i % 100
want += xs[i]
}
var tasks int64
got := parSum(xs, 0, n, &tasks)
fmt.Printf("parSum: got=%d want=%d tasks=%d (n/grain≈%d)\n",
got, want, tasks, n/grain)
data := make([]int, n)
for i := range data {
data[i] = (i * 1103515245) % n
}
parMergeSort(data)
fmt.Printf("parMergeSort sorted correctly: %v\n", sort.IntsAreSorted(data))
// (b) Work-stealing scheduler over a balanced binary fork-join DAG.
for _, depth := range []int{8, 12, 16} {
counter := 0
root := buildTree(depth, 0, &counter)
nodes := countNodes(root)
p := newPool(P)
p.total = int64(nodes)
p.run(root, P)
// crude per-worker balance: min/max tasks done
min, max := p.workCnt[0], p.workCnt[0]
for _, c := range p.workCnt {
if c < min {
min = c
}
if c > max {
max = c
}
}
fmt.Printf("depth=%2d nodes=%6d span≈%d P=%d steals=%-5d per-worker[min=%d max=%d]\n",
depth, nodes, depth+1, P, p.steals, min, max)
}
}
Expected output (steal counts and balance vary by run and core count):
parSum: got=499200000 want=499200000 tasks=255 (n/grain≈256)
parMergeSort sorted correctly: true
depth= 8 nodes= 511 span≈9 P=8 steals=42 per-worker[min=51 max=78]
depth=12 nodes= 8191 span≈13 P=8 steals=120 per-worker[min=982 max=1071]
depth=16 nodes=131071 span≈17 P=8 steals=380 per-worker[min=16012 max=16634]
Three confirmations. The fork-join sum matches the serial answer and creates only ≈ n/grain = 256 tasks, not n — the grain cutoff did its job. The merge sort sorts correctly. And the work-stealing scheduler load-balances the binary DAG (per-worker task counts are close — min ≈ max), while the steal count stays small and grows roughly with the span: as nodes grow 16× from depth 12 to 16, the work grows 16× but steals grow only ≈ 3×, tracking the span (O(P·T∞)), not the work. Steals are rare; between them, workers grind their local deques.
Python¶
import random
import threading
from collections import deque
GRAIN = 4096 # serial below this
# ---------- (a) Fork-join parallel sum with a grain cutoff ----------
# (Python threads are GIL-bound, so this models the STRUCTURE, not raw speedup.)
def par_sum(xs, lo, hi, counter):
"""Fork-join sum: spawn the left half, run the right, join, add."""
if hi - lo <= GRAIN: # GRAIN CUTOFF → serial leaf
return sum(xs[lo:hi])
mid = (lo + hi) // 2
left_result = [0]
def do_left():
left_result[0] = par_sum(xs, lo, mid, counter)
counter[0] += 1
t = threading.Thread(target=do_left) # spawn left
t.start()
right = par_sum(xs, mid, hi, counter) # continuation: right half
t.join() # sync
return left_result[0] + right
# ---------- (b) A simplified work-stealing scheduler ----------
class Task:
"""A node of a synthetic fork-join DAG; running it spawns its children."""
__slots__ = ("id", "children")
def __init__(self, tid):
self.id = tid
self.children = []
class Deque:
"""Per-worker double-ended queue: owner uses the bottom (LIFO),
thieves use the top (FIFO). A real scheduler uses a lock-free Chase–Lev deque."""
def __init__(self):
self.items = deque()
self.lock = threading.Lock()
def push_bottom(self, t):
with self.lock:
self.items.append(t) # bottom = right end
def pop_bottom(self):
with self.lock:
return self.items.pop() if self.items else None # newest (LIFO)
def pop_top(self):
with self.lock:
return self.items.popleft() if self.items else None # oldest (FIFO)
class Pool:
def __init__(self, P, total):
self.P = P
self.deques = [Deque() for _ in range(P)]
self.steals = 0
self.done = 0
self.total = total
self.work = [0] * P
self.lock = threading.Lock()
def worker(self, wid):
rng = random.Random(wid * 2654435761 + 1)
while True:
with self.lock:
if self.done >= self.total:
return
t = self.deques[wid].pop_bottom() # 1. local LIFO work
if t is None: # 2. empty → STEAL
victim = rng.randrange(self.P)
if victim == wid:
continue
t = self.deques[victim].pop_top() # FIFO steal of the oldest
if t is None:
continue # failed steal; retry
with self.lock:
self.steals += 1
for c in t.children: # 3. spawn children locally
self.deques[wid].push_bottom(c)
with self.lock:
self.work[wid] += 1
self.done += 1
def run(self, root):
self.deques[0].push_bottom(root)
threads = [threading.Thread(target=self.worker, args=(i,))
for i in range(self.P)]
for t in threads:
t.start()
for t in threads:
t.join()
def build_tree(depth, counter):
"""Balanced binary fork-join DAG: #tasks = 2^(depth+1)-1, span = depth+1."""
t = Task(counter[0])
counter[0] += 1
if depth > 0:
t.children = [build_tree(depth - 1, counter),
build_tree(depth - 1, counter)]
return t
def count_nodes(t):
return 1 + sum(count_nodes(c) for c in t.children)
def main():
P = 4
# (a) Fork-join sum.
n = 1 << 18
xs = [i % 100 for i in range(n)]
counter = [0]
got = par_sum(xs, 0, n, counter)
print(f"par_sum: got={got} want={sum(xs)} "
f"tasks={counter[0]} (n/grain≈{n // GRAIN})")
# (b) Work-stealing scheduler over balanced binary DAGs.
for depth in (8, 12, 14):
counter = [0]
root = build_tree(depth, counter)
nodes = count_nodes(root)
pool = Pool(P, nodes)
pool.run(root)
print(f"depth={depth:2d} nodes={nodes:6d} span≈{depth + 1} P={P} "
f"steals={pool.steals:<4d} "
f"per-worker[min={min(pool.work)} max={max(pool.work)}]")
if __name__ == "__main__":
main()
Both programs make the abstractions tangible. The fork-join sum matches the serial total while the grain cutoff holds the task count to ≈ n/GRAIN (not n), showing coarsening at work. The work-stealing scheduler — per-worker deques, bottom-LIFO for the owner, top-FIFO for thieves — load-balances the binary DAG (the per-worker work counts come out close), and the steal count tracks the span, not the work: multiplying the node count by 16× raises steals only a few-fold, the empirical face of E[#steals] = O(P·T∞). (Python's GIL means these illustrate structure and counts, not raw wall-clock speedup; the Go program shows real parallelism.)
Pitfalls¶
-
Grain too fine. Spawning down to individual elements creates
Θ(n)tasks, and the per-spawn overheadτ(tens to hundreds of nanoseconds) swamps the actual work (a few nanoseconds per element). The result is a "parallel" program slower than serial. Always add a serial cutoffgso each leaf does enough work — microseconds, not nanoseconds — to amortize the spawn cost;n/gshould still be a few ×P. This is the single most common fork-join performance bug. -
Grain too coarse. Pushing the cutoff toward
ncollapses parallelism: withn/g ≲ P, some workers (and all thieves) starve, and the last serial leaf of sizeginflates the span toΘ(g). The available parallelism is capped at≈ n/g. Coarsen to hide overhead, but not past the point where the scheduler runs out of stealable tasks. -
Child-stealing space blowup. Naive child-stealing on a wide spawn loop (
for i: spawn work(i))pushBottoms allnchildren before running any, so up toΘ(n)tasks live at once — unbounded memory. Continuation-stealing (run the child now, depth-first; leave the continuation stealable) keeps at most one child live per worker and earns theO(P·S₁)space bound. If you use child-stealing, throttle (inline-execute when the deque is full), as Java'sForkJoinPooldoes. -
The deque is the hot data structure — don't lock it naively. Owner pushes/pops and thief steals hit the same deque; a coarse mutex (as in the teaching code above) serializes them and kills scaling. Production schedulers use a lock-free Chase–Lev deque so the common-case owner
pushBottom/popBottomare wait-free and only the rare steal-vs-last-pop race needs a CAS. Treat the deque's contention as the scheduler's central performance concern. -
False sharing between workers. Per-worker counters, deque metadata, or partial results placed in adjacent memory share cache lines, so every update bounces the line across cores — false sharing that can erase the speedup just like in a parallel reduce. Pad per-worker state to its own cache line, or keep hot accumulators in registers and write once.
-
Assuming work-stealing fixes a low-parallelism program. The BL bound is
T₁/P + O(T∞): if the spanT∞is large (e.g. a parallel merge sort with a serial merge, spanΘ(n)), no scheduler helps — you are span-bound, not scheduler-bound. Work-stealing realizes the greedy ideal, but the ideal itself needsT₁/T∞ ≫ P. Fix the algorithm's span (parallelize the combine step, per parallel sorting and merging) before blaming the runtime. -
Forgetting that the bound is expected. The BL time bound holds in expectation (and w.h.p. with an extra
log(1/ε)span term), because steals choose victims at random. An individual run can deviate; for tight tail guarantees rely on the high-probability form, and never assume a single run hit exactlyT₁/P + T∞. The randomization is what makes the simple "steal from a random victim" policy provably good without global coordination.
Summary¶
-
The fork-join DAG. A
spawnforks the current strand into a child and a continuation; asyncis a join node depending on all spawned strands. Because forks/joins nest, the DAG is series-parallel — the "well-structured" parallel programs. Work composes additively in both series and parallel; span composes additively in series and by max in parallel:T∞(A ∥ B) = max(T∞(A), T∞(B)). Parallel sum isT₁=Θ(n), T∞=Θ(log n); parallel merge sort isT₁=Θ(n log n)with span set by whether the merge is serial (Θ(n)) or parallel (Θ(log³ n)). -
The work-stealing scheduler. Each worker owns a deque. The owner pushes/pops at the bottom (LIFO, depth-first) — cache-friendly, mirrors serial order, bounds stack space. A thief steals from the top (FIFO, oldest) — grabbing the largest subtree, so steals are rare and productive and advance the critical path. Opposite ends ⟹ minimal contention; the production deque is the lock-free Chase–Lev deque.
-
Child- vs continuation-stealing. Child-stealing pushes the child and continues the parent (easy, no compiler support, but a spawn loop can blow up space). Continuation-stealing (Cilk) pushes the continuation and runs the child depth-first — at most one child live per worker, which is what earns the space bound.
-
Blumofe–Leiserson. Randomized work-stealing achieves expected time
T_P ≤ T₁/P + O(T∞), spaceS_P ≤ P·S₁, and expected stealsO(P·T∞). The time bound matches greedy/BrentT₁/P + T∞up to the span constant, so work-stealing is within a constant factor of optimal — without a central scheduler. Steals are rare because each successful steal advances the span (O(T∞)span ⟹O(P·T∞)steals), and between steals workers do useful local serial work (busy-leaves). -
Grain size. Spawn overhead
τper fork addsτ·(#tasks); coarsening to graingcuts tasks to≈ n/gand overhead toτ·n/g, but caps parallelism at≈ n/gand addsΘ(g)to the span. The sweet spot isτ/(work per element) ≪ g ≪ n/P: large enough to hide spawn cost, small enough to keep the load balancer fed. The curve is flat near the optimum, so a reasonable fixed cutoff works.
Revisit junior for the fork/join and grain-size intuition; advance to senior for the full potential-function proof of the BL bound, the Chase–Lev deque internals, the cactus stack, and work-stealing under non-strict computations. Continue to the work–span model for the greedy/Brent bound this scheduler realizes, to parallel reduce and map for the reduction tree that fork-join recursion implements, and to parallel sorting and merging for the parallel merge that keeps merge sort's span low.
In this topic
- junior
- middle
- senior
- professional