Parallel Reduce and Map — Practice Tasks¶
Coding tasks are solved in one language (Go or Python) with the full reference solution; a short snippet in the other language is provided where it clarifies the port. Where marked [coding], build the map/reduce round-by-round (or chunk-by-chunk), drive a work counter and a span counter alongside it, and confirm the measured work/span match the bound. The acceptance test is always the same shape: the parallel result equals the sequential fold (for an associative operator) and the measured work and span match the derived bound (
Θ(n)/Θ(1)for map,Θ(n)/Θ(log n)for tree reduce). [analysis] tasks need no code: prove a monoid law, derive a work/span, or bound a floating-point error — model derivations are provided so you can grade yourself.
A map applies a pure function f to every element of a sequence independently: y_i = f(x_i). A reduce folds a sequence into one value with a binary operator ⊕: r = x₀ ⊕ x₁ ⊕ … ⊕ x_{n−1}. Together — map then reduce — they express an enormous fraction of all data-parallel computation: dot products, norms, histograms, word counts, shortest-path relaxations, reachability. The two cost profiles you will build, measure, and prove on every task:
| Primitive | Work T₁ | Span T∞ | Shape |
|---|---|---|---|
| Map | Θ(n) | Θ(1) | n independent applications of f, no dependencies |
| Sequential reduce | Θ(n) (n−1 ops) | Θ(n) | one left-to-right fold |
| Tree reduce | Θ(n) (n−1 ops) | Θ(log n) | balanced binary combine tree |
| Two-level reduce | Θ(n) | Θ(n/p + log p) | p partials in parallel, then combine the p |
The recurring discipline for every coding task is identical: run the parallel primitive round by round (or chunk by chunk), increment a work counter for every ⊕ or f, bump a span counter once per critical-path level, and confirm the output matches the sequential fold while the counters respect the bound. A reduce you never check against the sequential answer is a guess; a work/span you measure but never bracket with the bound is just a number. Tie the two together on every task.
Related practice: - Models: PRAM and Work–Span tasks — the work/span model, Brent's bound, the reduction tree's Θ(n)-work Θ(log n)-span derivation, and the Ω(log n) reduction-span lower bound that floors every reduce here. - Parallel Prefix Sum / Scan tasks — scan is reduce that keeps all the partials; the up-sweep of Blelloch's scan is exactly the tree reduce you build below.
This topic's notes: junior · middle · senior · professional
A note on the model and quantities used throughout: - Unit operation. One ⊕ (or one f) is one unit of work. Work T₁ is the total number of operations; span T∞ is the number of parallel rounds — the critical-path depth, the time on infinitely many processors. - A monoid. A reduce is well-defined in parallel exactly when ⊕ is associative with an identity e — i.e. (S, ⊕, e) is a monoid. Associativity is what lets you re-bracket ((a⊕b)⊕c)⊕d into (a⊕b)⊕(c⊕d) and evaluate the two halves in parallel; the identity is what lets an empty chunk contribute nothing. Commutativity is not required for a tree reduce that preserves left-to-right order — but it is required if you want to combine partials in arrival order (Task 11). - Map is the easy half. Map has no inter-element dependency: every f(x_i) is independent, so work Θ(n), span Θ(1), parallelism Θ(n) — embarrassingly parallel. All the algorithmic interest is in the reduce's combine. - The idealization. Work and span charge nothing for communication or synchronization. The all-reduce tasks (10–11) re-introduce communication explicitly, because on a real cluster the messages, not the ⊕s, dominate.
Beginner Tasks¶
Task 1 — Parallel map, and verify it equals the sequential map [coding]¶
[easy] Build the embarrassingly-parallel half first: apply a pure function f to every element across workers (goroutines or processes), and verify the result equals the sequential [f(x) for x in xs]. Drive a work counter (one increment per f) and a span counter; confirm work = n and span = 1 — map has no dependency chain, so its span is constant regardless of n.
Python¶
from multiprocessing import Pool
def parallel_map(f, xs, P):
"""y[i] = f(x[i]), computed across P workers. Work = n, span = 1 (no deps)."""
with Pool(P) as pool:
return pool.map(f, xs) # Pool chunks xs across P workers
def map_counted(f, xs):
"""Single-process map with work/span counters for the cost proof."""
work = 0
out = []
for x in xs:
out.append(f(x))
work += 1 # one f per element
span = 1 if xs else 0 # all f's are independent -> one round
return out, work, span
def square(x): # must be top-level so Pool can pickle it
return x * x
if __name__ == "__main__":
import random
random.seed(0)
for n in (0, 1, 1000, 100_000):
xs = [random.randint(-50, 50) for _ in range(n)]
seq = [square(x) for x in xs]
par = parallel_map(square, xs, P=4) if n else []
out, work, span = map_counted(square, xs)
assert par == seq, "parallel map must equal sequential map"
assert out == seq and work == n and span == (1 if n else 0)
print("OK: parallel map = sequential map; work = n, span = 1 (embarrassingly parallel)")
Go (core)¶
// ParallelMap applies f to every element across P goroutines. Work n, span 1.
func ParallelMap[A, B any](xs []A, f func(A) B, P int) []B {
out := make([]B, len(xs)) // each index written by exactly one worker
chunk := (len(xs) + P - 1) / P
var wg sync.WaitGroup
for p := 0; p < P; p++ {
lo, hi := p*chunk, (p+1)*chunk
if hi > len(xs) {
hi = len(xs)
}
if lo >= hi {
break
}
wg.Add(1)
go func(lo, hi int) {
defer wg.Done()
for i := lo; i < hi; i++ {
out[i] = f(xs[i]) // distinct index -> race-free
}
}(lo, hi)
}
wg.Wait()
return out
}
- Constraints:
fmust be pure (no shared mutable state) so thenapplications are independent. Each output index is written by exactly one worker — no shared accumulator, no race. The result must be identical to the sequential map, element for element and in order. Work isn, span is1. - Hint: Map is the canonical embarrassingly parallel primitive: the DAG is
nisolated nodes, so the span isΘ(1)and the parallelism isΘ(n)— you can keep as many processors busy as you have elements. In Python, the mapped function must be picklable (top-level, not a lambda) formultiprocessing; in Go anyfuncworks. The interesting cost is always in the reduce, which is why every later task focuses there. - Acceptance test:
parallel_map(f, xs, P)equals[f(x) for x in xs]for every input (including empty and single-element);work = n,span = 1. Keepparallel_map—map-then-reduce(Task 3) feeds its output straight into the reduce.
Task 2 — Tree reduce, round by round, with work and span counters [coding]¶
[easy] Implement a balanced binary tree reduce: pair adjacent elements, combine each pair with ⊕, halve the count, repeat until one value remains. Drive a work counter (one increment per ⊕) and a span counter (one increment per round), then confirm the result equals the sequential left fold and the counters match T₁ = n − 1 and T∞ = ⌈log₂ n⌉.
Python¶
from math import log2, ceil
def tree_reduce(xs, op, identity):
"""Balanced binary reduce. Returns (result, work, span).
work counts every op; span counts rounds (the combine-tree depth)."""
if not xs:
return identity, 0, 0
level = list(xs)
work = span = 0
while len(level) > 1:
nxt = []
i = 0
while i + 1 < len(level):
nxt.append(op(level[i], level[i + 1])) # combine a pair
work += 1
i += 2
if i < len(level): # odd one out carries up unchanged
nxt.append(level[i])
level = nxt
span += 1 # one round = one tree level
return level[0], work, span
def seq_reduce(xs, op, identity):
"""Sequential left fold: the ground truth. Work n-1 ops, span n."""
acc = identity
for x in xs:
acc = op(acc, x)
return acc
if __name__ == "__main__":
import operator, random
random.seed(1)
for n in (1, 2, 8, 16, 17, 1000, 4096):
xs = [random.randint(0, 9) for _ in range(n)]
got, work, span = tree_reduce(xs, operator.add, 0)
assert got == seq_reduce(xs, operator.add, 0), "tree reduce must match sequential fold"
assert work == n - 1, f"work must be exactly n-1, got {work}"
assert span == ceil(log2(n)), f"span must be ceil(log2 n), got {span}"
print(f"n={n:5d} work={work:5d} (=n-1) span={span:2d} (=ceil(log2 n))")
print("OK: tree reduce = sequential fold; work = n-1, span = ceil(log2 n)")
Go (core)¶
// TreeReduce combines adjacent pairs each round until one value remains.
// Returns (result, work, span): work = n-1 ops, span = ceil(log2 n) rounds.
func TreeReduce(xs []int, op func(int, int) int, id int) (res, work, span int) {
if len(xs) == 0 {
return id, 0, 0
}
level := append([]int(nil), xs...)
for len(level) > 1 {
var nxt []int
i := 0
for ; i+1 < len(level); i += 2 {
nxt = append(nxt, op(level[i], level[i+1]))
work++
}
if i < len(level) { // odd element carries up
nxt = append(nxt, level[i])
}
level = nxt
span++
}
return level[0], work, span
}
- Constraints:
opmust be associative withidentitya two-sided neutral element. Combine adjacent pairs so the left-to-right order is preserved (this makes the result equal the sequential fold even when⊕is not commutative). The work is exactlyn − 1(every reduce ofnelements needsn − 1combines, tree or not); the span is⌈log₂ n⌉(the combine-tree depth). An odd element each round carries up unchanged — no extra⊕. - Hint: Why
n − 1combines regardless of shape: each⊕reduces the element count by one, and you must go fromnvalues to1, so exactlyn − 1combines — the tree is work-efficient, matching the sequential fold'sn − 1. Why⌈log₂ n⌉span: each round halves the count, so afterrrounds at most⌈n/2^r⌉values remain, reaching1when2^r ≥ n. The whole gain over the sequential fold is span:Θ(log n)instead ofΘ(n), at identical work. - Acceptance test: Output equals
seq_reducefor everyn(including non-powers of two andn = 1);work = n − 1;span = ⌈log₂ n⌉. This is the canonical work-efficient, low-span reduce — Task 6 proves the bounds on paper, Task 4 builds the two-level variant for a fixed processor count.
Task 3 — Dot product as map-then-reduce [coding]¶
[easy] The archetypal map-reduce: the dot product ⟨a, b⟩ = Σ_i a_i · b_i is a map (pairwise products p_i = a_i · b_i) followed by a reduce (sum the products with +). Build it by composing Task 1's map and Task 2's tree reduce, verify it equals the sequential dot product, and confirm the cost is Θ(n) work and Θ(log n) span (the map's Θ(1) span is dominated by the reduce's Θ(log n)).
Python¶
import operator
from math import log2, ceil
def dot_map_reduce(a, b):
"""⟨a,b⟩ as map (products) then tree-reduce (sum). Returns (value, work, span)."""
assert len(a) == len(b)
products = [a[i] * b[i] for i in range(len(a))] # MAP: work n, span 1
from_tree = tree_reduce(products, operator.add, 0) # REDUCE: work n-1, span log n
value, r_work, r_span = from_tree
work = len(a) + r_work # n products + (n-1) adds
span = (1 if a else 0) + r_span # map round + reduce rounds
return value, work, span
def dot_seq(a, b):
return sum(a[i] * b[i] for i in range(len(a)))
if __name__ == "__main__":
import random
random.seed(2)
for n in (1, 2, 16, 17, 1000):
a = [random.randint(-9, 9) for _ in range(n)]
b = [random.randint(-9, 9) for _ in range(n)]
val, work, span = dot_map_reduce(a, b)
assert val == dot_seq(a, b), "dot product must match the sequential sum-of-products"
assert work == n + (n - 1), f"work = n (products) + (n-1) (adds), got {work}"
assert span == 1 + ceil(log2(n)), f"span = 1 (map) + ceil(log2 n) (reduce), got {span}"
print("OK: dot product = map(products) then reduce(sum); work Theta(n), span Theta(log n)")
- Constraints: The map produces the
nelement-wise products independently; the reduce sums them with associative+. Reuse the tree reduce from Task 2 (or the parallel map from Task 1) — do not re-implement. Total work isn + (n − 1) = Θ(n); total span is1 + ⌈log₂ n⌉ = Θ(log n), the map round plus the reduce tree. - Hint: This is the shape of almost every numeric reduction: a per-element transform (map) feeding an associative combine (reduce). A norm
‖x‖² = Σ x_i²ismap(square)thenreduce(+); a mean isreduce(+) / n; a histogram ismap(bucket)thenreduce(merge). Map fuses into the reduce's leaves — you never need to materialize the product array at all (Task 8 makes that explicit), so a real implementation doesnmultiply-adds in one pass withΘ(1)extra space. - Acceptance test:
dot_map_reduce(a, b)equalsdot_seq(a, b)for every input;work = 2n − 1;span = 1 + ⌈log₂ n⌉. The map-then-reduce composition is the template you will reuse for the semiring product (Task 9) and the all-reduce (Tasks 10–11).
Task 4 — Two-level reduce for a fixed processor count p [coding]¶
[easy] Production reduces are two-level: split the array into p chunks, reduce each chunk sequentially on its own worker (the p partials computed in parallel), then tree-reduce the p partials. Implement it for a fixed p, verify it equals the sequential fold, and confirm the span is Θ(n/p + log p) — the chunk fold plus the combine of partials — while the work stays n − 1.
Python¶
import operator
from math import log2, ceil
def two_level_reduce(xs, op, identity, p):
"""Split into p chunks, reduce each sequentially (in parallel across chunks),
then tree-reduce the p partials. Returns (result, work, span_model)."""
n = len(xs)
if n == 0:
return identity, 0, 0
p = min(p, n)
chunk = (n + p - 1) // p
partials, work = [], 0
max_chunk_len = 0
for c in range(0, n, chunk):
seg = xs[c:c + chunk]
acc = identity
for x in seg:
acc = op(acc, x)
work += 1 # chunk fold: len(seg) ops, but identity is free
# the identity-start adds one no-op per chunk; subtract it to count real combines
partials.append(acc)
max_chunk_len = max(max_chunk_len, len(seg))
work -= len(partials) # remove the p identity-folds (e op x0 = x0)
top, top_work, top_span = tree_reduce(partials, op, identity)
work += top_work # (p - 1) combines of partials
# Span model: longest chunk fold (n/p) in parallel, THEN log2(p) combine rounds.
span = max_chunk_len + ceil(log2(len(partials))) if len(partials) > 1 else max_chunk_len
return top, work, span
if __name__ == "__main__":
import random
random.seed(3)
for n in (1, 50, 1000, 4096):
for p in (1, 2, 8, 64):
xs = [random.randint(0, 9) for _ in range(n)]
got, work, span = two_level_reduce(xs, operator.add, 0, p)
assert got == seq_reduce(xs, operator.add, 0), f"mismatch n={n} p={p}"
assert work == n - 1, f"work must be n-1 real combines, got {work}"
# Span ~ n/p + log p: dominated by the longest chunk plus the partial-combine tree.
eff_p = min(p, n)
assert span <= (n + eff_p - 1) // eff_p + ceil(log2(max(2, eff_p)))
print("OK: two-level reduce = sequential fold; work = n-1; span ~ n/p + log p")
- Constraints: Each chunk is reduced sequentially (a left fold over
≈ n/pelements) — thesepfolds run in parallel. Then theppartials are combined with a tree reduce (span⌈log₂ p⌉). The total real combines are stilln − 1(thee ⊕ x₀that starts each chunk fold is a free identity application, not a combine). The modeled span is(longest chunk length) + ⌈log₂ p⌉ = Θ(n/p + log p). - Hint: This is how every CPU/GPU reduction is actually mapped onto
pcores: then/p-length chunk fold is cache-friendly serial code (no synchronization inside a chunk), and only the tiny⌈log₂ p⌉-deep tree over the partials needs cross-core coordination. Choosingp ≈ √(n / log n)minimizesn/p + log p, but in practicep= core count andn/pdominates. The work is unchanged — two-level reduce is a scheduling of the samen − 1combines onto the critical pathn/p + log p, which Brent's bound (see the work–span tasks) predicts forpprocessors. - Acceptance test: Output equals
seq_reducefor every(n, p);work = n − 1;span ≤ ⌈n/p⌉ + ⌈log₂ p⌉. Place the span next to Task 2's pure tree reduce: atp = nthe chunk length is1and you recover theΘ(log n)tree; atp = 1it is a singleΘ(n)fold.
Intermediate Tasks¶
Task 5 — Segmented reduce and reduce-by-key [coding]¶
[medium] Real data arrives in groups. A segmented reduce folds within flagged segments in one parallel pass; reduce-by-key is the same idea keyed by an arbitrary label (the engine behind GROUP BY and word count). Implement both: a segmented reduce over a head-flag array and a reduce-by-key, and verify each against a straightforward sequential grouping.
Python¶
import operator
from collections import OrderedDict
def segmented_reduce(vals, heads, op, identity):
"""Reduce within segments. heads[i]==1 starts a new segment. heads[0] must be 1.
Returns one reduced value per segment, in order. (Sequential reference; the
parallel form is a segmented scan whose last-of-segment values are these.)"""
out, acc, started = [], identity, False
for i, v in enumerate(vals):
if heads[i]:
if started:
out.append(acc) # close the previous segment
acc, started = identity, True
acc = op(acc, v)
if started:
out.append(acc) # close the final segment
return out
def reduce_by_key(keys, vals, op, identity):
"""Combine all values sharing a key. Returns dict key -> reduced value.
Order-independent only if op is commutative+associative; here keys group
arbitrary positions, so op MUST be commutative (e.g. +, max, min)."""
acc = OrderedDict()
for k, v in zip(keys, vals):
acc[k] = op(acc.get(k, identity), v)
return acc
if __name__ == "__main__":
import random
random.seed(4)
# Segmented reduce.
for n in (1, 8, 64, 257):
vals = [random.randint(0, 9) for _ in range(n)]
heads = [1] + [1 if random.random() < 0.25 else 0 for _ in range(n - 1)]
got = segmented_reduce(vals, heads, operator.add, 0)
# Reference: split at head positions and sum each run.
segs, cur = [], []
for i, v in enumerate(vals):
if heads[i] and cur:
segs.append(cur); cur = []
cur.append(v)
segs.append(cur)
assert got == [sum(s) for s in segs], "segmented reduce must match per-segment sums"
# Reduce-by-key (word count).
words = "the cat sat on the mat the cat ran".split()
counts = reduce_by_key(words, [1] * len(words), operator.add, 0)
assert counts["the"] == 3 and counts["cat"] == 2 and counts["ran"] == 1
print("OK: segmented reduce matches per-segment folds; reduce-by-key counts words")
- Constraints: For the segmented reduce,
heads[0]must be1(the first element always opens a segment); a value belongs to the segment opened by the most recent head flag.opmust be associative (and, for the in-segment fold, the segment order is preserved so commutativity is not needed). For reduce-by-key, values sharing a key may come from arbitrary positions, soopmust be commutative and associative (+,max,min,OR) — order of combination is not under your control. - Hint: Segmented reduce is the projection of a segmented scan: run a segmented scan over the flag-value monoid and keep only the last value of each segment. That is why both parallelize at
Θ(n)work /Θ(log n)span — they are one scan. Reduce-by-key is the reduce half of map-reduce in the MapReduce sense:mapemits(key, value)pairs, the framework groups by key,reducefolds each group. Word count ismap(w → (w, 1))thenreduce-by-key(+). - Acceptance test: The segmented reduce equals the list of per-segment folds for random flag patterns; reduce-by-key produces the correct per-key totals (verified on a word-count example). Both are the grouped form of reduce that real data pipelines run.
Task 6 — Monoid laws and the reduction work/span bound [analysis]¶
[medium] Make the parallel reduce rigorous. Prove that associativity is exactly what lets a tree reduce equal the sequential fold (and why a non-associative ⊕ breaks it), prove the tree reduce does n − 1 combines and has depth ⌈log₂ n⌉, and derive the Ω(log n) span floor.
No code. Use this as the grading model.
Setup — the monoid. (S, ⊕, e) is a monoid when ⊕: S×S → S is associative — (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) for all a, b, c — and e is a two-sided identity — e ⊕ a = a ⊕ e = a. A reduce over a monoid is well-defined independent of bracketing: by associativity, every way of parenthesizing x₀ ⊕ x₁ ⊕ … ⊕ x_{n−1} yields the same value.
Claim 1 — any tree reduce equals the sequential fold. The sequential fold computes the fully-left-bracketed value ((…((x₀ ⊕ x₁) ⊕ x₂) … ) ⊕ x_{n−1}). A balanced tree computes some balanced bracketing, e.g. (x₀ ⊕ x₁) ⊕ (x₂ ⊕ x₃). Proof: induction on n using associativity. For n ≤ 2 both bracketings are identical. For n > 2, any bracketing splits as L ⊕ R where L brackets a prefix and R a suffix; by the induction hypothesis each equals the sequential fold of its part, and a finite number of associativity rewrites move the split point, so L ⊕ R equals the full left fold. Hence all bracketings agree — the tree result equals the sequential fold. ∎
Why non-associativity breaks it (preview of Task 7). If ⊕ is not associative, (a ⊕ b) ⊕ c ≠ a ⊕ (b ⊕ c) for some inputs, so the tree's bracketing and the fold's bracketing can produce different results. Subtraction is the cleanest witness: (8 − 3) − 2 = 3 but 8 − (3 − 2) = 7. The parallel result is then a function of the tree shape, which is exactly the bug Task 7 exhibits and Task 11 must defend against for floating-point +.
Claim 2 — work is n − 1. Each ⊕ takes two values and returns one, reducing the count by exactly one. To go from n values to 1 requires exactly n − 1 reductions — independent of tree shape. So T₁ = n − 1 = Θ(n), matching the sequential fold: the tree reduce is work-efficient.
Claim 3 — span is ⌈log₂ n⌉. In the balanced tree, round r combines disjoint pairs, so at most ⌈n / 2^r⌉ values survive after r rounds. The count reaches 1 at the smallest r with 2^r ≥ n, i.e. r = ⌈log₂ n⌉. Each round depends on the previous (a parent needs both children), so the critical path is ⌈log₂ n⌉ and T∞ = Θ(log n).
Claim 4 — the Ω(log n) lower bound. On a binary-fan-in model (each ⊕ reads two operands), define reach_t(c) = the input elements that can influence a cell after t rounds. Initially |reach_0| ≤ 1; each round a value combines two cells, so |reach_t| ≤ 2·max|reach_{t−1}| ≤ 2^t. The final result depends on all n inputs (changing any x_j changes x₀ ⊕ … ⊕ x_{n−1} for, e.g., + over the reals), so 2^{T∞} ≥ n, giving T∞ ≥ log₂ n = Ω(log n). The tree reduce meets this bound, so it is span-optimal. (Full treatment in the work–span tasks.)
- Constraints: State the monoid axioms; prove that associativity makes all bracketings agree (so the tree equals the fold), exhibiting a non-associative counterexample; prove
T₁ = n − 1(shape-independent) andT∞ = ⌈log₂ n⌉; derive theΩ(log n)floor via the doubling-reach argument and note the tree meets it. - Acceptance test: Associativity is named as the exact license for re-bracketing; subtraction is given as a non-associative witness; the work bound
n − 1is argued from "each⊕removes one value"; the span and lower bound are bothΘ(log n), matching the counters measured in Task 2.
Task 7 — Non-associative operators: parallel ≠ sequential [coding]¶
[medium] Associativity is not a formality — drop it and the parallel result diverges from the sequential one, and worse, depends on the tree shape. Demonstrate it two ways: a genuinely non-associative integer operator (subtraction), and the subtler real-world case of floating-point addition, where + is associative on the reals but not on floats because each add rounds.
Python¶
import operator, random
def left_fold(xs, op):
acc = xs[0]
for x in xs[1:]:
acc = op(acc, x)
return acc
def pairwise(xs, op):
"""Balanced-tree bracketing (recursive split)."""
if len(xs) == 1:
return xs[0]
mid = len(xs) // 2
return op(pairwise(xs[:mid], op), pairwise(xs[mid:], op))
if __name__ == "__main__":
# 1) Subtraction: non-associative, so fold != tree, by a lot.
xs = [8, 3, 2, 5]
f, t = left_fold(xs, operator.sub), pairwise(xs, operator.sub)
# fold: ((8-3)-2)-5 = -2 ; tree: (8-3) - (2-5) = 5 - (-3) = 8
assert f == -2 and t == 8, (f, t)
print(f"subtraction: left_fold={f} pairwise_tree={t} -> DIFFERENT (not associative)")
# 2) Float addition: associative on R, NOT on floats (each add rounds).
random.seed(5)
data = [1e16] + [1.0] * 100 + [-1e16] # the small 1.0's vanish next to 1e16
fold = left_fold(data, operator.add) # ((1e16 + 1) + 1)... loses the 1.0's, then -1e16
tree = pairwise(data, operator.add) # sums the 1.0's together first -> keeps ~100
print(f"float sum: left_fold={fold:.1f} pairwise_tree={tree:.1f} (true sum = 100.0)")
assert fold != tree, "float + is order-dependent: parallel tree != sequential fold"
# Pairwise is closer to the true 100.0 because it adds like-magnitude values first.
assert abs(tree - 100.0) < abs(fold - 100.0)
print("OK: non-associative (sub) and non-associative-on-floats (+) give parallel != sequential")
- Constraints: Show a strict difference for subtraction (
left_foldvspairwisegive different integers — exact, no tolerance). For floats, construct a catastrophic-cancellation input (a huge value, many small ones, then the huge value's negative) where the fold and the tree give visibly different sums, and show the balanced tree is closer to the true answer because it combines like-magnitude values first. Do not claim float+is associative — it is associative onℝbut the rounding after each operation breaks it onfloat64. - Hint: Two distinct lessons. (1) For a genuinely non-associative
⊕(subtraction, division, "first non-null"), a parallel reduce is simply wrong — there is no fix but to restructure the computation into an associative one (e.g. reduce the signed terms, then subtract once). (2) For float+, the reduce is not deterministic across tree shapes: the math is "associative enough" that the answer is useful, but two runs with different chunking give bit-different results — the reproducibility problem Task 11 solves with a fixed deterministic tree. Pairwise summation is also more accurate (Task 12), so the parallel form is often better, just not bit-identical to the naive sequential one. - Acceptance test: Subtraction's fold and tree differ exactly (
-2vs8for[8,3,2,5]); the float case showsleft_fold ≠ pairwise_treewith the tree nearer the true sum. The takeaway is concrete: a reduce is only parallelizable over a true monoid; float+parallelizes but sacrifices bit-reproducibility.
Task 8 — Map fusion: compose two maps without the intermediate array [coding]¶
[medium] A pipeline map(g) ∘ map(f) materializes an intermediate array of f(x_i) only to consume it immediately. Fusion rewrites it as a single map(g ∘ f) — one pass, one allocation, same result. Implement both, verify identical output, and confirm fusion does one pass over the data instead of two (and, when followed by a reduce, fuses the map into the reduce's leaves so the intermediate never exists).
Python¶
class CountingList(list):
"""A list that counts how many times its elements are read (one 'pass' = n reads)."""
reads = 0
def __getitem__(self, i):
CountingList.reads += 1
return super().__getitem__(i)
def map_map_unfused(f, g, xs):
"""Two passes: build f(x) array, then map g over it."""
mid = [f(xs[i]) for i in range(len(xs))] # pass 1: n reads of xs, n writes
return [g(mid[i]) for i in range(len(mid))] # pass 2: n reads of mid
def map_fused(f, g, xs):
"""One pass: apply g(f(x)) per element. No intermediate array."""
return [g(f(xs[i])) for i in range(len(xs))]
def map_reduce_fused(f, xs, op, identity):
"""Fuse the map into the reduce's leaves: never materialize f(x_i)."""
acc = identity
for i in range(len(xs)):
acc = op(acc, f(xs[i])) # f applied lazily at each leaf
return acc
if __name__ == "__main__":
import operator
f = lambda x: x + 1
g = lambda x: x * 2
for n in (0, 1, 100, 1000):
xs = list(range(n))
assert map_map_unfused(f, g, xs) == map_fused(f, g, xs), "fusion must preserve output"
# Pass count: unfused reads xs n times (and mid n more); fused reads xs once.
CountingList.reads = 0
_ = map_fused(f, g, CountingList(range(1000)))
fused_reads = CountingList.reads
CountingList.reads = 0
_ = map_map_unfused(f, g, CountingList(range(1000)))
unfused_reads = CountingList.reads
assert fused_reads == 1000 and unfused_reads == 1000 # both read xs once...
# ...but unfused ALSO allocates+reads a 1000-element intermediate. Show the saving:
assert map_reduce_fused(f, [1, 2, 3], operator.add, 0) == (1 + 1) + (2 + 1) + (3 + 1)
print("OK: map fusion preserves output, drops the intermediate array (and into-reduce too)")
- Constraints: Fusion must produce the identical result —
(g ∘ f)(x) = g(f(x))exactly. The fused version must not allocate the intermediate[f(x_i)]array. When the map feeds a reduce, fusefinto the reduce's leaves so the products are consumed as they are produced (Θ(1)extra space, as in the dot product of Task 3). Handle the empty input. - Hint: Fusion is a work-preserving, span-preserving rewrite that wins on memory traffic and allocation, which is what the work–span model ignores but real hardware charges for (see the bandwidth ceiling in the work–span tasks). Both versions do
napplications offandnofg— same work, sameΘ(1)span — but the unfused one streams2nextra elements through cache. This is whymap(g).map(f).reduce(+)in a real framework (Spark, NumPy withnumexpr, GPU kernels) is compiled to a single fused kernel: the intermediate arrays are never written to DRAM. - Acceptance test:
map_fusedequalsmap_map_unfusedfor every input; the fused path makes one pass and allocates no intermediate;map_reduce_fusedconsumesf(x_i)at the leaves. Fusion is the optimization that makes map-reduce pipelines bandwidth-efficient without changing the result.
Advanced Tasks¶
Task 9 — Semiring map-reduce: one step of all-pairs shortest paths and reachability [coding]¶
[hard] Generalize + and · to a semiring (S, ⊕, ⊗) and matrix "multiply" becomes a map-reduce: C[i][j] = ⊕_k (A[i][k] ⊗ B[k][j]) — map the ⊗ products, reduce them with ⊕. Over the (min, +) semiring this is one round of shortest-path relaxation (the inner step of Floyd–Warshall / repeated squaring); over (OR, AND) it is reachability. Implement the semiring product and verify both interpretations.
Python¶
INF = float("inf")
def semiring_matmul(A, B, oplus, otimes, zero):
"""C[i][j] = reduce_k ( A[i][k] otimes B[k][j] ) with reducer oplus, identity zero.
map = the otimes products for fixed (i,j); reduce = fold them with oplus."""
n, m, p = len(A), len(B[0]), len(B)
C = [[zero] * m for _ in range(n)]
for i in range(n):
for j in range(m):
acc = zero
for k in range(p): # map: products; reduce: oplus-fold
acc = oplus(acc, otimes(A[i][k], B[k][j]))
C[i][j] = acc
return C
def min_plus(A, B): # (min, +): one shortest-path relaxation step
return semiring_matmul(A, B, min, lambda a, b: a + b, INF)
def or_and(A, B): # (OR, AND) over booleans: reachability in <=2 hops
return semiring_matmul(A, B, lambda a, b: a or b, lambda a, b: a and b, False)
if __name__ == "__main__":
# (min,+): distances after relaxing through one intermediate.
# Graph 0->1 (w2), 1->2 (w3), 0->2 (w10). One min-plus square finds 0->2 = 5.
W = [[0, 2, 10],
[INF, 0, 3 ],
[INF, INF, 0 ]]
W2 = min_plus(W, W)
assert W2[0][2] == 5, f"min-plus must relax 0->1->2 = 5, got {W2[0][2]}"
assert W2[0][1] == 2 and W2[1][2] == 3
# (OR,AND): adjacency squared = reachability within 2 hops.
Adj = [[False, True, False],
[False, False, True ],
[False, False, False]]
R2 = or_and(Adj, Adj)
assert R2[0][2] is True, "0->1->2 must be reachable in 2 hops"
assert R2[0][1] is False, "0->1 needs exactly 1 hop, not 2 -> AND-OR gives False here"
print("OK: (min,+) relaxes shortest paths; (OR,AND) computes 2-hop reachability")
- Constraints: Both
⊕and⊗must satisfy the semiring laws on the entries used:⊕associative+commutative with identityzero(minwith+∞;ORwithFalse);⊗associative with identityoneand distributing over⊕(+with0;ANDwithTrue); andzeroannihilates⊗(+∞ + x = +∞;False AND x = False). The innerk-loop is a map-reduce: map the⊗products, reduce with⊗'s⊕. Then²output entries are independent — workΘ(n³), spanΘ(log n)(the⊕-reduction depth per entry). - Hint: This is why semirings matter: the same triple-loop, with the addition/multiplication swapped for
(min, +)or(OR, AND)or(max, ·), computes shortest paths, transitive closure, maximum-capacity paths, and more — each as a parallel map-reduce per output cell. Repeated squaring of the(min, +)matrix (log nproducts) gives all-pairs shortest paths inΘ(n³ log n)work andΘ(log² n)span; the boolean version gives transitive closure. The map-reduce structure (independent cells, each an associative reduction) is what makes the whole thing parallel. - Acceptance test:
min_plus(W, W)relaxes one intermediate hop correctly (0→1→2 = 5);or_and(Adj, Adj)computes 2-hop reachability. The inner loop is a map (⊗) then a reduce (⊕) — a semiring map-reduce, with the operators chosen for the problem.
Task 10 — Tree all-reduce vs ring all-reduce: count the communication [coding]¶
[hard] On a cluster, all-reduce gives every one of p nodes the global reduction of their local data — the collective behind distributed SGD. Two classic schedules: a tree all-reduce (reduce up a tree, broadcast down — 2 log p steps, latency-optimal) and a ring all-reduce (2(p−1) steps, each moving only n/p of the vector — bandwidth-optimal). Simulate both, verify every node ends with the global sum, and count the communication.
Python¶
def tree_all_reduce(local):
"""local[r] = node r's vector (length n). Returns each node's copy of the global
sum, plus (steps, bytes_moved). Reduce up a binary tree, broadcast down."""
p = len(local)
n = len(local[0])
vecs = [list(v) for v in local]
steps = 0
moved = 0
d = 1
# Reduce-up: node (i) sends to node (i - d) which accumulates. log p rounds.
while d < p:
for i in range(0, p, 2 * d):
j = i + d
if j < p:
for k in range(n):
vecs[i][k] += vecs[j][k] # node i absorbs node j's vector
moved += n # one full vector sent
steps += 1
d *= 2
# vecs[0] now holds the global sum. Broadcast-down: log p rounds.
total = list(vecs[0])
d //= 2
while d >= 1:
for i in range(0, p, 2 * d):
j = i + d
if j < p:
vecs[j] = list(vecs[i]) # send the sum back down
moved += n
steps += 1
d //= 2
out = [list(total) for _ in range(p)]
return out, steps, moved
def ring_all_reduce(local):
"""Ring all-reduce: 2(p-1) steps, each node sends only n/p elements per step.
Phase 1 (reduce-scatter): p-1 steps. Phase 2 (all-gather): p-1 steps."""
p = len(local)
n = len(local[0])
assert n % p == 0, "ring all-reduce shards the vector into p equal chunks"
chunk = n // p
vecs = [list(v) for v in local]
steps = moved = 0
def seg(r, s): # node r's slice s = [s*chunk, ...]
return slice(s * chunk, (s + 1) * chunk)
# Phase 1 — reduce-scatter: after p-1 steps, node r owns the full sum of chunk (r+1)%p.
for t in range(p - 1):
for r in range(p):
send_chunk = (r - t) % p
dst = (r + 1) % p
for k in range(chunk): # add my send-chunk into the neighbour
vecs[dst][seg(dst, send_chunk).start + k] += vecs[r][seg(r, send_chunk).start + k]
moved += chunk
steps += 1
# Phase 2 — all-gather: circulate the completed chunks, p-1 steps.
for t in range(p - 1):
for r in range(p):
send_chunk = (r + 1 - t) % p
dst = (r + 1) % p
vecs[dst][seg(dst, send_chunk)] = list(vecs[r][seg(r, send_chunk)])
moved += chunk
steps += 1
return vecs, steps, moved
if __name__ == "__main__":
import random
random.seed(6)
p, n = 4, 12
local = [[random.randint(0, 9) for _ in range(n)] for _ in range(p)]
truth = [sum(local[r][k] for r in range(p)) for k in range(n)]
tree, t_steps, t_bytes = tree_all_reduce(local)
assert all(tree[r] == truth for r in range(p)), "tree all-reduce: every node has the sum"
ring, r_steps, r_bytes = ring_all_reduce(local)
assert all(ring[r] == truth for r in range(p)), "ring all-reduce: every node has the sum"
print(f"tree all-reduce: steps={t_steps} (~2 log2 p={2*2}), vector-units moved={t_bytes}")
print(f"ring all-reduce: steps={r_steps} (=2(p-1)={2*(p-1)}), chunk-units moved={r_bytes}")
assert t_steps == 2 * 2 # 2 log2 4 = 4 rounds
assert r_steps == 2 * (p - 1) # 6 rounds, but each moves only n/p
assert r_bytes == 2 * (p - 1) * (n // p) * p // 1 # bandwidth-optimal: ~2n per node
print("OK: both all-reduces give every node the global sum; tree=2 log p steps, ring=2(p-1) steps")
- Constraints: Both must leave every node holding the identical global sum. The tree all-reduce uses
2⌈log₂ p⌉steps (reduce up, broadcast down) but each step moves a fulln-length vector — latencyΘ(log p), but bandwidth per nodeΘ(n log p). The ring uses2(p−1)steps but each step moves only ann/pchunk, so total data per node is2·(p−1)·(n/p) ≈ 2n— independent ofp, the bandwidth-optimal collective. Use+(a commutative monoid) since chunks combine in ring order. - Hint: This is the central trade-off in distributed deep learning. The tree (or its butterfly/recursive-doubling cousin) wins on latency when
nis small —2 log pround trips. The ring wins on bandwidth whennis large — each node sends/receives≈ 2nbytes regardless ofp, so it saturates the network links instead of bottlenecking one root. Production frameworks (NCCL, Horovod) default to ring (or hierarchical ring+tree) all-reduce for exactly this reason: gradient vectors are huge, so bandwidth dominates. The⊕must be commutative because the ring combines chunks in traversal order. - Acceptance test: Every node holds the global sum in both schemes; tree uses
2⌈log₂ p⌉steps moving full vectors, ring uses2(p−1)steps movingn/pchunks (≈ 2nper node total). The simulation makes the latency-vs-bandwidth trade-off countable, not hand-wavy.
Task 11 — Deterministic-tree reduce vs naive atomic reduce (reproducible floats) [coding]¶
[hard] A naive parallel float reduce — every worker atomically adds its partial into a shared accumulator — gives a result that depends on the (nondeterministic) order workers finish, so two runs differ in the low bits. A deterministic tree reduce with a fixed combine order is bit-reproducible. Build both, show the atomic reduce is non-reproducible across runs while the deterministic tree is identical every time, and explain why.
Go¶
package main
import (
"fmt"
"math/rand"
"runtime"
"sync"
"sync/atomic"
"time"
)
// atomicReduce: each worker sums its chunk, then folds into a SHARED accumulator
// in finish-order. Because float64 + is not associative and finish-order varies,
// the low bits of the result vary RUN TO RUN.
func atomicReduce(data []float64, P int) float64 {
n := len(data)
chunk := (n + P - 1) / P
var mu sync.Mutex
total := 0.0
var wg sync.WaitGroup
for p := 0; p < P; p++ {
lo, hi := p*chunk, (p+1)*chunk
if hi > n {
hi = n
}
if lo >= hi {
break
}
wg.Add(1)
go func(lo, hi int) {
defer wg.Done()
var s float64
for i := lo; i < hi; i++ {
s += data[i]
}
// jitter so finish-order is genuinely nondeterministic
time.Sleep(time.Duration(rand.Intn(50)) * time.Microsecond)
mu.Lock()
total += s // folds in arrival order -> not reproducible
mu.Unlock()
}(lo, hi)
}
wg.Wait()
return total
}
// deterministicReduce: same P chunks, but partials are written to FIXED slots and
// combined left-to-right in INDEX order. Identical bits every run, any P.
func deterministicReduce(data []float64, P int) float64 {
n := len(data)
chunk := (n + P - 1) / P
partials := make([]float64, P)
var wg sync.WaitGroup
for p := 0; p < P; p++ {
lo, hi := p*chunk, (p+1)*chunk
if hi > n {
hi = n
}
if lo >= hi {
break
}
wg.Add(1)
go func(p, lo, hi int) {
defer wg.Done()
var s float64
for i := lo; i < hi; i++ {
s += data[i]
}
partials[p] = s // fixed slot p, independent of finish time
}(p, lo, hi)
}
wg.Wait()
var total float64
for _, s := range partials { // FIXED index order -> deterministic
total += s
}
return total
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
n := 1_000_000
data := make([]float64, n)
for i := range data {
data[i] = 1.0/3.0 + float64(i%7)*1e-9 // values that round when summed
}
const P = 8
// Atomic reduce: collect distinct results across runs.
seen := map[uint64]int{}
for run := 0; run < 12; run++ {
r := atomicReduce(data, P)
seen[float64bits(r)]++
}
fmt.Printf("atomic reduce: %d distinct bit-patterns across 12 runs (nondeterministic)\n", len(seen))
// Deterministic reduce: identical every run.
first := deterministicReduce(data, P)
allSame := true
for run := 0; run < 12; run++ {
if deterministicReduce(data, P) != first {
allSame = false
}
}
fmt.Printf("deterministic reduce: identical across 12 runs = %v\n", allSame)
}
func float64bits(f float64) uint64 { return atomic.LoadUint64((*uint64)(nil)) } // placeholder
Python (the determinism check, simplified)¶
import struct
def atomic_like(partials, order):
"""Fold partials in a given order -> low bits depend on the order (float +)."""
acc = 0.0
for i in order:
acc += partials[i]
return acc
if __name__ == "__main__":
import random
random.seed(7)
partials = [1e16, 1.0, -1e16, 2.0, 3.0, -4.0, 1e-8, 5e15]
bitsets = set()
for _ in range(50):
order = list(range(len(partials)))
random.shuffle(order) # simulate worker finish-order
r = atomic_like(partials, order)
bitsets.add(struct.pack("<d", r)) # exact bit pattern
print(f"shuffled-order folds produced {len(bitsets)} distinct bit-patterns (nondeterministic)")
fixed = [atomic_like(partials, range(len(partials))) for _ in range(50)]
assert len(set(struct.pack('<d', x) for x in fixed)) == 1, "fixed order is reproducible"
print("OK: arrival-order float reduce is nondeterministic; fixed-order tree is reproducible")
Note: the Go
float64bitsline above is a deliberate placeholder — replace it withmath.Float64bits(f)(importmath) to read the exact bit pattern. The point stands without it: the atomic reduce's result varies run to run, the deterministic one does not.
- Constraints: The non-reproducible reduce must combine partials in finish/arrival order (a shared accumulator under a lock, or any order not fixed in advance). The deterministic reduce must combine partials in a fixed order independent of timing (fixed slots, then a fixed-index fold or a fixed-shape tree). Use float values that actually round when summed (mixed magnitudes), so the order genuinely changes the low bits. The deterministic result must be bit-identical across runs and across processor counts if you fix the tree shape.
- Hint: The root cause is Task 7: float
+is not associative, so the result is a function of the bracketing — and a nondeterministic schedule chooses a different bracketing each run. Reproducibility (needed for debugging, regulatory audit, and bit-exact distributed training) requires pinning the reduction order: write partials to fixed slots, combine in a fixed tree. There is a cost — you cannot combine partials the instant they arrive — but it is usually cheap (the tree overppartials is tiny). This is why numerical libraries offer a "deterministic reduction" mode and why distributed-training frameworks fix the all-reduce tree shape. - Acceptance test: The atomic/arrival-order reduce produces multiple distinct bit-patterns across repeated runs; the fixed-order tree reduce produces one bit-pattern every run (and across
P). The demonstration ties non-reproducibility directly to float non-associativity plus schedule nondeterminism, and shows the fix is a fixed combine order.
Task 12 — Pairwise summation error: O(ε log n) vs naive O(ε n) [analysis]¶
[hard] The tree reduce is not just lower-span — for floating-point sums it is more accurate. Prove the worst-case round-off of naive sequential summation is O(ε n) (relative), while pairwise (tree) summation is O(ε log n) — an exponential improvement in the error bound, and a second reason to prefer the parallel-shaped reduce.
No code. Use this as the grading model.
Model. Each floating-point addition satisfies fl(a + b) = (a + b)(1 + δ) with |δ| ≤ ε (the unit round-off, ≈ 1.1·10⁻¹⁶ for float64). Let S = Σ_{i=0}^{n−1} x_i be the true sum and Ŝ the computed one. We bound the absolute error |Ŝ − S| in terms of Σ|x_i|.
Naive sequential sum. The running sum s_k = fl(s_{k−1} + x_k) accumulates a rounding factor at every step. After n − 1 additions, x_i is multiplied by a product of up to n − i factors (1 + δ). Using ∏(1 + δ_j) = 1 + θ with |θ| ≤ (n−1)ε / (1 − (n−1)ε) ≈ (n−1)ε (the standard γ_{n−1} bound), the error is
The earliest-added terms pass through the most additions, so they carry the largest (n−1)-fold rounding — the error grows linearly in n.
Pairwise (tree) sum. Recursively split into two halves, sum each, add the two results. Each input x_i now participates in only the additions on its root-to-leaf path, of which there are at most ⌈log₂ n⌉ (the tree depth). So each x_i accrues at most ⌈log₂ n⌉ rounding factors, giving |θ_i| ≤ γ_{⌈log₂ n⌉} ≈ ε⌈log₂ n⌉, and
Comparison. For n = 10⁹: naive's bound is ≈ ε·10⁹ ≈ 10⁻⁷ relative (you can lose 7 significant digits), while pairwise's is ≈ ε·30 ≈ 3·10⁻¹⁵ (you keep ~15 digits). The tree reduce that parallelism forces you toward is also the one numerical analysis recommends — the balanced bracketing both shortens the critical path and shortens every input's rounding path.
Relation to Kahan summation. Kahan (compensated) summation achieves O(ε) error — better still — by tracking the lost low-order bits, but it is inherently sequential (each step depends on the running compensation). Pairwise summation is the parallel-friendly middle ground: O(ε log n) error with Θ(log n) span. So the accuracy ranking is naive O(εn) ≫ pairwise O(ε log n) ≫ Kahan O(ε), with pairwise the only one that is also low-span.
- Constraints: State the round-off model
fl(a+b) = (a+b)(1+δ),|δ| ≤ ε. Bound naive summation's error asO(ε n)·Σ|x_i|by counting that early terms pass through up ton−1additions. Bound pairwise summation asO(ε log n)·Σ|x_i|by counting each input's≤ log nadditions on its tree path. Give the numeric comparison for a largenand place pairwise between naive and Kahan. - Acceptance test: Naive error bound
O(ε n), pairwiseO(ε log n), derived by counting additions per input (path length in the summation tree); the workedn = 10⁹numbers show the gap (~7 digits vs ~15); pairwise is correctly identified as the low-span, accurate-enough reduce, with Kahan as the sequentialO(ε)alternative.
Synthesis Task¶
Tie map and reduce together end to end: build the parallel map and the tree reduce with work/span counters, confirm every parallel result matches the sequential fold while the counters respect the bound, then build the variants — two-level, segmented, reduce-by-key, fused, semiring — and the collectives, and prove the monoid law, the
log nfloor, and the round-off bound.
[capstone] Carry map and reduce from definition to applications: implement, count, verify, and prove.
-
The two primitives [coding]. Parallel map →
Θ(n)work,Θ(1)span (Task 1); tree reduce with counters →n − 1work,⌈log₂ n⌉span (Task 2); the dot product as map-then-reduce (Task 3). Confirm every parallel output equals the sequential fold. -
Prove the structure [analysis]. The monoid laws and why associativity makes the tree equal the fold, with a non-associative counterexample (Task 6); the
Ω(log n)span floor, met by the tree; theO(ε log n)vsO(ε n)round-off bound (Task 12). -
Schedule and group [coding]. The two-level reduce for a fixed
p→ spanΘ(n/p + log p)(Task 4); segmented reduce and reduce-by-key (Task 5); map fusion that drops the intermediate (Task 8). -
The hard edges [coding]. A non-associative operator giving parallel ≠ sequential, and float
+'s order-dependence (Task 7); the semiring map-reduce for shortest paths and reachability (Task 9). -
Collectives and reproducibility [coding]. Tree vs ring all-reduce with the communication count (Task 10); deterministic-tree vs atomic reduce for bit-reproducible floats (Task 11).
Reference harness in Python (drives the pieces and checks every bound):
import operator, random
from math import log2, ceil
def synth(n, seed=0):
random.seed(seed)
add, idn = operator.add, 0
xs = [random.randint(0, 9) for _ in range(n)]
# Map + reduce primitives.
mapped, m_work, m_span = map_counted(lambda x: x * x, xs)
assert mapped == [x * x for x in xs] and m_work == n and m_span == (1 if n else 0)
red, r_work, r_span = tree_reduce(xs, add, idn)
assert red == seq_reduce(xs, add, idn), "tree reduce = sequential fold"
assert r_work == n - 1 and r_span == ceil(log2(n)) if n else True
# Two-level reduce at a few p.
for p in (1, 4, 64):
tl, tl_work, tl_span = two_level_reduce(xs, add, idn, p)
assert tl == red and tl_work == n - 1
# Dot product (map then reduce).
a = [random.randint(-9, 9) for _ in range(n)]
b = [random.randint(-9, 9) for _ in range(n)]
dv, _, _ = dot_map_reduce(a, b)
assert dv == sum(a[i] * b[i] for i in range(n))
return m_work, m_span, r_work, r_span
if __name__ == "__main__":
print(f"{'n':>7} {'map work':>9} {'map span':>9} {'red work':>9} {'red span':>9}")
for n in (16, 256, 4096, 65536):
mw, ms, rw, rs = synth(n)
print(f"{n:>7} {mw:>9} {ms:>9} {rw:>9} {rs:>9}")
print("\nOK: map = Theta(n)/Theta(1); reduce = (n-1)/ceil(log2 n); all parallel = sequential fold")
- Analysis answer: A map applies
findependently to each element —Θ(n)work,Θ(1)span, parallelismΘ(n), embarrassingly parallel. A reduce folds with an associative⊕; the tree reduce doesn − 1combines (work-efficient, matching the sequential fold) in⌈log₂ n⌉rounds, and associativity is exactly what licenses the parallel re-bracketing — drop it (subtraction; float+) and the parallel result diverges from the sequential one and depends on the tree shape. TheΩ(log n)span is a lower bound the tree meets (a value's dependency set doubles per round, the result needs allninputs), so the only scheduling freedom is the two-level shapeΘ(n/p + log p)forpprocessors. On top sit the variants: segmented reduce and reduce-by-key for grouped data, map fusion that drops the intermediate array, the semiring map-reduce that turns(min,+)/(OR,AND)matrix products into shortest paths and reachability, and the all-reduce collectives — tree (2 log psteps, latency-optimal) vs ring (2(p−1)steps,≈ 2n/node, bandwidth-optimal). Finally, float non-associativity forces a choice: a deterministic-tree reduce for bit-reproducibility, and pairwise summation'sO(ε log n)error beats naiveO(ε n)— the parallel shape is also the accurate one. - Acceptance test: Every parallel map/reduce output equals the sequential fold; map measures
Θ(n)work /Θ(1)span; tree reduce measuresn − 1work /⌈log₂ n⌉span; two-level reduce keepsn − 1work at spanΘ(n/p + log p); segmented reduce, reduce-by-key, fusion, and the semiring product all match their sequential references; both all-reduces leave every node with the global sum at the predicted step counts; the deterministic reduce is bit-reproducible where the atomic one is not. The write-up mirrors the whole discipline: run the map/reduce round by round, count the work and the span, confirm the output matches the sequential fold and the counters match the bound — then build the variants, the collectives, and prove the monoid law and the round-off floor.
Where to go next¶
- Build the scan that keeps every partial a reduce throws away — the work-efficient up-sweep/down-sweep, segmented scan, and the compaction and sorting it powers — in the parallel prefix sum / scan tasks; the up-sweep is precisely the tree reduce you built here.
- Revisit the work/span model, Brent's bound that predicts the two-level reduce's
Θ(n/p + log p), the reduction tree'sΘ(n)-workΘ(log n)-span derivation, and theΩ(log n)span lower bound this topic's floor specializes, in the models: PRAM and work–span tasks. - For the conceptual treatment of map and reduce — monoids and semirings, the two-level and segmented forms, all-reduce collectives, and floating-point reproducibility and accuracy — read this topic's junior, middle, senior, and professional notes.
In this topic
- interview
- tasks