Models of Parallel Computation: PRAM and Work–Span — Practice Tasks¶
Coding tasks are solved in one language (Go or Python) with the full reference solution; a short snippet in the other language is provided where it clarifies the port. Where marked [coding], build the work–span analyzer / greedy scheduler / parallel benchmark, run it, and confirm the measured numbers respect the laws. The acceptance test is always the same shape: the measured or derived
T_Prespects the Work LawT_P ≥ T₁/P, the Span LawT_P ≥ T∞, and the Brent boundT_P ≤ T₁/P + T∞. [analysis] tasks need no code: derive a work/span, apply Brent, or work an Amdahl/Gustafson number from first principles — model derivations are provided so you can grade yourself.
A parallel computation is modeled as a DAG of unit tasks. Two numbers summarize it and everything else follows from them:
T₁— the work: the total number of unit tasks, equivalently the time on one processor.T∞— the span (depth, critical-path length): the number of tasks on the longest directed path, equivalently the time on infinitely many processors.
Two derived quantities recur and must become reflexes:
parallelism = T₁/T∞— the average width of the DAG; the maximum speedup any schedule can ever achieve, and the processor count beyond which workers must idle.S_P = T₁/T_P,E_P = S_P/P— the speedup and efficiency onPprocessors.
The bounds you will measure, derive, and verify on every task:
| Bound | Statement | Why |
|---|---|---|
| Work Law | T_P ≥ T₁/P | P workers do at most P·T_P tasks; all T₁ must be done |
| Span Law | T_P ≥ T∞ | the critical path serializes; even ∞ processors cannot beat T∞ |
| Brent bound | T_P ≤ T₁/P + T∞ | a greedy schedule has ≤ T₁/P complete and ≤ T∞ incomplete steps |
| Amdahl | S_P = 1/(s + (1−s)/P) → 1/s | serial fraction s is a hard ceiling (fixed problem) |
| Gustafson | S_P = s + (1−s)·P | scaled problem; linear, unbounded |
The recurring discipline for every coding task is identical: build the DAG (or the parallel program), compute T₁ and T∞, run a schedule (or measure wall-clock), and confirm the achieved T_P lands between max(T₁/P, T∞) and T₁/P + T∞. A work/span you never schedule against is just a pair of numbers; a measured speedup you cannot bracket with the laws is just noise. Tie the two together on every task.
Related practice: - Fork–Join and Work-Stealing tasks — the runtime that realizes the greedy schedule Brent's theorem assumes, randomized work-stealing analysis, and grain-size control in practice. - Parallel Prefix Sum / Scan tasks — the work-efficient Θ(n)-work Θ(log n)-span scan you analyze here, built and benchmarked in full.
This topic's notes: junior · middle · senior · professional
A note on the model and quantities used throughout: - Unit task. One DAG node = one strand of serial work taking one time step. Edges are dependencies: a node is ready only when all its predecessors have finished. - Greedy schedule. At each step, if ≥ P tasks are ready, run any P (a complete step); else run all ready tasks (an incomplete step). Never idle a processor next to a ready task. - PRAM vs work–span. The PRAM fixes P and steps in lockstep (EREW ⊆ CREW ⊆ CRCW); the work–span model is P-independent and Brent's principle compiles it to any P. We use work–span for scheduling and PRAM only where the variant matters. - The idealization. Both models charge nothing for communication, memory contention, or synchronization. T_P = T₁/P + T∞ measures inherent parallelism, not wall-clock — which is exactly why the coding tasks that do measure wall-clock reveal where the model and the hardware part ways (the memory-bandwidth ceiling, scheduling overhead, grain size).
Beginner Tasks¶
Task 1 — Compute T₁, T∞, and parallelism of a task DAG by longest path [coding]¶
[easy] Build the core instrument used by every later task: given a DAG (each node a unit task, each edge a dependency), compute the work T₁ (node count), the span T∞ (longest path in nodes), and the parallelism T₁/T∞. Process nodes in topological order; the longest-path-to-a-node recurrence is level[v] = 1 + max over predecessors u of level[u], and T∞ = max_v level[v].
Then run it on a balanced binary reduction tree over 8 leaves (15 nodes, span 4) and confirm T₁ = 15, T∞ = 4, parallelism = 3.75.
Python¶
def work_span(deps):
"""deps[v] = list of predecessors of node v, input in topological order.
Returns (T1, Tinf, level): work = node count, span = longest path (nodes),
level[v] = longest distance from a source to v (1-based)."""
n = len(deps)
level = [0] * n
for v in range(n):
best = max((level[u] for u in deps[v]), default=0)
level[v] = best + 1 # one step below its deepest predecessor
return n, max(level), level
if __name__ == "__main__":
# Balanced binary reduction tree over 8 leaves.
# 0..7 leaves; 8..11 level-1; 12,13 level-2; 14 root.
deps = [
[], [], [], [], [], [], [], [], # 0..7 leaves
[0, 1], [2, 3], [4, 5], [6, 7], # 8..11
[8, 9], [10, 11], # 12,13
[12, 13], # 14 root
]
T1, Tinf, level = work_span(deps)
print(f"reduction DAG: T1={T1} Tinf={Tinf} parallelism={T1/Tinf:.2f}")
assert T1 == 15 and Tinf == 4
assert abs(T1 / Tinf - 3.75) < 1e-9
print("OK: T1 = node count, Tinf = longest path, parallelism = T1/Tinf")
Go (core)¶
// WorkSpan returns T1 (= #nodes), Tinf (= longest path in nodes), and level[].
// deps[v] = predecessors of v; nodes given in topological order 0..n-1.
func WorkSpan(deps [][]int) (T1, Tinf int, level []int) {
level = make([]int, len(deps))
for v := range deps {
best := 0
for _, u := range deps[v] {
if level[u] > best {
best = level[u]
}
}
level[v] = best + 1
if level[v] > Tinf {
Tinf = level[v]
}
}
return len(deps), Tinf, level
}
- Constraints: Input must be in topological order so every predecessor is processed before the node (otherwise
level[u]is not yet final).T∞is the genuine longest path, not the number of levels you drew — use themax-over-predecessors recurrence. A node with no predecessors haslevel = 1. - Hint: The longest path to
vis one more than the longest path to its deepest predecessor — a one-pass DP over a topological order,O(V + E). The reduction tree has 4 levels (8 leaves → 4 → 2 → 1), soT∞ = 4; it has8 + 4 + 2 + 1 = 15nodes, soT₁ = 15. - Acceptance test:
T₁ = 15,T∞ = 4, parallelism= 3.75for the reduction tree. Keep thiswork_spanfunction — every later task drives it (or the greedy scheduler built on top of it).
Task 2 — Work and span of summing n numbers (the reduction tree) [analysis]¶
[easy] Before scheduling anything, derive the work and span of the canonical parallel primitive: summing n numbers with a balanced binary reduction tree (pair up adjacent elements, combine, halve the count, repeat).
No code. Use this as the grading model.
Model derivation.
Work. Level k (counting from the leaves, k = 0) performs n/2^{k+1} additions. Summing over all levels:
Every value is touched a constant number of times, and the sequential sum also does n − 1 additions — so the reduction is work-efficient (T₁ = Θ(T_seq)).
Span. The tree has ⌈log₂ n⌉ levels, and each level depends on the one below it (you cannot combine a partial sum before its inputs exist). The critical path runs leaf → … → root:
Parallelism.
Worked numbers. For n = 1,048,576 = 2²⁰:
T₁ = n − 1 = 1,048,575 (additions)
T∞ = log₂ n = 20 (tree height)
parallelism = T₁/T∞ ≈ 52,428 (you could keep ~52k processors busy on average)
So a million-element sum has enormous parallelism: any realistic core count P (say 64) sits far below 52,428, putting you deep in the linear-speedup regime where T∞ = 20 is a negligible additive term against T₁/P ≈ 16,384.
Why span cannot be O(1). Combining n values into one on an EREW/CREW PRAM requires Ω(log n) steps — a cell's set of input-dependencies at most doubles per step, so reaching all n inputs needs ≥ log₂ n steps. The Θ(log n) reduction span is forced, not an artifact of the tree shape. (Only a CRCW machine's concurrent-write collapses OR/AND-style reductions to O(1).)
- Constraints: Derive
T₁ = n − 1 = Θ(n),T∞ = ⌈log₂ n⌉ = Θ(log n), parallelism= Θ(n/log n). Plug in a concreten(e.g.2²⁰) and report all three. State why the algorithm is work-efficient and whyΘ(log n)span is unavoidable on EREW. - Acceptance test: Work
Θ(n)(matching sequential), spanΘ(log n), parallelismΘ(n/log n); the worked numbers forn = 2²⁰giveT₁ ≈ 10⁶,T∞ = 20, parallelism≈ 5.2·10⁴. This is the work/span you will measure in Task 3 and schedule in Task 4.
Task 3 — Write a parallel reduction and MEASURE speedup vs P [coding]¶
[easy] Leave the abstract DAG for a moment and run a real parallel reduction. Split an array of n numbers into P chunks, sum each chunk on its own worker (goroutine or process), then combine the P partial sums. Time it for P = 1, 2, 4, 8, … and report the measured speedup S_P = T_1 / T_P against the ideal P.
This is the work–efficient reduction of Task 2 realized: work Θ(n), span Θ(log n) (here the final combine of P partials), and parallelism Θ(n/log n) — so for any small P you should see speedup close to P until overhead or the memory-bandwidth ceiling bites (Task 11).
Go¶
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
// parallelSum splits data into P contiguous chunks, sums each on its own
// goroutine, then combines the P partial sums. Work Theta(n); span is the
// chunk-sum (n/P) plus the final combine (~log P), i.e. Theta(n/P + log P).
func parallelSum(data []float64, P int) float64 {
n := len(data)
partials := make([]float64, P)
var wg sync.WaitGroup
chunk := (n + P - 1) / P
for p := 0; p < P; p++ {
lo := p * chunk
hi := lo + chunk
if hi > n {
hi = n
}
if lo >= n {
break
}
wg.Add(1)
go func(p, lo, hi int) {
defer wg.Done()
var s float64
for i := lo; i < hi; i++ {
s += data[i]
}
partials[p] = s // each worker writes its own slot: no contention
}(p, lo, hi)
}
wg.Wait()
var total float64
for _, s := range partials { // final combine of P partials
total += s
}
return total
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
const n = 200_000_000
data := make([]float64, n)
for i := range data {
data[i] = 1.0 // known answer: sum == n
}
// Baseline T_1: single goroutine.
t0 := time.Now()
base := parallelSum(data, 1)
t1 := time.Since(t0).Seconds()
if base != float64(n) {
panic("wrong sum")
}
fmt.Printf("P=1 time=%.4fs speedup=1.00 (baseline)\n", t1)
for _, P := range []int{2, 4, 8, 16} {
t0 = time.Now()
got := parallelSum(data, P)
tp := time.Since(t0).Seconds()
if got != float64(n) {
panic("wrong sum")
}
fmt.Printf("P=%-3d time=%.4fs speedup=%.2f efficiency=%.2f\n",
P, tp, t1/tp, (t1/tp)/float64(P))
}
}
Python (multiprocessing core)¶
from multiprocessing import Pool
def chunk_sum(args):
data, lo, hi = args
return sum(data[lo:hi])
def parallel_sum(data, P):
n = len(data)
chunk = (n + P - 1) // P
tasks = [(data, p * chunk, min((p + 1) * chunk, n)) for p in range(P)]
with Pool(P) as pool:
return sum(pool.map(chunk_sum, tasks)) # P partials, then combine
- Constraints: Each worker writes its own partial slot (no shared accumulator → no contention or race). Use
nlarge enough that the per-chunk work dwarfs spawn overhead (hundreds of millions of cheap ops, or a heavier per-element op in Python where interpreter overhead is large). Pin the data to a known value (all1.0) so the answer isnand you can assert correctness — a wrong sum invalidates the timing. - Hint: Speedup will be near-linear for small
P(you are far below the parallelismn/log n), then flatten. The flattening is not a violation of Brent — it is the memory-bandwidth wall and scheduling overhead the work–span model omits (Task 11 isolates it). Go shows cleaner scaling than Python'smultiprocessing, which pays pickling/IPC overhead per chunk. - Acceptance test: The sum is correct for every
P; measuredS_Pincreases withPand stays≤ P(the Work Law in wall-clock form — you can never beat linear speedup in the model). ReportS_PandE_P; near-linear speedup at smallPconfirms the reduction's high parallelism.
Task 4 — Apply Brent's bound to predict T_P; compare to a greedy schedule [coding + analysis]¶
[easy] Brent's theorem says a greedy schedule of a DAG runs in T_P ≤ T₁/P + T∞. Predict T_P for the reduction DAG of Task 1 at several P, then simulate a greedy schedule and confirm the achieved T_P lands between the lower bound max(T₁/P, T∞) and the upper bound T₁/P + T∞.
The greedy simulator is the second core instrument: maintain a ready set, and each step pop up to P ready tasks, mark them done, and release any successor whose last predecessor just finished.
Python¶
from collections import deque
def greedy(deps, P):
"""Simulate a greedy schedule on P processors; return T_P (steps).
A node becomes ready when all its predecessors have completed."""
n = len(deps)
remaining = [len(deps[v]) for v in range(n)] # unfinished predecessors
succ = [[] for _ in range(n)]
for v in range(n):
for u in deps[v]:
succ[u].append(v)
ready = deque(v for v in range(n) if remaining[v] == 0)
steps = 0
while ready:
steps += 1
take = min(P, len(ready)) # incomplete step runs ALL ready tasks
batch = [ready.popleft() for _ in range(take)]
for u in batch:
for w in succ[u]:
remaining[w] -= 1
if remaining[w] == 0:
ready.append(w)
return_check = None # (placeholder; loop continues)
return steps
if __name__ == "__main__":
from math import ceil
deps = [
[], [], [], [], [], [], [], [],
[0, 1], [2, 3], [4, 5], [6, 7],
[8, 9], [10, 11],
[12, 13],
]
T1, Tinf = len(deps), 4
print(f"reduction DAG: T1={T1} Tinf={Tinf} parallelism={T1/Tinf:.2f}")
for P in (1, 2, 4, 8, 16):
tp = greedy(deps, P)
lower = max(ceil(T1 / P), Tinf) # max(Work Law, Span Law)
brent = ceil(T1 / P) + Tinf # Brent upper bound
print(f" P={P:2d}: T_P={tp:2d} lower=max(T1/P,Tinf)={lower:2d} "
f"Brent=T1/P+Tinf={brent:2d} speedup={T1/tp:.2f}")
assert lower <= tp <= brent, "T_P must satisfy the Work/Span laws and Brent"
print("OK: max(T1/P, Tinf) <= T_P <= T1/P + Tinf for every P")
Bug to fix first (intentional): the
return_check = Noneline and a missing re-loop make this snippet terminate after one step. Remove the stray line and ensure thewhile ready:loop iterates untilreadyis empty — the structure above (pop a batch, release successors, repeat) is correct once the spurious early statement is deleted. Verifying the fix is part of the task: a scheduler that stops early reports a falsely smallT_Pthat violates the Span Law.
- Analysis to write: For
P = 8on the 15-node reduction tree,T₁/P = ⌈15/8⌉ = 2andT∞ = 4, so Brent predictsT_P ≤ 6; the lower bound ismax(2, 4) = 4. A correct greedy schedule achievesT_P = 4(it is span-bound: 8 leaves run in step 1, then 4, 2, 1 partials over the next 3 levels). AtP = 8 ≥ parallelism = 3.75, more processors cannot help — the span dominates, and speedup pins atT₁/T_P = 15/4 = 3.75, exactly the parallelism. - Constraints: The greedy rule is exact: run up to
Pready tasks per step, never fewer when more are ready. The achievedT_Pmust satisfy both lower-bound laws and the Brent upper bound for everyP. Fix the seeded bug before trusting any number. - Acceptance test:
max(⌈T₁/P⌉, T∞) ≤ T_P ≤ ⌈T₁/P⌉ + T∞forP ∈ {1, 2, 4, 8, 16}; atP ≥ parallelismthe speedup saturates at exactlyT₁/T∞ = 3.75. The simulator confirms Brent's theorem on a real DAG.
Intermediate Tasks¶
Task 5 — Verify the Brent bound across many random DAGs [coding]¶
[medium] One DAG proves nothing. Generate many random DAGs (random layered or random-edge graphs in topological order), compute T₁ and T∞ for each, run the greedy scheduler for several P, and assert that every run satisfies max(T₁/P, T∞) ≤ T_P ≤ T₁/P + T∞. This stress-tests both your analyzer and Brent's theorem.
Python¶
import random
from collections import deque
from math import ceil
def random_dag(n, edge_prob):
"""Random DAG on n nodes in topological order: edge u->v only if u < v."""
deps = [[] for _ in range(n)]
for v in range(1, n):
for u in range(v):
if random.random() < edge_prob:
deps[v].append(u)
return deps
def work_span(deps):
level = [0] * len(deps)
for v in range(len(deps)):
level[v] = 1 + max((level[u] for u in deps[v]), default=0)
return len(deps), max(level)
def greedy(deps, P):
n = len(deps)
remaining = [len(deps[v]) for v in range(n)]
succ = [[] for _ in range(n)]
for v in range(n):
for u in deps[v]:
succ[u].append(v)
ready = deque(v for v in range(n) if remaining[v] == 0)
steps = 0
while ready:
steps += 1
for _ in range(min(P, len(ready))):
u = ready.popleft()
for w in succ[u]:
remaining[w] -= 1
if remaining[w] == 0:
ready.append(w)
return steps
if __name__ == "__main__":
random.seed(0)
trials, violations = 0, 0
for _ in range(2000):
n = random.randint(5, 60)
deps = random_dag(n, edge_prob=random.uniform(0.05, 0.4))
T1, Tinf = work_span(deps)
for P in (1, 2, 3, 5, 8):
tp = greedy(deps, P)
lower = max(ceil(T1 / P), Tinf)
brent = ceil(T1 / P) + Tinf
trials += 1
if not (lower <= tp <= brent):
violations += 1
print(f"ran {trials} (DAG, P) trials; bound violations = {violations}")
assert violations == 0, "Brent and the Work/Span laws must hold on every DAG"
print("OK: max(T1/P, Tinf) <= T_P <= T1/P + Tinf on all random DAGs")
- Constraints: Generate DAGs in topological order (edge
u → vonly whenu < v) so acyclicity is automatic. Varyn, edge density, andP. The assertion must hold on every trial — a single violation means a bug inwork_span(likely a non-topological order) or ingreedy(likely an incomplete step that idles a processor). - Hint: Sparse DAGs (low edge probability) have small span and high parallelism, so they are usually Work-Law-bound (
T_P ≈ T₁/P); dense, chain-like DAGs have large span and are Span-Law-bound (T_P ≈ T∞). Both extremes — and everything between — must fall inside[max(T₁/P, T∞), T₁/P + T∞]. The bound is never violated; greedy can only ever be within2×of the lower bound. - Acceptance test: Zero bound violations across thousands of
(DAG, P)pairs. This is Brent's theorem and the two laws verified empirically — the same instrument that, applied to a fork–join runtime's DAG, certifies its scheduler.
Task 6 — Plot Amdahl's law (speedup vs P for several serial fractions) [coding]¶
[medium] Amdahl's law S_P = 1/(s + (1−s)/P) says a serial fraction s caps speedup at 1/s regardless of P. Compute and plot the family of curves for several s, showing each flatten toward its 1/s ceiling. Even an ASCII table or a matplotlib chart makes the diminishing returns vivid.
Python¶
def amdahl(s, P):
return 1.0 / (s + (1.0 - s) / P)
if __name__ == "__main__":
serial_fractions = [0.0, 0.01, 0.05, 0.10, 0.25, 0.50]
procs = [1, 2, 4, 8, 16, 64, 256, 1024, 1 << 16]
header = " s \\ P " + "".join(f"{P:>9}" for P in procs)
print(header)
for s in serial_fractions:
row = f" {s:>5.2f} "
for P in procs:
row += f"{amdahl(s, P):>9.2f}"
ceiling = float('inf') if s == 0 else 1.0 / s
print(row + f" ceiling 1/s = {ceiling:.1f}")
# The curve must be monotone in P and bounded by 1/s.
prev = 0.0
for P in procs:
sp = amdahl(s, P)
assert sp >= prev - 1e-9, "speedup is monotone non-decreasing in P"
if s > 0:
assert sp <= 1.0 / s + 1e-9, "speedup never exceeds 1/s"
prev = sp
print("\nOK: each Amdahl curve rises monotonically toward its 1/s ceiling")
# Optional matplotlib plot:
try:
import matplotlib.pyplot as plt
Ps = list(range(1, 1025))
for s in [0.01, 0.05, 0.10, 0.25, 0.50]:
plt.plot(Ps, [amdahl(s, P) for P in Ps], label=f"s={s}")
plt.axhline(1.0 / s, ls="--", lw=0.6)
plt.xscale("log", base=2); plt.xlabel("P"); plt.ylabel("speedup S_P")
plt.title("Amdahl's law: speedup vs P"); plt.legend()
plt.savefig("amdahl.png", dpi=120)
print("wrote amdahl.png")
except ImportError:
pass
- Constraints:
S_P = 1/(s + (1−s)/P). Each curve must be monotone non-decreasing inPand bounded above by1/s(fors > 0);s = 0gives the idealS_P = P. Show at least the rowss = 0.05(ceiling20×) ands = 0.25(ceiling4×) so the ceiling's harshness is visible. - Hint: The headline numbers:
s = 0.05caps at20×, yet atP = 1024you already have≈ 19.6×— going from1024to65536cores buys essentially nothing.s = 0.25caps at4×. This is why attacking the serial fraction itself (I/O, initialization, a sequential reduction) beats buying more cores onceP ≫ 1/s. - Acceptance test: Every curve is monotone and respects its
1/sceiling; thes = 0.05curve is within5%of20×byP = 256and essentially flat thereafter. The plot/table shows the ceiling eachsimposes — the quantitative face of "fixed-problem-size parallelism has a wall."
Task 7 — Gustafson's law next to Amdahl's: scaled vs fixed problem [coding + analysis]¶
[medium] Amdahl fixes the problem and scales the machine (ceiling 1/s); Gustafson scales both — a bigger machine runs a bigger problem, and S_P = s + (1−s)·P is linear and unbounded. Compute both for the same s and show the contrast, then explain which question each answers.
Python¶
def amdahl(s, P):
return 1.0 / (s + (1.0 - s) / P)
def gustafson(s, P):
return s + (1.0 - s) * P # = P - s*(P - 1)
if __name__ == "__main__":
s = 0.05
print(f"serial fraction s = {s} (Amdahl ceiling 1/s = {1/s:.0f})")
print(f"{'P':>8} {'Amdahl':>10} {'Gustafson':>12}")
for P in (1, 4, 16, 64, 256, 1024):
a, g = amdahl(s, P), gustafson(s, P)
print(f"{P:>8} {a:>10.2f} {g:>12.2f}")
assert a <= 1 / s + 1e-9, "Amdahl bounded by 1/s"
assert abs(g - (P - s * (P - 1))) < 1e-9, "Gustafson is linear in P"
# Gustafson slope is (1 - s): doubling P nearly doubles scaled speedup.
slope = (gustafson(s, 1024) - gustafson(s, 512)) / 512
print(f"\nGustafson slope ~ (1 - s) = {1 - s:.2f} (measured {slope:.2f})")
assert abs(slope - (1 - s)) < 1e-9
print("OK: Amdahl -> 1/s (bounded); Gustafson -> ~(1-s)*P (linear, unbounded)")
- Analysis to write: The two laws answer different questions. Amdahl (strong scaling): the problem size is fixed; you want the same computation done faster. The serial part
sdoes not shrink, soS_P → 1/s— a hard wall. Gustafson (weak scaling): the problem grows with the machine; you want a bigger computation in the same time. The parallel work grows withPwhile the serial part stays roughly constant, so the scaled speedups + (1−s)·Pis linear with slope1 − s. They do not contradict each other — Amdahl measures the same job on more cores, Gustafson measures more job on more cores. The classic error is quoting Amdahl's1/spessimism for a workload that actually scales with the machine (most scientific computing): nobody buys a supercomputer to solve last year's problem20×faster; they solve a1000×bigger one. - Constraints: Same
sfor both. Amdahl must stay≤ 1/s; Gustafson must be linear with slope1 − s. Tabulate both side by side so the bounded-vs-unbounded contrast is unmistakable. - Acceptance test: Amdahl saturates near
1/s = 20; Gustafson reaches≈ 0.95·P(e.g.≈ 973atP = 1024). The write-up names which scaling regime each law describes and why reporting the wrong one mis-states parallel value.
Task 8 — Analyze the work/span of prefix-sum and matrix multiply [analysis]¶
[medium] Two algorithms beyond reduction, by hand. These are the canonical cases where work-efficiency (not just low span) decides the winner.
No code. Use this as the grading model.
Prefix sum (scan): naive vs work-efficient.
Naive (Hillis–Steele). For d = 1, 2, 4, …, n/2, every element adds the element d positions to its left. There are log₂ n rounds, each doing Θ(n) independent additions:
The span is great (Θ(log n)), but the work is Θ(n log n) — work-inefficient, since sequential scan is Θ(n). That extra log n factor of work is paid by every processor on every run.
Work-efficient (Blelloch). An up-sweep (a reduction tree computing partial sums) followed by a down-sweep (pushing prefixes back down), each Θ(log n) levels with Θ(n) total work across the whole sweep:
This matches the sequential Θ(n) work and keeps Θ(log n) span — the best of both, which is why it is the production algorithm. (Full derivation in the scan tasks.)
Matrix multiply (definitional C[i][j] = Σ_k A[i][k]·B[k][j]).
There are n² output entries, each a dot product of length n. The n products for one entry are independent (compute in parallel); summing them is a reduction of n terms (span Θ(log n)). All n² entries are mutually independent:
Work Θ(n³) is work-efficient against the classical sequential algorithm (not against Strassen's Θ(n^{2.807})); span Θ(log n) is just the depth of the dot-product reductions. The parallelism Θ(n³/log n) is gigantic — for any realistic P you sit far below it, in the linear-speedup regime.
Worked comparison. For n = 10⁶, the naive scan does ≈ 10⁶ · 20 = 2·10⁷ work versus the efficient scan's 10⁶ — a 20× work blow-up. By the Work Law, speedup over sequential is capped at P · (T_seq/T₁) = P · (10⁶ / 2·10⁷) = P/20: the naive scan throws away a factor of 20 in processors before it starts, even though its span is identical to the efficient one's.
- Constraints: Derive
T₁,T∞, parallelism for all three (naive scan, Blelloch scan, matrix multiply). State which are work-efficient. Show numerically how the naive scan's work blow-up caps its speedup atP/log n. - Acceptance test: Naive scan
T₁ = Θ(n log n)(inefficient), Blelloch scan and matrix multiply work-efficient; matrix multiply parallelismΘ(n³/log n). The comparison shows a low span is worthless if bought with inflated work — the central lesson of work-efficiency.
Task 9 — Show a work-inefficient algorithm wastes processors [coding + analysis]¶
[medium] Make Task 8's point measurable. Take the same problem (prefix sum) two ways — the work-efficient Θ(n) algorithm and a work-inefficient Θ(n log n) one — and compare their effective speedup over the sequential baseline at a fixed P. The inefficient version, despite identical span, delivers less wall-clock speedup because every processor does redundant work.
Model it through the scheduler: build the DAG of each algorithm (or just its T₁, T∞), apply Brent to predict T_P, and compare S_P = T_seq / T_P for both. Since T_seq = Θ(n) is the same sequential baseline, the efficient algorithm's S_P can reach Θ(P) while the inefficient one's caps at Θ(P/log n).
Python¶
import math
def predict_TP(T1, Tinf, P):
"""Brent upper bound on T_P."""
return math.ceil(T1 / P) + Tinf
def compare(n, P):
T_seq = n # sequential scan: Theta(n)
# Work-efficient (Blelloch): work Theta(n), span Theta(log n).
eff_T1, eff_Tinf = 2 * n, 2 * math.ceil(math.log2(n)) # ~2n up+down sweep
# Work-inefficient (Hillis-Steele): work Theta(n log n), span Theta(log n).
inef_T1, inef_Tinf = n * math.ceil(math.log2(n)), math.ceil(math.log2(n))
eff_TP = predict_TP(eff_T1, eff_Tinf, P)
inef_TP = predict_TP(inef_T1, inef_Tinf, P)
return {
"T_seq": T_seq,
"eff": (eff_T1, eff_Tinf, eff_TP, T_seq / eff_TP),
"inef": (inef_T1, inef_Tinf, inef_TP, T_seq / inef_TP),
}
if __name__ == "__main__":
P = 64
print(f"P={P} speedup over sequential = T_seq / T_P")
print(f"{'n':>10} {'efficient S_P':>15} {'inefficient S_P':>17} {'ratio':>8}")
for n in (2**12, 2**16, 2**20, 2**24):
r = compare(n, P)
eff_sp = r["eff"][3]
inef_sp = r["inef"][3]
print(f"{n:>10} {eff_sp:>15.2f} {inef_sp:>17.2f} {eff_sp/inef_sp:>8.2f}")
# Efficient speedup approaches P; inefficient is ~log n times worse.
assert eff_sp > inef_sp, "work-efficient must out-speed work-inefficient"
assert eff_sp / inef_sp >= 0.5 * math.log2(n) / 2 # gap grows ~ log n
print("\nOK: inefficient algo wastes ~log n factor of processors (same span!)")
- Analysis to write: Both algorithms have span
Θ(log n), so a naive "minimize the span" view rates them equal. But speedup over sequential isT_seq / T_P, and by the Work LawT_P ≥ T₁/P, soS_P ≤ P · (T_seq/T₁). The efficient algorithm hasT_seq/T₁ = Θ(1), ceilingΘ(P); the inefficient one hasT_seq/T₁ = Θ(1/log n), ceilingΘ(P/log n). The inefficient algorithm spendsPprocessors re-doing work the sequential machine never did — the redundantlog nfactor is paid on every run by every processor. Span buys latency; work buys throughput, and throughput is what speedup-over-sequential measures. - Constraints: Use the same
T_seq = Θ(n)baseline for both. PredictT_Pvia Brent. Show the speedup gap grows withn(it is≈ log n), proving the waste is not a constant overhead but an asymptotic processor loss. - Acceptance test: Efficient
S_PapproachesP; inefficientS_Pis≈ log ntimes smaller, and the ratio widens withn. The numbers show a work-inefficient algorithm with a beautiful span can be slower in practice than a work-efficient one — work-efficiency is the first thing to check.
Advanced Tasks¶
Task 10 — Classify problems: embarrassingly parallel / work-efficient parallel / inherently sequential [analysis]¶
[hard] Sort a list of problems into three buckets by their inherent parallelism, with justification grounded in work and span. This is the conceptual payoff of the whole model: knowing where a problem sits before you write a line of parallel code.
No code. Use this as the grading model.
The three classes.
-
Embarrassingly parallel —
T∞ = Θ(1)or tiny; the work splits into independent pieces with no (or a trivial) combine. Parallelism≈ T₁. Examples: applying a pure function to every element of an array (map); Monte Carlo trials; rendering independent pixels; brute-force search over a partitioned space. Justification: no dependency edges between the pieces, so the DAG isPindependent chains; the only span is a finalO(log P)combine if any. -
Work-efficient parallel (poly-log span) —
T∞ = Θ(polylog n)andT₁ = Θ(T_seq). The problem has a logarithmic-depth dependency structure you can exploit without inflating work. Examples: reduction (Θ(n)work,Θ(log n)span); work-efficient prefix sum (Blelloch); merge sort and other divide-and-conquer with balanced combines; connected components via pointer jumping. Justification: a recursion or sweep ofΘ(log n)depth, each levelΘ(n)total work — high parallelismΘ(n/log n)with no wasted work. -
Inherently sequential (P-complete-ish) — believed to have no poly-log-span work-efficient algorithm; the dependency chain is intrinsic. Examples: the Circuit Value Problem (evaluate a Boolean circuit —
P-complete, the canonical "hard to parallelize"); lexicographically-first DFS / greedy maximal independent set by lowest-index rule; linear-programming (P-complete); a strict left-fold with a non-associative operator (each step needs the previous result). Justification: the answer at stepkrequires the answer at stepk − 1through a chain of lengthΘ(n), soT∞ = Θ(n)and parallelism isΘ(1)— adding processors cannot help.
Borderline cases to discuss. Quicksort with a sequential partition is Θ(n)-span (the partition is a left-to-right scan) but with a parallel partition (a scan-based filter) becomes Θ(log² n)-span — the class depends on the algorithm, not just the problem. General DFS order is sequential, but reachability (which the DFS computes) is parallelizable via BFS-style frontier expansion. Always ask: is the sequential bottleneck intrinsic to the problem or an artifact of the chosen algorithm?
- Constraints: Place at least 8 named problems across the three classes. For each, state
T₁,T∞(or the reason span isΘ(n)), and the parallelism. Identify at least two borderline cases where the classification flips with the algorithm. Name why the P-complete class resists poly-log span (the chain is intrinsic — formalized as "no NC algorithm unless NC = P"). - Acceptance test: Each problem is justified by its span: embarrassingly parallel ⇒
T∞ = Θ(1)–Θ(log P); work-efficient parallel ⇒T∞ = Θ(polylog n)withT₁ = Θ(T_seq); inherently sequential ⇒T∞ = Θ(n), parallelismΘ(1). The borderline discussion correctly separates problem-intrinsic from algorithm-induced serialization.
Task 11 — Grain-size cutoff: measure the overhead/speedup tradeoff [coding]¶
[hard] The work–span model charges nothing for spawning a task. Reality does: spawning a goroutine/process/future has fixed overhead, so splitting work into too-fine grains drowns the computation in scheduling cost. Implement a parallel divide-and-conquer with a grain-size cutoff — below a threshold G, run serially; above it, fork — and measure how total time varies with G. There is a sweet spot: too large and you lose parallelism; too small and overhead dominates.
Go¶
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
// parSum sums data[lo:hi]. If the segment is smaller than the grain cutoff,
// sum it serially; otherwise split and fork. The cutoff bounds the number of
// spawned tasks: ~ (hi-lo)/grain leaves, each with one fork's overhead.
func parSum(data []float64, lo, hi, grain int) float64 {
if hi-lo <= grain {
var s float64
for i := lo; i < hi; i++ {
s += data[i]
}
return s
}
mid := (lo + hi) / 2
var left float64
var wg sync.WaitGroup
wg.Add(1)
go func() { defer wg.Done(); left = parSum(data, lo, mid, grain) }()
right := parSum(data, mid, hi, grain) // stay on this goroutine for the right half
wg.Wait()
return left + right
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
const n = 100_000_000
data := make([]float64, n)
for i := range data {
data[i] = 1.0
}
fmt.Printf("n=%d cores=%d\n", n, runtime.NumCPU())
fmt.Printf("%12s %12s %14s\n", "grain", "time(ms)", "~#tasks")
var best time.Duration = 1 << 62
var bestGrain int
for _, grain := range []int{64, 1_000, 100_000, 1_000_000, 10_000_000, n} {
t0 := time.Now()
got := parSum(data, 0, n, grain)
dt := time.Since(t0)
if got != float64(n) {
panic("wrong sum")
}
fmt.Printf("%12d %12.2f %14d\n", grain, float64(dt.Microseconds())/1e3, n/grain)
if dt < best {
best, bestGrain = dt, grain
}
}
fmt.Printf("\nbest grain = %d (too small: fork overhead; too large: serial)\n", bestGrain)
}
- Constraints: The cutoff
Gmust switch cleanly between serial base case and parallel fork. SweepGacross several orders of magnitude — fromG = 64(millions of tiny tasks, overhead-dominated) toG = n(one task, no parallelism). Pin the data so the answer is checkable. Report time and the approximate task countn/Gper grain. - Hint: Tiny grains create
≈ n/Gtasks; each fork costs a fixedc, so spawn overhead is≈ c·n/G— explosive asG → 0. Huge grains (G ≈ n) give one serial task —T_P ≈ T₁, no speedup. The optimum is where per-task useful work≈ Gcomfortably exceeds the spawn costc, typicallyGin the thousands-to-millions for cheap ops. The work–span model'sT∞ignores the spawn cost entirely; this task measures the term it omits. - Acceptance test: Time is U-shaped in
G: high at both extremes, minimal in the middle. The best grain gives speedup approaching the core count, whileG = 64is slower than the serial sum (overhead> work) andG = nshows no speedup. The measured curve exposes the spawn-overhead term the pure work–span model omits — and motivates the work-stealing runtimes of the fork–join tasks.
Task 12 — Demonstrate the memory-bandwidth ceiling: a parallel sum that stops scaling [coding + analysis]¶
[hard] The PRAM/work–span model gives unit-cost memory access, so a reduction should speed up linearly to the parallelism n/log n. Real machines share a finite memory bandwidth: once the cores collectively saturate it, adding more cores cannot deliver more bytes/second, and speedup flattens far below P. Demonstrate it: run a bandwidth-bound reduction (sum over a large array — almost no arithmetic per byte) and a compute-bound reduction (an expensive per-element function), and show only the latter keeps scaling.
Go¶
package main
import (
"fmt"
"math"
"runtime"
"sync"
"time"
)
func parReduce(data []float64, P int, f func(float64) float64) float64 {
n := len(data)
partials := make([]float64, P)
chunk := (n + P - 1) / P
var wg sync.WaitGroup
for p := 0; p < P; p++ {
lo, hi := p*chunk, (p+1)*chunk
if hi > n {
hi = n
}
if lo >= n {
break
}
wg.Add(1)
go func(p, lo, hi int) {
defer wg.Done()
var s float64
for i := lo; i < hi; i++ {
s += f(data[i])
}
partials[p] = s
}(p, lo, hi)
}
wg.Wait()
var t float64
for _, s := range partials {
t += s
}
return t
}
func bench(name string, data []float64, f func(float64) float64) {
t0 := time.Now()
parReduce(data, 1, f)
t1 := time.Since(t0).Seconds()
fmt.Printf("\n%s (baseline P=1: %.3fs)\n", name, t1)
for _, P := range []int{2, 4, 8, 16} {
t0 = time.Now()
parReduce(data, P, f)
tp := time.Since(t0).Seconds()
fmt.Printf(" P=%-3d time=%.3fs speedup=%.2f\n", P, tp, t1/tp)
}
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
const n = 200_000_000
data := make([]float64, n)
for i := range data {
data[i] = float64(i%7) + 1
}
// Bandwidth-bound: trivial work per element -> limited by memory throughput.
bench("BANDWIDTH-BOUND (s += x)", data, func(x float64) float64 { return x })
// Compute-bound: heavy work per element -> limited by arithmetic, scales.
bench("COMPUTE-BOUND (s += sqrt(sin^2+cos^2...))", data, func(x float64) float64 {
y := x
for k := 0; k < 50; k++ {
y = math.Sqrt(math.Sin(y)*math.Sin(y) + math.Cos(y)*math.Cos(y) + 1)
}
return y
})
}
- Analysis to write: Both reductions have identical work–span:
T₁ = Θ(n),T∞ = Θ(log n), parallelismΘ(n/log n). The model predicts near-linear speedup for both. Yet the bandwidth-bound sum flattens early — say at≈ 3–4×regardless of how many cores you add — while the compute-bound version keeps scaling towardP. The reason is outside the model: the bandwidth-bound loop's bottleneck is bytes/second from DRAM, a shared, fixed resource; once a handful of cores saturate it, extra cores starve. The compute-bound loop doesO(1)memory traffic perO(50)flops, so it is limited by arithmetic throughput, which does scale with cores. Arithmetic intensity (flops per byte) decides whether the work–span prediction holds. This is the single most important way the idealized model diverges from hardware — and why "high parallelism" does not guarantee "high speedup." - Constraints: Keep the two kernels structurally identical (same loop, same chunking) so only the per-element work differs. The bandwidth kernel must be near-trivial (a bare add); the compute kernel must do enough arithmetic per element to be limited by the ALU, not memory. Use an array far larger than last-level cache so DRAM traffic is real.
- Acceptance test: The bandwidth-bound reduction's speedup saturates well below
P(and below its parallelism), while the compute-bound reduction's keeps climbing towardP. Both obey the work–span bounds (T_Pis never belowmax(T₁/P, T∞)), but only the compute-bound one approaches the optimisticT₁/P. The write-up attributes the gap to shared memory bandwidth — the cost the PRAM model sets to zero.
Task 13 — Work-stealing vs static partition on an imbalanced workload [coding + analysis]¶
[hard] Greedy scheduling assumes a scheduler that never idles a processor next to a ready task. A static partition (split the work into P equal index ranges up front) violates this when the per-task cost is imbalanced: a worker that drew the cheap tasks finishes early and sits idle while one worker grinds through the expensive ones. Work-stealing (idle workers steal ready tasks from busy ones) approximates the greedy schedule and recovers the lost time. Build both on a deliberately imbalanced workload and measure.
Go¶
package main
import (
"fmt"
"runtime"
"sync"
"sync/atomic"
"time"
)
// cost(i): heavily skewed — a few indices are very expensive, most are cheap.
// Static range partition will give one worker most of the expensive indices.
func cost(i int) int {
if i%97 == 0 {
return 20000 // rare, very expensive task
}
return 10 // common, cheap task
}
func busy(work int) float64 { // simulate `work` units of computation
var s float64
for k := 0; k < work; k++ {
s += float64(k) * 1.0000001
}
return s
}
// staticPartition: contiguous index ranges. Imbalance -> idle workers.
func staticPartition(n, P int) time.Duration {
t0 := time.Now()
chunk := (n + P - 1) / P
var wg sync.WaitGroup
for p := 0; p < P; p++ {
lo, hi := p*chunk, (p+1)*chunk
if hi > n {
hi = n
}
wg.Add(1)
go func(lo, hi int) {
defer wg.Done()
for i := lo; i < hi; i++ {
busy(cost(i))
}
}(lo, hi)
}
wg.Wait()
return time.Since(t0)
}
// workStealing: a shared atomic counter as a global task queue — every idle
// worker grabs the next index, so no worker idles while tasks remain.
func workStealing(n, P int) time.Duration {
t0 := time.Now()
var next int64 = 0
var wg sync.WaitGroup
for p := 0; p < P; p++ {
wg.Add(1)
go func() {
defer wg.Done()
for {
i := int(atomic.AddInt64(&next, 1) - 1)
if i >= n {
return
}
busy(cost(i))
}
}()
}
wg.Wait()
return time.Since(t0)
}
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
const n = 2_000_000
fmt.Printf("imbalanced workload, n=%d, cores=%d\n", n, runtime.NumCPU())
fmt.Printf("%5s %14s %16s %10s\n", "P", "static(ms)", "work-steal(ms)", "speedup")
for _, P := range []int{2, 4, 8, 16} {
st := staticPartition(n, P)
ws := workStealing(n, P)
fmt.Printf("%5d %14.1f %16.1f %10.2fx\n",
P, float64(st.Microseconds())/1e3, float64(ws.Microseconds())/1e3,
float64(st)/float64(ws))
}
}
- Analysis to write: With a static index partition, the makespan is
max over workers of (sum of that worker's task costs), not the average — one unlucky worker holding the expensive tasks sets the time, and the rest idle. This is the greedy rule violated: idle processors next to ready (stealable) tasks. Work-stealing keeps every worker busy until the global queue drains, so its makespan approaches the balancedT₁/P— exactly the greedy/Brent ideal. The achieved gap between the two is the load-imbalance penalty, and it grows with the skew of the cost distribution. (A finer-grained static partition — many small chunks round-robined — narrows the gap but never closes it the way dynamic stealing does; this is why production fork–join runtimes steal rather than pre-partition.) - Constraints: Use a genuinely skewed cost function (a few very expensive tasks among many cheap ones) so a contiguous partition is unlucky for some worker. Keep both schedulers' total work identical (same
cost(i)evaluated once peri). The work-stealing version must let any idle worker grab the next task (a shared atomic index is a simple, correct global queue). - Acceptance test: Work-stealing finishes faster than the static partition on the imbalanced workload, with the gap widening as
Pgrows (more workers ⇒ more idle ones under static partition). Work-stealing'sT_PapproachesT₁/P(the greedy ideal); the static partition's is pinned nearmax-worker-load. The measurement confirms why real runtimes realize Brent's greedy schedule via stealing — continued in the fork–join and work-stealing tasks.
Synthesis Task¶
Tie the model together end to end: build the analyzer and the greedy scheduler, drive real DAGs through them, write and measure parallel code, and confirm at every step that the numbers respect the Work Law, the Span Law, and the Brent bound — then connect the idealized model to where hardware diverges.
[capstone] Carry the work–span model from DAG to silicon: analyze, schedule, measure, and explain the gaps.
-
Analyzer + scheduler [coding]. Build
work_span(Task 1) andgreedy(Task 4); confirmmax(⌈T₁/P⌉, T∞) ≤ T_P ≤ ⌈T₁/P⌉ + T∞on the reduction DAG and on thousands of random DAGs (Task 5) — zero violations. -
Derive the canonical bounds [analysis]. Work/span of reduction (Task 2), prefix sum, and matrix multiply (Task 8); classify a slate of problems as embarrassingly parallel / work-efficient parallel / inherently sequential (Task 10).
-
Scaling laws [coding + analysis]. Plot Amdahl's
1/(s + (1−s)/P) → 1/snext to Gustafson'ss + (1−s)·P(Tasks 6–7); name which scaling regime each answers. -
Measure real speedup [coding]. Write a parallel reduction and measure
S_PvsP(Task 3); confirmS_P ≤ Pand near-linear at smallP. Show a work-inefficient algorithm wastes alog nfactor of processors (Task 9). -
Where the model breaks [coding + analysis]. Find the grain-size sweet spot (Task 11), demonstrate the memory-bandwidth ceiling (Task 12), and beat a static partition with work-stealing on an imbalanced load (Task 13).
Reference harness in Python (combines the analysis pieces):
import math
def report(T1, Tinf, P, T_seq, s):
parallelism = T1 / Tinf
brent = math.ceil(T1 / P) + Tinf # Brent upper bound on T_P
lower = max(math.ceil(T1 / P), Tinf) # Work + Span lower bound
speedup_ceiling = P * (T_seq / T1) # capped by work-efficiency
amdahl = 1.0 / (s + (1.0 - s) / P)
gustafson = s + (1.0 - s) * P
return parallelism, lower, brent, speedup_ceiling, amdahl, gustafson
if __name__ == "__main__":
P, s = 64, 0.05
cases = [
# (name, T1, Tinf, T_seq)
("reduction", 10**6, 20, 10**6),
("Blelloch scan", 2*10**6, 40, 10**6),
("Hillis-Steele", 20*10**6, 20, 10**6), # work-inefficient
("matrix multiply", 10**9, 30, 10**9),
]
print(f"P={P} serial fraction s={s}")
print(f"{'algorithm':>16} {'parallelism':>12} {'lower<=T_P<=Brent':>20} "
f"{'S_P ceiling':>12}")
for name, T1, Tinf, T_seq in cases:
par, lo, br, ceil_sp, _, _ = report(T1, Tinf, P, T_seq, s)
print(f"{name:>16} {par:>12.0f} {f'{lo} <= T_P <= {br}':>20} "
f"{ceil_sp:>12.2f}")
assert lo <= br # lower bound <= upper bound, always
assert ceil_sp <= P + 1e-9 # speedup ceiling never exceeds P
a = 1.0 / (s + (1 - s) / P)
g = s + (1 - s) * P
print(f"\nAmdahl(s={s}, P={P}) = {a:.2f} (ceiling 1/s = {1/s:.0f}) "
f"Gustafson = {g:.2f}")
assert a <= 1 / s + 1e-9
print("OK: T_P bracketed by the laws; S_P <= P; Amdahl <= 1/s; Gustafson linear")
- Analysis answer: Work
T₁is the 1-processor time; spanT∞is the∞-processor time; parallelismT₁/T∞is the speedup ceiling and the processor count past which workers idle. The Work LawT_P ≥ T₁/Pand Span LawT_P ≥ T∞boundT_Pfrom below; BrentT_P ≤ T₁/P + T∞bounds it from above, so greedy is within2×of optimal. Speedup over sequential is capped atP·(T_seq/T₁), so work-efficiency (T₁ = Θ(T_seq)) matters more than a small span — alog nwork blow-up throws away alog nfactor of processors. Amdahl (→ 1/s) governs the fixed-size problem, Gustafson (s + (1−s)·P) the scaled one. And the model's silences — spawn overhead (grain size), shared memory bandwidth (arithmetic intensity), and load imbalance (static vs work-stealing) — are exactly where measured wall-clock departs fromT₁/P + T∞. - Acceptance test: Every analyzed and measured
T_Psatisfiesmax(⌈T₁/P⌉, T∞) ≤ T_P ≤ ⌈T₁/P⌉ + T∞;S_P ≤ Palways; the work-inefficient algorithm's speedup ceiling is≈ P/log n; Amdahl respects1/sand Gustafson is linear; the grain-size curve is U-shaped, the bandwidth-bound reduction saturates belowP, and work-stealing beats static partition on imbalance. The write-up places the model's predictions next to the hardware's behavior — mirroring the whole discipline: build the DAG, compute work and span, schedule it, measure the speedup, and confirm the numbers respect the laws — then explain every gap with the cost the model omits.
Where to go next¶
- Realize the greedy schedule Brent's theorem assumes — randomized work-stealing, the
T_P = T₁/P + O(T∞)bound it achieves, deques and steal protocols, and grain-size control in production runtimes — in the fork–join and work-stealing tasks. - Build and benchmark the work-efficient
Θ(n)-workΘ(log n)-span scan you analyzed here — up-sweep/down-sweep, segmented scan, and its role as the workhorse behind compaction, sorting, and graph algorithms — in the parallel prefix sum / scan tasks. - For the conceptual treatment of PRAM variants, the Work/Span laws, Brent's proof, work-efficiency, and the Amdahl/Gustafson derivations, read this topic's junior, middle, senior, and professional notes.
In this topic
- interview
- tasks