Models of Parallel Computation: PRAM and Work–Span — Middle Level¶
Table of Contents¶
- Introduction
- The PRAM, Restated for Proofs
- The Machine and the Cost Measure
- The Variant Hierarchy: EREW ⊆ CREW ⊆ CRCW
- CRCW Write-Resolution Rules
- The Power Gap: CRCW Is Strictly Stronger
- OR of n Bits in O(1) on CRCW
- The Ω(log n) Lower Bound on EREW
- Simulation Cost Between Variants
- The Work–Span (DAG) Model
- The Computation DAG
- The Work Law and the Span Law
- Brent's Theorem and Greedy Scheduling
- The Greedy Schedule
- Proof of the Brent Bound
- Corollaries: 2-Optimality and Linear Speedup
- Work-Efficiency
- Why Work Matters More Than Span
- Brent's Scheduling Principle: From Work–Span to PRAM
- Speedup, Efficiency, Cost, and the Two Laws of Scaling
- Speedup, Efficiency, Cost
- Amdahl's Law
- Gustafson's Law
- Worked Examples
- Reduction
- Prefix Sum / Scan
- Matrix Multiply
- Code: Work–Span Analyzer and Speedup Measurement
- Go
- Python
- Pitfalls
- Summary
Introduction¶
Focus: turn the junior facts — PRAM variants, work
T₁, spanT∞, parallelismT₁/T∞, the Brent bound, Amdahl intuition — into rigorous statements you can derive. By the end you can proveT_P ≤ T₁/P + T∞from greedy scheduling, place the EREW/CREW/CRCW variants in their power hierarchy (and prove the gap with the OR-of-n-bits separation), convert any work–span algorithm into aP-processor PRAM algorithm via Brent's principle, and derive both Amdahl's and Gustafson's laws.
At the junior level you met two idealized models of parallel computation. The PRAM (Parallel Random Access Machine) is P synchronous processors sharing one memory, stepping in lockstep; it comes in variants — EREW, CREW, CRCW — distinguished by whether concurrent reads and writes to the same cell are allowed. The work–span model abstracts a computation as a DAG of unit tasks, with work T₁ (total tasks) and span T∞ (critical-path length); their ratio T₁/T∞ is the parallelism. You saw the Brent bound T_P ≤ T₁/P + T∞ quoted, and the Amdahl intuition that a serial fraction s caps speedup at 1/s.
This file makes all of that rigorous:
- PRAM formally. We fix the cost measure, prove the power hierarchy
EREW ⊆ CREW ⊆ CRCW, catalog the four CRCW write-resolution rules (COMMON, ARBITRARY, PRIORITY, COMBINING), and quantify the simulation cost: one CRCW step runs on an EREW machine inO(log P)steps. We prove the separation is strict — OR ofnbits isO(1)on CRCW butΩ(log n)on EREW. - The work–span model formally. The computation DAG;
T₁= number of nodes;T∞= longest path. The Work LawT_P ≥ T₁/Pand the Span LawT_P ≥ T∞. - Brent's theorem. A greedy schedule achieves
T_P ≤ T₁/P + T∞. We prove it by counting complete versus incomplete steps. Corollary: greedy is within2×of optimal, and you get linear speedup wheneverP ≪ T₁/T∞. - Work-efficiency. Why
T₁ = Θ(sequential time)matters more than a small span for real-world speedup, and Brent's scheduling principle for compiling a work–span algorithm down to aP-processor PRAM. - Speedup, efficiency, cost; Amdahl vs Gustafson — both derived, with the regime each one describes.
- Worked: reduction, prefix sum (forward link to parallel prefix sum / scan), matrix multiply.
A note on vocabulary used throughout:
| Symbol | Meaning |
|---|---|
P | number of processors |
T₁ | work — total number of unit tasks (= time on 1 processor) |
T∞ | span (depth, critical-path length) — time on ∞ processors |
T_P | time to run on exactly P processors |
T₁/T∞ | parallelism — the average available parallelism |
S_P = T₁/T_P | speedup on P processors |
E_P = S_P/P | efficiency (utilization) |
C_P = P·T_P | cost (processor-time product) |
Throughout, the PRAM is synchronous (a global clock; every processor executes one instruction per step) and idealized: shared-memory access is unit cost, and there is no communication, contention, or synchronization charge. That idealization is the model's whole point and also its central caveat — we return to it in the pitfalls.
The PRAM, Restated for Proofs¶
The Machine and the Cost Measure¶
A PRAM consists of P processors p₀, …, p_{P−1}, each a RAM with private local registers, all attached to one unbounded shared memory. Execution is synchronous: in one step, every active processor simultaneously (i) reads from shared memory, (ii) does a local computation, and (iii) writes to shared memory. The time T_P of a PRAM algorithm on an input is the number of steps until all processors halt; the work is the total number of processor-steps performed (Σ over steps of the number of active processors).
Two cost numbers describe an algorithm:
- Time
T_P— the step count usingPprocessors. - Cost
C_P = P · T_P— the processor-time product, i.e. the total resource the machine reserves (idle processors still count, because the machine is committed toPof them forT_Psteps).
The model charges nothing for memory access beyond one step, nothing for communication, and nothing for synchronization — every processor is magically in lockstep. This is exactly what makes PRAM a clean theoretical model and a poor predictive one for real hardware; it isolates the question "how much parallelism does this problem inherently have?" from every hardware artifact.
The Variant Hierarchy: EREW ⊆ CREW ⊆ CRCW¶
The variants differ only in what happens when several processors touch the same shared-memory cell in the same step:
| Variant | Concurrent Read | Concurrent Write |
|---|---|---|
| EREW (Exclusive Read, Exclusive Write) | forbidden | forbidden |
| CREW (Concurrent Read, Exclusive Write) | allowed | forbidden |
| CRCW (Concurrent Read, Concurrent Write) | allowed | allowed |
"Forbidden" means the algorithm is only legal if it never causes a same-cell collision in a step; "allowed" means the hardware tolerates the collision and resolves it by a rule (for writes, see below).
These form a power hierarchy:
The containments are immediate and worth stating as a lemma, because they justify "an EREW algorithm runs unchanged on a CRCW machine":
Lemma (containment). Any algorithm legal on a weaker model is legal on a stronger one with the same time and processor bounds. An EREW algorithm never issues a concurrent access, so it is a fortiori a legal CREW algorithm; a CREW algorithm never issues a concurrent write, so it is a legal CRCW algorithm.
So an upper bound proved for EREW transfers up the hierarchy for free, and — the contrapositive that does the real work — a lower bound proved for the strongest model (CRCW) is the strongest possible statement. The interesting questions are the reverse directions: how much does a stronger model actually buy you, and at what simulation cost can a weaker model emulate it? Those are the next two subsections' subjects.
CRCW Write-Resolution Rules¶
When CRCW permits several processors to write the same cell in one step, it must say which value lands. There are four standard rules, in increasing order of strength:
- COMMON. All concurrent writers to a cell must write the same value; it is the algorithm's responsibility to guarantee this, and the common value is stored. (Writing differing values is illegal.)
- ARBITRARY. Any one of the competing writes succeeds, chosen adversarially; the algorithm must be correct no matter which writer wins.
- PRIORITY. The writer with the smallest processor index (or smallest priority tag) wins. The outcome is deterministic and the algorithm may rely on it.
- COMBINING. All written values are combined by a fixed associative operator — sum, max, OR, etc. — and the combined result is stored. (This is the strongest and least hardware-realistic.)
They form their own sub-hierarchy COMMON ⊆ ARBITRARY ⊆ PRIORITY ⊆ COMBINING: an algorithm correct under COMMON is correct under ARBITRARY (it never relies on which writer wins, since they all agree), and so on up. Combining-CRCW is strongest because it can, for example, sum n values into one cell in a single step — a primitive the others cannot match in O(1). Unless stated otherwise, "CRCW" in lower-bound discussions means the strongest relevant rule, so the bound is the most pessimistic (hardest to beat).
The Power Gap: CRCW Is Strictly Stronger¶
The hierarchy EREW ⊆ CREW ⊆ CRCW would be uninteresting if the containments were equalities. They are not: CRCW is strictly more powerful, and the cleanest witness is the OR of n bits.
OR of n Bits in O(1) on CRCW¶
Claim. On a COMMON-CRCW PRAM with
nprocessors, the OR ofnbitsx₀, …, x_{n−1}can be computed inO(1)time.
Algorithm. Use a single shared result cell R.
step 1: R ← 0 (one processor initializes)
step 2: for each i in parallel:
if x_i = 1 then R ← 1 (all writers write the SAME value, 1)
In step 2, every processor i with x_i = 1 writes 1 to R simultaneously. Because they all write the identical value 1, this is a legal COMMON write, resolved to 1. If at least one bit is set, R becomes 1; if none is set, R stays 0. That is exactly the OR — in two steps, i.e. O(1), independent of n. The same construction computes AND (write 0 when x_i = 0), and with PRIORITY or COMBINING you get max, min, and "find any/first index where x_i = 1" in O(1) as well.
The magic is the concurrent write: n processors collapse n values into one cell in a single step.
The Ω(log n) Lower Bound on EREW¶
Claim. On an EREW (or CREW) PRAM, computing the OR of
nbits requiresΩ(log n)time, no matter how many processors are available.
Proof (the "knowledge" / fan-in argument). Consider the cell that holds the final answer. Its value must depend on all n input bits — flipping any single x_i from 0 to 1 can change the OR from 0 to 1, so the output cell's content is a function of every input. Define, for a memory cell or register at time t, the set of inputs it depends on. Initially each cell depends on at most one input (the one stored there).
In one EREW step, a processor reads at most one cell and writes at most one cell (exclusive access forbids it from reading two different processors' cells into the same target in a way that pools their dependencies faster). More carefully: a write to a cell makes that cell depend on the union of (its old dependencies) ∪ (the dependencies of the single cell the processor read) ∪ (the processor's local state). Because reads and writes are exclusive, the dependency set of any cell can at most double per step — it merges with at most one other set's worth of information.
Let D(t) be the maximum size of any cell's dependency set after t steps. Then D(0) = 1 and D(t) ≤ 2·D(t−1), so D(t) ≤ 2^t. For the answer cell to depend on all n inputs we need D(t) ≥ n, hence 2^t ≥ n, i.e.
So EREW (and CREW, by the same doubling argument — concurrent reads still let a cell absorb only one other cell's dependencies per write) needs Ω(log n) steps. ∎
Putting the two together:
a Θ(log n) separation. CRCW is strictly more powerful than EREW. The concurrent-write primitive lets information fan in to one cell in a single step; without it, information can only combine at a binary-tree rate, costing log n levels. This same Ω(log n) barrier is exactly why EREW reduction and scan are Θ(log n)-span, not O(1) — see the worked examples.
Simulation Cost Between Variants¶
The strict gap above means a CRCW step cannot always be done in one EREW step. But it can be done in O(log P) EREW steps — the variants are equivalent up to a logarithmic factor:
Theorem (CRCW → EREW simulation). One step of a
P-processor PRIORITY-CRCW PRAM can be simulated by aP-processor EREW PRAM inO(log P)steps.
Sketch. The only obstacles are concurrent reads and concurrent writes; each is resolved by sorting. To simulate a concurrent step, each processor i that wants to access address a_i forms a record (a_i, i, value_i) in its own (exclusive) scratch cell. Sort these P records by address using an EREW sorting network (O(log P) time with P processors, e.g. via Cole's merge sort or AKS). After sorting, all accesses to the same address are contiguous. For a concurrent read, the first processor in each address-group reads the cell, then the value is broadcast to the rest of its group by a parallel prefix/segmented-copy in O(log P) time. For a concurrent write under PRIORITY, the first record in each group (smallest index, since the sort can be made stable on index) is the winner and performs the single exclusive write. Each sub-step — sort, segmented broadcast, group-leader write — is O(log P) on EREW, so the whole simulation is O(log P) per CRCW step. ∎
The consequence is a clean equivalence: any T-step CRCW algorithm runs on EREW in O(T log P) steps. So the variants are the same up to a log factor — which is why theorists often design on the convenient CRCW model and then note the (at most) log P blow-up to land on the realistic EREW. The strict OR separation says the log factor is sometimes necessary; the simulation says it is always sufficient.
The Work–Span (DAG) Model¶
The PRAM fixes P up front and steps in lockstep. The work–span model (also called the DAG model, or multithreaded computation model) is more flexible: it describes the computation independently of P, then asks how it runs for any P. It is the model behind fork–join and work-stealing runtimes, and the one in which Brent's theorem lives.
The Computation DAG¶
Model a parallel computation as a directed acyclic graph G = (V, E):
- Each node
v ∈ Vis a unit-time task (one strand of serial work, taking one time step). - Each edge
(u, v) ∈ Eis a dependency: taskvcannot start until taskuhas finished.
A node is ready when all its predecessors have completed. The DAG is generated dynamically as the program runs (a spawn/fork creates new ready nodes; a sync/join adds dependency edges), but for analysis we treat the whole DAG as a static object. Two numbers summarize it:
Work
T₁ = |V|, the total number of nodes — equivalently, the time to execute the DAG on one processor (which must do every task, one after another).Span
T∞, the number of nodes on the longest directed path inG(the critical path) — equivalently, the time on infinitely many processors, where the only thing that can delay you is a dependency chain.
● T₁ = 8 (eight nodes total)
/ \ T∞ = 4 (longest path: top→…→bottom has 4 nodes)
● ● parallelism = T₁/T∞ = 8/4 = 2
/| |\
● ● ● ●
\ /
●
The parallelism T₁/T∞ is the average width of the DAG: the maximum speedup you could ever hope for, the number of processors beyond which you cannot keep everyone busy. If P > T₁/T∞, some processors must idle on average, because the critical path serializes the work no faster than T∞.
The Work Law and the Span Law¶
Two unconditional lower bounds on T_P follow directly from the definitions. They hold for any schedule on P processors.
The Work Law.
T_P ≥ T₁ / P.In
T_Psteps,Pprocessors perform at mostP · T_Punit-tasks. AllT₁tasks must be done, soP · T_P ≥ T₁, i.e.T_P ≥ T₁/P. You cannot doT₁work faster than dividing it perfectly amongPworkers.The Span Law.
T_P ≥ T∞.The critical path is a chain of
T∞tasks, each depending on the previous; they must execute one after another regardless of how many processors you have. Even infinitely many processors cannot beatT∞. Since adding processors never hurts,T_P ≥ T_∞ ≥ ...is monotone:T_∞ ≤ T_P ≤ T_1.
Combining, for every P:
These two laws bound T_P from below. Brent's theorem, next, bounds it from above by their sum — and the gap between max(a,b) and a+b is at most a factor of 2, which is precisely why greedy scheduling is within 2× of optimal.
Brent's Theorem and Greedy Scheduling¶
The Greedy Schedule¶
A scheduler assigns ready tasks to processors at each step. A greedy (work-conserving) scheduler obeys one rule:
Greedy rule. At every step, if at least
Ptasks are ready, run anyPof them (a complete step). If fewer thanPare ready, run all of them (an incomplete step). Never leave a processor idle while a ready task exists.
The point of greed is that no processor is ever idle when there is work it could do. This is exactly what a work-stealing runtime approximates in practice — an idle processor steals a ready task rather than waiting.
Proof of the Brent Bound¶
Brent's Theorem (greedy scheduling bound). Any greedy schedule of a DAG with work
T₁and spanT∞onPprocessors runs in time
Proof. Classify every step of the greedy schedule as either complete or incomplete.
Complete steps. A complete step runs exactly P tasks. Let there be c complete steps. They account for c · P tasks, and since the total number of tasks is T₁, we have c · P ≤ T₁, hence
Incomplete steps. An incomplete step runs fewer than P tasks — which, by the greedy rule, means it ran every ready task (otherwise a processor sat idle next to a ready task, contradicting greed). We claim each incomplete step reduces the span of the remaining DAG by exactly 1.
Here is the key argument. Let G' be the sub-DAG of not-yet-executed tasks at the start of an incomplete step, and let ℓ be its span (longest path). Every task with no unexecuted predecessor is ready; on an incomplete step greed executes all ready tasks, i.e. all tasks at "depth 0" of G'. Removing the entire set of depth-0 tasks from G' removes the first node of every longest path, so the span of what remains is ℓ − 1. Thus each incomplete step strictly shortens the remaining critical path by one. The remaining span starts at T∞ and cannot go below 0, so there are at most T∞ incomplete steps:
Total. Every step is complete or incomplete, so
The two halves of the bound are exactly the two lower-bound laws made into an additive upper bound: complete steps charge against the Work Law (T₁/P), incomplete steps charge against the Span Law (T∞).
Corollaries: 2-Optimality and Linear Speedup¶
Corollary 1 (greedy is within 2× of optimal). Let
T_P*be the optimal schedule's time onPprocessors. By the lower bounds,T_P* ≥ max(T₁/P, T∞). The greedy schedule givesT_P ≤ T₁/P + T∞ ≤ 2·max(T₁/P, T∞) ≤ 2·T_P*. So any greedy schedule is at most twice the optimal — no scheduler can do more than twice as well, which is why simple greedy/work-stealing schedulers are good enough in practice.Corollary 2 (linear speedup while
P ≪ T₁/T∞). SupposeP ≤ (T₁/T∞)/cfor somec ≥ 1, i.e. the parallelism slackness(T₁/T∞)/P ≥ c. ThenT∞ ≤ T₁/(cP), soThe speedup is
S_P = T₁/T_P ≥ P / (1 + 1/c) = P·c/(c+1), which→ Pas the slacknessc → ∞. As long asPis comfortably below the parallelismT₁/T∞, you get near-linear speedup; the termT∞is a negligible additive overhead. The parallelismT₁/T∞is thus the processor count up to which the algorithm scales — beyond it, the span term dominates and adding processors stops helping.
This is the practical reading of work–span: design for high parallelism (T₁/T∞ ≫ P) and the scheduler — any greedy one — does the rest, delivering speedup close to P.
Work-Efficiency¶
Why Work Matters More Than Span¶
A parallel algorithm is work-efficient if its work matches the best sequential algorithm:
Definition. A parallel algorithm is work-efficient if
T₁ = Θ(T_seq), whereT_seqis the running time of the best known sequential algorithm for the problem.
Work-efficiency is the property beginners undervalue and practitioners obsess over. The reason is the Work Law. Speedup over the sequential baseline is T_seq / T_P, and by the Work Law T_P ≥ T₁/P, so
If the algorithm is work-efficient, T_seq/T₁ = Θ(1) and the speedup ceiling is Θ(P) — you can actually approach linear speedup. But if the algorithm is work-inefficient — say T₁ = Θ(T_seq · log n) — then the ceiling drops to Θ(P / log n): you are throwing away a log n factor of your processors before you even start. A work-inefficient algorithm with a beautiful O(log n) span can be slower in wall-clock time than a work-efficient one with O(log² n) span, because the inefficient one wastes processors doing redundant work that the sequential machine never did.
work-efficient (T₁ = Θ(T_seq)): speedup can reach Θ(P).
work-inefficient (T₁ = Θ(T_seq·log n)): speedup capped at Θ(P/log n) — wasted processors.
The slogan: span buys you latency, work buys you throughput. A short span is worthless if you bought it by inflating the work, because the inflated work is paid by every processor on every run. The gold standard is work-efficient and low-span — the work-efficient scan (work Θ(n), span Θ(log n)) is the canonical example, contrasted with the naive scan that is Θ(n log n) work and therefore wasteful.
Brent's Scheduling Principle: From Work–Span to PRAM¶
How do you implement a work–span algorithm on a fixed-P PRAM? Brent's theorem gives the recipe, often called Brent's scheduling principle (or the work–time framework):
Brent's Principle. An algorithm described as a sequence of parallel steps with total work
T₁and spanT∞can be executed on aP-processor PRAM in time
The idea is to take the work–time description — at each of the T∞ levels, some number of independent operations w_t (with Σ_t w_t = T₁) is available — and have the P processors simulate each level by dividing its w_t operations among them in ⌈w_t / P⌉ rounds. Summing over levels:
This is Brent's theorem in algorithm-design clothes: it tells you that you only ever need to specify the work and the span, and any P-processor machine realizes the T₁/P + T∞ bound. The practical caveats are (1) the per-level work counts w_t must be computable so processors know their assignments (often itself needing a scan), and (2) the simulation must respect the PRAM variant's read/write rules. But the headline is liberating: design in the P-independent work–span model, and Brent's principle compiles it to every P.
Speedup, Efficiency, Cost, and the Two Laws of Scaling¶
Speedup, Efficiency, Cost¶
Three derived quantities measure how well a parallel algorithm uses its P processors:
| Quantity | Definition | Meaning | Range |
|---|---|---|---|
| Speedup | S_P = T₁ / T_P | how many times faster than 1 processor | 1 ≤ S_P ≤ P |
| Efficiency | E_P = S_P / P = T₁/(P·T_P) | fraction of processors usefully busy | 0 < E_P ≤ 1 |
| Cost | C_P = P · T_P | total processor-time reserved | ≥ T₁ |
By the Work Law, T_P ≥ T₁/P, so S_P ≤ P (linear speedup is the ceiling) and E_P ≤ 1. An algorithm is cost-optimal when C_P = P·T_P = Θ(T₁) = Θ(T_seq) — it wastes no more than a constant factor of processor-time over the sequential work, equivalently E_P = Θ(1). Cost-optimality and work-efficiency-at-efficiency-Θ(1) are the same idea seen from two angles: no processor-time is squandered.
Superlinear speedup (S_P > P) is occasionally observed on real machines — caused by P processors' combined caches holding data that overflowed one processor's cache — but it is an artifact of the memory hierarchy, not of the PRAM/work–span model, where S_P ≤ P is a theorem.
Amdahl's Law¶
Suppose a fraction s of the computation is inherently serial (cannot be parallelized) and the remaining 1 − s is perfectly parallelizable. Normalize the 1-processor time to 1. On P processors the serial part still takes s, while the parallel part takes (1−s)/P:
The speedup is therefore
Amdahl's Law.
Take the limit P → ∞:
The speedup is capped at
1/s, no matter how many processors you add. A program that is5%serial (s = 0.05) can never run more than20×faster, even on a million cores. This is Amdahl's sobering verdict on fixed-problem-size parallelism: the serial fractionsis a hard ceiling, and the marginal value of each processor collapses asPgrows.
Connecting to work–span: the serial fraction acts like a mandatory span. If s of the work is on the critical path, then T∞ ≥ s·T₁, so the parallelism T₁/T∞ ≤ 1/s — Amdahl's 1/s ceiling is exactly the parallelism ceiling of a DAG whose critical path carries an s-fraction of the work.
Gustafson's Law¶
Amdahl fixes the problem size and scales the machine. Gustafson's law observes that in practice we scale both — when you get a bigger machine, you run a bigger problem (finer mesh, more data, higher resolution), and the parallel work grows while the serial part stays roughly constant.
Fix the parallel run time and ask how long the same work would take on one processor. Let s be the serial fraction of the parallel execution and 1 − s the parallel fraction. The work that took time 1 on P processors would take s + (1 − s)·P on one processor (the parallel part, done by P workers in parallel time 1−s, is (1−s)·P of serial work). The scaled speedup is:
Gustafson's Law.
This is linear in P (slope 1 − s), with no fixed ceiling. The contrast is the whole point:
Amdahl (fixed problem): S_P = 1/(s + (1−s)/P) → 1/s (bounded)
Gustafson (scaled problem): S_P = s + (1−s)·P → ∞ as P→∞ (linear)
Neither law contradicts the other — they answer different questions. Amdahl applies when the problem size is fixed (you want the same computation done faster — strong scaling). Gustafson applies when the problem grows with the machine (you want a bigger computation in the same time — weak scaling). Most large-scale scientific computing lives in Gustafson's world: nobody buys a supercomputer to solve last year's problem 20× faster; they solve a 1000× bigger problem. Quoting Amdahl's pessimistic 1/s as if it were the final word on parallelism is a classic mistake — it is the right answer to only one of two questions.
Worked Examples¶
Three canonical algorithms, each analyzed by work and span. All assume an EREW PRAM (or the equivalent DAG).
Reduction¶
Problem. Combine n values with an associative operator (sum, max, …) into one result.
Algorithm. A balanced binary tree of combinations: pair up adjacent elements and combine, halving the count each level, for log₂ n levels.
level 0: x0 x1 x2 x3 x4 x5 x6 x7 (n leaves)
level 1: (x0+x1) (x2+x3) (x4+x5) (x6+x7)
level 2: (x0..x3) (x4..x7)
level 3: (x0..x7) (the root = result)
Analysis. Level k does n/2^{k+1} combinations; summing over levels, total combinations = n/2 + n/4 + … + 1 = n − 1. So work T₁ = Θ(n) — work-efficient, since sequential reduction is also Θ(n). The tree has log₂ n levels, each depending on the one below, so span T∞ = Θ(log n). Parallelism = Θ(n / log n). The Θ(log n) span is forced by the EREW OR lower bound: combining n values into one cannot beat log n levels without concurrent writes. Full treatment in parallel reduce and map.
Prefix Sum / Scan¶
Problem. Given x₀, …, x_{n−1}, compute all prefix sums y_k = x₀ + x₁ + … + x_k (an inclusive scan).
Naive parallel scan. The Hillis–Steele algorithm: for d = 1, 2, 4, …, every element adds the element d positions to its left. This has span Θ(log n) but does Θ(n) work per level over log n levels — work T₁ = Θ(n log n), which is work-inefficient (sequential scan is Θ(n)).
Work-efficient scan. The Blelloch algorithm runs an up-sweep (a reduction tree, computing partial sums) then a down-sweep (pushing prefixes back down), each Θ(log n) levels with Θ(n) total work across the whole sweep:
This matches the sequential Θ(n) work and keeps the Θ(log n) span — the best of both, and the reason it is the production algorithm. The full up-sweep/down-sweep derivation, with the work-efficiency proof, is in parallel prefix sum / scan. Scan is the workhorse primitive behind compaction, sorting, and graph algorithms, which is why getting it work-efficient matters so much.
Matrix Multiply¶
Problem. C = A · B for n × n matrices, by the standard triple-loop definition C[i][j] = Σ_k A[i][k]·B[k][j].
Analysis. There are n² output entries, each a dot product of length n. The n products for one entry are independent (compute in parallel), and summing them is a reduction of n terms — span Θ(log n). All n² entries are independent of each other. So:
Work Θ(n³) (the n³ multiply-adds, work-efficient against the classical sequential algorithm — though not against Strassen's Θ(n^{2.807})), and span Θ(log n) (the depth of the dot-product reductions; the n² entries and the n products within each impose no extra depth). The enormous parallelism Θ(n³/log n) is why dense matrix multiply is the poster child of parallel computing — for any realistic P, you sit far below the parallelism, in the linear-speedup regime of Corollary 2.
A summary of the three:
| Algorithm | Work T₁ | Span T∞ | Parallelism | Work-efficient? |
|---|---|---|---|---|
| Reduction | Θ(n) | Θ(log n) | Θ(n/log n) | yes |
| Naive scan (Hillis–Steele) | Θ(n log n) | Θ(log n) | Θ(n) | no |
| Work-efficient scan (Blelloch) | Θ(n) | Θ(log n) | Θ(n/log n) | yes |
| Matrix multiply | Θ(n³) | Θ(log n) | Θ(n³/log n) | yes (vs classical) |
Code: Work–Span Analyzer and Speedup Measurement¶
The theory predicts three measurable facts:
- For a task DAG,
T₁is the node count andT∞is the longest-path length; parallelism is their ratio. - A greedy schedule of that DAG on
Pprocessors finishes in≤ T₁/P + T∞steps (Brent's bound), and the achieved speedup approachesPwhileP ≪ T₁/T∞. - Amdahl's curve
S_P = 1/(s + (1−s)/P)flattens toward1/s.
The code below (a) computes T₁, T∞, parallelism from a DAG by topological longest-path; (b) simulates a greedy schedule and reports the achieved T_P against the Brent bound; (c) prints the Amdahl curve.
Go¶
package main
import (
"fmt"
)
// DAG is a task graph: node i depends on every node in deps[i].
type DAG struct {
n int
deps [][]int // deps[i] = predecessors of node i
}
// workSpan returns T1 (work = #nodes), Tinf (span = longest path in nodes),
// and the level of each node (longest distance from a source, 1-based).
func (g *DAG) workSpan() (T1, Tinf int, level []int) {
level = make([]int, g.n)
// Process nodes in an order where all predecessors precede a node.
// Assume input is given in topological order 0..n-1.
for v := 0; v < g.n; v++ {
best := 0
for _, u := range g.deps[v] {
if level[u] > best {
best = level[u]
}
}
level[v] = best + 1 // this node sits one below its deepest predecessor
if level[v] > Tinf {
Tinf = level[v]
}
}
return g.n, Tinf, level
}
// greedy simulates a greedy schedule on P processors and returns T_P (steps).
// A node becomes ready when all its predecessors have completed.
func (g *DAG) greedy(P int) int {
done := make([]bool, g.n)
remaining := make([]int, g.n) // count of unfinished predecessors
ready := []int{}
for v := 0; v < g.n; v++ {
remaining[v] = len(g.deps[v])
if remaining[v] == 0 {
ready = append(ready, v)
}
}
// successors[u] = nodes that depend on u
succ := make([][]int, g.n)
for v := 0; v < g.n; v++ {
for _, u := range g.deps[v] {
succ[u] = append(succ[u], v)
}
}
steps := 0
for len(ready) > 0 {
steps++
take := P
if len(ready) < P {
take = len(ready) // incomplete step: run all ready
}
batch := ready[:take]
ready = ready[take:]
var newlyReady []int
for _, u := range batch {
done[u] = true
for _, w := range succ[u] {
remaining[w]--
if remaining[w] == 0 {
newlyReady = append(newlyReady, w)
}
}
}
ready = append(ready, newlyReady...)
}
return steps
}
func amdahl(s float64, P int) float64 {
return 1.0 / (s + (1.0-s)/float64(P))
}
func main() {
// A balanced binary reduction tree over 8 leaves: 15 nodes, span 4.
// Nodes 0..7 = leaves; 8..11 = level-1; 12,13 = level-2; 14 = root.
g := &DAG{
n: 15,
deps: [][]int{
{}, {}, {}, {}, {}, {}, {}, {}, // 0..7 leaves
{0, 1}, {2, 3}, {4, 5}, {6, 7}, // 8..11
{8, 9}, {10, 11}, // 12,13
{12, 13}, // 14 root
},
}
T1, Tinf, _ := g.workSpan()
fmt.Printf("reduction DAG: T1=%d Tinf=%d parallelism=%.2f\n",
T1, Tinf, float64(T1)/float64(Tinf))
for _, P := range []int{1, 2, 4, 8} {
tp := g.greedy(P)
brent := (T1+P-1)/P + Tinf // ceil(T1/P) + Tinf
fmt.Printf(" P=%d: T_P=%d Brent bound T1/P+Tinf=%d speedup=%.2f\n",
P, tp, brent, float64(T1)/float64(tp))
}
fmt.Println("Amdahl's law (serial fraction s = 0.05):")
for _, P := range []int{1, 2, 4, 16, 64, 1024, 1 << 20} {
fmt.Printf(" P=%-8d speedup=%.2f (ceiling 1/s = %.1f)\n",
P, amdahl(0.05, P), 1.0/0.05)
}
}
Expected output:
reduction DAG: T1=15 Tinf=4 parallelism=3.75
P=1: T_P=15 Brent bound T1/P+Tinf=19 speedup=1.00
P=2: T_P=8 Brent bound T1/P+Tinf=12 speedup=1.88
P=4: T_P=6 Brent bound T1/P+Tinf=8 speedup=2.50
P=8: T_P=4 Brent bound T1/P+Tinf=6 speedup=3.75
Amdahl's law (serial fraction s = 0.05):
P=1 speedup=1.00 (ceiling 1/s = 20.0)
P=2 speedup=1.90 (ceiling 1/s = 20.0)
P=4 speedup=3.48 (ceiling 1/s = 20.0)
P=16 speedup=9.14 (ceiling 1/s = 20.0)
P=64 speedup=15.42 (ceiling 1/s = 20.0)
P=1024 speedup=19.62 (ceiling 1/s = 20.0)
P=1048576 speedup=20.00 (ceiling 1/s = 20.0)
Three confirmations: the DAG's T₁ = 15, T∞ = 4 give parallelism 3.75; for every P the simulated greedy T_P is comfortably below the Brent bound T₁/P + T∞, and at P = 8 ≥ parallelism the speedup pins at exactly the parallelism 3.75 (more processors cannot help — the span dominates); and Amdahl's speedup climbs but is strangled by the 1/s = 20 ceiling, gaining almost nothing from P = 1024 to P = 2²⁰.
Python¶
from collections import deque
def work_span(deps):
"""T1 = #nodes, Tinf = longest path (in nodes). Input in topological order."""
n = len(deps)
level = [0] * n
for v in range(n):
best = max((level[u] for u in deps[v]), default=0)
level[v] = best + 1
return n, max(level), level
def greedy(deps, P):
"""Simulate a greedy schedule on P processors; return T_P (steps)."""
n = len(deps)
remaining = [len(deps[v]) for v in range(n)]
succ = [[] for _ in range(n)]
for v in range(n):
for u in deps[v]:
succ[u].append(v)
ready = deque(v for v in range(n) if remaining[v] == 0)
steps = 0
while ready:
steps += 1
take = min(P, len(ready)) # incomplete step runs all ready
batch = [ready.popleft() for _ in range(take)]
for u in batch:
for w in succ[u]:
remaining[w] -= 1
if remaining[w] == 0:
ready.append(w)
return steps
def amdahl(s, P):
return 1.0 / (s + (1.0 - s) / P)
def gustafson(s, P):
return s + (1.0 - s) * P
def main():
# Balanced binary reduction tree over 8 leaves: 15 nodes, span 4.
deps = [
[], [], [], [], [], [], [], [], # 0..7 leaves
[0, 1], [2, 3], [4, 5], [6, 7], # 8..11
[8, 9], [10, 11], # 12,13
[12, 13], # 14 root
]
T1, Tinf, _ = work_span(deps)
print(f"reduction DAG: T1={T1} Tinf={Tinf} "
f"parallelism={T1 / Tinf:.2f}")
for P in (1, 2, 4, 8):
tp = greedy(deps, P)
brent = -(-T1 // P) + Tinf # ceil(T1/P) + Tinf
print(f" P={P}: T_P={tp} Brent bound={brent} "
f"speedup={T1 / tp:.2f}")
print("Amdahl vs Gustafson (serial fraction s = 0.05):")
for P in (1, 4, 16, 64, 1024):
print(f" P={P:<6} Amdahl={amdahl(0.05, P):6.2f} "
f"Gustafson={gustafson(0.05, P):8.2f}")
if __name__ == "__main__":
main()
Both programs make the abstractions tangible: work_span reads T₁ and T∞ straight off the DAG; greedy shows the achieved time always satisfies Brent's T₁/P + T∞ and saturates at the parallelism; and the scaling functions put Amdahl's bounded 1/s curve next to Gustafson's unbounded linear one — the same s = 0.05 gives a 20× ceiling under Amdahl but a ≈ 0.95·P slope under Gustafson, the numerical face of "fixed problem vs scaled problem."
Pitfalls¶
-
Counting span wrong. Span is the longest path through the DAG, not the number of levels you happen to have drawn, and not
T₁/Pfor anyP. A common slip is to forget that async/joinadds dependency edges that lengthen the critical path: two parallel branches that both feed a join contributemax(span of branch)plus the join, not the sum. ComputeT∞as a genuine longest-path (thelevel[v] = 1 + max over predecessorsrecurrence), or you will overstate the parallelism and overpromise speedup. -
Chasing low span while ignoring work. A
Θ(log n)-span algorithm that doesΘ(n log n)work (the naive Hillis–Steele scan) can be slower in practice than anΘ(log² n)-span algorithm that doesΘ(n)work, because the extralog nfactor of work is paid by every processor on every run. Work-efficiencyT₁ = Θ(T_seq)is the first thing to check; a small span is only valuable on top of efficient work. Speedup over sequential is capped atP · (T_seq/T₁), so alog nwork blow-up throws away alog nfactor of processors before you start. -
Forgetting Amdahl's serial fraction. Optimizing the parallel part while a fixed serial fraction
sremains is the canonical waste of effort: ifs = 0.1, no amount of parallelism beats10×, so doubling the core count past a point buys nothing. Find and attack the serial fraction itself (often I/O, initialization, or a sequential reduction step), not just the already-parallel region. Conversely, do not quote Amdahl's pessimism when the problem grows with the machine — that is Gustafson's regime, where speedup is linear. -
Confusing Amdahl and Gustafson. They answer different questions. Amdahl: fixed problem size, how much faster (strong scaling, ceiling
1/s). Gustafson: problem grows withP, how much more work in the same time (weak scaling, linear). Reporting an Amdahl ceiling for a workload that actually scales with the machine understates parallel value; reporting Gustafson linearity for a fixed-size job overstates it. State which scaling regime you are in. -
Treating PRAM as a hardware model. PRAM charges nothing for communication, memory contention, or synchronization — every processor accesses shared memory in unit time, in lockstep. Real machines pay for cache coherence, NUMA distance, network latency, and barrier synchronization, none of which appear in
T_P = T₁/P + T∞. PRAM and work–span tell you the inherent parallelism of a problem (a real and useful thing); they do not predict wall-clock time. A high-parallelism algorithm can still be slow if its memory-access pattern thrashes the cache — see the I/O and cache models for the cost the PRAM ignores. -
Assuming
S_Pcan exceedP. In the PRAM/work–span modelS_P = T₁/T_P ≤ Pis a theorem (it is the Work Law). Observed superlinear speedup on real hardware is a memory-hierarchy artifact (more aggregate cache), not a contradiction — it lives outside the model. If your model analysis predictsS_P > P, you have a bug, almost always an undercount ofT₁or an overcount of parallelism. -
Using
M, not the right CRCW rule, in a separation. When you claim a problem isO(1)on CRCW, name the write-resolution rule (COMMON suffices for OR; you need COMBINING for sum-in-one-step). And when you claim aloglower bound, prove it for the weakest model you target (EREW), since the bound does not automatically transfer downward from CRCW. Mixing the rules silently changes the power and invalidates the bound.
Summary¶
-
The PRAM is
Psynchronous processors on one shared memory, with unit-cost access and no communication/contention/synchronization charge. Variants form a power hierarchyEREW ⊆ CREW ⊆ CRCW; CRCW resolves concurrent writes by COMMON ⊆ ARBITRARY ⊆ PRIORITY ⊆ COMBINING. An upper bound flows up the hierarchy for free; a lower bound proved on CRCW is the strongest. -
CRCW is strictly stronger. OR of
nbits isΘ(1)on COMMON-CRCW (all set bits write1to one cell at once) butΩ(log n)on EREW (a cell's dependency set at most doubles per step, so reaching allninputs needs≥ log₂ nsteps). One CRCW step simulates on EREW inO(log P)via sort-then-broadcast — the variants are equivalent up to alogfactor. -
The work–span model describes a computation as a DAG: work
T₁= node count (1-processor time), spanT∞= longest path (∞-processor time), parallelismT₁/T∞. The Work LawT_P ≥ T₁/Pand Span LawT_P ≥ T∞are unconditional lower bounds. -
Brent's theorem: a greedy schedule achieves
T_P ≤ T₁/P + T∞. Proof: complete steps number≤ T₁/P(each usesPof theT₁tasks); incomplete steps number≤ T∞(each shortens the remaining critical path by 1). Corollary: greedy is within2×of optimal, and speedup is near-linear whileP ≪ T₁/T∞. -
Work-efficiency (
T₁ = Θ(T_seq)) matters more than span: speedup over sequential is capped atP·(T_seq/T₁), so a work blow-up wastes processors. Brent's scheduling principle compiles any work–span algorithm to aP-processor PRAM inO(T₁/P + T∞). -
Speedup
S_P = T₁/T_P ≤ P, efficiencyE_P = S_P/P ≤ 1, costC_P = P·T_P ≥ T₁. Amdahl (fixed problem):S_P = 1/(s + (1−s)/P) → 1/s— a hard ceiling set by the serial fraction. Gustafson (scaled problem):S_P = s + (1−s)·P— unbounded, linear. They describe strong vs weak scaling; name which you mean. -
Worked bounds: reduction
T₁=Θ(n), T∞=Θ(log n); work-efficient scan same (vs the wastefulΘ(n log n)-work naive scan); matrix multiplyT₁=Θ(n³), T∞=Θ(log n). High parallelism + work-efficiency is the goal.
Revisit junior for the intuition behind these models; advance to senior for the deeper PRAM lower bounds, the BSP/LogP communication-aware models, and randomized work-stealing analysis. Continue to parallel prefix sum / scan for the work-efficient scan in full, parallel reduce and map for the building-block primitives, and fork–join and work-stealing for how a real runtime realizes the greedy schedule Brent's theorem assumes.
In this topic
- junior
- middle
- senior
- professional