Parallel Prefix Sum (Scan) — Interview Questions¶
Table of Contents¶
- Conceptual Questions
- The Two Algorithms: Hillis–Steele vs Blelloch
- Applications
- Lower Bound & Recurrences
- Gotcha / Trick Questions
- Rapid-Fire Q&A
- Common Mistakes
- Tips & Summary
Conceptual Questions¶
Q1: "Prefix sum looks inherently sequential — out[i] = out[i-1] + a[i]. How do you parallelize it?" (Easy)¶
Answer: The recurrence looks serial, but the operation ⊕ is associative, and associativity is exactly the property that lets you re-parenthesize a long chain into a balanced tree. a₀ ⊕ a₁ ⊕ a₂ ⊕ a₃ can be computed as (a₀ ⊕ a₁) ⊕ (a₂ ⊕ a₃) — the two halves are independent and run in parallel. Recursing this gives a tree of depth log n.
So scan is not serial: it has Θ(n) work and Θ(log n) span, hence parallelism Θ(n/log n). The trick is to compute the partial results once, in tree order, and reuse them — not to recompute each prefix from scratch. The one requirement is associativity; commutativity is not needed (order is preserved), which is why scan works for non-commutative monoids like matrix products and affine maps.
Q2: Define inclusive scan vs exclusive scan. (Easy)¶
Answer: Given input a₀, a₁, …, a_{n-1} and associative ⊕:
- Inclusive scan:
out[i] = a₀ ⊕ a₁ ⊕ … ⊕ a_i— elementiis included in its own prefix. - Exclusive scan (a.k.a. prescan):
out[i] = a₀ ⊕ … ⊕ a_{i-1}, without[0] = identity(e.g.0for+) — elementiis excluded.
Conversion is O(1) extra: exclusive[i] = inclusive[i] ⊖ a[i], or shift-right inclusive by one and drop the last. Exclusive scan is the one you usually want for indexing — it gives each element the offset of where it starts, which is precisely what stream compaction and radix-sort splits need (Q6–Q7).
Q3: Reduction and scan have the same (work, span) — so what's the difference? (Medium)¶
Answer: Both are (Θ(n) work, Θ(log n) span), both built on a balanced tree of ⊕. The difference is what they return:
- Reduction returns a single value — the total
a₀ ⊕ … ⊕ a_{n-1}. Only the up-sweep is needed. - Scan returns all
nprefixes. It needs an up-sweep and a down-sweep to distribute partial totals back to every position.
So scan ≈ "reduction that remembers and redistributes its partial results." Knowing they share the same asymptotics — and that scan costs a second tree pass — is the canonical framing. See ../01-models-pram-work-span/interview.md.
The Two Algorithms: Hillis–Steele vs Blelloch¶
Q4: Describe Hillis–Steele scan and give its work and span. (Medium)¶
Answer: Hillis–Steele (the naive or step-efficient scan) runs log n rounds. In round d (for d = 1, 2, 4, … < n), every element adds the element d positions to its left:
After round with stride d, each x[i] holds the sum of the 2d elements ending at i; after log n rounds it holds the full inclusive prefix.
- Span:
Θ(log n)— there arelog nrounds. - Work:
Θ(n log n)— each oflog nrounds touchesΘ(n)elements.
It is not work-efficient: Θ(n log n) total operations versus the serial Θ(n). Its virtues are simplicity and being an in-place, depth-minimal pattern with no second pass — attractive on a GPU within a single warp/block where extra work is cheaper than synchronization.
Q5: Describe Blelloch scan (up-sweep / down-sweep) and give its work and span. (Hard)¶
Answer: Blelloch is the work-efficient scan: two passes over a conceptual balanced binary tree.
Up-sweep (reduce) — leaves to root, building partial sums in place. Stride doubles each level:
for d = 1, 2, 4, ... n/2:
for k = 0, d*2, 2*d*2, ... in parallel:
x[k + 2d - 1] = x[k + d - 1] ⊕ x[k + 2d - 1]
Down-sweep — set the root to the identity, then walk root to leaves; at each node, the left child receives the parent's value and the parent receives parent ⊕ (old left child):
x[n-1] = identity
for d = n/2, ... 4, 2, 1:
for k = 0, d*2, ... in parallel:
t = x[k + d - 1]
x[k + d - 1] = x[k + 2d - 1]
x[k + 2d - 1] = x[k + 2d - 1] ⊕ t
This produces an exclusive scan.
- Span:
Θ(log n)—log nup-sweep levels +log ndown-sweep levels. - Work:
Θ(n)— the levels don/2 + n/4 + … = noperations each pass; both passes sum toΘ(n).
So Blelloch matches the serial work Θ(n) while keeping Θ(log n) span — work-efficient, parallelism Θ(n/log n).
Q6: Why does work-efficiency matter — isn't the Θ(log n) span all that counts? (Hard)¶
Answer: No. By the work law T_P ≥ T₁/P, wall-clock time is floored by total work divided by the processor count P. On any real machine P is bounded (thousands of cores, not n). Hillis–Steele's extra log n factor of work means every processor does log n× more operations, so for large n on bounded P it can be slower than the serial loop despite its beautiful span.
Blelloch wastes no processors: Θ(n) work spread over P cores gives Θ(n/P + log n) time — near-linear speedup until P ≈ n/log n. Rule: be work-efficient first, low-span second. This is the textbook illustration of why span alone is a trap. See ../01-models-pram-work-span/interview.md.
Applications¶
Q7: How does scan implement a parallel FILTER / stream compaction? (Medium)¶
Answer: Stream compaction = "keep the elements that pass a predicate, packed contiguously." The serial version is a single loop with a running write-cursor — apparently serial because each output index depends on how many earlier elements passed. Scan turns that running cursor into a parallel computation:
- Flags: in parallel, set
flag[i] = 1ifa[i]passes the predicate, else0. - Exclusive scan of the flags →
offset[i]= the number of kept elements beforei= exactly the destination index fora[i]if it is kept. - Scatter: in parallel,
if flag[i]: out[offset[i]] = a[i].
a: [d, a, c, b, e] (keep vowels: a, e)
flag: [0, 1, 0, 0, 1]
offset: [0, 0, 1, 1, 1] (exclusive scan)
scatter: a → out[0], e → out[1] => out = [a, e]
The exclusive scan is the whole trick: it computes every output position in parallel, with Θ(n) work and Θ(log n) span. This is the foundational pattern behind GPU filtering, partitioning, and quicksort partition steps.
Q8: How is scan used in radix sort? (Hard)¶
Answer: Radix sort processes keys one digit (or bit-group) at a time; each pass must stably scatter keys into buckets by the current digit. The destination of each key is "the start offset of its bucket plus how many same-digit keys came before it" — again a running count, again a scan.
Binary (1-bit) radix split: for the current bit,
- Compute
flag[i] = bit of a[i]. The0-keys go to the front, the1-keys after them. - Exclusive scan of
(1 - flag)counts the0-keys before each position → offsets for the0-bucket. - The total count of
0s gives the base offset for the1-bucket; a second scan overflagplaces the1-keys. - Scatter every key to its computed index — stable and fully parallel.
For multi-bit digits you compute a per-bucket histogram and an exclusive scan over the histogram to get each bucket's starting offset, then a per-element scan within buckets. Scan is what makes the scatter offsets computable in parallel — it is the heart of every GPU radix sort. See ../03-parallel-sorting-and-merging/interview.md.
Q9: What is segmented scan, and what is it good for? (Hard)¶
Answer: Segmented scan runs independent scans over contiguous segments of one array in a single parallel pass, marked by a head-flag bit per element (1 = start of a new segment, which resets the accumulation). It is implemented by scanning a modified monoid on (flag, value) pairs whose operator resets when it crosses a segment boundary — so it keeps the same Θ(n) work, Θ(log n) span.
Why it matters: it parallelizes irregular, nested computations as one flat scan, avoiding load imbalance from ragged inner loops. Key uses:
- SpMV (sparse matrix–vector multiply): each matrix row is a segment; segmented scan sums the per-row products in parallel regardless of how the nonzeros are distributed across rows.
- Parallel quicksort / partition: sub-arrays at one recursion level are segments scanned together.
- Any "do a scan/reduction per group" over jagged data without launching one kernel per group.
Lower Bound & Recurrences¶
Q10: Why does scan need Ω(log n) span? (Medium)¶
Answer: The last output, out[n-1] = a₀ ⊕ … ⊕ a_{n-1}, depends on all n inputs. Each ⊕ gate (or PRAM step) has bounded fan-in — it combines at most a constant number of values. To funnel information from n inputs into one output through constant-fan-in gates, the data must pass through a tree of depth at least log_c n = Ω(log n). No fewer levels can gather n items. Hence any scan (indeed any reduction) has span Ω(log n), and Blelloch's Θ(log n) is asymptotically optimal. The Θ(log n) is not an artifact of a particular algorithm — it is an information-theoretic floor.
Q11: Why can a linear recurrence x_i = a_i · x_{i-1} + b_i be computed as a scan? (Hard)¶
Answer: Because each step is an affine map f_i(x) = a_i · x + b_i, and affine maps are closed under composition — they form an associative monoid (with identity f(x) = x, i.e. (a, b) = (1, 0)). Composing two affine maps:
so the pair (a, b) composes by (a₂, b₂) ⊕ (a₁, b₁) = (a₂ a₁, a₂ b₁ + b₂), an associative (non-commutative) operator. Running an inclusive scan with this ⊕ over the (a_i, b_i) pairs yields, at position i, the composed map from the start up to i; applying it to x₀ gives x_i. The "serial-looking" recurrence is thus a scan over the affine-composition monoid — Θ(n) work, Θ(log n) span. This generalizes: any recurrence whose update is an associative map over a fixed-dimension state (e.g. tridiagonal solves, IIR filters, finite-state-machine runs) parallelizes by scan.
Gotcha / Trick Questions¶
Q12: "Is a floating-point parallel scan deterministic — same result every run?" (Hard)¶
Answer: No. Floating-point addition is not associative ((a + b) + c ≠ a + (b + c) due to rounding), yet every parallel scan re-associates the sum into a tree whose shape can depend on the thread count, block size, or scheduling. Different groupings round differently, so a parallel float scan can give a different result than the serial scan, and different results across runs if the reduction tree varies.
Consequences and fixes: don't expect bit-identical reproducibility from a parallel float scan; pin the reduction-tree shape if you need determinism; or use higher-precision/compensated (Kahan) accumulation to bound the error. With integers, + is associative, so integer scan is deterministic. This is a frequent real-world bug — "my GPU sum doesn't match the CPU" — and the answer is non-associativity, not a code error.
Q13: "Hillis–Steele or Blelloch — which is better?" (Medium)¶
Answer: "It depends" is the correct answer; anyone who says one is universally better is wrong.
- Blelloch is work-efficient (
Θ(n)work) → better for large arrays and boundedP, where wasted work dominates. It needs two passes and more index arithmetic, and reads/writes the array twice. - Hillis–Steele does
Θ(n log n)work but is simpler, branch-light, in-place, and needs only one pass with no down-sweep. Within a single GPU warp/block (smalln, e.g. ≤ 1024, abundant idle lanes), the extra work is essentially free and the simpler synchronization wins.
Production GPU scans (e.g. CUB, Thrust) are hybrid: Hillis–Steele or a shuffle-based scan within a warp/block where steps are cheap, Blelloch-style work-efficient reduction across blocks where work matters. Span is Θ(log n) for both; the trade is work vs simplicity. See ./middle.md.
Q14: "Scan needs the operator to be commutative, right?" (Medium)¶
Answer: No — only associative. Scan preserves input order; it never swaps operands, only re-parenthesizes. So commutativity is not required, which is exactly why scan works for non-commutative monoids: matrix multiplication, affine-map composition (Q11), string concatenation, min/max-plus. Confusing the two requirements would wrongly rule out these uses. (A parallel reduction can additionally exploit commutativity to reorder freely, but even reduction only strictly needs associativity.)
Rapid-Fire Q&A¶
| # | Question | Answer |
|---|---|---|
| 1 | Required property of ⊕? | Associative (commutativity not needed) |
| 2 | Inclusive scan out[i]? | a₀ ⊕ … ⊕ a_i (self included) |
| 3 | Exclusive scan out[i]? | a₀ ⊕ … ⊕ a_{i-1}; out[0] = identity |
| 4 | Scan work & span (best)? | Θ(n) work, Θ(log n) span (Blelloch) |
| 5 | Hillis–Steele work/span? | Θ(n log n) work, Θ(log n) span |
| 6 | Blelloch work/span? | Θ(n) work, Θ(log n) span |
| 7 | Is Hillis–Steele work-efficient? | No (n log n vs serial n) |
| 8 | Blelloch passes? | Up-sweep (reduce) + down-sweep |
| 9 | Down-sweep starts by setting root to? | Identity |
| 10 | Scan span lower bound? | Ω(log n) (bounded fan-in over n inputs) |
| 11 | Stream compaction recipe? | flags → exclusive scan → scatter |
| 12 | Radix-sort use of scan? | Per-digit split / bucket offsets |
| 13 | Segmented scan use? | SpMV, per-group scans, quicksort |
| 14 | Linear recurrence as scan via? | Affine-composition monoid (a,b) |
| 15 | Parallel float scan deterministic? | No — + non-associative in FP |
| 16 | Parallelism of scan? | Θ(n / log n) |
| 17 | Reduction vs scan? | One total vs all n prefixes |
Common Mistakes¶
- Calling scan "inherently serial." Associativity → balanced tree →
Θ(log n)span. It is one of the most parallelizable primitives. - Requiring commutativity. Scan needs only associativity; it preserves order and works for matrix/affine monoids.
- Quoting Hillis–Steele as work-efficient. It does
Θ(n log n)work — alog nfactor worse than serial; Blelloch is the work-efficient one. - Forgetting to seed the down-sweep with the identity. Blelloch sets the root to identity before descending; skipping this corrupts every prefix.
- Mixing up inclusive and exclusive. Indexing/compaction wants exclusive (offset of where each element starts); off-by-one bugs come from using inclusive here.
- Expecting bit-identical parallel float scans. Re-association + non-associative FP rounding → nondeterminism. Integers are fine.
- Treating low span as enough. Bounded
Pmeans work-efficiency dominates; prefer Blelloch for largen.
Tips & Summary¶
- Open with the one-liner: "Scan looks serial but
⊕is associative, so re-parenthesize into a tree —Θ(n)work,Θ(log n)span." That single sentence reframes the whole problem and signals you know the model. - Contrast the two algorithms by trade-off, not winner: Hillis–Steele = simpler, one pass,
Θ(n log n)work; Blelloch = up/down-sweep, work-efficientΘ(n). BothΘ(log n)span. Real GPU scans are hybrid. - Lead with work-efficiency when asked "which is better": bounded
P+ work lawT_P ≥ T₁/Pmeans wasted work hurts — prefer Blelloch for large arrays, Hillis–Steele within a warp. - Have the application triad ready: stream compaction (flags → exclusive scan → scatter), radix-sort split (bucket offsets), segmented scan (SpMV). These prove you know why scan is the workhorse primitive of parallel computing.
- State the
Ω(log n)lower bound from fan-in: the last output depends on allninputs through constant-fan-in gates → depthΩ(log n)→ Blelloch is optimal. - Name the FP gotcha: parallel float scan is nondeterministic because
+isn't associative; integer scan is deterministic. This separates senior candidates from juniors.
Related: Models, PRAM & Work–Span — Interview · Parallel Sorting & Merging — Interview · Parallel Prefix Sum (Scan) — Middle
In this topic
- interview
- tasks