Parallel Graph BFS — Interview Questions¶
Table of Contents¶
- Conceptual Questions
- The Frontier: Races, Dedup & Compaction
- Direction Optimization
- Beyond BFS: Connected Components & Linear Algebra
- Hardness: What Doesn't Parallelize
- Gotcha / Trick Questions
- Rapid-Fire Q&A
- Common Mistakes
- Tips & Summary
Conceptual Questions¶
Q1: "How do you parallelize BFS?" (Medium)¶
Answer: Run it level-synchronously. Serial BFS processes one vertex at a time off a FIFO queue; the parallel version processes an entire BFS level (the frontier) at once. All vertices at distance d from the source are independent of each other, so you expand them in parallel:
frontier = {source}; dist[source] = 0; level = 0
while frontier not empty:
next = ∅ # built in parallel
for each u in frontier in parallel:
for each neighbor v of u:
if claim(v): # atomic CAS on dist[v]
dist[v] = level + 1
add v to next # via scan/compaction
frontier = next; level += 1
- Work:
Θ(V + E)— every vertex and every edge is touched once across all levels. Work-optimal, same as serial BFS. - Span:
Θ(D · log n), whereDis the diameter (number of BFS levels). There areDsequential level-barriers, and each level's frontier expansion + compaction costsΘ(log n)span.
The two hard parts are (1) deduplicating neighbor discovery across racing threads (Q5) and (2) building the next frontier in parallel via a scan (Q6).
Q2: Why does the span depend on the diameter D? (Medium)¶
Answer: Because BFS is inherently level-by-level — you cannot discover level d+1 until level d is fully expanded (a vertex's distance is 1 + the distance of its discoverer). That imposes a sequential chain of D global barriers, one per level. Within a level, expansion is fully parallel (Θ(log n) span for the parallel-for + frontier compaction), but the D barriers lie on the critical path and cannot be overlapped.
So span = D × Θ(log n) = Θ(D · log n). The number of vertices doesn't bound the span — the number of levels does. This is why graph shape matters more than graph size for BFS parallelism (Q14): a low-diameter graph (D = O(log n), like a social network) gives span Θ(log² n) and enormous parallelism, while a long path (D = n−1) gives span Θ(n log n) — essentially serial.
The Frontier: Races, Dedup & Compaction¶
Q5: Two frontier vertices discover the same neighbor — how do you avoid processing it twice? (Hard)¶
Answer: With an atomic claim on the visited/distance array. When threads expanding two different frontier vertices both reach a common neighbor v, exactly one must "win" the right to set dist[v] and push v into the next frontier; the others must drop it. The standard primitive is an atomic compare-and-swap (CAS):
CAS atomically checks dist[v] == UNVISITED and, if so, sets it — in one indivisible step. Only the winner gets true and adds v exactly once; losers get false and skip it. This guarantees each vertex enters the next frontier at most once and dist[v] is set to the correct BFS distance.
Without the atomic, two threads both read UNVISITED, both write, and both push v → duplicates (re-expanded, wasted work) and possibly a torn distance. A plain non-atomic check-then-set is the classic read–modify–write race. Note correctness of the distance is fine even with a benign race (all discoverers are at the same level, so they'd all write level+1), but you still need the atomic to dedup the frontier.
Q6: How is the next frontier built in parallel without a lock? (Hard)¶
Answer: With a parallel prefix-sum (scan) + compaction, not a shared lock-protected queue. A global mutex-guarded queue serializes every insert and kills scalability. Instead:
- Each thread (or each frontier vertex's expansion) collects the neighbors it won via CAS into a small thread-local buffer, and counts them:
c_i. - Exclusive scan over the per-thread counts gives each thread a unique, contiguous offset
off_i = c_0 + … + c_{i−1}into the output array. - Each thread scatters its won vertices to
next[off_i …]— no two threads touch the same slot, so writes are race-free with no locking.
This is the same scan-based stream compaction used for filtering. It costs Θ(log n) span (the scan) and Θ(frontier size) work, keeping the whole level expansion work-efficient. See Parallel Reduce and Map — Interview and Models, PRAM & Work–Span — Interview for the scan/compaction primitive.
Direction Optimization¶
Q7: What is top-down vs bottom-up BFS, and when does the hybrid switch? (Hard)¶
Answer: They are two ways to advance one level:
- Top-down (push) — the classic. For each vertex
uin the current frontier, scanu's out-edges and try to claim each unvisited neighbor. Cost ∝ edges leaving the frontier. - Bottom-up (pull) — for each unvisited vertex
v, scanv's in-edges looking for any neighbor already in the frontier; if found,vjoins the next level and stops early (no need to check the rest of its edges). Cost ∝ edges of unvisited vertices, but with early termination.
Direction optimization (Beamer, Asanović, Patterson, 2012) picks per level: when the frontier is huge — the "middle" levels of a low-diameter, scale-free graph, where most vertices are in the frontier at once — top-down redundantly checks countless edges to already-visited vertices, while bottom-up finds a frontier parent almost immediately and bails. The hybrid switches top-down → bottom-up when the frontier (or its edge count m_f) exceeds a threshold (heuristic: m_f > m / α), and switches back when the frontier shrinks again. This is the standard trick that wins Graph500; on a Kronecker/social graph it can cut edge inspections by an order of magnitude.
Beyond BFS: Connected Components & Linear Algebra¶
Q8: How do you compute connected components in parallel? (Hard)¶
Answer: Two classic approaches:
-
Label propagation — initialize each vertex's label to its own id, then iteratively set each vertex's label to the minimum label among itself and its neighbors, in parallel, until no label changes (a fixpoint). Simple and fully data-parallel, but it can take up to
Θ(D)rounds to converge (a label crawls one hop per round), so span is diameter-bound like BFS — bad for long chains. -
Shiloach–Vishkin (1982) — a
Θ(log n)-span PRAM algorithm. It maintains a forest of "stars" (rooted trees) and alternates two operations: hooking (point one tree's root at a neighboring tree's root, breaking ties by id to avoid cycles) and pointer jumping / shortcutting (parent[v] ← parent[parent[v]], which halves tree height each round). Pointer jumping is the key: it collapses any tree to depth 1 inΘ(log n)rounds regardless of diameter, so the whole algorithm finishes inΘ(log n)rounds → spanΘ(log n), independent ofD. That diameter-independence is exactly why it beats label propagation and plain BFS-per-component for high-diameter graphs.
Q9: How is BFS a sparse matrix–vector product over a semiring? (Hard)¶
Answer: Represent the frontier as a boolean vector f (f[v] = 1 iff v is in the current frontier) and the graph as its adjacency matrix A. One level of BFS expansion is the sparse matrix–vector product (SpMV):
i.e. f_next[v] = OR over u of (A[u][v] AND f[u]) — vertex v is in the next frontier iff any frontier vertex u has an edge to it. The AND is "is there an edge and is u active," the OR reduces over all such u. Mask out already-visited vertices each step to get the dedup'd frontier.
This is the GraphBLAS view: BFS, shortest paths ((min, +) tropical semiring), and PageRank ((+, ×)) are all the same SpMV skeleton, you just swap the (⊕, ⊗) semiring. The payoff: decades of optimized parallel sparse-linear-algebra kernels (load balancing, cache blocking, distribution) apply directly to graph traversal. See the semiring discussion in Parallel Reduce and Map — Interview.
Hardness: What Doesn't Parallelize¶
Q10: Which graph problems don't parallelize well, and why? (Hard)¶
Answer: Problems that are P-complete are the canonical "inherently sequential" ones: they are in P, but are not believed to be in NC (no polylog-span, poly-work algorithm) unless NC = P. Being P-complete is, informally, evidence a problem resists parallelization. Two graph examples:
- Lexicographically-first DFS — computing the DFS tree/ordering that always takes the lowest-numbered unvisited neighbor. The forced tie-breaking makes each step depend on the entire prior traversal — a long dependency chain. (Plain reachability/BFS is parallel; it's the lex-first DFS order that's hard.)
- Maximum flow (and lex-first max flow / the classic max-flow as a P-complete problem). The augmenting-path structure is a deep sequential dependency.
Contrast with BFS reachability, connectivity, and MST, which are in NC. The interview point: not every problem with a fast serial algorithm has a fast parallel one — P ⊇ NC, and P-complete problems are the suspected boundary. So the right answer to "can we parallelize everything?" is no, modulo NC ≠ P.
Gotcha / Trick Questions¶
Q11: "Is parallel BFS work-efficient?" (Medium)¶
Answer: Yes — Θ(V + E) work, matching serial BFS. Each vertex is claimed once (CAS dedup) and each edge is inspected once across all levels, so the total work is linear in the graph size; the scan-based frontier compaction adds only Θ(V) more. The parallelism comes "for free" on the span axis (Θ(D · log n)), not by doing extra work. (Caveat: direction-optimizing bottom-up can re-scan in-edges of unvisited vertices, and top-down on a huge frontier inspects edges to already-visited vertices — so a naive single-direction BFS can waste edge inspections; the hybrid in Q7 is what keeps total edge work near Θ(E).)
Q12: "More cores always means faster BFS — right?" (Hard)¶
Answer: No. Three hard limits:
- Diameter caps the span. Span is
Θ(D · log n)withDsequential level barriers. Adding cores cannot remove a barrier — on a high-diameter graph (a path, a road network) BFS is nearly serial no matter how many cores you throw at it (Amdahl on the level structure). - Memory-latency bound. BFS is the textbook irregular, pointer-chasing, low-arithmetic-intensity workload: random access into the adjacency structure and the visited array, almost no compute per byte. It's limited by memory latency/bandwidth and atomic-contention on
dist[], not by core count. - Load imbalance from skewed degrees. Scale-free graphs have huge-degree hubs; if you partition the frontier by vertex, one thread expanding a million-edge hub stalls everyone at the level barrier. You must balance by edges, not vertices (e.g. CSR + edge-based partitioning) — otherwise extra cores sit idle.
So speedup is bounded by min(P, work/span) and by memory/contention/imbalance in practice. On a diameter-D graph, work/span = Θ((V+E)/(D log n)) — that, not P, is often the ceiling.
Q13: "Doesn't the visited-array race make BFS distances wrong?" (Medium)¶
Answer: The distance itself is benign-racy but correct, if you're careful: every vertex that discovers v in a given round is at the same level d, so they would all write dist[v] = d+1 — concurrent writes of the same value don't corrupt it. What you do need the atomic for is deduplication: ensuring v is added to the next frontier exactly once (Q5). Skip the CAS and you re-expand v from every discoverer — an explosion of redundant work and possible duplicate frontier entries, not a wrong distance. (This benign-race property breaks for weighted shortest paths, where discoverers can offer different distances — there you genuinely need an atomic-min, which is why Δ-stepping/Dijkstra are harder than BFS.)
Rapid-Fire Q&A¶
| # | Question | Answer |
|---|---|---|
| 1 | Parallel BFS strategy? | Level-synchronous frontier expansion |
| 2 | BFS work? | Θ(V + E) — work-optimal |
| 3 | BFS span? | Θ(D · log n), D = diameter |
| 4 | Why span ∝ diameter? | D sequential level barriers, un-overlappable |
| 5 | Dedup racing neighbor discovery? | Atomic CAS claim on dist[]/visited |
| 6 | Build next frontier in parallel? | Scan + compaction (no lock) |
| 7 | Top-down BFS cost ∝ ? | Edges leaving the frontier (push) |
| 8 | Bottom-up BFS cost ∝ ? | Unvisited vertices' in-edges, early-exit (pull) |
| 9 | When switch to bottom-up? | Huge frontier (m_f > m/α) — Beamer / Graph500 |
| 10 | Label propagation rounds? | Up to Θ(D) — diameter-bound |
| 11 | Shiloach–Vishkin span? | Θ(log n) via hooking + pointer jumping |
| 12 | Why SV beats BFS-CC? | Diameter-independent (log n, not D) |
| 13 | BFS as linear algebra? | SpMV f_next = Aᵀf over (OR, AND) semiring |
| 14 | Shortest-path semiring? | (min, +) tropical |
| 15 | P-complete = ? | In P, not believed in NC — resists parallelizing |
| 16 | P-complete graph examples? | Lex-first DFS, max flow |
| 17 | Worst graph shape for BFS span? | High-diameter (path, road network) |
| 18 | Best graph shape? | Low-diameter scale-free (D = O(log n)) |
| 19 | Partition frontier by ? | Edges, not vertices (degree skew) |
| 20 | BFS bottleneck on real HW? | Memory latency + atomic contention, not compute |
Common Mistakes¶
- Quoting BFS span as
Θ(log n). It'sΘ(D · log n)— the diameter dominates; only within a level is itlog n. - Skipping the atomic claim. A plain check-then-set is a read–modify–write race → duplicate frontier entries and re-expansion. Use CAS.
- Building the frontier with a global locked queue. Serializes inserts. Use a scan + compaction for race-free, lock-free appends.
- Always going top-down. On the giant middle frontier of a scale-free graph, top-down wastes edge checks; switch to bottom-up (direction optimization).
- Partitioning the frontier by vertex. A high-degree hub stalls one thread at the level barrier — balance by edges.
- Claiming "more cores → faster" unconditionally. Diameter caps span; BFS is memory-latency-bound and imbalance-prone.
- Thinking everything in P parallelizes. P-complete problems (lex-DFS, max flow) resist NC unless
NC = P. - Confusing the benign distance race with a correctness bug. Distance is correct (same-level writers); the atomic is for dedup, not the value — but weighted SSSP genuinely needs atomic-min.
Tips & Summary¶
- Open with the two numbers: "Level-synchronous frontier BFS —
Θ(V+E)work,Θ(D · log n)span." Then immediately explain theD: it's the count of sequential level barriers, which is why graph shape drives parallelism. - Name the two mechanisms: atomic CAS to claim/dedup neighbors, and scan-based compaction to build the next frontier lock-free. These are the implementation crux of every question.
- Have direction optimization ready: top-down (push, cost ∝ frontier out-edges) vs bottom-up (pull, cost ∝ unvisited in-edges with early-exit); switch on a huge frontier. Cite Beamer / Graph500 — it signals you know the real-world kernel.
- Know one fast CC algorithm: Shiloach–Vishkin,
Θ(log n)span via pointer jumping, diameter-independent — the standard answer to "components in parallel." - Show the linear-algebra lens: BFS is SpMV over (OR, AND); swap the semiring for shortest paths
(min,+)or PageRank(+,×). That's the GraphBLAS unification. - Land the hardness point: not everything in P is in NC — P-complete problems (lex-first DFS, max flow) are the suspected wall — and high-diameter graphs make even easy-to-parallelize BFS span-bound. See ./middle.md and Models, PRAM & Work–Span — Interview.
Related: Models, PRAM & Work–Span — Interview · Parallel Reduce and Map — Interview · Parallel Graph BFS — Middle
In this topic
- interview
- tasks