Dataflow & Stream Programming — Interview Q&A¶

Roadmap: Programming Paradigms → Dataflow & Stream

Dataflow models a program as a directed graph: nodes are operations, edges carry data, and a node fires when its inputs are available — data availability drives execution, not statement order. The interview tests whether you can tell that apart from control flow, reason about push/pull and backpressure, and — at the senior+ end — talk fluently about determinism (Kahn networks), event-time, windowing, and exactly-once.

A bank of 40+ questions spanning definitions, the execution model, concurrency mechanics, distributed stream processing, and code/graph-reading. Each answer models the reasoning a strong candidate gives, including the trade-offs underneath. Use the <details> toggles to self-quiz: read the question, answer out loud, then expand.

Examples are in shell, Python, and Go, with Flink/Beam-style pseudocode where streaming semantics need it.

Table of Contents¶

Fundamentals / Junior
Execution Model / Middle
Concurrency & Pipelines / Middle–Senior
Trade-offs & Determinism / Senior
Distributed Stream Processing / Senior–Staff
Code & Graph Reading
Curveballs
Rapid-Fire / One-Liners
How to Talk About Dataflow in Interviews
Summary
Related Topics

Fundamentals / Junior¶

Definitions, the core distinction from control flow, and the everyday examples.

Q1. What is dataflow programming, in one sentence?

Answer

It's a paradigm where a program is a **directed graph**: nodes are operations, edges carry data between them, and a node runs **when data is available on its inputs** — so the *flow of data*, not a sequence of statements, drives execution. The classic everyday example is a Unix pipe (`cat | grep | sort`) and a spreadsheet (a cell recomputes when its inputs change).

Q2. What's the core difference between dataflow and control flow?

Answer

The **trigger** for work. In **control flow** (imperative), a line runs because the *previous line finished* — you write the order, and a program counter marches down it. In **dataflow**, a node runs because *data arrived on its input* — there's no central program counter; data availability is the schedule. Put differently: control flow asks "what happens next?"; dataflow asks "where does this data go next?" One consequence: independent dataflow nodes can run concurrently for free, because the only thing constraining order is the data dependency expressed by the edges.

Q3. Why is a Unix pipe a dataflow program?

Answer

`cat f | grep X | sort` is a graph of three nodes connected by two edges (the pipes). Each stage reads a stream of lines, transforms it, writes a stream of lines — knowing nothing about the others; the only contract is the data on the pipe. The shell starts all three concurrently, and `grep` processes line 1 while `cat` reads line 2 — data flows through like water. You program by *wiring stages*, not by writing control flow. That's dataflow: small single-purpose nodes composed by the flow of data between them.

Q4. Why is a spreadsheet the purest mainstream example of dataflow?

Answer

Cells are nodes; formulas declare edges (`C1 = A1 + B1` makes `A1, B1 → C1`). The engine maintains a dependency graph and, when you change `A1`, automatically recomputes exactly the cells downstream of `A1` — you didn't call a function or write a loop. That's **data-driven execution**: a node recomputes because its inputs changed. It's also declarative (you state relationships, the engine schedules) — which is why a spreadsheet feels effortless: it's a tuned dataflow runtime.

Q5. What does "a node fires when its inputs are available" actually mean?

Answer

A node has a **firing rule** — typically "there's a data item (token) on each required input edge." When that condition holds, the node consumes those inputs, computes, and emits results on its output edges, possibly waking downstream nodes. An adder with inputs `a` and `b` fires only when *both* have a token. There's no external scheduler telling it "go now" — the presence of data *is* the go signal. Different nodes have different firing rules (a `merge` fires when *any* input has a token), and the firing rule is essentially the node's personality.

Q6. How do generators relate to dataflow?

Answer

A generator is a node in a *lazy, pull-based* dataflow pipeline. It produces values one at a time on demand (`yield`), pausing between them. Composing generators (`parse(filter(read(path)))`) wires a graph, but no work happens until you *pull* from the end — then one item flows all the way through the chain. This gives you flat memory (one item in flight, not the whole dataset), short-circuiting (`take(5)` stops upstream early), and composability. Generators are how you build Unix-style pipelines *inside* a program.

Q7. Give three real systems built on the dataflow paradigm.

Answer

- **Unix pipes** — the canonical pipeline. - **Build systems** (`make`, Bazel) and **CI/CD** — targets/jobs are nodes, dependencies are edges, rebuild a node when its inputs change. - **Stream processors** (Flink, Kafka Streams, Spark, Beam) — operator graphs over billions of events. - Also: **spreadsheets**, **audio/video effect chains** (DSP), **TensorFlow/PyTorch computation graphs**, and React's "UI = f(state)". The thread: each is a graph of operations over a flow of data.

Execution Model / Middle¶

Push vs pull, laziness, backpressure, and bounded buffers.

Q8. What's the difference between push and pull pipelines?

Answer

It's about *who initiates the data transfer*. In **pull** (demand-driven, lazy), the *consumer* drives: the end asks for a value, and the request propagates upstream — generators, iterators, Java `Stream`, DB cursors. In **push** (data-driven), the *producer* drives: a source emits and hands the value downstream — Rx Observables, Kafka, event streams, `onNext(value)`. The crucial consequence: **pull pipelines get backpressure for free** (a slow consumer simply pulls slowly, so the producer idles), while **push pipelines must engineer backpressure** or a fast producer overwhelms a slow consumer.

Q9. What is backpressure, and why does it matter?

Answer

Backpressure is the mechanism by which a slow consumer signals upstream "slow down, I'm full," throttling the producer to the consumer's rate. It matters because without it, a producer faster than its consumer piles data into a buffer that grows until the process OOM-crashes — running *too well* starves the system. With backpressure, throughput settles at the rate of the **slowest stage**, which is correct: you can't durably go faster than your slowest component; pretending otherwise just buffers the difference until you run out of memory.

Q10. How are bounded buffers used to implement backpressure?

Answer

Each edge is a fixed-capacity queue. When the producer tries to add to a **full** buffer, the operation *blocks* until the consumer removes an item — that blocking *is* the backpressure signal, propagated automatically upstream. When the consumer tries to take from an **empty** buffer, it blocks until the producer adds one (preventing busy-waiting and starvation). Go channels are exactly this: `make(chan T, N)` is a bounded buffer; send-on-full and receive-on-empty both block. An *unbounded* buffer removes the backpressure and just delays the OOM under load.

Q11. What does the buffer size trade off?

Answer

**Latency vs throughput**, plus burst absorption. A **small** buffer (even size 0 / unbuffered) means tight coupling, low memory, low latency, but stages stall on each other often and can't absorb bursts. A **large** buffer absorbs spikes and improves average throughput, but adds latency (an item can sit in a long queue) and memory, and it *hides* a chronically slow consumer until the buffer finally fills and the system falls over abruptly. Tune it from the burst size and latency budget, and monitor queue depth as a leading indicator of a slow consumer.

Q12. Lazy vs eager streams — what's the difference and when does it matter?

Answer

**Eager** stages fully materialize each step (read whole file → filter whole list → parse whole list) before the next starts. **Lazy** stages do work only on demand, item by item. It matters enormously for large/infinite data: lazy gives flat memory (one item in flight), short-circuiting (`take(5)` reads ~5 items end-to-end instead of processing everything), and fusion potential. The cost of lazy: side effects become deferred and order-sensitive (a lazy `map` with a `print` won't run until consumed), and stack traces are harder. Most real pipelines are lazy through transforms with a deliberate eager "collect here" boundary.

Q13. What's the relationship between dataflow and reactive programming?

Answer

They're the same core idea — data-dependency-driven computation — viewed differently. **Reactive** ([05](../05-reactive-programming/)) emphasizes *values that change over time and the propagation of those changes* (UIs, event handling): `total = price.map(p => p*qty)` recomputes `total` when `price` emits — exactly a spreadsheet generalized to event streams. **Dataflow** (this topic) emphasizes *the graph of operators and the data moving through it*, often for throughput/transformation (ETL, media, big data). RxJava, Flink, and a spreadsheet all feel related because they are. Backpressure is the shared hard problem once data *pushes*.

Concurrency & Pipelines / Middle–Senior¶

Go channels, fan-out/fan-in, and the parallelism story.

Q14. Build a two-stage pipeline with Go channels and explain the dataflow properties.

Answer

func gen(nums ...int) <-chan int {
    out := make(chan int)
    go func() { defer close(out); for _, n := range nums { out <- n } }()
    return out
}
func square(in <-chan int) <-chan int {
    out := make(chan int)
    go func() { defer close(out); for n := range in { out <- n * n } }()
    return out
}
// wire: gen ──► square ──► main
for r := range square(gen(1, 2, 3, 4)) { fmt.Println(r) }  // 1 4 9 16

Channels are the **edges**; goroutines are the concurrent **nodes** (gen produces `3` while square handles `2` while main prints `1`). `close(out)` is the **end-of-stream token** — `for range` ends when the channel closes. **Backpressure is automatic**: a full channel blocks the sender. This is CSP wearing a dataflow hat.

Q15. What is fan-out / fan-in, and what problem does it solve?

Answer

A straight pipeline is capped by its slowest stage. **Fan-out** runs N identical copies of that stage reading the *same* input (split the work); **fan-in** merges their outputs back into one stream (collect results). It parallelizes one stage *by wiring, not rewriting* — the node's logic is unchanged; you just run three copies. In Go, fan-out is starting several goroutines on one input channel (the runtime load-balances); fan-in is a `merge` that forwards from all of them and closes the output only after all inputs drain.

Q16. Does fan-out preserve ordering?

Answer

No. Parallel workers finish in nondeterministic order, and fan-in interleaves whichever result is ready first — output order depends on timing. If you need the original order, you must tag items with sequence numbers and reorder downstream (a reorder buffer), or partition by key and preserve order *within* each key while parallelizing *across* keys. Ordering is **not** free across a parallel fan-out — that's the recurring tension between parallelism and determinism.

Q17. Why is parallelism "almost free" in dataflow — and what's the asterisk?

Answer

Free because two nodes with no data dependency *can* run concurrently, and the runtime knows this *from the graph* — you get pipeline, task, and data parallelism with no locks, because nodes share no mutable state (they communicate only through edges). The asterisks: (1) the **slowest stage caps** total throughput; (2) **edges cost** coordination and serialization, so too many tiny nodes hurt; (3) **stateful nodes** can't be trivially parallelized — you must *partition state by key*; (4) **ordering is sacrificed** under parallel merges. So dataflow makes parallelism free to *express* and kills the lock-race bug class, but doesn't repeal the slowest-stage law.

Q18. What's the difference between synchronous (SDF) and dynamic dataflow?

Answer

In **Synchronous Dataflow (SDF)**, every node consumes/produces a *fixed, known* number of tokens per firing. Because rates are constant, the runtime can compute a **static schedule**, size every buffer exactly, and prove no deadlock and bounded memory *before running* — ideal for DSP and hard real-time (the price is no data-dependent routing). In **dynamic** dataflow, token rates depend on the data (a filter emits 0 or 1; a switch routes by value), so the schedule is decided at runtime — far more expressive (Turing-complete) but you lose the static guarantees. Most general stream processors (Flink, Spark) are dynamic.

Trade-offs & Determinism / Senior¶

The reasoning tax, Kahn networks, deadlock, and when to choose dataflow.

Q19. What's the central trade-off of the dataflow paradigm?

Answer

**Local clarity for global opacity.** Because execution order is implicit in the graph (not explicit in text), each node is a small, testable, near-pure function you can reason about in isolation — a huge testability and composability win, and what lets the runtime parallelize. The flip side: *no single place tells you how the whole thing behaves*. To answer "why did this record end up in the dead-letter sink?" you trace a path through a concurrent graph, not read a function top to bottom. Control flow that imperative code writes down is, in dataflow, emergent from topology + data + timing — harder to reason about, debug, and safely modify.

Q20. What is a Kahn Process Network, and why does it guarantee determinism?

Answer

A KPN is processes connected by **unbounded FIFO channels**, where each process communicates *only* through channels, **blocks** on an empty read, and **cannot test** whether a channel is empty (no peek, no "read whichever's ready"). Kahn proved that under these rules, the network's output depends *only* on inputs and topology — **never on timing or scheduling**. Same inputs ⇒ same outputs on one core or a thousand, any interleaving. The crux is the no-peek blocking read: a process can't decide based on *which* input arrived first, so timing can't leak into the result. It's determinism by construction.

Q21. What breaks KPN determinism, and when do you want that?

Answer

A **nondeterministic merge** — a node that fires on *whichever input arrives first* (reads any ready channel) — violates the no-peek rule and makes that subgraph timing-dependent: the merged order now depends on which token won the race. You *need* it for real systems (combining live feeds, "first response wins," load-balanced fan-in), so it's a deliberate escape hatch. The senior discipline: keep the deterministic KPN core as large as possible and **isolate nondeterminism to explicit, named merge points**. An accidentally nondeterministic pipeline is untestable and unreproducible.

Q22. How can a dataflow graph deadlock?

Answer

Two ways. (1) **Cycle deadlock**: a feedback edge where A needs B's output and B needs A's, with no initial token — neither can fire first. Fix with an **initial token** (a delay element seeded with a value) on the cycle, like a base case for a recurrence. (2) **Buffer deadlock** (sneakier): even an acyclic graph deadlocks if **bounded buffers** form a circular *wait* — A blocks writing to a full channel while B blocks writing to another full channel A must drain. This depends on buffer sizes and data rates, so it often appears only under load and not in small-input tests. Prefer DAGs; seed real cycles; test with realistic rates.

Q23. How do you debug a dataflow graph?

Answer

Not like a call stack — a stack trace tells you *where* a node failed, not *how the data got there*. The effective techniques: (1) **tap edges** — log/sample records crossing key edges, carrying a **correlation ID** on each record so you can trace "what happened to record X" across all stages (the dataflow analogue of a stack trace = data lineage). (2) **Per-node/edge counters** — records-in/out, errors, queue depth; a stage where in ≫ out is dropping, a rising queue depth localizes the bottleneck. (3) To reproduce a flaky bug, **serialize the suspect region** (force single-threaded/fixed merge order) — if it vanishes, it's ordering/timing-dependent. Also: under backpressure, "slow" looks like "stuck," so check queue depths before concluding a hang.

Q24. When does dataflow win, and when is it the wrong tool?

Answer

**Wins** when the problem genuinely *is* a graph of transformations over flowing data: ETL/data pipelines, signal/media processing, build systems (incremental rebuild = recompute affected downstream), ML pipelines, and stream analytics. **Wrong** when the logic is inherently sequential and control-heavy (many interdependent branches, early returns, loop-backs — a state machine is clearer than graph routing + feedback edges), when state is highly shared/global (the node-as-local-function model fights you), or when the transformation is trivial (ceremony exceeds payoff). One line: choose dataflow for *"data flowing through transformations,"* and imperative/state-machine code for *"a sequence of decisions over shared state."*

Distributed Stream Processing / Senior–Staff¶

Event-time, windowing, watermarks, state, and exactly-once.

Q25. Event-time vs processing-time — what's the difference and why does it matter?

Answer

**Event-time** is when the event *actually happened* (a timestamp from the source); **processing-time** is when it *arrives at the operator* (wall clock on the machine). They diverge because events arrive out-of-order and delayed across a network (offline mobile clients, GC pauses, lagging partitions). It's decisive for windowed aggregates: **processing-time** windows count whatever *arrived* in each wall-clock minute — simple, low-latency, but *wrong and non-reproducible* (a network blip shifts events into wrong windows; a faster cluster gives different results). **Event-time** windows count by *when events happened* — reproducible and correct regardless of arrival. Use event-time for correctness-critical work (billing, fraud, metrics).

Q26. What is windowing, and what are the main window types?

Answer

Windowing slices an *unbounded* stream into finite chunks so you can aggregate. **Fixed/tumbling**: adjacent, non-overlapping, equal size — each event in one window ("count per minute"). **Sliding**: fixed size overlapping by a slide interval — each event in multiple windows ("5-min average, updated every minute"), more state. **Session**: boundaries defined by *gaps* in activity — the window grows with activity and closes after N idle ("a user's browsing session"); its size is **data-dependent**, i.e., dynamic dataflow made concrete. Windows are assigned by *event-time*, and windowing + trigger together answer "what set of events do I aggregate, and when do I emit?"

Q27. What is a watermark?

Answer

A watermark is the engine's assertion *"I believe I've now seen all events with event-time ≤ T"* — a moving estimate of event-time progress, derived from the timestamps flowing through (e.g., max-seen-timestamp minus an allowed-lateness bound). When the watermark passes the *end* of a window, the engine concludes that window is complete and **fires** it (emits the result). It's a deliberate **latency vs completeness** knob: a conservative watermark waits longer (more correct, higher latency); an aggressive one emits sooner (lower latency, but more events arrive after their window closed). It's how event-time windows know when to stop waiting for stragglers.

Q28. How is late data handled?

Answer

Late data = events arriving after the watermark already passed their window. Policy options: **drop** it (simplest, with a dropped-late-events metric to watch); keep window state for an **allowed-lateness** period and **re-fire** with an updated result when stragglers arrive; or route late events to a **side output** for separate handling. **Triggers** generalize emission timing: fire at the watermark (standard), fire *early* (speculative low-latency results before the window closes), and/or fire *late* (refinements as stragglers arrive). Triggers plus accumulation mode (replace vs accumulate) tune the latency/correctness/completeness trade-off per pipeline.

Q29. How do stateful operators parallelize across a cluster without locks?

Answer

By **keyed state**: the stream is partitioned by key (e.g., `userId`), and all events for a given key route to the *same* task instance, which owns that key's state exclusively. So `Count.perKey()` over a billion users runs on 500 machines, each owning a disjoint key-range — no shared state, no locks; **the partitioning is the parallelization strategy**. State survives failure via periodic **checkpoints** to durable storage; on a crash the job restores state and replays input from the matching source offset. The state backend (in-memory vs RocksDB-on-disk) is itself a choice — RocksDB lets state exceed RAM at a latency cost.

Q30. What does "exactly-once" really mean, and how is it achieved?

Answer

It rarely means each event is *physically* processed once — machines fail and replay *will* re-process events. It means **exactly-once *effect*** on state and output: duplicates are reprocessed but their effects are reconciled so the end result is correct. Two mechanisms: (1) **consistent checkpoints** — Flink injects **checkpoint barriers** that flow through the operator graph (Chandy–Lamport distributed snapshot); when a barrier passes an operator it snapshots state, giving a globally consistent snapshot at a precise stream position to restore-and-replay from. (2) **Transactional/idempotent sinks** — output is committed only when the checkpoint commits, or writes are idempotent (upserts). It's exactly-once **end-to-end only if** the source is replayable AND checkpoints are consistent AND the sink is transactional/idempotent — a chain, not a checkbox.

Q31. What's a checkpoint barrier, and why is it elegant?

Answer

A checkpoint barrier is a special marker the engine injects into the data stream that flows through the operator graph *like a regular token*. When a barrier reaches an operator, that operator snapshots its state; when the barrier reaches all sinks, the entire graph has a globally consistent snapshot at one logical stream position — without stopping the world. It's elegant because it reuses the dataflow paradigm itself: the consistency mechanism is *just another thing riding the edges*. That's the distributed-snapshot algorithm (Chandy–Lamport) applied to a dataflow graph, and it's what makes exactly-once feasible at high throughput.

Q32. What is the Beam/Dataflow model's key contribution?

Answer

It unified **batch and streaming** (a bounded dataset is just a stream that ends, so the same pipeline runs over a file or an infinite topic) and factored every streaming computation into four questions: **What** are you computing (the operator graph), **Where** in event-time (windowing), **When** in processing-time do you emit (triggers/watermarks), and **How** do refinements relate (accumulation mode). It also separated the **pipeline definition** from the **runner** (Flink/Spark/Cloud Dataflow) — write the graph once, choose where it runs. That portability is the dataflow paradigm's dividend: because the program is a graph of pure-ish operators, any capable engine can schedule it.

Q33. In a distributed runtime, what is "the graph as deployment topology"?

Answer

The logical graph you wrote *is* the execution plan: each operator expands into **N parallel tasks** across machines; each edge becomes either a **forward** (in-memory, cheap) or a **shuffle** (repartition by key across the network — the dominant cost; `keyBy`/`GroupByKey` forces it); compatible adjacent operators are **chained/fused** into one task to avoid serialization; and backpressure is monitored per network edge via credit-based flow control. So operational reasoning *is* graph reasoning: throughput is gated by the slowest task, a hot key causes **data skew** (one task overloaded), a shuffle edge concentrates network cost, and recovery time scales with per-task state size.

Code & Graph Reading¶

You're shown a snippet or a graph; explain the behavior.

Q34. Python — what does this print, and how much work runs?

def nums():
    n = 0
    while True:
        yield n; n += 1            # infinite generator

evens = (n for n in nums() if n % 2 == 0)
squares = (n * n for n in evens)
print([next(squares) for _ in range(3)])

Answer

`[0, 4, 16]`. Despite `nums()` being **infinite**, this terminates and uses flat memory because the pipeline is **lazy/pull-based**: each `next(squares)` pulls one value end-to-end — `squares` asks `evens`, which asks `nums()` for just enough numbers to find the next even, squares it, done. Only ~5 numbers are ever generated to produce 3 squares. This is the dataflow superpower: composing infinite streams safely because nothing runs until pulled, and only as much as demanded.

Q35. Shell — what's the dataflow, and why does it work on a 100 GB file?

cat huge.log | grep "5xx" | awk '{print $4}' | sort | uniq -c | sort -rn | head

Answer

Seven nodes, six edges; each stage streams lines through, transforming them (filter 5xx → extract field 4 → sort → count → rank → top). It works on 100 GB because most stages are **streaming** — they process line-by-line without holding the file (`cat`, `grep`, `awk` stream; the OS pipe buffers a little and backpressures `cat` when downstream is slow). The exception is `sort`, which must see all input — it spills to disk for large inputs rather than holding everything in RAM. So it's a dataflow pipeline with backpressure, and only the genuinely-blocking stage (`sort`) buffers, and it does so to disk.

Q36. Go — what's wrong with this pipeline stage?

func square(in <-chan int) <-chan int {
    out := make(chan int)
    go func() {
        for n := range in { out <- n * n }   // note: no close(out)
    }()
    return out
}

Answer

It never **closes `out`**. The `for range in` loop exits when `in` is closed, the goroutine returns — but `out` is left open. Any downstream consumer doing `for v := range out` will block **forever** after the last value, waiting for an end-of-stream signal that never comes (a goroutine leak and a hang). The fix is `defer close(out)` at the top of the goroutine. The lesson: in hand-built dataflow, **end-of-stream is a real token you must propagate** — closing the output channel is how "the stream ended" flows downstream so consumers know to stop.

Q37. Read this graph — what does the validate node's two output edges replace?

   raw ──►[ parse ]──►[ validate ]──┬─ valid ─►[ enrich ]──►[ write ]
                                    └─ invalid ─►[ log+drop ]

Answer

The two output edges replace an `if/else` written as **control flow**. Instead of imperative branching, the `validate` node emits each record onto the correct *output edge*, and the data flows where the wires lead — the routing is encoded in the **graph topology**, not in statements. This is Flow-Based Programming: behavior changes by *rewiring* (insert a "dedupe" node on the valid path; add a second writer) rather than editing control flow. The picture *is* the program — and a `validate` that emits 0-or-1 tokens per output is **dynamic** dataflow (data-dependent routing).

Q38. Beam-style pseudocode — explain what each line controls.

events
  .apply(Window.into(FixedWindows.of(1 min)))   // (a)
  .apply(WithKeys.of(e -> e.userId))            // (b)
  .apply(Count.perKey())                        // (c)
  .apply(Trigger.atWatermark().withLateFirings()) // (d)

Answer

(a) **Windowing** — slice the unbounded stream into 1-minute event-time buckets ("where in event-time"). (b) **Keying** — partition by `userId` so each key's state lives on one task; this is what makes the aggregate parallelizable without locks. (c) **Stateful aggregate** — count per key per window; this operator holds keyed state. (d) **Trigger** — emit each window's result when the *watermark* passes the window end (standard), and re-fire with refinements on *late* data. Together they answer Beam's what/where/when: *count per user per minute, by event-time, emitted at the watermark with late corrections.*

Curveballs¶

The questions designed to catch glib answers.

Q39. Is dataflow the same as concurrency or parallelism?

Answer

No — that's the most common conflation. Dataflow is defined by **data-dependency-driven execution**: a node fires when its inputs are available. Concurrency/parallelism is a *benefit dataflow enables* (independent nodes can run simultaneously because the graph makes their independence explicit), not its definition. A single-threaded generator pipeline is fully dataflow with zero parallelism. Conversely, you can write parallel code that isn't dataflow at all (raw threads sharing locked state). Dataflow *makes parallelism easy and lock-free* by structuring communication as edges — but the paradigm is the data-driven model, and the parallelism is a consequence.

Q40. "Unbounded buffers make pipelines safer because they never block." True?

Answer

False — it's backwards. An unbounded buffer removes **backpressure**: a fast producer feeding a slow consumer now piles data into an ever-growing queue, so instead of the producer *politely blocking*, the process grows memory until it **OOM-crashes** under load. The unbounded buffer doesn't prevent the failure; it *delays and worsens* it — turning a graceful slowdown into a sudden crash, usually in production at peak. Bounded buffers with blocking are the *safe* default precisely because the blocking *is* the backpressure that throttles the producer to a sustainable rate.

Q41. Does event-time processing make results slower than processing-time?

Answer

Often yes for *individual window emission latency* — because event-time windows must wait for the **watermark** (an estimate that enough late events have arrived) before firing, whereas processing-time fires on the wall clock immediately. But that's a latency/correctness trade, not a throughput penalty, and you can tune it: early triggers emit speculative results before the watermark for low-latency dashboards, then late firings refine them. The point of event-time isn't speed — it's **correctness and reproducibility** under out-of-order delivery. You accept some completeness latency to get answers that don't depend on infrastructure timing.

Q42. Can an acyclic dataflow graph deadlock?

Answer

Yes — that surprises people. Topology alone (a DAG) can't deadlock (data flows downhill), but add **bounded buffers** and you can get a circular *wait* even without a graph cycle: node A blocks writing to a full channel while node B blocks writing to another full channel that A must drain. Each waits for the other to make room — the dining-philosophers deadlock in pipeline form. It depends on buffer sizes and data rates, so it can pass small-input tests and only manifest under production load at certain throughputs. Acyclic ≠ deadlock-free once buffers are finite.

Q43. "Exactly-once means each event is processed exactly one time." Correct?

Answer

Not literally. Under failures, machines crash and replay, so events *are* physically reprocessed. "Exactly-once" means exactly-once **effect** on state and output — duplicates from replay are reconciled (via consistent checkpoints) so the *result* is as if each event counted once. And it only holds **end-to-end** if the source is replayable, checkpoints are consistent, *and* the sink is transactional or idempotent. Saying "each event runs once" is the glib answer; the precise answer is "exactly-once effect, achieved by snapshot-and-replay plus an exactly-once sink, as a chain."

Q44. Why does a keyBy / GroupByKey often dominate the cost of a streaming job?

Answer

Because it forces a **network shuffle**: to route all records with the same key to the same task, the engine must repartition the stream across all machines, serializing and sending records over the network — far more expensive than the in-memory "forward" edges between co-located operators. It's the dataflow analogue of a sort/shuffle in MapReduce. Mitigations: **pre-aggregate** before the shuffle (combine locally so fewer records cross the network), avoid unnecessary re-keying, and watch for **data skew** (a hot key sends a disproportionate share to one task, capping the whole job). Reading a job graph, the shuffle edges are where you look first for cost.

Rapid-Fire / One-Liners¶

Crisp answers; what an interviewer wants in a sentence or two.

Q45. Dataflow in one line?

Answer

A program as a directed graph where nodes are operations, edges carry data, and a node fires when its inputs are available — data drives execution, not statement order.

Q46. What triggers a node to run?

Answer

The availability of data on its required input edges (its firing rule) — not the previous statement finishing.

Q47. Pull vs push, one line each?

Answer

Pull = consumer asks (lazy; backpressure free). Push = producer emits (eager-ish; backpressure must be engineered).

Q48. Why do pull pipelines get backpressure for free?

Answer

A slow consumer simply pulls slowly, so the producer is only ever asked for what's needed and naturally idles.

Q49. Event-time vs processing-time?

Answer

Event-time = when it happened (reproducible, correct); processing-time = when it arrived (simple, timing-dependent).

Q50. Three window types?

Answer

Fixed/tumbling (non-overlapping), sliding (overlapping), session (gap-defined, data-dependent size).

Q51. What's a watermark, in one line?

Answer

An assertion that all events with event-time ≤ T have probably arrived — it triggers window emission, trading latency for completeness.

Q52. What guarantees determinism in dataflow?

Answer

A Kahn Process Network — blocking reads with no peeking — makes output depend only on inputs and topology, never timing.

Q53. Does fan-out preserve order?

Answer

No — parallel workers finish nondeterministically; reorder with sequence numbers or partition by key.

Q54. One-line fix for a Go pipeline stage that hangs downstream?

Answer

`defer close(out)` — propagate end-of-stream by closing the output channel.

How to Talk About Dataflow in Interviews¶

A few habits separate a strong answer from a textbook recital:

Lead with the trigger. "In dataflow, a node runs because data arrived, not because the previous line finished." That one sentence proves you understand the paradigm versus naming examples.
Anchor in the familiar. Unix pipes and spreadsheets are dataflow everyone has used — start there, then scale up to Flink. It shows the through-line from cat | grep to a thousand-node job.
Keep "dataflow vs concurrency" straight. Concurrency is a benefit the model enables, not the definition. Conflating them is a junior tell.
Name backpressure unprompted. Push pipelines need it; pull pipelines get it free; bounded buffers implement it; unbounded buffers just delay the OOM. This is the most practically important concept and a quick credibility win.
Get determinism right. Kahn networks (blocking, no peek) give determinism by construction; a nondeterministic merge is the deliberate escape hatch. This signals real depth.
For senior+, talk time and state. Event-time vs processing-time, watermarks, windowing, keyed state, and exactly-once-as-a-chain (replayable source + consistent checkpoints + idempotent/transactional sink) are the distributed-streaming vocabulary that separates levels.
Name the trade-off. Local clarity for global opacity; free-to-express parallelism capped by the slowest stage; latency vs throughput on the buffer knob. "It depends, and here's on what" beats absolutism.

Summary¶

Dataflow = program as a directed graph: nodes are operations, edges carry data, a node fires when its inputs are available. The defining contrast with control flow is the trigger — data arrival, not statement order — which is why independent nodes parallelize for free. Unix pipes and spreadsheets are the everyday examples; generators are lazy pull-based dataflow in code.
The execution-model bar is push-vs-pull, lazy-vs-eager, and backpressure via bounded buffers (pull gets it free; push must engineer it; unbounded buffers just delay the OOM). The concurrency bar is Go-channel pipelines, fan-out/fan-in (parallelize by wiring, lose order), and SDF-vs-dynamic.
The senior bar is the trade-off (local clarity for global opacity), Kahn-network determinism vs nondeterministic merge, deadlock (cycles need initial tokens; bounded buffers can deadlock acyclic graphs under load), graph debugging by data lineage, and when dataflow wins.
The staff bar is distributed stream processing: event-time vs processing-time, windowing, watermarks/triggers/late data, keyed state + checkpoints, exactly-once as an end-to-end chain (snapshot-and-replay + transactional sink), and the graph as deployment topology (shuffles, skew, chaining). The strongest answers lead with mechanism and trade-off, anchor in the familiar, and keep dataflow distinct from "just concurrency."

junior.md — the pipeline intuition, the graph model, generators, data-driven execution.
middle.md — firing rules, push/pull, backpressure, Go channels, fan-out/fan-in, SDF.
senior.md — trade-offs, Kahn networks, deadlock, graph debugging, when dataflow wins.
professional.md — Flink/Kafka Streams/Beam, windowing, event-time, exactly-once, computation graphs.
05 — Reactive Programming — the time-varying-values cousin and Rx backpressure.
07 — Actor Model & CSP — the message-passing model behind Go channels.
11 — Event-Driven Programming — events and handlers; the source side of many streams.
System Design → Data Streaming — Kafka, stream processing, and exactly-once at distributed scale.

Dataflow & Stream Programming — Interview Q&A¶

Table of Contents¶

Fundamentals / Junior¶

Execution Model / Middle¶

Concurrency & Pipelines / Middle–Senior¶

Trade-offs & Determinism / Senior¶

Distributed Stream Processing / Senior–Staff¶

Code & Graph Reading¶

Curveballs¶

Rapid-Fire / One-Liners¶

How to Talk About Dataflow in Interviews¶

Summary¶

Related Topics¶