Map / Filter / Reduce — Professional Level¶
Roadmap: Functional Programming → Map / Filter / Reduce
The trio is semantically trivial. Its cost is entirely in where the intermediate values live and how many times the data is walked. This file is about turning a chain of allocations into a single pass — and measuring the difference instead of asserting it.
Table of Contents¶
- Introduction
- Prerequisites
- The Intermediate-Allocation Problem
- Loop Fusion / Deforestation
- Eager vs Lazy: Memory and Throughput
- Java Stream Internals & Parallelism
- Boxing, Branch Cost & Measurement
reduceas a Catamorphism- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: theory + runtime. Three questions decide the performance of every
map/filter/reducechain you will ever write — how many collections does it allocate, how many times does it walk the data, and what does each element cost to box and to branch on? Everything below is an answer to one of those three.
The earlier levels taught the semantics: map transforms each element, filter keeps a subset, reduce/fold collapses the sequence to one value. That is correct and complete as semantics. As machine behavior it is dangerously incomplete, because the obvious implementation — run map to completion, then filter its result, then reduce that — allocates a fresh collection at every arrow and walks the data once per stage. On large inputs the trio's cost is dominated not by the transform functions but by the collections nobody named.
This is the professional distinction. A junior writes xs.map(f).filter(p).sum() and ships it. A professional knows that line has two possible compilations that differ by an order of magnitude in allocation and throughput — an eager chain of three passes with two throwaway collections, or a fused single pass with none — and knows which one their language gives them by default, how to force the other, and how to prove the gap with a benchmark rather than a hunch.
Two disciplines define this level:
- Count the intermediates. Before reasoning about speed, ask how many collections a chain allocates. In eager languages (Python lists, Go slices) the answer is "one per stage." In lazy ones (Haskell, Java
Stream, Python generators) the answer can be zero. That count, not the big-O of the transform, is usually what the profiler blames. - Never argue throughput from intuition. Every claim here pairs with the instrument that proves it on your data: JMH for the JVM,
testing.B+benchstatfor Go,timeit/pyperffor Python. The illustrative numbers in this file are labeled as such; your job is to generate the real ones.
The mental model: a
map/filter/reducechain is a description of a computation, not a sequence of collection-building steps. Whether that description compiles to N passes with N−1 intermediates or to one pass with none depends on laziness, fusion, and which API you reached for. The whole game is collapsing the description into one walk.
Prerequisites¶
- Required: Fluent with
senior.md— you can refactor imperative loops into declarative pipelines and back, and you knowmap/filter/reducecold. - Required: Working model of a managed heap: allocation cost, a tracing GC's mark/sweep, why short-lived garbage still costs (young-gen churn, allocation-rate-driven GC frequency).
- Required: You can read a JMH table, a
benchstatdiff, and atimeitresult and tell signal from noise. - Helpful: Lazy evaluation and stream mechanics — see Laziness & Streams, which this file leans on heavily.
- Helpful: big-o-analysis, profiling-techniques, memory-leak-detection skills for the measurement vocabulary.
The Intermediate-Allocation Problem¶
Take the canonical chain — "double every number, keep the big ones, sum them":
Written eagerly, stage by stage, on a Python list:
# EAGER — each stage materializes a full list. Three passes, two throwaway lists.
doubled = [x * 2 for x in xs] # pass 1: allocates a list of len(xs)
big = [y for y in doubled if y > 100] # pass 2: allocates a second list
result = sum(big) # pass 3: walks the second list
For len(xs) == 10_000_000, this allocates two intermediate lists of ~10M elements each that exist only to be consumed by the next stage and immediately become garbage. The transform functions (* 2, > 100) are trivial; the cost is the two collections and the three separate walks over (mostly) the same data.
The same shape, the same trap, in Go and Java with eager helpers:
// Go — generic helpers (hand-rolled or x/exp/slices-style) are EAGER.
doubled := Map(xs, func(x int) int { return x * 2 }) // new []int, len(xs)
big := Filter(doubled, func(y int) bool { return y > 100 }) // another new []int
result := Reduce(big, 0, func(a, b int) int { return a + b })
// Two heap slices allocated, two GC-tracked, two extra passes.
// Java — Collectors materialize. This is an eager chain wearing a Stream costume.
List<Integer> doubled = xs.stream().map(x -> x * 2).collect(toList()); // list 1
List<Integer> big = doubled.stream().filter(y -> y > 100).collect(toList()); // list 2
int result = big.stream().mapToInt(i -> i).sum();
The costs, all of which a profiler will surface and none of which the source code names:
- Allocation pressure. Two large arrays allocated and discarded per call. Allocation is cheap per object but the rate drives GC frequency; a hot endpoint doing this per request churns the young generation.
- Memory footprint. Peak resident memory is
O(input)for each live intermediate — you hold the whole doubled list in RAM before filtering starts. For inputs near memory limits this is the difference between running and OOM. - Cache traffic. Three passes touch ~30M elements of memory traffic where one pass would touch ~10M. The data is read from and written to RAM three times instead of streamed once.
- No short-circuit. If the real goal were
.first()or.any(), the eager chain still builds the entire doubled and filtered lists before the consumer looks at element zero.
Diagnose it: Python
tracemalloc(peak + per-line allocation) ormemray; Go-memprofile/pprof -alloc_spaceand theB/op+allocs/opcolumns frombenchstat; Java JFR allocation events or-Xlog:gc*showing young-gen churn. The signature of this anti-pattern is high allocation rate with low algorithmic work — the GC is busy collecting collections you never meant to keep.
Loop Fusion / Deforestation¶
The fix is not to write a worse loop — it is to make the same declarative chain compile to a single pass with no intermediate collections. This transformation has a name in the literature: deforestation (Wadler, 1990) — eliminating the intermediate data structures ("trees") that a composition of transformations would otherwise build. The runtime-engineering term for the same idea is loop fusion: collapsing map ∘ filter ∘ reduce into one loop body.
The hand-fused version makes the target explicit:
// FUSED by hand — one pass, zero intermediate slices, zero extra GC pressure.
sum := 0
for _, x := range xs {
y := x * 2 // map, inlined
if y > 100 { // filter, inlined
sum += y // reduce, inlined
}
}
This is what every fusion mechanism is trying to produce automatically so you don't have to surrender the declarative form. The mechanisms differ by language:
1. Lazy streams / pull-based iteration (Java Stream, Python generators, Rust iterators). A lazy pipeline doesn't compute anything until a terminal operation pulls. Each pulled element flows through map then filter then into the reduce before the next element is pulled — so no stage ever materializes a full collection. This is fusion by architecture: the intermediates never exist because evaluation is element-at-a-time.
# FUSED via generators — element-at-a-time, no intermediate lists.
result = sum(x * 2 for x in xs if x * 2 > 100)
# └─ generator expression: map+filter fused, sum pulls one at a time
# Peak extra memory: O(1), not O(n). One logical pass.
// FUSED via Stream — lazy by design. map/filter are intermediate ops that build
// no collection; sum() is the terminal op that drives one pass.
int result = xs.stream()
.mapToInt(x -> x * 2) // primitive stream — also dodges boxing (see below)
.filter(y -> y > 100)
.sum(); // single traversal, zero intermediate lists
2. Transducers (Clojure, and the pattern ported to many languages). A transducer is a composable transformation decoupled from the source and the sink. (comp (map f) (filter p)) builds one combined reducing function; running it over a source applies f and p per element in a single reduce with no intermediate sequences. Fusion becomes a first-class, reusable value rather than a compiler trick.
3. Compiler rewrite rules (GHC / Haskell). Haskell's libraries ship RULES pragmas (the classic foldr/build short-cut fusion, and the stream-fusion framework) that let the compiler rewrite foldr k z (map f (filter p xs)) into a single recursive loop at compile time, allocating no intermediate list. The programmer writes the maximally-declarative form; GHC deforests it.
-- Haskell: written as a three-stage pipeline...
result = sum (map (*2) (filter (>50) xs))
-- ...but GHC's list-fusion RULES rewrite this to a single foldr/loop with
-- NO intermediate list ever allocated. Declarative source, fused machine code.
Before / After — labeled illustrative numbers¶
A microbenchmark of "double, keep >100, sum" over 10M ints, eager-chain vs fused, is the canonical demonstration. These numbers are illustrative — reproduce them on your hardware with the harnesses in the next section.
| Form | Passes | Intermediate collections | Allocations | Throughput (rel.) |
|---|---|---|---|---|
| Eager chain (3 stages, materialized) | 3 | 2 × O(n) | ~2 × n objects/refs | 1.0× (baseline) |
| Fused (lazy stream / generator) | 1 | 0 | ~0 (plus pipeline scaffolding) | ~2.5–4× |
| Hand-written loop | 1 | 0 | 0 | ~3–5× (often == fused, sometimes a touch faster) |
# Go benchstat sketch (illustrative):
│ eager-chain │ fused-loop │
│ sec/op │ sec/op vs base │
DoubleFilterSum │ 42.1m ± 2% │ 11.3m ± 1% -73.2% │
│ B/op │ B/op vs base │
DoubleFilterSum │ 160.0Mi ± 0% │ 0.0 ± 0% -100.00% │ ← intermediates gone
│ allocs/op │ allocs/op vs base │
DoubleFilterSum │ 2.00 ± 0% │ 0.00 ± 0% -100.00% │
The headline is the B/op and allocs/op going to zero, not just the wall-clock win. The throughput improvement is a consequence of removing the allocation and the extra passes; the allocation columns are the proof of why.
The catch — fusion isn't free or guaranteed. Lazy fusion adds per-element overhead (a virtual call / closure invocation per stage) that a fully-inlined hand loop avoids; on small inputs the eager or hand-loop form can win because the pipeline scaffolding dominates. GHC's RULES fire only when the code matches the pattern (a
let-bound intermediate or an unfortunateseqcan block them). The professional move is to know your default (is this language eager or lazy here?) and measure the specific chain, not to assume fusion always happens.
Eager vs Lazy: Memory and Throughput¶
The same trio behaves oppositely depending on whether the language evaluates the chain eagerly or lazily. This is the single most important runtime distinction in this topic.
| Property | Eager (Python list, Go slice helpers, Java Collectors.toList) | Lazy (Java Stream, Python generator, Haskell, Rust Iterator) |
|---|---|---|
| Intermediate collections | One per stage | Zero (fused element-at-a-time) |
| Peak extra memory | O(n) per live stage | O(1) |
| Passes over data | One per stage | One total |
Short-circuit (first/any/take) | No — full chain runs first | Yes — stops at the first hit |
| Per-element overhead | Low (tight loop per stage) | Higher (closure/virtual call per stage) |
| Reusable after consumption | Yes (it's a materialized collection) | No — a lazy stream/iterator is single-use |
| Debuggability | Easy — inspect intermediates | Harder — nothing exists until pulled |
| Best when | Small data, need the intermediate, reuse it | Large data, long chains, early termination, streaming |
Eager's hidden virtue. Eager isn't simply "worse." When the input is small, a tight per-stage loop with no closure indirection can beat a lazy pipeline whose scaffolding cost dominates. And if you genuinely need the doubled list (you reuse it three times), materializing it once is correct — laziness would recompute it on each consumption unless you force/cache it.
Lazy's decisive wins. Two cases make laziness not just faster but categorically necessary:
- Short-circuiting.
stream.map(expensive).filter(p).findFirst()evaluatesexpensiveonly until the first match. The eager equivalent computesexpensivefor every element first. On a 10M-element source where the answer is element 3, that is the difference between 3 calls and 10M. - Infinite / unbounded sources. A generator or
Stream.iterate(...)canmap/filteran infinite sequence andtake(k); the eager form cannot exist — it would never finish materializing stage one. (See Laziness & Streams.)
# Lazy short-circuit: stops at the first match; expensive() called ~3 times.
first_big = next(y for x in xs if (y := expensive(x)) > THRESH)
# Eager equivalent: expensive() called len(xs) times before we look at element 0.
first_big = [expensive(x) for x in xs if expensive(x) > THRESH][0] # wasteful + double-calls
# timeit sketch (illustrative), 10M ints, "double, keep >100, sum":
# list comprehension chain (2 lists) : 1.00x (baseline, ~2 GB peak via tracemalloc)
# map()+filter() returning iterators : ~0.95x (lazy in py3, but lambda call overhead)
# generator expression (fused) : ~0.70x (O(1) peak memory — the real win)
# hand loop : ~0.65x
# Lesson: in Python the memory story (generator: O(1) peak) often matters more than ns.
Diagnose it: the eager-vs-lazy choice is visible in
tracemalloc/memraypeak (lazy stays flat, eager spikes per stage) and in theallocs/opcolumn. If peak memory scales with input and you never reuse the intermediate, you have an eager chain that should be lazy.
Java Stream Internals & Parallelism¶
Java's Stream is the richest production example of a fused, lazy pipeline, and its internals reward understanding because its parallel mode is the most commonly misused feature in this whole topic.
Pipeline structure. A stream is a lazily-built linked structure of pipeline stages. Each intermediate operation (map, filter) creates a new stage that wraps the previous one; no work happens until a terminal operation (sum, collect, findFirst, reduce) is invoked. The terminal op drives a single traversal, pushing each element through the chained stages — this is the fusion: stages compose into one accept(element) call path, no intermediate collection.
Spliterator. The source is described by a Spliterator ("splittable iterator") — it knows how to traverse (tryAdvance) and how to split itself (trySplit) for parallelism, plus characteristics (SIZED, ORDERED, SORTED, DISTINCT) the pipeline uses to optimize. A well-characterized source (an ArrayList, an array, an IntStream.range) splits cheaply into balanced halves; a poorly-characterized one (a LinkedList, an I/O source, iterate) splits badly or not at all.
Short-circuiting. Terminal ops like findFirst, anyMatch, limit are short-circuiting — the traversal stops as soon as the answer is known, exactly the lazy win described above. findAny exists specifically so the parallel engine needn't preserve encounter order when you don't care.
// Sequential stream: one Spliterator, one thread, one fused pass.
OptionalInt firstBig = xs.stream()
.mapToInt(x -> x * 2)
.filter(y -> y > 100)
.findFirst(); // short-circuits — stops at the first match
Parallel streams: the fork-join trap¶
.parallel() reuses the common ForkJoinPool to split the source via trySplit, process chunks on multiple threads, and combine results. It is one method call — and that is the danger, because it hides a cost model that frequently makes it slower than sequential.
parallelStream() tends to lose when:
- The source splits badly.
LinkedList,HashSetiteration,Stream.iterate, or any non-SIZED/non-balanced source produces lopsided splits; one thread gets most of the work and the parallelism evaporates while you still pay coordination overhead. - N is small or per-element work is tiny. The fork/split/merge and pool-scheduling overhead is fixed; below a threshold (rule of thumb: N × cost-per-element must be large, often cited as needing ≳ ~10⁴–10⁵ cheap elements to break even) sequential wins outright.
- The merge is expensive or stateful. Combining results (
Collectors.toList,groupingBy) has its own cost; an ordered collector forces extra work to restore encounter order. - It contends on shared state. A lambda touching a shared mutable structure (a non-concurrent collector, an external counter) either corrupts data or serializes on a lock, erasing the gain. Parallel stream lambdas must be stateless and non-interfering — this is a correctness requirement, not just a performance one.
- It steals the common pool. The default
ForkJoinPool.commonPool()is shared process-wide; a blocking or long parallel stream starves every other parallel stream and CompletableFuture in the JVM. Blocking I/O inside a parallel stream is a classic production incident.
// LOOKS faster, often slower. Measure before believing.
int sum = bigList.parallelStream() // LinkedList here = bad splits → no speedup
.mapToInt(x -> compute(x)) // if compute() is cheap, overhead dominates
.sum(); // and it borrows the shared common pool
// Correct parallel use: large, well-characterized, SIZED source; substantial per-element
// work; stateless lambdas; sequential measured as the baseline first.
int sum2 = Arrays.stream(bigArray) // primitive array → SIZED, splits perfectly
.parallel()
.map(x -> expensivePureTransform(x)) // heavy + pure → parallelism pays
.sum();
Diagnose it: JMH with
@Threads/Blackholecomparing.stream()vs.parallelStream()on your source and your element cost — never assume. JFR will show ForkJoinPool worker activity and whether splits are balanced. The default answer to "should I add.parallel()?" is no, until a JMH benchmark on the real workload says yes.
Boxing, Branch Cost & Measurement¶
Two per-element costs determine whether a fused chain is fast or merely fused: boxing in numeric reduces, and branch prediction in filters.
Boxing: IntStream vs Stream<Integer>¶
On the JVM, Stream<Integer> stores boxed Integer objects — each is a heap object with a header, reached through a reference, and each arithmetic step unboxes and re-boxes. A map(i -> i * 2) on a Stream<Integer> allocates a fresh Integer per element (outside the −128..127 cache) and chases a pointer per access. The primitive specializations — IntStream, LongStream, DoubleStream — store raw primitives in a flat array, no boxing, no pointer chase, vectorizable.
// SLOW — boxed. Every element is a heap Integer; sum() unboxes 10M times.
int s1 = xs.stream() // Stream<Integer>
.map(i -> i * 2) // allocates a new Integer per element
.reduce(0, Integer::sum); // unbox both args every step
// FAST — primitive. Flat int storage, no allocation, no unboxing.
int s2 = xs.stream()
.mapToInt(Integer::intValue) // (or start from IntStream)
.map(i -> i * 2)
.sum();
# JMH sketch (illustrative), sum of doubled 10M ints:
Benchmark Mode Cnt Score Error Units
boxedStreamReduce avgt 5 58.402 ± 2.1 ms/op ← Integer boxing tax
intStreamSum avgt 5 12.118 ± 0.6 ms/op ← ~4.8x faster, 0 boxing alloc
intStreamSum:·gc.alloc.rate avgt 5 ≈0 MB/sec ← the proof: no allocation
boxedStreamReduce:·gc.alloc.rate avgt 5 ≈3200 MB/sec ← 10M Integers/op
The gc.alloc.rate row (JMH's -prof gc) is the smoking gun: the boxed form allocates an Integer per element, the primitive form allocates nothing. The same trap exists in Python only weakly (ints are always objects, but small ints are cached) and in Go not at all for []int — but on the JVM it is the most common silent 3–5× tax in numeric pipelines.
Branch / predicate cost in filter¶
A filter is a branch per element. The CPU's branch predictor makes a well-predicted branch nearly free (~1 cycle) and a mispredicted one expensive (a pipeline flush, ~15–20 cycles). So filter's real cost depends on the predictability of the predicate over your data:
- A predicate that's ~99% one way (
x > 0on mostly-positive data) predicts perfectly — nearly free. - A predicate that's ~50/50 and data-order-random is a misprediction factory — every other element flushes the pipeline.
This is why two filters with identical big-O can differ measurably in throughput, and why predicate ordering matters: put the cheapest, most-selective predicate first so it rejects most elements before the expensive predicate runs (.filter(cheapCommon).filter(expensiveRare)), and the downstream stages see fewer elements.
The measurement toolkit¶
| Concern | Go | Java / JVM | Python |
|---|---|---|---|
| Microbenchmark | testing.B + benchstat | JMH (@Benchmark, Blackhole) | timeit, pyperf |
| Allocation per op | -benchmem → B/op, allocs/op | JMH -prof gc → gc.alloc.rate | tracemalloc, memray |
| CPU profile | -cpuprofile, pprof | async-profiler, JFR | cProfile, py-spy, scalene |
| Branch/cache counters | perf stat -e branches,branch-misses | perf, async-profiler hw events | perf stat python … |
| Did it fuse / inline? | go build -gcflags=-m | -XX:+PrintInlining | (CPython doesn't inline) |
# Go: prove the intermediates are gone — allocs/op should hit 0 for the fused form.
go test -bench=DoubleFilterSum -benchmem -count=10 | tee out.txt && benchstat out.txt
# Java: prove boxing — gc.alloc.rate separates IntStream (≈0) from Stream<Integer>.
java -jar benchmarks.jar DoubleFilter -prof gc
# Python: prove the memory win — peak stays O(1) for the generator form.
python -c "import tracemalloc, timeit; ..." # compare list-comp peak vs genexpr peak
Discipline: the three columns that prove this topic's lessons are
allocs/op/B/op(did fusion remove the intermediates?),gc.alloc.rate(didIntStreamremove the boxing?), andbranch-misses(is the filter predictable?). Wall-clock alone tells you that it's faster; these tell you why, which is what lets you generalize the fix.
reduce as a Catamorphism¶
reduce/fold is not merely "the third member of the trio." It is the universal recursion scheme over a sequence — in category-theory terms, a catamorphism: the unique structure-collapsing morphism from an algebraic data type to a result. The practical payoff of knowing this is twofold.
1. map and filter are special cases of fold. You can define both in terms of reduce, which is why fusion works: a chain of maps and filters is a single fold whose step function is the composition of the per-stage transforms. That is the theoretical statement of which loop fusion is the implementation.
-- map and filter are folds; therefore map∘filter∘fold collapses to one fold.
map f = foldr (\x acc -> f x : acc) []
filter p = foldr (\x acc -> if p x then x : acc else acc) []
-- A fold is the catamorphism for lists: it replaces (:) and [] with your (step, init).
2. Associativity is what enables parallel reduce. A sequential foldl threads an accumulator left-to-right — inherently serial. But if the combining operation is associative ((a ⊕ b) ⊕ c == a ⊕ (b ⊕ c)), the reduction can be split into independent sub-reductions on different threads and combined in any grouping — this is exactly what Java's Stream.reduce(identity, accumulator, combiner) and parallel reduce require. The identity must be a true identity element (x ⊕ identity == x) and the operation must be associative, or the parallel result is nondeterministic.
// Parallel reduce is correct ONLY if combiner is associative and identity is a real
// identity. '+' with 0 qualifies; subtraction or "append with a separator" does not.
int sum = stream.parallel().reduce(0, Integer::sum); // ✓ associative, 0 is identity
// reduce(1.0, (a,b) -> a - b) ✗ not associative → wrong/nondeterministic in parallel
This is why sum, min, max, count, and monoidal concatenation parallelize trivially while "running difference" or "build a string left-to-right with state" do not. The algebra (monoid: associative op + identity) is the precondition for the fork-join split to be correct — the catamorphism view tells you, before you benchmark, whether a reduce is even eligible for parallelism.
The through-line: fusion (one pass) and parallelism (many passes combined) are the two performance superpowers of the trio, and both are theorems about
fold— fusion because map/filter are folds, parallelism because an associative fold splits. The theory isn't decoration; it tells you which optimizations are legal.
Common Mistakes¶
Professional-level mistakes — subtle, and therefore expensive:
- Materializing intermediates you immediately consume.
collect(toList())between amapand afilter, or stage-by-stage list comprehensions, allocatesO(n)garbage for nothing. Keep the pipeline lazy end-to-end; collect once, at the terminal. - Assuming your language fuses. Python list comprehensions and Go slice helpers are eager — no fusion, one allocation per stage. Only generators/
Stream/iterators/Haskell fuse. Know your default before reasoning about cost. - Reaching for
Stream<Integer>in a numeric reduce. Boxing is a silent 3–5× tax with an allocation per element. UseIntStream/LongStream/DoubleStream; confirm with JMHgc.alloc.rate ≈ 0. - Sprinkling
.parallel()as a free speedup. On small N, cheap per-element work, badly-splitting sources (LinkedList), or stateful lambdas it is slower or wrong — and it borrows the shared common pool, hurting the whole JVM. Default to sequential; add parallel only when a JMH benchmark on the real source says yes. - Filtering after the expensive map.
.map(expensive).filter(p)runsexpensiveon every element. Push selective, cheap filters before the expensive map so fewer elements reach it. (Don't confuse this with fusion — order matters even in a fused pass because it changes how many elements each stage sees.) - Reusing a consumed lazy stream / iterator. A
Streamor Python generator is single-use; touching it twice throws or yields nothing. If you need two passes, materialize once or build two pipelines. - Parallel-reducing with a non-associative combiner. Subtraction, division, ordered string-building with state — these give nondeterministic results under parallel reduce. Verify the monoid (associative op + true identity) before parallelizing.
- Benchmarking only wall-clock. A faster ns/op without checking
allocs/opandgc.alloc.ratedoesn't tell you why — and a "win" that's really the JIT eliminating dead code (noBlackhole) is a measurement artifact. Always capture the allocation columns and consume your results.
Test Yourself¶
- Explain, in terms of passes and allocations, why
xs.map(f).filter(p).sum()written with eager helpers is slower than the hand-written single loop — and name the benchmark columns that would prove the difference. - What is loop fusion / deforestation, and name three distinct mechanisms (across different languages) that achieve it. When does fusion fail to happen or lose to the eager form?
- Give two cases where laziness is not merely faster but categorically necessary (impossible to do eagerly).
- On the JVM, why is
Stream<Integer>summed viareduceso much slower thanIntStream.sum()? Which JMH profiler row is the smoking gun? - List four conditions under which
parallelStream()is slower thanstream(), and the tool you'd use to decide on a real workload. - Why does the associativity of a reduce's combining operation determine whether it can be parallelized? Give one operation that qualifies and one that doesn't.
- Two
filterpredicates have identical big-O but one is measurably slower. Give the microarchitectural reason and the counter that confirms it.
Answers
1. The eager form runs **three passes** (map, filter, reduce) and allocates **two intermediate collections** of `O(n)` that become garbage immediately; the hand loop runs **one pass** and allocates **zero** intermediates. The proof columns are **`allocs/op`** and **`B/op`** (Go `-benchmem`) — they drop to 0 for the fused/loop form — plus throughput as the consequence. 2. Deforestation/fusion = collapsing a composition of transforms into one pass with no intermediate data structures. Mechanisms: (a) **lazy pull-based iteration** (Java `Stream`, Python generators, Rust iterators — element-at-a-time, intermediates never materialize); (b) **transducers** (Clojure — compose reducing functions into one); (c) **compiler rewrite rules** (GHC `foldr/build` / stream fusion RULES). It fails when the language is eager (Python lists, Go slices) or the RULES don't match (a forced/`seq`'d intermediate); it *loses* on small N where per-element closure/virtual-call overhead exceeds the saved allocation. 3. (a) **Short-circuiting** — `map(expensive).filter(p).findFirst()` evaluates `expensive` only until the first match; eager computes it for all elements first. (b) **Infinite/unbounded sources** — `Stream.iterate`/generators can map/filter then `take(k)`; the eager form would never finish materializing stage one. 4. `StreamCheat Sheet¶
| Concern | Symptom | Measure with | Fix |
|---|---|---|---|
| Intermediate allocations | High allocs/op / peak memory O(n) per stage; GC churn with little real work | -benchmem, tracemalloc/memray, JFR alloc | Keep pipeline lazy end-to-end; collect once at the terminal |
| No fusion (eager default) | Python list-comp chains, Go slice helpers | tracemalloc peak, B/op | Generators / Stream / iterators / Haskell; or hand loop |
| Boxing in numeric reduce | Stream<Integer>; gc.alloc.rate in GB/s | JMH -prof gc | IntStream/LongStream/DoubleStream |
Unpredictable filter | Same big-O, slower; high branch-misses | perf stat -e branch-misses | Order predicates cheap+selective first; restructure data |
| Wrong-source parallel stream | .parallel() no faster or slower | JMH .stream() vs .parallelStream(), JFR | Sequential by default; parallel only on large, SIZED, heavy, stateless work |
| Non-associative parallel reduce | Nondeterministic parallel result | Property test vs sequential | Use a monoid (associative op + identity) or stay sequential |
Three golden rules: - Count the intermediates first; the collections nobody named are usually what the profiler blames. - Know your default — eager (Python list, Go slice) allocates per stage; lazy (Stream, generator, Haskell) fuses to one pass with none. - Prove every optimization with the allocation columns (allocs/op, gc.alloc.rate), not wall-clock alone — and add .parallel() only when a benchmark on the real source says yes.
Summary¶
- The trio's semantics are trivial; its cost is intermediate-collection allocation and number of passes — not the transform functions. An eager
map→filter→reducechain allocates one collection and one pass per stage. - Loop fusion / deforestation collapses the chain into a single pass with zero intermediates. It arrives three ways: lazy pull-based iteration (Java
Stream, Python generators, Rust iterators), transducers (Clojure), and compiler rewrite RULES (GHC). Fusion isn't free — on small N the per-element scaffolding can lose to the eager or hand-loop form. - Eager vs lazy is the central distinction: eager (Python lists, Go slices) materializes
O(n)per stage and can't short-circuit; lazy (Stream, generators, Haskell) usesO(1)extra memory, one pass, short-circuits, and handles infinite sources — at the price of higher per-element overhead and single-use semantics. - Java
Streaminternals: lazily-linked pipeline stages, no work until the terminal op, traversal/splitting viaSpliterator, short-circuiting terminals..parallel()reuses the shared ForkJoinPool and frequently loses — bad-splitting sources, small N, cheap elements, stateful lambdas, expensive merges. Default sequential; parallelize only on a JMH-proven workload. - Boxing is the JVM's silent numeric tax:
Stream<Integer>allocates anIntegerper element;IntStream/LongStream/DoubleStreamstore flat primitives — often 3–5× faster with ≈0 allocation (gc.alloc.rateis the proof). Filter branch cost depends on predicate predictability; order cheap, selective predicates first. - Measure first, always: the proof columns are
allocs/op/B/op(fusion removed intermediates),gc.alloc.rate(IntStreamremoved boxing), andbranch-misses(predictable filter) — via JMH,testing.B+benchstat, andtimeit/tracemalloc. reduceis a catamorphism — the universal fold.mapandfilterare folds (which is why fusion works), and an associative combiner with a true identity is the precondition for parallel reduce. The theory tells you which optimizations are legal before you benchmark which are fast.- This completes the level ladder for the trio:
junior.md→middle.md→senior.md→ professional.md (runtime & fusion).
Further Reading¶
- Deforestation: transforming programs to eliminate trees — Philip Wadler (1990) — the original statement of removing intermediate data structures from composed transforms.
- Stream Fusion: From Lists to Streams to Nothing at All — Coutts, Leshchinsky, Stewart (ICFP 2007) — the framework behind GHC's list fusion.
- Java 8 in Action / Modern Java in Action — Urma, Fusco, Mycroft — Stream pipeline internals, Spliterator, parallel stream cost model.
- JMH (Java Microbenchmark Harness) — official samples, especially the GC profiler (
-prof gc) andBlackholefor avoiding dead-code elimination. - Optimizing Java — Evans, Gough, Newland (2018) — boxing cost, escape analysis, when parallel streams pay.
- High Performance Python — Gorelick & Ozsvald (2nd ed., 2020) — generators vs comprehensions,
tracemalloc, memory-vs-speed trade-offs. - Clojure transducers — Rich Hickey's reference docs — fusion as a first-class, composable value.
- Go testing & benchmarking —
testing.B,-benchmem, andbenchstatdocs atgolang.org/x/perf.
Related Topics¶
- Laziness & Streams — the evaluation model that makes fusion and short-circuiting possible; this file's closest sibling.
- Composition —
map/filter/reducechains are function composition over sequences; fusion is composition collapsed into one pass. - Recursion & Tail Calls —
foldis the structured recursion scheme; the catamorphism view connects them. - Pure Functions & Referential Transparency — purity is what lets fusion and parallel reduce reorder/merge work safely.
- Immutability — why transforms return new values, and the structural-sharing cost model behind them.
- Over-Engineering → Premature Optimization — profile before you fuse or parallelize; the counterweight to this file.
- big-o-analysis · profiling-techniques · memory-leak-detection — the measurement and complexity toolkits referenced throughout.
In this topic