Unnecessary Allocation — Middle Level¶
Category: Performance Anti-Patterns → Unnecessary Allocation — throwaway objects, boxing, and copies churned in a hot path.
Table of Contents¶
- Introduction
- Prerequisites
- How to See Allocations:
-benchmem - Form 1 — String Building
- Form 2 — Presizing Collections
- Form 3 — Boxing and Autoboxing
- Form 4 — Reuse vs Re-create
- Form 5 — Intermediate Collections in Pipelines
- Reading the Numbers Honestly
- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: The common forms and the fixes — measured, not guessed.
At the junior level you learned the headline case (loop string concatenation) and the one rule (allocate fewer times, only where hot). This file is the working catalog: the five forms of unnecessary allocation you'll actually meet, each with a before/after and — critically — the numbers that prove the fix.
The skill that separates this level from the last is measurement. You stop believing an allocation is there and start seeing it: allocs/op and B/op from a benchmark, allocation counts from a profiler. A change that "should be faster" but moves no numbers is not an optimization — it's a guess. Every fix below comes with the flag that makes the allocation visible.
The discipline: before you change code to allocate less, run the benchmark with
-benchmem(Go) or an allocation profiler (Java/Python). After, run it again. Ifallocs/opdidn't drop, you fixed nothing.
Prerequisites¶
- Required:
junior.md— you can spot loop concatenation and reach for a builder. - Required: You can run a benchmark in your language — Go
testing.B, a JMH harness or a timed loop in Java,timeit/tracemallocin Python. - Helpful: A rough mental model of the heap and a tracing GC (objects live on the heap; the GC reclaims unreachable ones).
- Helpful: Familiarity with the
profiling-techniquesskill —-benchmemis the entry-level version of allocation profiling.
How to See Allocations: -benchmem¶
In Go, every benchmark can report allocations for free. This single flag is the most useful tool in this entire topic:
func BenchmarkJoin(b *testing.B) {
words := make([]string, 1000)
for i := range words {
words[i] = "abc"
}
b.ReportAllocs() // or pass -benchmem on the command line
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = joinSlow(words)
}
}
BenchmarkJoinSlow-8 2014 583012 ns/op 2557104 B/op 999 allocs/op
BenchmarkJoinFast-8 228913 5216 ns/op 5376 B/op 2 allocs/op
Read the last two columns first. B/op is bytes allocated per operation; allocs/op is the number of allocations per operation. The slow version makes 999 allocations (one per word, as predicted) and ~2.5 MB of garbage per call; the builder makes 2 and ~5 KB. The ns/op improvement (112×) follows from the allocation drop. allocs/op is the signal; ns/op is the consequence.
In Java the equivalent is JMH with -prof gc, which reports gc.alloc.rate.norm — bytes allocated per operation. In Python, tracemalloc snapshots show allocation by line. We'll use all three below.
Form 1 — String Building¶
The junior case, now with numbers and the subtler variants.
// Before — O(n²), n allocations. After — O(n), ~2 allocations.
func joinSlow(ws []string) string { s := ""; for _, w := range ws { s += w }; return s }
func joinFast(ws []string) string {
var b strings.Builder
b.Grow(totalLen(ws)) // optional: presize the buffer → 1 alloc
for _, w := range ws { b.WriteString(w) }
return b.String()
}
b.Grow(n) is the pro touch: if you know the final length, presize the builder's buffer so it never reallocates while growing. That takes joinFast from ~2 allocs to 1.
The hidden version — fmt.Sprintf in a hot loop. It's a builder-killer because it allocates for formatting and boxes its ...interface{} arguments:
// Hot-path logging — Sprintf allocates a lot.
msg := fmt.Sprintf("user=%d action=%s", id, action) // formatting + arg boxing
// Cheaper when hot:
var b strings.Builder
b.WriteString("user="); b.WriteString(strconv.Itoa(id))
b.WriteString(" action="); b.WriteString(action)
Don't reflexively kill every
Sprintf— it's fine on a cold path and far more readable. Replace it only where a profile shows the allocations matter.
Form 2 — Presizing Collections¶
A slice or list that doesn't know its final size reallocates as it grows: it allocates a backing array, fills it, then when full allocates a bigger one (typically ~2×), copies everything over, and discards the old one. For n elements you pay roughly log₂(n) reallocations and copy ~2n elements total — all avoidable if you knew the size.
// Before — grows from nil; ~log2(n) reallocations.
func squares(n int) []int {
out := []int{} // cap 0
for i := 0; i < n; i++ { out = append(out, i*i) }
return out
}
// After — one allocation, exact size known.
func squaresFast(n int) []int {
out := make([]int, 0, n) // cap n up front
for i := 0; i < n; i++ { out = append(out, i*i) }
return out
}
BenchmarkSquaresSlow-8 122163 9788 ns/op 357627 B/op 19 allocs/op
BenchmarkSquaresFast-8 384720 3110 ns/op 81920 B/op 1 allocs/op
Same answer, 19 allocations → 1. The Java equivalent is the constructor argument:
List<Integer> out = new ArrayList<>(n); // not new ArrayList<>(); presize to n
Map<String,V> m = new HashMap<>(n * 4 / 3 + 1); // presize to avoid rehash/resize
Presizing is the cheapest, lowest-risk allocation win there is: no reuse, no pooling, no correctness hazard — you just tell the collection what you already know.
Form 3 — Boxing and Autoboxing¶
A boxed primitive is the value wrapped in a heap object: Java's Integer for int, Double for double. A List<Integer> doesn't store ints — it stores pointers to Integer objects, each its own heap allocation. Autoboxing silently converts int ↔ Integer at the boundaries, so the allocations hide in plain-looking code.
// Before — every element is a boxed Integer on the heap.
long sumBoxed(List<Integer> nums) {
long total = 0;
for (Integer n : nums) { // unbox each; the list already boxed them
total += n;
}
return total;
}
// After — primitives never leave the stack / a flat array.
long sumPrimitive(int[] nums) {
long total = 0;
for (int n : nums) { // no boxing, no per-element object
total += n;
}
return total;
}
A JMH run with -prof gc tells the story:
sumBoxed gc.alloc.rate.norm 16.0 B/op (plus the list's boxing upstream)
sumPrimitive gc.alloc.rate.norm 0.0 B/op ← zero allocation
Boxing's worst hiding spots:
Map<Integer, Integer>/Map<Long, …>— keys and values boxed. Reach for a primitive-specialized map (Eclipse Collections, fastutil, HPPC) in hot code.- Streams over
Integer— useIntStream,LongStream,DoubleStream, which carry primitives, not boxed wrappers. Integercache trap —==on boxed values works for −128..127 (cached) and breaks above it. That's a correctness bug riding along with the allocation cost.
Go has no autoboxing of numbers, but it has the analogous trap: putting a value into an interface{}/any boxes it (the value escapes to the heap). fmt.Println(x), []any{...}, and sync.Map storing primitives all box.
Form 4 — Reuse vs Re-create¶
When the same buffer is needed every iteration, allocating a fresh one each time is pure waste. Hoist it out of the loop and reuse it.
// Before — a new scratch buffer per record.
func process(records [][]byte) {
for _, r := range records {
buf := make([]byte, 0, 4096) // fresh allocation every iteration
buf = transform(buf, r)
emit(buf)
}
}
// After — one buffer, reset and reused. (buf[:0] keeps the capacity.)
func processFast(records [][]byte) {
buf := make([]byte, 0, 4096) // allocate once
for _, r := range records {
buf = transform(buf[:0], r) // reuse the backing array
emit(buf)
}
}
buf[:0] resets the length to zero while keeping the capacity, so transform writes into the same memory. Allocations drop from one-per-record to one total.
The catch — only reuse what's truly done being used.
emit(buf)must consumebufbefore the next iteration overwrites it. Ifemitstashes the slice for later, you've created an aliasing bug — the stored slices all point at the same buffer and show the last record's bytes. Reuse trades a small allocation win for a real correctness obligation. (The heavyweight version of reuse —sync.Pool— and its dangers aresenior.md.)
Form 5 — Intermediate Collections in Pipelines¶
Each map/filter/collect step that materializes a whole new collection allocates a throwaway. Chaining four of them allocates four lists you never name.
// Before — three intermediate lists, then a fourth result.
List<String> result = users.stream()
.collect(Collectors.toList()) // materialize
.stream().filter(User::isActive)
.collect(Collectors.toList()) // materialize
.stream().map(User::name)
.collect(Collectors.toList()); // materialize
// After — one lazy pipeline, one terminal allocation.
List<String> result = users.stream()
.filter(User::isActive)
.map(User::name)
.collect(Collectors.toList()); // single materialization
Streams are lazy: filter and map don't allocate collections — they're applied element-by-element and only the terminal collect materializes. The "before" forces a list at each step by re-streaming, defeating laziness. The fix is simply don't break the pipeline — keep it lazy until one terminal operation.
Python's analog: prefer generators over building intermediate lists when you only iterate once.
# Before — two full lists in memory.
names = [u.name for u in users]
active = [n for n in names if is_active(n)]
# After — generator pipeline, no intermediate list.
active = (u.name for u in users if u.is_active) # lazy; materialize only if needed
Reading the Numbers Honestly¶
A few rules so your benchmarks don't lie to you:
- Look at
allocs/opandB/op, not justns/op. Allocation cost is partly deferred into future GC cycles; a micro-benchmark may not pay that bill, so thens/opwin understates the real one. The allocation count is the honest signal. - Make sure the result is used. If you discard the benchmark's output, the compiler may delete the work (dead-code elimination). Assign it to a package-level sink or
b-scoped variable. (More inprofessional.md.) - Benchmark realistic sizes. Loop concatenation looks fine at n=10 and catastrophic at n=100,000. Use an input size near production.
- Same answer, fewer allocations. Every fix above is behavior-preserving. If the output changed, you didn't optimize — you broke it.
Common Mistakes¶
- Trusting
ns/opalone. A change can shave nanoseconds in a micro-benchmark while making more garbage that hits you under real GC load. Watchallocs/op. - Presizing with the wrong number.
make([]T, n)(length n) vsmake([]T, 0, n)(capacity n) — the first prefills n zero values; appending then doubles the slice. For "append n items," you want0, n. - Killing every
Sprintf/stream/Integer. These are fine on cold paths and far clearer. Replace them only where a profile points. - Reusing a buffer that's still referenced. Buffer reuse is a correctness contract: the old contents must be fully consumed before overwrite. Get this wrong and you have an aliasing bug, not a speedup.
HashMapwithout sizing, then surprised by rehashes. A map that outgrows its capacity rehashes — reallocating the whole table. Presize from the known element count.- Boxing in the hot loop you didn't notice.
map.get(key)wherekeyis anintautoboxes on every call. The allocation is invisible in the source.
Test Yourself¶
- You see
BenchmarkX-8 ... 120000 B/op 300 allocs/op. Which number tells you there's an unnecessary-allocation problem, and what does each mean? - In Go, what's the difference between
make([]int, n)andmake([]int, 0, n)when you're about toappendn items? Which avoids reallocation? - Why does
List<Integer>allocate more thanint[]for the same numbers? What's the name for wrapping a primitive in a heap object? - Rewrite to one allocation:
- A four-step Java stream pipeline calls
.collect(Collectors.toList())after every step. How many lists does it materialize, and how do you get to one? - You hoist a
make([]byte, 4096)out of a loop and reuse it withbuf[:0]. What new bug must you check for?
Answers
1. **`allocs/op` (300)** is the key signal — 300 separate allocations per call strongly suggests one-per-something in a loop. `B/op` (120 KB) is the total bytes; high bytes with low allocs is one big allocation, high allocs is many small ones (usually the fixable kind). 2. `make([]int, n)` creates a slice of **length n**, prefilled with n zeros; appending adds a 2nd batch and triggers growth. `make([]int, 0, n)` creates **length 0, capacity n**; appending n items fits exactly with **no reallocation**. Use the second. 3. `ListCheat Sheet¶
| Form | Before | After | What drops |
|---|---|---|---|
| String building | s += x in a loop | strings.Builder / StringBuilder (+Grow) | n allocs → 1–2 |
| Presizing | []T{} / new ArrayList<>() | make([]T,0,n) / new ArrayList<>(n) | log n reallocs → 1 |
| Boxing | List<Integer>, Map<Integer,…> | int[], primitive-specialized maps | per-element objects → 0 |
| Reuse | make(...) inside loop | hoist + buf[:0] | n allocs → 1 (mind aliasing) |
| Pipelines | .collect() per step | one lazy pipeline, one terminal | n lists → 1 |
Measure both ends:
go test -benchmem, JMH-prof gc, Pythontracemalloc. Ifallocs/opdidn't fall, you changed nothing.
Summary¶
- The five everyday forms of unnecessary allocation are string building, un-presized collections, boxing/autoboxing, re-creating instead of reusing, and intermediate collections in pipelines.
- Each fix is behavior-preserving and provable: run the benchmark with
-benchmem(Go),-prof gc(JMH), ortracemalloc(Python) and watchallocs/opfall. allocs/opis the signal,ns/opis the consequence — and a micro-benchmark understates allocation cost because GC is deferred.- Presizing is the cheapest, safest win (no reuse, no pool). Boxing hides in
List<Integer>andinterface{}. Buffer reuse is fast but carries an aliasing correctness contract. - Apply these where they're measured to matter; on cold paths, prefer the clearer code. (Premature optimization is the failure on the other side.)
- Next:
senior.md— reading a real allocation profile, escape analysis (-gcflags=-m),sync.Pooland its dangers, and the rule "profile first — most allocations don't matter."
Further Reading¶
- Systems Performance — Brendan Gregg (2nd ed., 2020) — connecting allocation rate to observed CPU and latency.
- Java Performance — Scott Oaks (2nd ed., 2020) — autoboxing costs, the young generation, and allocation rate as the GC's primary input.
- The Go Blog — escape analysis (go.dev/blog) — why some allocations vanish entirely when a value stays on the stack.
- Effective Java — Joshua Bloch (3rd ed., 2018) — Item 6 (unnecessary objects), Item 61 (prefer primitives to boxed primitives).
Related Topics¶
- Premature Optimization Traps — apply these fixes only where measured; don't pre-pool cold paths.
- N+1 in Code — repeated work in a loop, the structural cousin of repeated allocation.
- Wrong Data Structure — the wrong collection often allocates badly and runs slow; aliasing on reuse links here.
- Profiling Techniques — the
profiling-techniquesskill;-benchmemis the entry point to allocation profiling. junior.md·senior.md— the headline case, then reducing allocation in a real hot path.
In this topic
- junior
- middle
- senior
- professional