Benchmark Deep — Find the Bug¶
Each snippet compiles and produces a number. The number is wrong. Identify the bug, explain why the benchmark lies, and write the corrected version.
Bug 1 — The vanishing call¶
The result is suspiciously fast. What did the compiler do, and how do you prevent it? Hint: there are two separate problems here — one about the input, one about the output.
Answer sketch: the input is a string constant so the byte slice is hoisted out of the loop and the compiler can prove the result is unused. Fix by introducing a per-iteration input (e.g. []byte(strconv.Itoa(i))) and assigning the return value to a package-level var sink [32]byte.
Bug 2 — Timer not reset¶
func BenchmarkParse(b *testing.B) {
data, _ := os.ReadFile("big.json")
for i := 0; i < b.N; i++ {
var out map[string]any
json.Unmarshal(data, &out)
}
}
Why does b.N come out tiny and the reported ns/op huge for the first run? What is missing?
Answer sketch: os.ReadFile is inside the timer window. For small b.N values the framework picks during scaling, that I/O dominates. Add b.ResetTimer() between the read and the loop. For extra credit, move the read above b.Run if you are in a sub-benchmark.
Bug 3 — Setup inside the loop¶
func BenchmarkSort(b *testing.B) {
for i := 0; i < b.N; i++ {
s := rand.Perm(10000)
sort.Ints(s)
}
}
rand.Perm(10000) is doing 10000 allocations and a Fisher–Yates shuffle on every iteration. The benchmark is measuring shuffle + sort, not sort. Fix without losing the property that each iteration sees a different input order.
Answer sketch: precompute N copies into a slice-of-slices outside the timer, or generate a single permutation, copy it into a reusable scratch buffer inside b.StopTimer/b.StartTimer blocks.
Bug 4 — The optimised-away accumulator¶
var sink int
func BenchmarkSum(b *testing.B) {
xs := []int{1, 2, 3, 4, 5}
for i := 0; i < b.N; i++ {
s := 0
for _, x := range xs {
s += x
}
sink = s
}
}
This looks correct — there is a sink. But the reported time is still way too small. Why?
Answer sketch: xs is a tiny constant-content slice, the inner sum is loop-invariant, and the compiler can lift the entire inner loop out, so sink is written once not b.N times. Use an input that varies with i (e.g. xs[i%len(xs)] driving the work) or randomise xs per iteration in StopTimer'd setup.
Bug 5 — Wrong baseline for benchstat¶
go test -bench BenchmarkNew -count=10 > new.txt
git stash
go test -bench BenchmarkOld -count=10 > old.txt
git stash pop
benchstat old.txt new.txt
What is wrong with this comparison workflow?
Answer sketch: the benchmark names differ (BenchmarkOld vs BenchmarkNew), so benchstat will report them as two unrelated benchmarks rather than a paired comparison. benchstat matches by name. Rename both to the same name, or use -name-regexp/the pkg/op columns properly.
Bug 6 — Single-sample comparison¶
benchstat prints ~ for everything. The dev says "the test is broken". What is actually wrong?
Answer sketch: each file has -count=1 so benchstat has only one sample per benchmark and cannot compute a confidence interval — it correctly refuses to claim significance. Re-run with -count=10 or more.
Bug 7 — Frequency-scaled laptop¶
A benchmark reports 12 ns/op in the morning and 19 ns/op after lunch. The dev concludes their change is non-deterministic. Diagnose.
Answer sketch: the CPU went into turbo while cool and dropped out under thermal load. Mitigate by disabling turbo (echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo on Linux), pinning to a single core with taskset, and running -count=10 in a warmed-up state. Or move the workload to a server with frequency lock.
Bug 8 — GC pause hides in the mean¶
A benchmark reports a stable 200 ns/op mean across 10 runs, but tail latency in production is terrible. Why is the mean lying?
Answer sketch: GC pauses are rare relative to b.N so they get averaged out. Switch to b.ReportMetric with p95/p99 by collecting per-iteration time.Now() deltas, or sample /gc/pauses:seconds from runtime/metrics in parallel. The mean is the wrong statistic for a latency-sensitive benchmark.
Bug 9 — Captured loop variable¶
Two bugs at once. Name both.
Answer sketch: (1) classic captured-variable bug, all goroutines see the final i. Fix with i := i or by passing as a parameter. (2) the timer includes goroutine launch but the work runs concurrently, so the benchmark measures scheduling, not work. Either run serially or use a sync.WaitGroup and include the wait inside the timer.
Bug 10 — Benchcmp output misread¶
A dev shows benchcmp old.txt new.txt reporting "-12% +/-?" and merges the PR. Why is this risky?
Answer sketch: benchcmp does not do a statistical test and is deprecated. The 12% delta may be entirely within noise. Re-run with benchstat and check that p < 0.05.
Bug 11 — The non-resetting timer in a sub-benchmark¶
func BenchmarkAll(b *testing.B) {
data := loadHugeFixture()
for _, n := range []int{1, 10, 100, 1000} {
b.Run(fmt.Sprintf("N=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
sink = process(data, n)
}
})
}
}
The outer loadHugeFixture happens once, fine. But why are the small-N sub-benches dominated by setup time anyway?
Answer sketch: b.Run inherits the outer timer state on entry, but each sub-bench has its own b.N selection. The framework starts the timer when the sub-bench function is invoked. If there is per-sub-bench setup inside the closure (e.g. prep := make([]int, n)), it counts. Add b.ResetTimer() inside the sub-bench closure if you have per-sub-bench setup.
Bug 12 — Allocs counted from a defer¶
trackCall does a fmt.Sprintf. The benchmark reports 1 allocs/op and the dev cannot find the source. Where is the alloc?
Answer sketch: the defer allocates a deferred-function record on the heap, and trackCall itself allocates via fmt.Sprintf. Combined that may report as 1-2 allocs. The fix is to remove or refactor trackCall, or use the Go 1.14+ open-coded defer optimisation (which only fires for simple defers; complex defers still allocate).
Bug 13 — Pre-allocation that perturbs cache¶
func BenchmarkLookup(b *testing.B) {
m := buildMap(10000)
keys := buildKeys(10000)
// 1MB of warmup data:
warm := make([]byte, 1<<20)
for i := range warm { warm[i] = byte(i) }
b.ResetTimer()
for i := 0; i < b.N; i++ {
sink = m[keys[i%len(keys)]]
}
}
The dev added warm to "warm up the cache" but the bench got slower. Why?
Answer sketch: warm is 1MB of unused-but-resident data sharing cache lines with the map's buckets. The map's buckets get evicted to make room for warm. The benchmark now measures cold-cache map lookup, not warm. Remove warm or write a real warmup that touches m[keys[...]] in a loop before ResetTimer.
Bug 14 — The bench that times out¶
func BenchmarkSlowDB(b *testing.B) {
db := connectDB(b)
for i := 0; i < b.N; i++ {
sink = db.Query("SELECT * FROM big_table")
}
}
go test -bench=BenchmarkSlowDB runs forever and times out at 10 minutes. The dev concludes "the DB is broken." What is actually happening?
Answer sketch: the framework grew b.N until the per-pass time hit -benchtime, but each iteration takes seconds (DB query). The first real pass with b.N=100 takes 10 minutes. Use -benchtime=10s (so the framework stops growing b.N) or -benchtime=10x (fixed iteration count). Or — better — do not microbench against a real DB; mock the slow parts.
Bug 15 — The b.N used as input¶
func BenchmarkInsert(b *testing.B) {
list := newList()
for i := 0; i < b.N; i++ {
list.Insert(b.N - i)
}
}
The "input size" is b.N itself. As b.N grows during stabilisation the benchmark measures inserts into ever-larger lists. The reported ns/op is meaningless — it averages costs across input sizes that differ by orders of magnitude.
Fix: separate input size from iteration count. Use a sub-bench parameterised by size:
for _, n := range []int{100, 10000, 1000000} {
b.Run(fmt.Sprintf("N=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
list := newList()
b.StartTimer()
for j := 0; j < n; j++ {
list.Insert(j)
}
}
})
}
Now b.N controls how many times you build a fresh list of size n, which is what you actually wanted.
Bug 16 — Cross-package state leak¶
// in package A:
var globalCache map[string]*Resource
func BenchmarkA(b *testing.B) {
for i := 0; i < b.N; i++ {
sink = lookup("key")
}
}
// in package B (different bench file):
func BenchmarkB(b *testing.B) {
for i := 0; i < b.N; i++ {
sink = lookup("key") // sees A's cache
}
}
globalCache is populated by BenchmarkA and BenchmarkB sees the hot cache. Run order matters; reverse the order and the numbers flip. What is the right fix?
Answer sketch: globals across benches are an antipattern. Clear the global in b.Cleanup, or refactor to take cache as a parameter. The fundamental cause is the global itself.
Bug 17 — Misuse of -cpu=1¶
A dev sees BenchmarkParallel-1 and BenchmarkParallel-8 and benchstat-compares them. Why is this wrong?
Answer sketch: benchstat compares the same benchmark across two files, treating the -N suffix as part of the name. The two rows are different benchmarks; benchstat reports them as unrelated, not as a scaling comparison. The right tool for scaling analysis is to extract the ns/op manually and plot.
Bug 18 — Forgetting b.Cleanup on a leaked goroutine¶
func BenchmarkWithBackground(b *testing.B) {
go backgroundLoop() // never stops
for i := 0; i < b.N; i++ {
sink = foreground()
}
}
Each invocation of the benchmark (the framework calls it many times during stabilisation) spawns a new background goroutine. Memory and CPU leak across iterations. By the end of -count=10 you may have hundreds of background goroutines polluting the bench. Fix with b.Cleanup: