Benchmarking Strategy — Writing Effective Benchmarks¶
A benchmark is only useful when it answers a question accurately. This file is about writing benchmarks that hold up — under varying inputs, under repeated runs, under reviewers — and avoiding the dozens of subtle traps that turn benchmarks into expensive coincidences.
1. Start with a question¶
Before writing one line of benchmark code, write down the question.
| Bad question | Better question |
|---|---|
"Is Foo fast?" | "Is Foo faster than Bar for inputs in the range we ship?" |
| "How fast is JSON parsing?" | "What is the per-byte cost of json.Unmarshal on payloads of 1 KB, 10 KB, 100 KB?" |
| "Does pooling help?" | "Does pooling reduce allocations per request below 5 for handler H?" |
A precise question fixes the workload, the comparison target, and the success metric. Without these three, a benchmark is just a number.
2. Identify the real workload¶
Benchmarks must reflect production. Three dimensions:
Input size. Pull histogram data from logs or a profile. If 90% of requests are 1–4 KB and 10% are 100 KB+, benchmark both ends, not the median.
Input shape. Is the data random? Sorted? Mostly-empty? Repeating? A strings.IndexByte benchmark on "aaaaa..." measures a different path than on natural English text.
Concurrency. Is the function called once per request (single-threaded), once per CPU (parallel-bounded), or thousands of times concurrently (heavily contended)? Each needs a different benchmark form.
A useful exercise: list five real inputs the function sees in production. Make each one a sub-benchmark.
3. Choose a representative input table¶
var inputs = []struct {
name string
data []byte
}{
{"small_json_ok", loadFixture("small_ok.json")},
{"small_json_err", loadFixture("small_err.json")},
{"medium_json_typical", loadFixture("medium.json")},
{"large_json_typical", loadFixture("large.json")},
{"adversarial_deep", loadFixture("deep_nested.json")},
}
func BenchmarkParseJSON(b *testing.B) {
for _, in := range inputs {
b.Run(in.name, func(b *testing.B) {
b.SetBytes(int64(len(in.data)))
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_, _ = parse(in.data)
}
})
}
}
Now one benchmark answers five questions. The adversarial_deep case is especially useful: it catches regressions in the error path that "typical" cases miss.
4. Hoist setup correctly¶
The single most common benchmark mistake: setup measured as part of the work.
// BAD
func BenchmarkProcess(b *testing.B) {
for i := 0; i < b.N; i++ {
cfg := loadConfig("config.yaml") // 5 ms per iteration!
process(cfg)
}
}
// GOOD
func BenchmarkProcess(b *testing.B) {
cfg := loadConfig("config.yaml") // once, outside the timer
b.ResetTimer()
for i := 0; i < b.N; i++ {
process(cfg)
}
}
When you cannot hoist (because the function under test consumes its input destructively), use b.StopTimer/b.StartTimer around the per-iteration setup:
func BenchmarkConsume(b *testing.B) {
template := buildPayload()
for i := 0; i < b.N; i++ {
b.StopTimer()
input := slices.Clone(template)
b.StartTimer()
consume(input)
}
}
But: StopTimer/StartTimer have non-zero overhead (~50 ns each). For very fast benchmarks they can become 20% of the measurement. In those cases, prefer to fix the design — make consume non-destructive, or use a ring buffer of pre-built inputs.
5. Defeat dead-code elimination¶
The compiler may remove unused work. Use one of:
Package-level sink:
var sinkResult Result
func BenchmarkCompute(b *testing.B) {
for i := 0; i < b.N; i++ {
sinkResult = compute(input)
}
}
runtime.KeepAlive for allocations without a natural sink:
b.Loop() on Go 1.24+:
Verify the work runs. Two cheap checks:
- Disassemble:
go test -c -o test.bin && go tool objdump -s 'BenchmarkCompute' test.bin. - Profile:
go test -bench=BenchmarkCompute -cpuprofile=cpu.out -run=^$thengo tool pprof -top cpu.out. The function under test should dominate.
6. Avoid microbench traps¶
A microbench measures a tiny operation in isolation. Some traps unique to that scale:
Cache effects. A benchmark that re-reads the same 1 KB array forever runs entirely in L1; the production function sweeps 1 GB of working set. Add a warm-up pass or rotate through a larger pool of inputs.
Branch prediction. A benchmark that repeats if x > 0 with the same input every iteration trains the predictor; production data has varying branches. Use varied input.
Inlining decisions. The benchmark calls compute(x) directly; production calls it via an interface or function pointer. The compiler may inline one and not the other.
Constant folding. compute(42) with a literal argument can be folded at compile time; compute(arg) where arg comes from a function call cannot. Pass arguments through a var x = ... to break folding.
// Susceptible to folding
for i := 0; i < b.N; i++ {
_ = parse("hello world")
}
// Folding-resistant
var input = []byte("hello world")
for i := 0; i < b.N; i++ {
_ = parse(input)
}
7. Pre-allocate inputs¶
For benchmarks that consume an input slice or buffer:
const N = 1024
func BenchmarkSort(b *testing.B) {
data := make([][]int, b.N)
for i := range data {
data[i] = randomInts(N)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
sort.Ints(data[i])
}
}
Each iteration sorts a fresh, independent slice. The allocation cost is excluded by ResetTimer. If you used a single slice and re-randomized in the loop, you'd be measuring randomInts plus Sort — not what you wanted.
Catch: data itself is now b.N slices of N ints = 8 * N * b.N bytes. For b.N = 10M, that's 80 GB. Pre-allocate a pool of K independent inputs (e.g., K=1024) and rotate:
const PoolSize = 1024
data := make([][]int, PoolSize)
for i := range data {
data[i] = randomInts(N)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
sort.Ints(data[i%PoolSize])
}
Caveat: now you're sorting already sorted slices on later iterations. Restore each slice or use a fresh shuffle. There is no free lunch.
8. Allocation budgets¶
Decide ahead of time what the alloc budget should be. Encode it in a test:
func TestBenchmarkRouteAllocBudget(t *testing.T) {
result := testing.Benchmark(BenchmarkRoute)
if got := result.AllocsPerOp(); got > 3 {
t.Errorf("BenchmarkRoute: got %d allocs/op, budget 3", got)
}
if got := result.AllocedBytesPerOp(); got > 256 {
t.Errorf("BenchmarkRoute: got %d B/op, budget 256", got)
}
}
The test runs in go test ./... (no -bench flag needed because testing.Benchmark is just a function call). A PR that doubles allocations breaks the test. Reviewers see it before merge.
9. Concurrent benchmarks done right¶
b.RunParallel is for contended code. Three rules:
Don't share mutable state between iterations. Each pb.Next() call should be independent:
func BenchmarkLookup(b *testing.B) {
var m sync.Map
for i := 0; i < 1000; i++ {
m.Store(i, i)
}
b.RunParallel(func(pb *testing.PB) {
rng := rand.New(rand.NewSource(rand.Int63())) // per-goroutine RNG
for pb.Next() {
m.Load(rng.Intn(1000))
}
})
}
Vary the keys across goroutines. If every goroutine looks up the same key, you're measuring read-only cache coherence, not realistic load.
Don't use RunParallel for single-threaded code. It adds scheduler overhead and produces less-informative numbers.
10. Sub-benchmarks for scaling curves¶
The single best use of b.Run is to plot a scaling curve:
func BenchmarkSort(b *testing.B) {
for _, n := range []int{10, 100, 1000, 10000, 100000} {
b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
data := randomInts(n)
buf := make([]int, n)
b.ResetTimer()
for i := 0; i < b.N; i++ {
copy(buf, data)
sort.Ints(buf)
}
})
}
}
Plot ns/op vs n on a log-log axis. A line with slope 1 = linear; slope 1.something = n log n; slope 2 = quadratic. Surprises here ("we expected O(n log n) but the slope is 1.4 — what's the constant?") drive real investigations.
11. Statistical hygiene¶
Three rules that prevent most false-conclusion bugs:
Always -count=10 or more. A single sample is noise. benchstat refuses to compare with fewer than 6 samples; 10 is the conventional default.
Run base and head on the same machine, in the same session, with similar warmup. Comparing yesterday's bench to today's is risky — kernel updates, BIOS firmware, even ambient temperature affect numbers.
Reject p ≥ 0.05 differences. A 4% improvement with p=0.30 is not real. Either gather more samples or accept the result is inconclusive.
go test -bench=. -count=10 -run=^$ ./... | tee old.txt
# make change
go test -bench=. -count=10 -run=^$ ./... | tee new.txt
benchstat old.txt new.txt
If you see ~ (not significantly different), the change had no measurable effect. Don't claim it did.
12. Workload realism techniques¶
| Technique | When |
|---|---|
| Read inputs from a fixtures file | Production-shaped JSON, protobuf payloads |
| Sample inputs from a recorded request log | Heavy-tailed distributions matter |
| Generate randomized inputs with a fixed seed | Stress + reproducibility |
Use testing/quick for property-style inputs | Catches edge cases benchmarks miss |
| Replay anonymized production traffic | Highest fidelity, most expensive |
For most teams, fixture files plus seeded randomness covers 90% of the value. Recorded traffic is the gold standard for systems where input distribution dominates.
13. When the function does I/O¶
Pure-CPU benchmarks are easy. Benchmarks of code that touches the network, disk, or kernel are tricky because their cost depends on system state outside the benchmark.
Strategy: separate the CPU and I/O costs.
func BenchmarkHTTPHandler(b *testing.B) {
handler := newHandler()
req := httptest.NewRequest("GET", "/users/42", nil)
rec := httptest.NewRecorder()
b.ResetTimer()
for i := 0; i < b.N; i++ {
handler.ServeHTTP(rec, req)
rec.Body.Reset()
}
}
httptest.NewRecorder eliminates the kernel and network from the measurement. You're measuring the handler's CPU and allocation behavior in isolation — which is what's reproducible across machines.
For end-to-end timing (with real network, real disk), use a load tester (vegeta, hey, wrk) — not testing.B.
14. The opaque-input trick¶
When the compiler is too clever — folding constants, inlining everything, proving allocations away — break the chain with an explicit barrier:
//go:noinline
func opaque[T any](v T) T { return v }
func BenchmarkCompute(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = compute(opaque(42))
}
}
opaque is //go:noinline-marked, so the compiler cannot fold the argument through it. Use sparingly — it's a hack — but it's the cleanest way to defeat over-optimization in some Go versions.
15. A checklist before you commit a benchmark¶
- The benchmark name is descriptive (
BenchmarkRouteRequest_LargePayload, notBenchmarkX). - The input shape is documented (a comment names the fixture, or the code makes it obvious).
-
b.ResetTimer()is called after setup, before the loop. - The loop runs exactly
b.N(or usesb.Loop()) times. - Returns are assigned to a sink, or
b.Loop()is used. -
b.ReportAllocs()is on, or-benchmemis part of the team's standard invocation. - You ran it with
-count=10and checked variance. - If it's a comparison, you ran
benchstatand notedpvalue. - If on a tracked path, you added it to the tracked suite (build tag or name prefix).
- An alloc budget test exists if zero/low allocations are the goal.
16. Common output anti-patterns¶
| Output | What it suggests |
|---|---|
0.31 ns/op | Compiler removed the work. Add a sink. |
10000 iterations, ns/op fluctuates 50% | Benchmark is too noisy; -benchtime=5s, more -count, or hardware issue. |
allocs/op jumps from 0 to 4 with no code change | A dependency update added boxing; check go.sum. |
MB/s rising as size shrinks | Per-call overhead dominates; the function isn't I/O-bound at small sizes. |
| Sub-benchmarks with wildly different variance | One is dominated by setup, one isn't. Hoist consistently. |
b.N ends very small (e.g., 100) | Each iteration is slow (ms+); use longer -benchtime for stability. |
17. Summary¶
Effective benchmarks start with a precise question, use a representative workload, and defend against the compiler removing work. Hoist setup, use b.ResetTimer, pin sinks, sweep input sizes with b.Run, and always confirm significance with -count=10 and benchstat. Allocation budgets enforced as tests are the cheapest way to lock in gains. Microbench wins that ignore production context are worse than no benchmark at all.
Further reading¶
- Damian Gryski, go-perfbook: https://github.com/dgryski/go-perfbook
- Dave Cheney, High Performance Go: https://dave.cheney.net/high-performance-go-workshop/dotgo-paris.html
- Go blog — Profile-Guided Optimization: https://go.dev/blog/pgo
runtime.KeepAlive: https://pkg.go.dev/runtime#KeepAlivego.dev/blog/subtests: https://go.dev/blog/subtests