Benchmarking Strategy — Writing Effective Benchmarks¶

A benchmark is only useful when it answers a question accurately. This file is about writing benchmarks that hold up — under varying inputs, under repeated runs, under reviewers — and avoiding the dozens of subtle traps that turn benchmarks into expensive coincidences.

1. Start with a question¶

Before writing one line of benchmark code, write down the question.

Bad question	Better question
"Is `Foo` fast?"	"Is `Foo` faster than `Bar` for inputs in the range we ship?"
"How fast is JSON parsing?"	"What is the per-byte cost of `json.Unmarshal` on payloads of 1 KB, 10 KB, 100 KB?"
"Does pooling help?"	"Does pooling reduce allocations per request below 5 for handler `H`?"

A precise question fixes the workload, the comparison target, and the success metric. Without these three, a benchmark is just a number.

2. Identify the real workload¶

Benchmarks must reflect production. Three dimensions:

Input size. Pull histogram data from logs or a profile. If 90% of requests are 1–4 KB and 10% are 100 KB+, benchmark both ends, not the median.

Input shape. Is the data random? Sorted? Mostly-empty? Repeating? A strings.IndexByte benchmark on "aaaaa..." measures a different path than on natural English text.

Concurrency. Is the function called once per request (single-threaded), once per CPU (parallel-bounded), or thousands of times concurrently (heavily contended)? Each needs a different benchmark form.

A useful exercise: list five real inputs the function sees in production. Make each one a sub-benchmark.

3. Choose a representative input table¶

var inputs = []struct {
    name string
    data []byte
}{
    {"small_json_ok",       loadFixture("small_ok.json")},
    {"small_json_err",      loadFixture("small_err.json")},
    {"medium_json_typical", loadFixture("medium.json")},
    {"large_json_typical",  loadFixture("large.json")},
    {"adversarial_deep",    loadFixture("deep_nested.json")},
}

func BenchmarkParseJSON(b *testing.B) {
    for _, in := range inputs {
        b.Run(in.name, func(b *testing.B) {
            b.SetBytes(int64(len(in.data)))
            b.ReportAllocs()
            for i := 0; i < b.N; i++ {
                _, _ = parse(in.data)
            }
        })
    }
}

Now one benchmark answers five questions. The adversarial_deep case is especially useful: it catches regressions in the error path that "typical" cases miss.

4. Hoist setup correctly¶

The single most common benchmark mistake: setup measured as part of the work.

// BAD
func BenchmarkProcess(b *testing.B) {
    for i := 0; i < b.N; i++ {
        cfg := loadConfig("config.yaml")   // 5 ms per iteration!
        process(cfg)
    }
}

// GOOD
func BenchmarkProcess(b *testing.B) {
    cfg := loadConfig("config.yaml")       // once, outside the timer
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        process(cfg)
    }
}

When you cannot hoist (because the function under test consumes its input destructively), use b.StopTimer/b.StartTimer around the per-iteration setup:

func BenchmarkConsume(b *testing.B) {
    template := buildPayload()
    for i := 0; i < b.N; i++ {
        b.StopTimer()
        input := slices.Clone(template)
        b.StartTimer()
        consume(input)
    }
}

But: StopTimer/StartTimer have non-zero overhead (~50 ns each). For very fast benchmarks they can become 20% of the measurement. In those cases, prefer to fix the design — make consume non-destructive, or use a ring buffer of pre-built inputs.

5. Defeat dead-code elimination¶

The compiler may remove unused work. Use one of:

Package-level sink:

var sinkResult Result

func BenchmarkCompute(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sinkResult = compute(input)
    }
}

runtime.KeepAlive for allocations without a natural sink:

for i := 0; i < b.N; i++ {
    buf := make([]byte, 4096)
    runtime.KeepAlive(buf)
}

b.Loop() on Go 1.24+:

for b.Loop() {
    _ = compute(input)
}

Verify the work runs. Two cheap checks:

Disassemble: go test -c -o test.bin && go tool objdump -s 'BenchmarkCompute' test.bin.
Profile: go test -bench=BenchmarkCompute -cpuprofile=cpu.out -run=^$ then go tool pprof -top cpu.out. The function under test should dominate.

6. Avoid microbench traps¶

A microbench measures a tiny operation in isolation. Some traps unique to that scale:

Cache effects. A benchmark that re-reads the same 1 KB array forever runs entirely in L1; the production function sweeps 1 GB of working set. Add a warm-up pass or rotate through a larger pool of inputs.

Branch prediction. A benchmark that repeats if x > 0 with the same input every iteration trains the predictor; production data has varying branches. Use varied input.

Inlining decisions. The benchmark calls compute(x) directly; production calls it via an interface or function pointer. The compiler may inline one and not the other.

Constant folding. compute(42) with a literal argument can be folded at compile time; compute(arg) where arg comes from a function call cannot. Pass arguments through a var x = ... to break folding.

// Susceptible to folding
for i := 0; i < b.N; i++ {
    _ = parse("hello world")
}

// Folding-resistant
var input = []byte("hello world")
for i := 0; i < b.N; i++ {
    _ = parse(input)
}

7. Pre-allocate inputs¶

For benchmarks that consume an input slice or buffer:

const N = 1024

func BenchmarkSort(b *testing.B) {
    data := make([][]int, b.N)
    for i := range data {
        data[i] = randomInts(N)
    }
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        sort.Ints(data[i])
    }
}

Each iteration sorts a fresh, independent slice. The allocation cost is excluded by ResetTimer. If you used a single slice and re-randomized in the loop, you'd be measuring randomInts plus Sort — not what you wanted.

Catch: data itself is now b.N slices of N ints = 8 * N * b.N bytes. For b.N = 10M, that's 80 GB. Pre-allocate a pool of K independent inputs (e.g., K=1024) and rotate:

const PoolSize = 1024
data := make([][]int, PoolSize)
for i := range data {
    data[i] = randomInts(N)
}
b.ResetTimer()

for i := 0; i < b.N; i++ {
    sort.Ints(data[i%PoolSize])
}

Caveat: now you're sorting already sorted slices on later iterations. Restore each slice or use a fresh shuffle. There is no free lunch.

8. Allocation budgets¶

Decide ahead of time what the alloc budget should be. Encode it in a test:

func TestBenchmarkRouteAllocBudget(t *testing.T) {
    result := testing.Benchmark(BenchmarkRoute)
    if got := result.AllocsPerOp(); got > 3 {
        t.Errorf("BenchmarkRoute: got %d allocs/op, budget 3", got)
    }
    if got := result.AllocedBytesPerOp(); got > 256 {
        t.Errorf("BenchmarkRoute: got %d B/op, budget 256", got)
    }
}

The test runs in go test ./... (no -bench flag needed because testing.Benchmark is just a function call). A PR that doubles allocations breaks the test. Reviewers see it before merge.

9. Concurrent benchmarks done right¶

b.RunParallel is for contended code. Three rules:

Don't share mutable state between iterations. Each pb.Next() call should be independent:

func BenchmarkLookup(b *testing.B) {
    var m sync.Map
    for i := 0; i < 1000; i++ {
        m.Store(i, i)
    }
    b.RunParallel(func(pb *testing.PB) {
        rng := rand.New(rand.NewSource(rand.Int63()))  // per-goroutine RNG
        for pb.Next() {
            m.Load(rng.Intn(1000))
        }
    })
}

Vary the keys across goroutines. If every goroutine looks up the same key, you're measuring read-only cache coherence, not realistic load.

Don't use RunParallel for single-threaded code. It adds scheduler overhead and produces less-informative numbers.

10. Sub-benchmarks for scaling curves¶

The single best use of b.Run is to plot a scaling curve:

func BenchmarkSort(b *testing.B) {
    for _, n := range []int{10, 100, 1000, 10000, 100000} {
        b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
            data := randomInts(n)
            buf := make([]int, n)
            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                copy(buf, data)
                sort.Ints(buf)
            }
        })
    }
}

Plot ns/op vs n on a log-log axis. A line with slope 1 = linear; slope 1.something = n log n; slope 2 = quadratic. Surprises here ("we expected O(n log n) but the slope is 1.4 — what's the constant?") drive real investigations.

11. Statistical hygiene¶

Three rules that prevent most false-conclusion bugs:

Always -count=10 or more. A single sample is noise. benchstat refuses to compare with fewer than 6 samples; 10 is the conventional default.

Run base and head on the same machine, in the same session, with similar warmup. Comparing yesterday's bench to today's is risky — kernel updates, BIOS firmware, even ambient temperature affect numbers.

Reject p ≥ 0.05 differences. A 4% improvement with p=0.30 is not real. Either gather more samples or accept the result is inconclusive.

go test -bench=. -count=10 -run=^$ ./... | tee old.txt
# make change
go test -bench=. -count=10 -run=^$ ./... | tee new.txt
benchstat old.txt new.txt

If you see ~ (not significantly different), the change had no measurable effect. Don't claim it did.

12. Workload realism techniques¶

Technique	When
Read inputs from a fixtures file	Production-shaped JSON, protobuf payloads
Sample inputs from a recorded request log	Heavy-tailed distributions matter
Generate randomized inputs with a fixed seed	Stress + reproducibility
Use `testing/quick` for property-style inputs	Catches edge cases benchmarks miss
Replay anonymized production traffic	Highest fidelity, most expensive

For most teams, fixture files plus seeded randomness covers 90% of the value. Recorded traffic is the gold standard for systems where input distribution dominates.

13. When the function does I/O¶

Pure-CPU benchmarks are easy. Benchmarks of code that touches the network, disk, or kernel are tricky because their cost depends on system state outside the benchmark.

Strategy: separate the CPU and I/O costs.

func BenchmarkHTTPHandler(b *testing.B) {
    handler := newHandler()
    req := httptest.NewRequest("GET", "/users/42", nil)
    rec := httptest.NewRecorder()

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        handler.ServeHTTP(rec, req)
        rec.Body.Reset()
    }
}

httptest.NewRecorder eliminates the kernel and network from the measurement. You're measuring the handler's CPU and allocation behavior in isolation — which is what's reproducible across machines.

For end-to-end timing (with real network, real disk), use a load tester (vegeta, hey, wrk) — not testing.B.

14. The opaque-input trick¶

When the compiler is too clever — folding constants, inlining everything, proving allocations away — break the chain with an explicit barrier:

//go:noinline
func opaque[T any](v T) T { return v }

func BenchmarkCompute(b *testing.B) {
    for i := 0; i < b.N; i++ {
        _ = compute(opaque(42))
    }
}

opaque is //go:noinline-marked, so the compiler cannot fold the argument through it. Use sparingly — it's a hack — but it's the cleanest way to defeat over-optimization in some Go versions.

15. A checklist before you commit a benchmark¶

16. Common output anti-patterns¶

Output	What it suggests
`0.31 ns/op`	Compiler removed the work. Add a sink.
`10000 iterations, ns/op fluctuates 50%`	Benchmark is too noisy; `-benchtime=5s`, more `-count`, or hardware issue.
`allocs/op` jumps from 0 to 4 with no code change	A dependency update added boxing; check `go.sum`.
`MB/s` rising as size shrinks	Per-call overhead dominates; the function isn't I/O-bound at small sizes.
Sub-benchmarks with wildly different variance	One is dominated by setup, one isn't. Hoist consistently.
`b.N` ends very small (e.g., 100)	Each iteration is slow (ms+); use longer `-benchtime` for stability.

17. Summary¶

Effective benchmarks start with a precise question, use a representative workload, and defend against the compiler removing work. Hoist setup, use b.ResetTimer, pin sinks, sweep input sizes with b.Run, and always confirm significance with -count=10 and benchstat. Allocation budgets enforced as tests are the cheapest way to lock in gains. Microbench wins that ignore production context are worse than no benchmark at all.