Benchmarking Strategy — Specification¶

Focus: Precise reference for Go's built-in benchmarking facilities — the testing.B API, the go test -bench driver, output format, and the benchstat companion tool used for statistical comparison.

Sources: - testing package documentation: https://pkg.go.dev/testing - go test command reference: https://pkg.go.dev/cmd/go#hdr-Testing_flags - benchstat tool: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat - Go blog — testing: https://go.dev/blog/subtests - Dave Cheney, High Performance Go Workshop: https://dave.cheney.net/high-performance-go-workshop/dotgo-paris.html

1. What `testing.B` is¶

A benchmark function has the signature

func BenchmarkXxx(b *testing.B)

and lives in a _test.go file. The go test driver discovers it, invokes it with successively larger values of b.N, and measures wall-clock time per operation. The result is reported in nanoseconds per operation (ns/op) and, when requested, bytes and allocations per operation.

Benchmarks share the test binary with TestXxx and ExampleXxx functions but are never run by go test alone; they require the -bench flag.

2. The `testing.B` API surface¶

Member	Purpose
`b.N`	The number of iterations the driver wants you to run. Loop body must run `b.N` times.
`b.ResetTimer()`	Reset elapsed time, bytes, and alloc counters; use after expensive setup.
`b.StartTimer()`	(Re)start the timer (automatic at function entry).
`b.StopTimer()`	Pause the timer for setup/teardown inside the loop.
`b.ReportAllocs()`	Force `-benchmem` style reporting for this benchmark only.
`b.ReportMetric(value, unit)`	Add a custom metric column to the output (e.g., `req/s`, `MB/s`).
`b.SetBytes(n)`	Declare bytes processed per op; enables the `MB/s` column.
`b.Run(name, fn)`	Sub-benchmark; enables table-driven measurement.
`b.RunParallel(fn)`	Distribute `b.N` across goroutines for concurrent benchmarks.
`b.SetParallelism(p)`	Multiplier on `GOMAXPROCS` for `RunParallel` (default 1).
`b.Loop()` (Go 1.24+)	Idiomatic replacement for `for i := 0; i < b.N; i++`; prevents many footguns.
`b.Helper()`	Mark a helper so failures point at the caller's line.
`b.Cleanup(fn)`	Run `fn` after the benchmark completes.
`b.Fatalf(...)` / `b.Errorf(...)`	Fail the benchmark (still reports partial results).

3. Driver flags¶

Flag	Effect
`-bench=<regex>`	Run benchmarks whose name matches regex. Use `.` for all.
`-benchmem`	Include `B/op` and `allocs/op` columns.
`-benchtime=<dur>`	Target wall time per benchmark (e.g., `5s`, `500ms`). Default `1s`.
`-benchtime=Nx`	Pin `b.N` to exactly `N` iterations. Useful for deterministic runs.
`-count=N`	Repeat each benchmark N times back-to-back; required for `benchstat`.
`-cpu=1,2,4`	Run each benchmark for each `GOMAXPROCS` value.
`-run=^$`	Skip all `TestXxx`; standard prefix to a `-bench` invocation.
`-cpuprofile=cpu.out`	Write a CPU profile during benchmarks.
`-memprofile=mem.out`	Write a heap profile after the run.
`-blockprofile=block.out`	Write a contention profile.
`-mutexprofile=mutex.out`	Write a mutex contention profile.
`-trace=trace.out`	Write an execution trace.
`-timeout=<dur>`	Per-benchmark timeout; `-bench` runs can be long.

Canonical invocation for a regression-comparable run:

go test -run=^$ -bench=. -benchmem -count=10 -benchtime=2s ./... | tee new.txt

4. Output format¶

goos: linux
goarch: amd64
pkg: example.com/mylib
cpu: AMD Ryzen 9 7950X
BenchmarkSum-32         12345678              97.4 ns/op            24 B/op          1 allocs/op

Column	Meaning
`BenchmarkSum-32`	Benchmark name plus active `GOMAXPROCS`.
`12345678`	Final `b.N` reached during scaling.
`97.4 ns/op`	Mean wall time per loop iteration.
`24 B/op`	Mean heap bytes allocated per op (with `-benchmem`).
`1 allocs/op`	Mean heap allocations per op (with `-benchmem`).

b.SetBytes(n) adds an MB/s column. b.ReportMetric(v, "req/s") adds a custom column.

5. How the driver picks `b.N`¶

The driver starts with b.N = 1. It runs the benchmark, measures elapsed time, and uses that to estimate how many iterations are needed to fill -benchtime. It then multiplies b.N by a factor (capped at 100×) and re-runs. The loop continues until elapsed time ≥ -benchtime or b.N saturates math.MaxInt32.

Consequences:

Each benchmark function is run multiple times by the driver during a single -bench invocation. Side effects across iterations are forbidden in well-formed benchmarks.
Setup code at the top of the function executes once per ramp-up iteration, not once per -count. Use b.ResetTimer() to exclude it.
A pinned run (-benchtime=10000x) runs exactly once with b.N = 10000.

6. `benchstat` semantics¶

benchstat is a separate tool that takes two (or more) output files and computes a statistical comparison.

go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt

Column	Meaning
`time/op`	Geometric-mean ns/op per file.
`delta`	Percent change, sign relative to `old.txt`.
`p=`	Two-sample Mann–Whitney U-test p-value.
`n=`	Number of samples per file (must match `-count=N`).
`±`	Coefficient of variation across samples.

By default benchstat declares a result significant only if p < 0.05. Differences with p ≥ 0.05 are printed as ~ (no significant change). It refuses to compare files with n < 6; 10 is the conventional minimum.

benchstat (post 0.0.0-2023) also accepts -filter, -geomean, and -confidence flags. Pre-0.0.0-2023 versions are stricter and lack a few of these.

7. Custom metrics with `ReportMetric`¶

func BenchmarkRouter(b *testing.B) {
    for i := 0; i < b.N; i++ {
        route("/users/42")
    }
    b.ReportMetric(float64(b.N)/b.Elapsed().Seconds(), "req/s")
    b.ReportMetric(0, "ns/op") // suppress the default column
}

ReportMetric(0, "ns/op") is the documented way to remove the default time column. Custom metrics participate in benchstat comparisons the same way ns/op does.

8. Sub-benchmarks (`b.Run`)¶

func BenchmarkParse(b *testing.B) {
    sizes := []int{16, 256, 4096, 65536}
    for _, n := range sizes {
        b.Run(fmt.Sprintf("size=%d", n), func(b *testing.B) {
            payload := make([]byte, n)
            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                _ = parse(payload)
            }
        })
    }
}

Each sub-benchmark gets its own b.N, its own output line, and a slash-separated name (BenchmarkParse/size=4096). -bench=BenchmarkParse/size=256$ runs just one.

9. Parallel benchmarks¶

func BenchmarkLookup(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            _ = cache.Get("key")
        }
    })
}

RunParallel spawns runtime.GOMAXPROCS(0) * b.Parallelism goroutines, each draining a shared b.N budget via pb.Next(). Use for code paths where contention is the point (locks, atomics, channel ops). Single-threaded benchmarks should not use RunParallel.

10. Avoiding compiler optimization¶

The compiler may delete a benchmark loop body entirely if the result is unused. Three defenses:

Technique	When
Package-level `var sink T` and assign `sink = result`	Most common; works in all versions.
`runtime.KeepAlive(x)`	When the optimization is "this allocation has no observable use".
`b.Loop()` (Go 1.24+)	The compiler explicitly marks the loop body as having side effects.

Without these, a benchmark of 1 + 2 reports 0.31 ns/op — the cost of the loop counter.

11. The `b.Loop()` form (Go 1.24+)¶

func BenchmarkFoo(b *testing.B) {
    for b.Loop() {
        _ = expensive()
    }
}

b.Loop() is equivalent to for i := 0; i < b.N; i++ but with three improvements:

The compiler treats arguments passed to functions inside the loop as escaping, preventing dead-code elimination.
Setup before the first b.Loop() call is excluded from the timer automatically — no b.ResetTimer() needed.
The body runs at least once for b.N == 1 ramp-ups, simplifying the mental model.

Prefer b.Loop() in new code targeting Go 1.24+.

12. Stability requirements¶

Reliable benchmark numbers require a stable system. Quoting perflock documentation:

Source of noise	Mitigation
CPU frequency scaling	Disable turbo, pin governor to `performance` (Linux)
Other processes	`taskset -c 2,3` to pin; `perflock` to serialize benchmark runs
Thermal throttling	Run short, give breathing room between runs
ASLR / address randomization	Run multiple times (`-count=10`) and use `benchstat`
GC	Built into the measurement; do not disable

Variance under 1% is achievable on a quiet Linux machine with perflock; 3–5% is typical on a laptop; 10%+ is normal on CI runners.

13. Non-goals / limitations¶

testing.B is not a load testing tool; it cannot drive a separate process or simulate users.
Benchmarks measure CPU-bound microcode well and end-to-end systems poorly.
The driver does not statistically validate a single run; pair with -count=10 and benchstat.
b.SetBytes does not measure I/O; it asserts a value you supplied.
Memory metrics under -benchmem are only heap allocations; stack allocations are invisible.

testing package: https://pkg.go.dev/testing
go test flags: https://pkg.go.dev/cmd/go#hdr-Testing_flags
benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
Dave Cheney, benchmarking pitfalls: https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go
Go 1.24 b.Loop() proposal: https://go.dev/issue/61515

Benchmarking Strategy — Specification¶

1. What testing.B is¶

2. The testing.B API surface¶

3. Driver flags¶

4. Output format¶

5. How the driver picks b.N¶

6. benchstat semantics¶

7. Custom metrics with ReportMetric¶

8. Sub-benchmarks (b.Run)¶