Microbenchmarking & Statistical Rigor (benchstat)¶
Write Go benchmarks that don't lie. Most microbenchmarks measure the compiler, not the code — dead-code elimination, hoisting, cold caches, and a single noisy run conspire to produce a number you'll quote in a PR and regret. Learn to measure a change and prove the difference is real with a p-value, then wire that proof into CI so a real regression fails the build.
| Tier | Load-testing (perf craft) |
| Primary domain | Benchmarking methodology / Go runtime |
| Skills exercised | testing.B, b.N mechanics, dead-code-elimination defeat, allocation accounting, RunParallel, variance & noise control, benchstat A/B with p-values, CI perf-regression gates, micro-vs-macro reasoning |
| Interview sections | 1 (Go language & runtime), 17 (performance engineering), 15 (testing) |
| Est. effort | 2–4 focused days |
1. Context¶
You're optimizing the hot path of a high-RPS service — say, the request-decode and routing step that runs on every one of 80k req/s. Someone sends a PR claiming "new JSON path is 30% faster, see the benchmark." You run it and get a different number. You run it again and get a third. The benchmark allocates a slice the compiler can prove is unused, the loop body gets hoisted out of b.N, the laptop throttled mid-run, and the "30%" is inside the noise.
This is the normal state of microbenchmarks. A microbenchmark is a measurement instrument, and an uncalibrated instrument is worse than none — it gives false confidence. Your job in this project is to build benchmarks you can trust, learn exactly how testing.B can betray you, and connect a micro number to the macro load-test number so you know when a 12 ns/op win actually matters and when it's theatre. You will produce numbers with confidence intervals, not opinions.
2. Goals / Non-goals¶
Goals - Write a single benchmark that is correct: no dead-code elimination, warmed up, allocations reported, and stable enough that repeated runs agree. - Defeat the compiler: prove (via go test -gcflags=-m and disassembly) that the work you think you're measuring is actually executed. - Quantify variance and reduce it: measure the run-to-run coefficient of variation on your machine and drive it down with pinning, frequency control, and -count. - Run an A/B comparison the right way with benchstat: multiple runs per side, a p-value, and a confidence/± on the delta — and learn to withhold a conclusion when p > 0.05. - Benchmark concurrent code with RunParallel and a -cpu sweep; read per-core scalability, not just a single throughput number. - Wire a CI gate that fails a build on a statistically significant slowdown of a tracked benchmark — and tune it so it doesn't fire on noise. - Connect micro to macro: take the hot-path function you benchmarked, drive it through a load test, and show the relationship (and the limits) between the ns/op and the system's RPS/p99.
Non-goals - Profiling and flame graphs — that's project 07 (this project tells you whether a change helped; 07 tells you where to spend the effort). - Allocation-reduction technique itself (sync.Pool, escape analysis, arenas) — that's project 05. Here you only need to measure allocations honestly. - Building a benchmark framework. Use testing.B and benchstat as shipped. - Cross-language benchmark wars. One language (Go), measured rigorously.
3. Functional requirements¶
- A package under test (
hotpath/) with at least two implementations of the same hot-path function (e.g. a request-key builder, a JSON field extractor, or a small encoder), so you have a real A vs B to compare. - A benchmark suite (
hotpath/bench_test.go) that: - measures both implementations with
b.ReportAllocs()and a defended sink so no result is eliminated; - calls
b.ResetTimer()after any per-benchmark setup; - parameterizes input size via sub-benchmarks (
b.Run(fmt.Sprintf("n=%d", n), …)); - includes a
RunParallelvariant for the concurrent path. - A
scripts/bench.shthat runs each side-count=10(or more) into separate files and produces abenchstat old.txt new.txtreport committed as an artifact. - A CI workflow (
/.github/workflows/bench.ymlor equivalent) that runs the tracked benchmarks on a PR, compares against a stored baseline withbenchstat, and fails when a tracked metric regresses withp < 0.05by more than a stated threshold. - A short bridge to the load test (project 01): drive the same hot-path through the load generator at a defined RPS and record the system p99, so the findings note can place the micro delta next to the macro delta.
4. Load & data profile¶
This project's "load" is the input to the function under test and the parallelism of the benchmark — that's where the two scale axes live.
- Input sizes: sweep at least
n ∈ {1, 64, 1024, 65536}so an O(n) and an O(n²) implementation visibly diverge. State the units (bytes, elements, fields). - Input shape: deterministic given a seed (
cmd/genor atestinghelper); include at least one adversarial shape (e.g. all-distinct keys vs all-same key, worst-case for a map or a sort) so you don't benchmark the lucky case. - Parallelism: sweep
-cpu=1,2,4,8(and up to your core count) for theRunParallelbenchmark; record per-corens/opand total throughput. - Runs per measurement:
-count ≥ 10per side for any A/B claim; more if the coefficient of variation is high. One run is never a measurement. - Machine state: record CPU model, core count,
GOMAXPROCS, governor/turbo state, and whether the machine was otherwise idle. A benchmark number without its machine context is not reproducible.
5. Non-functional requirements / SLOs¶
These are the measurement-quality SLOs — the bar your instrument must clear before any performance claim it produces is allowed to count.
| Metric | Target |
|---|---|
| Correctness of the benchmark (no DCE) | The measured work is provably executed — -gcflags=-m and/or disassembly show the call is not elided; removing the sink changes ns/op |
Run-to-run coefficient of variation (single benchmark, -count=10) | < 3% on a quiesced machine; if you can't get under ~5%, you must say so and widen your significance threshold accordingly |
| Warm-up | b.N auto-scales to ≥ ~1 s of work; first iterations not counted (timer reset after setup) |
| Allocation accounting | B/op and allocs/op reported for every benchmark; a "zero-alloc" claim shows 0 allocs/op |
| A/B significance | A speedup is only claimed when benchstat reports p < 0.05; the report includes the ± / confidence and the delta % |
| Regression-detection threshold (CI gate) | Gate fires on a real ≥ X% slowdown (state X, e.g. 5%) at p < 0.05; false-positive rate of the gate on a no-op PR is ~0 over 10 trial runs |
| Micro↔macro link | The findings note states the hot-path ns/op, the fraction of request CPU it represents, and the measured effect on system p99/RPS |
The goal is not a fast number — it's a number you would bet the deploy on. A benchmark you can't reproduce within its stated variance is not a result.
6. Architecture constraints & guidance¶
- Defeat dead-code elimination explicitly. Assign the result to a package-level
var sink T(or accumulate into one), then read it. The compiler must not be able to prove the result is unused. A function call whose result is discarded and whose body has no observable effect can be eliminated entirely — and then you're timing an empty loop. - Beware hoisting & constant-folding. If the benchmark input is a constant, the compiler can compute the answer once and hoist it out of the
b.Nloop. Vary the input by loop index, or read it from a slice the compiler can't see through. - Use
//go:noinlinedeliberately, not reflexively. Inlining is real behavior you usually want to measure; forcenoinlineonly when you're isolating a single function's cost and inlining would fuse it into the loop and let the optimizer cheat. State why eachnoinlineis there. b.ResetTimer()after setup,b.StopTimer()/b.StartTimer()around per-iteration setup you don't want counted — but know thatStopTimerin a tight loop adds overhead and can itself distort cheap benchmarks; prefer building inputs once outside the timed region.b.ReportAllocs()on every benchmark (or-benchmem). Allocations are often the real story; an implementation that's 5 ns/op slower but does0 allocs/opmay win at scale because it doesn't feed the GC.- Pin and quiet the machine. Disable turbo / set a fixed CPU governor (
performance), close everything else, and ideally pin to specific cores (taskseton Linux). Thermal throttling and frequency scaling are the #1 source of fake variance. On a laptop, run plugged in. benchstatis the arbiter. Never eyeball twons/opnumbers and declare a winner. Feed-count-many runs of each side tobenchstat; it computes the delta, the±(or confidence interval), and a p-value via the Mann–Whitney U test.p > 0.05means "no detectable difference," full stop.- Tooling versions: modern
benchstatlives atgolang.org/x/perf/cmd/benchstat. Pin its version in your findings — output format and the default statistics have changed across versions.
7. Data model¶
benchmark sample (one -count run, one b.Run case):
{ name string, n int, cpu int, ns_per_op float64, b_per_op int, allocs_per_op int }
run file (go test -bench=. -count=10 -benchmem > old.txt):
goos/goarch/pkg/cpu header + N lines of:
BenchmarkXxx/n=1024-8 123456 812 ns/op 256 B/op 3 allocs/op (×10 lines)
benchstat comparison (benchstat old.txt new.txt):
per benchmark row: old (mean ± / CI), new (mean ± / CI), delta %, p-value
p < 0.05 → significant; "~" → not significant (withhold the claim)
CI baseline:
baseline.txt committed per tracked benchmark; PR run produces new.txt;
gate = benchstat baseline.txt new.txt, parsed for delta% > X at p<0.05 → exit 1
ns/op is a mean over b.N iterations; -count gives you a sample of means, and that sample is what benchstat reasons over. The single ns/op you see in one run is one draw from a distribution — never quote it alone.
8. Interface contract¶
go test -run=^$ -bench=BenchmarkHotpath -benchmem -count=10 ./hotpath/go test -bench=BenchmarkHotpathParallel -cpu=1,2,4,8 ./hotpath/scripts/bench.sh <ref-old> <ref-new>→ checks out each, runs the suite-count=10intoold.txt/new.txt, printsbenchstat old.txt new.txt.scripts/gate.sh new.txt baseline.txt -threshold 5→ exit non-zero iff any tracked benchmark regressed > 5% atp < 0.05.- Verification of "no DCE":
go test -gcflags=-m ./hotpath/ 2>&1 | grep <fn>plus a disassembly check (go test -gcflags=-S), recorded in the findings note. - Micro↔macro bridge:
cmd/loadbridge -rate <rps> -duration 2mdrives the same function through project 01's harness and emits system p50/p99/RPS.
9. Key technical challenges¶
- Dead-code elimination is silent. The most common failure: the benchmark reports a few hundred picoseconds/op because the compiler deleted the work entirely. The tell is an implausibly tiny
ns/opand no allocations. You must prove the work runs — sink + read, and confirm with-gcflags=-m. Until you've seen DCE fire on a deliberately broken benchmark, you don't really know it. b.Nmechanics & warm-up.testingruns the body with a growingb.Nuntil the timed region reaches ~1 s, then reports the last run'sns/op. If your per-iteration setup leaks into the loop, you measure setup, not work. If the function has a one-time cold-cache cost, the first cheapb.Nvalues mislead.- CPU frequency scaling & noise. Turbo boost, thermal throttling, and background processes produce 10–30% swings that dwarf the change you're hunting. This is the reason naive A/B "it's faster on my machine" claims are worthless. Control it, or measure it and inflate your significance bar.
- Single-run delusion. One
ns/opvs anotherns/optells you nothing about significance. A 4% "improvement" from a single run is, on most machines, pure noise. The fix is-count+benchstat— and the discipline to report~(no difference) when the statistics say so. - Benchmarking concurrency honestly.
RunParallelmeasures aggregate throughput acrossGOMAXPROCSworkers — but contention (a shared mutex, false sharing, an atomic hot line) makes per-corens/oprise as you add cores. A benchmark that only runs at-cpu=1hides the exact scalability problem the high-RPS service will hit. The-cpusweep is where contention becomes visible. - Micro ≠ macro. A function that's 2× faster in isolation may move system p99 by 0% — because it wasn't on the critical path, or the bottleneck is the DB, or GC pressure dominates, or the win is amortized away by batching. A micro number is a hypothesis about the system; the load test is the test of that hypothesis.
Stages (0 simple → 1 big data → 2 high RPS → 3 both)¶
Build Stage 0 correct first — a benchmark that lies at Stage 0 lies at every stage. Each stage pushes one axis (input size, then parallelism) before combining them and gating the result in CI.
| Stage | Input size | Parallelism | What's measured | Pass criterion |
|---|---|---|---|---|
| 0 · Simple | one fixed n (e.g. 1 KB) | -cpu=1 | A single clean benchmark: ns/op, B/op, allocs/op of one implementation | No DCE (proven via -gcflags=-m + sink-removal sanity check); timer reset after setup; CV < 3% over -count=10 on a quiesced machine |
| 1 · Big data | sweep n ∈ {1, 64, 1k, 64k} | -cpu=1 | ns/op and allocs/op as a function of n across both implementations | The complexity curve is read off the numbers: an O(n) vs O(n²) gap is visible in the sweep, and per-n allocation growth is quantified and explained |
| 2 · High RPS | one fixed n | RunParallel, -cpu=1,2,4,8,… | Aggregate throughput and per-core ns/op as cores are added | Scalability characterized: state whether the path scales linearly or where contention (mutex/atomic/false-sharing) flattens or worsens per-core cost, with the inflection point named |
| 3 · Both | large n (≥ 64k) | high parallelism (-cpu = core count) | Worst realistic load on the hot path + a benchstat-gated CI regression check | A↔B compared with benchstat (p < 0.05, ± reported); CI gate fails on a real ≥5% slowdown and does not fire on a no-op PR over 10 trials; findings note links the micro ns/op to the measured system p99/RPS |
10. Experiments to run (break it / tune it)¶
Record before/after numbers, the benchstat row (with p-value), and a one-line conclusion for each:
- Watch DCE fire. Write the "obvious" benchmark that discards the result. Note the absurd
ns/op. Add the sink + read; watchns/opjump to a realistic value. Confirm withgo test -gcflags=-m. This is the foundational experiment — do it first. - Hoisting demo. Benchmark with a constant input, then with an index-varied input from a slice. Show the constant version is suspiciously fast because the compiler folded it. Quantify the gap.
- Warm-up / timer placement. Put expensive setup inside the timed loop, then move it out with
b.ResetTimer(). Show how much the misplaced setup inflatedns/op, and howStopTimer/StartTimeroverhead distorts a cheap benchmark. - Variance hunt. Run the same benchmark
-count=20with turbo on vs a fixedperformancegovernor (and idle vs a busy machine). Report the coefficient of variation for each condition. Show that an A/B "win" smaller than your noise is undetectable. - A/B done right. Compare implementation A vs B with
benchstatat-count=10. Report delta %,±, and p. Then construct a case where the difference is real but tiny andp > 0.05— and correctly report "no detectable difference." - Input-size sweep (complexity). Run both implementations across the
nsweep. Plotns/opandallocs/opvsn. Identify which is O(n) and which is O(n²) from the numbers alone, and verify against the code. - Parallel scaling / contention. Run the
RunParallelbenchmark across the-cpusweep. Show a contended version (shared mutex/atomic) whose per-corens/oprises with cores, vs a sharded/local version that stays flat. Name the inflection point. - CI gate, true & false positive. Introduce a deliberate 8% slowdown and prove the gate fails the build with its
benchstatevidence. Then run the gate on a no-op PR 10 times and prove it never fires. Report the gate's threshold and false-positive rate. - Micro↔macro reconciliation. Take the A/B micro delta from experiment 5, drive both versions through the load harness at a fixed RPS, and report the change in system p99 and max sustainable RPS. Explain any gap between the micro win and the macro effect (critical-path fraction, GC, batching, the real bottleneck).
11. Milestones¶
hotpath/with A and B; first clean Stage-0 benchmark; DCE proven defeated (experiment 1) and CV measured.- Input-size sweep + allocation accounting (Stage 1, experiments 2, 6); complexity curves read off the numbers.
RunParallel+-cpusweep (Stage 2, experiment 7); contention characterized.benchstatA/B discipline + variance control (experiments 3, 4, 5); a result you'd bet on.- CI regression gate with proven true-positive and ~0 false-positive (Stage 3, experiment 8); micro↔macro bridge (experiment 9); findings note.
12. Acceptance criteria (definition of done)¶
- A Stage-0 benchmark with proven no-DCE (sink +
-gcflags=-mevidence in the note) and a measured CV < 3% (or a stated, justified higher bar). -
B/opandallocs/opreported for every benchmark; any "zero-alloc" claim shows0 allocs/op. - An input-size sweep that reveals the O(n) vs O(n²) difference from the numbers, with the allocation-growth explanation.
- A
RunParallel-cpusweep with per-corens/opand a named contention inflection point (or a justified "scales linearly" with evidence). - An A/B
benchstatreport committed, including a case correctly reported as~(not significant) — proving you withhold conclusions when the stats say so. - A CI gate that fails on a real ≥5% slowdown (show the failing run) and does not fire on a no-op PR over 10 trials (show the green runs).
- A findings note connecting the micro
ns/opto the measured system p99/RPS, including any gap and its explanation. - Every number is reproducible from a committed command + the recorded machine context (CPU, GOMAXPROCS, governor, idle state,
benchstatversion).
13. Stretch goals¶
benchstatover time: store baselines per commit and plot the tracked benchmark's trend across the last N merges to catch slow drift (the "boiling frog" regression no single PR trips).-benchtimesensitivity: run with-benchtime=100xvs-benchtime=1svs-benchtime=10sand show how fixed-iteration vs fixed-time benchmarking changes variance and what each is good for.perflock/ cpuset isolation: add machine-level isolation (a dedicated benchmark runner,perflock, or isolated cores) and re-measure your CV to show the variance floor your CI gate can actually rely on.- PGO interaction: build with profile-guided optimization and re-benchmark — show how PGO moves inlining decisions and therefore your micro numbers.
- Allocation regression gate: extend the CI gate to fail on an
allocs/opincrease too, not justns/op— often the earlier and more reliable signal.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Benchmark correctness | Knows to add a sink; benchmark looks right | Proves no DCE via -gcflags=-m/disassembly; has seen DCE fire and can spot a fake number on sight |
| Variance control | Runs -count and averages | Measures CV, controls frequency scaling, states the machine context, and sets the significance bar to the noise floor |
| Statistical reasoning | Uses benchstat for A/B | Reads the p-value and ±; withholds the claim when p > 0.05; explains what the test actually tests |
| Allocation accounting | Reports allocs/op | Treats allocations as a first-class signal; knows when the alloc delta predicts the macro win better than ns/op |
| Concurrency benchmarking | Uses RunParallel | Sweeps -cpu, reads per-core scaling, locates the contention inflection, ties it to the high-RPS path |
| CI gating | Has a benchmark in CI | Gate fires on real regressions, doesn't fire on noise (proven both ways); threshold justified by measured variance |
| Micro↔macro judgment | Reports the ns/op | Connects (or disconnects) the micro win from the system p99; trusts only statistically-sound numbers and says so when a micro number is meaningless |
| Communication | Clear findings note | Could defend every delta — and every ~ — to a staff panel |
Staff bar in one line: trusts only statistically-sound numbers, knows when a micro number is meaningless, and never ships a perf claim without a p-value and a machine context behind it.
15. References¶
- Go
testingpackage docs:B,b.N,ResetTimer,ReportAllocs,RunParallel,-bench,-benchmem,-benchtime,-count,-cpu. golang.org/x/perf/cmd/benchstat— installation, output format, the significance test it uses; pin the version you used.- Dave Cheney, "How to write benchmarks in Go" and "High Performance Go Workshop" — DCE, sinks,
b.Nmechanics, the canonical pitfalls. - Gil Tene on measurement honesty (the load-side companion: project 01, coordinated omission) — the same "your instrument is lying to you" discipline, applied to latency.
- Builds on
load-testing/01-distributed-load-generator(the macro side of the micro↔macro bridge); feedsload-testing/07-profiling-guided-optimization(benchmarks tell you whether; profiling tells you where). - See also:
Interview Question/17-performance-engineering/(benchmarking, measurement, regression) andInterview Question/01-golang-language-and-runtime/(compiler optimizations, inlining, escape analysis, the GC the allocations feed).