Microbenchmarking & Statistical Rigor (benchstat)¶

Write Go benchmarks that don't lie. Most microbenchmarks measure the compiler, not the code — dead-code elimination, hoisting, cold caches, and a single noisy run conspire to produce a number you'll quote in a PR and regret. Learn to measure a change and prove the difference is real with a p-value, then wire that proof into CI so a real regression fails the build.


Tier	Load-testing (perf craft)
Primary domain	Benchmarking methodology / Go runtime
Skills exercised	`testing.B`, `b.N` mechanics, dead-code-elimination defeat, allocation accounting, `RunParallel`, variance & noise control, `benchstat` A/B with p-values, CI perf-regression gates, micro-vs-macro reasoning
Interview sections	1 (Go language & runtime), 17 (performance engineering), 15 (testing)
Est. effort	2–4 focused days

1. Context¶

You're optimizing the hot path of a high-RPS service — say, the request-decode and routing step that runs on every one of 80k req/s. Someone sends a PR claiming "new JSON path is 30% faster, see the benchmark." You run it and get a different number. You run it again and get a third. The benchmark allocates a slice the compiler can prove is unused, the loop body gets hoisted out of b.N, the laptop throttled mid-run, and the "30%" is inside the noise.

This is the normal state of microbenchmarks. A microbenchmark is a measurement instrument, and an uncalibrated instrument is worse than none — it gives false confidence. Your job in this project is to build benchmarks you can trust, learn exactly how testing.B can betray you, and connect a micro number to the macro load-test number so you know when a 12 ns/op win actually matters and when it's theatre. You will produce numbers with confidence intervals, not opinions.

2. Goals / Non-goals¶

Goals - Write a single benchmark that is correct: no dead-code elimination, warmed up, allocations reported, and stable enough that repeated runs agree. - Defeat the compiler: prove (via go test -gcflags=-m and disassembly) that the work you think you're measuring is actually executed. - Quantify variance and reduce it: measure the run-to-run coefficient of variation on your machine and drive it down with pinning, frequency control, and -count. - Run an A/B comparison the right way with benchstat: multiple runs per side, a p-value, and a confidence/± on the delta — and learn to withhold a conclusion when p > 0.05. - Benchmark concurrent code with RunParallel and a -cpu sweep; read per-core scalability, not just a single throughput number. - Wire a CI gate that fails a build on a statistically significant slowdown of a tracked benchmark — and tune it so it doesn't fire on noise. - Connect micro to macro: take the hot-path function you benchmarked, drive it through a load test, and show the relationship (and the limits) between the ns/op and the system's RPS/p99.

Non-goals - Profiling and flame graphs — that's project 07 (this project tells you whether a change helped; 07 tells you where to spend the effort). - Allocation-reduction technique itself (sync.Pool, escape analysis, arenas) — that's project 05. Here you only need to measure allocations honestly. - Building a benchmark framework. Use testing.B and benchstat as shipped. - Cross-language benchmark wars. One language (Go), measured rigorously.

3. Functional requirements¶

A package under test (hotpath/) with at least two implementations of the same hot-path function (e.g. a request-key builder, a JSON field extractor, or a small encoder), so you have a real A vs B to compare.
A benchmark suite (hotpath/bench_test.go) that:
measures both implementations with b.ReportAllocs() and a defended sink so no result is eliminated;
calls b.ResetTimer() after any per-benchmark setup;
parameterizes input size via sub-benchmarks (b.Run(fmt.Sprintf("n=%d", n), …));
includes a RunParallel variant for the concurrent path.
A scripts/bench.sh that runs each side -count=10 (or more) into separate files and produces a benchstat old.txt new.txt report committed as an artifact.
A CI workflow (/.github/workflows/bench.yml or equivalent) that runs the tracked benchmarks on a PR, compares against a stored baseline with benchstat, and fails when a tracked metric regresses with p < 0.05 by more than a stated threshold.
A short bridge to the load test (project 01): drive the same hot-path through the load generator at a defined RPS and record the system p99, so the findings note can place the micro delta next to the macro delta.

4. Load & data profile¶

This project's "load" is the input to the function under test and the parallelism of the benchmark — that's where the two scale axes live.

Input sizes: sweep at least n ∈ {1, 64, 1024, 65536} so an O(n) and an O(n²) implementation visibly diverge. State the units (bytes, elements, fields).
Input shape: deterministic given a seed (cmd/gen or a testing helper); include at least one adversarial shape (e.g. all-distinct keys vs all-same key, worst-case for a map or a sort) so you don't benchmark the lucky case.
Parallelism: sweep -cpu=1,2,4,8 (and up to your core count) for the RunParallel benchmark; record per-core ns/op and total throughput.
Runs per measurement: -count ≥ 10 per side for any A/B claim; more if the coefficient of variation is high. One run is never a measurement.
Machine state: record CPU model, core count, GOMAXPROCS, governor/turbo state, and whether the machine was otherwise idle. A benchmark number without its machine context is not reproducible.

5. Non-functional requirements / SLOs¶

These are the measurement-quality SLOs — the bar your instrument must clear before any performance claim it produces is allowed to count.

Metric	Target
Correctness of the benchmark (no DCE)	The measured work is provably executed — `-gcflags=-m` and/or disassembly show the call is not elided; removing the sink changes `ns/op`
Run-to-run coefficient of variation (single benchmark, `-count=10`)	< 3% on a quiesced machine; if you can't get under ~5%, you must say so and widen your significance threshold accordingly
Warm-up	`b.N` auto-scales to ≥ ~1 s of work; first iterations not counted (timer reset after setup)
Allocation accounting	`B/op` and `allocs/op` reported for every benchmark; a "zero-alloc" claim shows `0 allocs/op`
A/B significance	A speedup is only claimed when `benchstat` reports `p < 0.05`; the report includes the `±` / confidence and the delta %
Regression-detection threshold (CI gate)	Gate fires on a real ≥ X% slowdown (state X, e.g. 5%) at `p < 0.05`; false-positive rate of the gate on a no-op PR is ~0 over 10 trial runs
Micro↔macro link	The findings note states the hot-path `ns/op`, the fraction of request CPU it represents, and the measured effect on system p99/RPS

The goal is not a fast number — it's a number you would bet the deploy on. A benchmark you can't reproduce within its stated variance is not a result.

6. Architecture constraints & guidance¶

Defeat dead-code elimination explicitly. Assign the result to a package-level var sink T (or accumulate into one), then read it. The compiler must not be able to prove the result is unused. A function call whose result is discarded and whose body has no observable effect can be eliminated entirely — and then you're timing an empty loop.
Beware hoisting & constant-folding. If the benchmark input is a constant, the compiler can compute the answer once and hoist it out of the b.N loop. Vary the input by loop index, or read it from a slice the compiler can't see through.
Use //go:noinline deliberately, not reflexively. Inlining is real behavior you usually want to measure; force noinline only when you're isolating a single function's cost and inlining would fuse it into the loop and let the optimizer cheat. State why each noinline is there.
b.ResetTimer() after setup, b.StopTimer()/b.StartTimer() around per-iteration setup you don't want counted — but know that StopTimer in a tight loop adds overhead and can itself distort cheap benchmarks; prefer building inputs once outside the timed region.
b.ReportAllocs() on every benchmark (or -benchmem). Allocations are often the real story; an implementation that's 5 ns/op slower but does 0 allocs/op may win at scale because it doesn't feed the GC.
Pin and quiet the machine. Disable turbo / set a fixed CPU governor (performance), close everything else, and ideally pin to specific cores (taskset on Linux). Thermal throttling and frequency scaling are the #1 source of fake variance. On a laptop, run plugged in.
benchstat is the arbiter. Never eyeball two ns/op numbers and declare a winner. Feed -count-many runs of each side to benchstat; it computes the delta, the ± (or confidence interval), and a p-value via the Mann–Whitney U test. p > 0.05 means "no detectable difference," full stop.
Tooling versions: modern benchstat lives at golang.org/x/perf/cmd/benchstat. Pin its version in your findings — output format and the default statistics have changed across versions.

7. Data model¶

benchmark sample (one -count run, one b.Run case):
  { name string, n int, cpu int, ns_per_op float64, b_per_op int, allocs_per_op int }

run file (go test -bench=. -count=10 -benchmem > old.txt):
  goos/goarch/pkg/cpu header + N lines of:
  BenchmarkXxx/n=1024-8   123456   812 ns/op   256 B/op   3 allocs/op   (×10 lines)

benchstat comparison (benchstat old.txt new.txt):
  per benchmark row: old (mean ± / CI), new (mean ± / CI), delta %, p-value
  p < 0.05  → significant;  "~"  → not significant (withhold the claim)

CI baseline:
  baseline.txt committed per tracked benchmark; PR run produces new.txt;
  gate = benchstat baseline.txt new.txt, parsed for delta% > X at p<0.05 → exit 1

ns/op is a mean over b.N iterations; -count gives you a sample of means, and that sample is what benchstat reasons over. The single ns/op you see in one run is one draw from a distribution — never quote it alone.

8. Interface contract¶

go test -run=^$ -bench=BenchmarkHotpath -benchmem -count=10 ./hotpath/
go test -bench=BenchmarkHotpathParallel -cpu=1,2,4,8 ./hotpath/
scripts/bench.sh <ref-old> <ref-new> → checks out each, runs the suite -count=10 into old.txt/new.txt, prints benchstat old.txt new.txt.
scripts/gate.sh new.txt baseline.txt -threshold 5 → exit non-zero iff any tracked benchmark regressed > 5% at p < 0.05.
Verification of "no DCE": go test -gcflags=-m ./hotpath/ 2>&1 | grep <fn> plus a disassembly check (go test -gcflags=-S), recorded in the findings note.
Micro↔macro bridge: cmd/loadbridge -rate <rps> -duration 2m drives the same function through project 01's harness and emits system p50/p99/RPS.

9. Key technical challenges¶

Dead-code elimination is silent. The most common failure: the benchmark reports a few hundred picoseconds/op because the compiler deleted the work entirely. The tell is an implausibly tiny ns/op and no allocations. You must prove the work runs — sink + read, and confirm with -gcflags=-m. Until you've seen DCE fire on a deliberately broken benchmark, you don't really know it.
b.N mechanics & warm-up. testing runs the body with a growing b.N until the timed region reaches ~1 s, then reports the last run's ns/op. If your per-iteration setup leaks into the loop, you measure setup, not work. If the function has a one-time cold-cache cost, the first cheap b.N values mislead.
CPU frequency scaling & noise. Turbo boost, thermal throttling, and background processes produce 10–30% swings that dwarf the change you're hunting. This is the reason naive A/B "it's faster on my machine" claims are worthless. Control it, or measure it and inflate your significance bar.
Single-run delusion. One ns/op vs another ns/op tells you nothing about significance. A 4% "improvement" from a single run is, on most machines, pure noise. The fix is -count + benchstat — and the discipline to report ~ (no difference) when the statistics say so.
Benchmarking concurrency honestly. RunParallel measures aggregate throughput across GOMAXPROCS workers — but contention (a shared mutex, false sharing, an atomic hot line) makes per-core ns/op rise as you add cores. A benchmark that only runs at -cpu=1 hides the exact scalability problem the high-RPS service will hit. The -cpu sweep is where contention becomes visible.
Micro ≠ macro. A function that's 2× faster in isolation may move system p99 by 0% — because it wasn't on the critical path, or the bottleneck is the DB, or GC pressure dominates, or the win is amortized away by batching. A micro number is a hypothesis about the system; the load test is the test of that hypothesis.

Stages (0 simple → 1 big data → 2 high RPS → 3 both)¶

Build Stage 0 correct first — a benchmark that lies at Stage 0 lies at every stage. Each stage pushes one axis (input size, then parallelism) before combining them and gating the result in CI.

Stage	Input size	Parallelism	What's measured	Pass criterion
0 · Simple	one fixed `n` (e.g. 1 KB)	`-cpu=1`	A single clean benchmark: `ns/op`, `B/op`, `allocs/op` of one implementation	No DCE (proven via `-gcflags=-m` + sink-removal sanity check); timer reset after setup; CV < 3% over `-count=10` on a quiesced machine
1 · Big data	sweep `n ∈ {1, 64, 1k, 64k}`	`-cpu=1`	`ns/op` and `allocs/op` as a function of n across both implementations	The complexity curve is read off the numbers: an O(n) vs O(n²) gap is visible in the sweep, and per-`n` allocation growth is quantified and explained
2 · High RPS	one fixed `n`	`RunParallel`, `-cpu=1,2,4,8,…`	Aggregate throughput and per-core `ns/op` as cores are added	Scalability characterized: state whether the path scales linearly or where contention (mutex/atomic/false-sharing) flattens or worsens per-core cost, with the inflection point named
3 · Both	large `n` (≥ 64k)	high parallelism (`-cpu` = core count)	Worst realistic load on the hot path + a benchstat-gated CI regression check	A↔B compared with `benchstat` (`p < 0.05`, `±` reported); CI gate fails on a real ≥5% slowdown and does not fire on a no-op PR over 10 trials; findings note links the micro `ns/op` to the measured system p99/RPS

10. Experiments to run (break it / tune it)¶

Record before/after numbers, the benchstat row (with p-value), and a one-line conclusion for each:

Watch DCE fire. Write the "obvious" benchmark that discards the result. Note the absurd ns/op. Add the sink + read; watch ns/op jump to a realistic value. Confirm with go test -gcflags=-m. This is the foundational experiment — do it first.
Hoisting demo. Benchmark with a constant input, then with an index-varied input from a slice. Show the constant version is suspiciously fast because the compiler folded it. Quantify the gap.
Warm-up / timer placement. Put expensive setup inside the timed loop, then move it out with b.ResetTimer(). Show how much the misplaced setup inflated ns/op, and how StopTimer/StartTimer overhead distorts a cheap benchmark.
Variance hunt. Run the same benchmark -count=20 with turbo on vs a fixed performance governor (and idle vs a busy machine). Report the coefficient of variation for each condition. Show that an A/B "win" smaller than your noise is undetectable.
A/B done right. Compare implementation A vs B with benchstat at -count=10. Report delta %, ±, and p. Then construct a case where the difference is real but tiny and p > 0.05 — and correctly report "no detectable difference."
Input-size sweep (complexity). Run both implementations across the n sweep. Plot ns/op and allocs/op vs n. Identify which is O(n) and which is O(n²) from the numbers alone, and verify against the code.
Parallel scaling / contention. Run the RunParallel benchmark across the -cpu sweep. Show a contended version (shared mutex/atomic) whose per-core ns/op rises with cores, vs a sharded/local version that stays flat. Name the inflection point.
CI gate, true & false positive. Introduce a deliberate 8% slowdown and prove the gate fails the build with its benchstat evidence. Then run the gate on a no-op PR 10 times and prove it never fires. Report the gate's threshold and false-positive rate.
Micro↔macro reconciliation. Take the A/B micro delta from experiment 5, drive both versions through the load harness at a fixed RPS, and report the change in system p99 and max sustainable RPS. Explain any gap between the micro win and the macro effect (critical-path fraction, GC, batching, the real bottleneck).

11. Milestones¶

hotpath/ with A and B; first clean Stage-0 benchmark; DCE proven defeated (experiment 1) and CV measured.
Input-size sweep + allocation accounting (Stage 1, experiments 2, 6); complexity curves read off the numbers.
RunParallel + -cpu sweep (Stage 2, experiment 7); contention characterized.
benchstat A/B discipline + variance control (experiments 3, 4, 5); a result you'd bet on.
CI regression gate with proven true-positive and ~0 false-positive (Stage 3, experiment 8); micro↔macro bridge (experiment 9); findings note.

12. Acceptance criteria (definition of done)¶

A Stage-0 benchmark with proven no-DCE (sink + -gcflags=-m evidence in the note) and a measured CV < 3% (or a stated, justified higher bar).
B/op and allocs/op reported for every benchmark; any "zero-alloc" claim shows 0 allocs/op.
An input-size sweep that reveals the O(n) vs O(n²) difference from the numbers, with the allocation-growth explanation.
A RunParallel -cpu sweep with per-core ns/op and a named contention inflection point (or a justified "scales linearly" with evidence).
An A/B benchstat report committed, including a case correctly reported as ~ (not significant) — proving you withhold conclusions when the stats say so.
A CI gate that fails on a real ≥5% slowdown (show the failing run) and does not fire on a no-op PR over 10 trials (show the green runs).
A findings note connecting the micro ns/op to the measured system p99/RPS, including any gap and its explanation.
Every number is reproducible from a committed command + the recorded machine context (CPU, GOMAXPROCS, governor, idle state, benchstat version).

13. Stretch goals¶

benchstat over time: store baselines per commit and plot the tracked benchmark's trend across the last N merges to catch slow drift (the "boiling frog" regression no single PR trips).
-benchtime sensitivity: run with -benchtime=100x vs -benchtime=1s vs -benchtime=10s and show how fixed-iteration vs fixed-time benchmarking changes variance and what each is good for.
perflock / cpuset isolation: add machine-level isolation (a dedicated benchmark runner, perflock, or isolated cores) and re-measure your CV to show the variance floor your CI gate can actually rely on.
PGO interaction: build with profile-guided optimization and re-benchmark — show how PGO moves inlining decisions and therefore your micro numbers.
Allocation regression gate: extend the CI gate to fail on an allocs/op increase too, not just ns/op — often the earlier and more reliable signal.

14. Evaluation rubric¶

Dimension	Senior bar	Staff bar
Benchmark correctness	Knows to add a sink; benchmark looks right	Proves no DCE via `-gcflags=-m`/disassembly; has seen DCE fire and can spot a fake number on sight
Variance control	Runs `-count` and averages	Measures CV, controls frequency scaling, states the machine context, and sets the significance bar to the noise floor
Statistical reasoning	Uses `benchstat` for A/B	Reads the p-value and `±`; withholds the claim when `p > 0.05`; explains what the test actually tests
Allocation accounting	Reports `allocs/op`	Treats allocations as a first-class signal; knows when the alloc delta predicts the macro win better than `ns/op`
Concurrency benchmarking	Uses `RunParallel`	Sweeps `-cpu`, reads per-core scaling, locates the contention inflection, ties it to the high-RPS path
CI gating	Has a benchmark in CI	Gate fires on real regressions, doesn't fire on noise (proven both ways); threshold justified by measured variance
Micro↔macro judgment	Reports the `ns/op`	Connects (or disconnects) the micro win from the system p99; trusts only statistically-sound numbers and says so when a micro number is meaningless
Communication	Clear findings note	Could defend every delta — and every `~` — to a staff panel

Staff bar in one line: trusts only statistically-sound numbers, knows when a micro number is meaningless, and never ships a perf claim without a p-value and a machine context behind it.

15. References¶

Go testing package docs: B, b.N, ResetTimer, ReportAllocs, RunParallel, -bench, -benchmem, -benchtime, -count, -cpu.
golang.org/x/perf/cmd/benchstat — installation, output format, the significance test it uses; pin the version you used.
Dave Cheney, "How to write benchmarks in Go" and "High Performance Go Workshop" — DCE, sinks, b.N mechanics, the canonical pitfalls.
Gil Tene on measurement honesty (the load-side companion: project 01, coordinated omission) — the same "your instrument is lying to you" discipline, applied to latency.
Builds on load-testing/01-distributed-load-generator (the macro side of the micro↔macro bridge); feeds load-testing/07-profiling-guided-optimization (benchmarks tell you whether; profiling tells you where).
See also: Interview Question/17-performance-engineering/ (benchmarking, measurement, regression) and Interview Question/01-golang-language-and-runtime/ (compiler optimizations, inlining, escape analysis, the GC the allocations feed).