Benchmarking Strategy — Hands-on Tasks¶
Work through these in order. Each has explicit acceptance criteria. Use Go 1.22+ (1.24+ for b.Loop() tasks).
Task 1: Your first benchmark¶
Write Sum(xs []int) int and a benchmark for it on a slice of length 10.
Acceptance criteria - [ ] go test -bench=BenchmarkSum -run=^$ -benchmem reports a finite ns/op. - [ ] You can name b.N, ns/op, B/op, allocs/op and describe each in one sentence. - [ ] You change the slice length to 1000 and observe ns/op rising roughly linearly.
Task 2: Catch dead-code elimination¶
Write the benchmark:
Acceptance criteria - [ ] The reported ns/op is under 1 ns. - [ ] You add a package-level var sink int and assign sink = 1 + 2. The new number is meaningfully larger. - [ ] You disassemble the test binary (go test -c && go tool objdump -s 'BenchmarkAdd') and confirm the addition appears in the inner loop only with the sink. - [ ] You rewrite using for b.Loop() (Go 1.24+) and confirm the work also survives.
Task 3: Compare two implementations with benchstat¶
Implement string concatenation two ways: with += and with strings.Builder (preallocated with Grow).
Acceptance criteria - [ ] You write BenchmarkPlus and BenchmarkBuilder over the same input. - [ ] You run each with -count=10 -run=^$ and save output to plus.txt and builder.txt. - [ ] benchstat plus.txt builder.txt produces a comparison with a p value. - [ ] You note in writing whether the delta is statistically significant (p < 0.05).
Task 4: Sub-benchmarks for an input-size sweep¶
Pick a parser (e.g., json.Unmarshal into a small struct). Build inputs at sizes 16, 256, 4096, 65536 bytes.
Acceptance criteria - [ ] One BenchmarkParse uses b.Run to produce four sub-benchmarks. - [ ] Each calls b.SetBytes(int64(n)) so output includes MB/s. - [ ] You plot ns/op vs size on log-log paper (or in a spreadsheet) and identify the slope. - [ ] You describe in one sentence what the slope tells you about per-call overhead vs per-byte cost.
Task 5: Parallel benchmark for a shared map¶
Build two read-mostly maps protected differently: one sync.RWMutex around map[string]int, one sync.Map.
Acceptance criteria - [ ] Each has a BenchmarkXxx using b.RunParallel. - [ ] You run with -cpu=1,2,4,8 -count=10 and observe how ns/op changes with concurrency. - [ ] You report which scales better and at what -cpu value the difference becomes significant. - [ ] You add a write workload (10% writes) and re-run; comment on how scaling changes.
Task 6: Allocation regression budget¶
Pick a function on a hot path of any small service or library you wrote. Write a benchmark that exercises it with b.ReportAllocs().
Acceptance criteria - [ ] You record the current allocs/op. - [ ] You add a TestXxxAllocBudget that calls testing.Benchmark(BenchmarkXxx) and t.Fatals if AllocsPerOp() exceeds your budget. - [ ] You deliberately introduce an extra allocation (e.g., a fmt.Sprintf in the hot path) and confirm the test fails. - [ ] You revert and confirm the test passes again.
Task 7: Profile-driven benchmark¶
Take a small Go service you've written. Generate a CPU profile under realistic load via /debug/pprof/profile?seconds=30.
Acceptance criteria - [ ] You run go tool pprof -top cpu.pprof and identify the top 3 functions by flat time. - [ ] For each, you write a benchmark that calls it with inputs shaped like the profile suggests. - [ ] You collect baseline -count=10 numbers for all three. - [ ] You write a short paragraph (5–8 lines) explaining why these specific benchmarks were chosen.
Task 8: Setup bias¶
Take the benchmark from Task 7 and deliberately introduce setup bias: move the input construction inside the b.N loop without StopTimer.
Acceptance criteria - [ ] You measure the biased version and the corrected version side by side. - [ ] You compute the ratio of biased / corrected ns/op. - [ ] You write a comment in the file explaining what was being measured incorrectly. - [ ] You verify with -cpuprofile that the bias was indeed in the setup code, not the function under test.
Task 9: Statistical noise floor¶
On your normal development machine, run any one benchmark from Task 4 with -count=30.
Acceptance criteria - [ ] You compute the coefficient of variation (σ/μ) across the 30 samples. - [ ] You repeat after pinning the CPU governor to performance (Linux) or disabling background apps (macOS). - [ ] You report the before/after CoV. - [ ] You explain in one sentence what the smallest detectable improvement on your machine would be.
Task 10: CI integration¶
Add a GitHub Actions (or GitLab CI) workflow that runs benchmarks on every PR and posts a benchstat diff as a comment.
Acceptance criteria - [ ] The workflow checks out the base commit, runs -bench=. -count=10 -benchmem -run=^$, saves output. - [ ] The workflow checks out the head commit, runs the same command, saves output. - [ ] It runs benchstat base.txt head.txt and posts the result. - [ ] A test PR demonstrates the comment showing up. - [ ] You write a one-paragraph note about why a self-hosted runner is or isn't worth it for your project.
Task 11: Branch predictor experiment¶
Write a benchmark that counts elements above 128 in a slice of 1024 ints.
Acceptance criteria - [ ] Run with sorted input (0..1023), random input, and reverse-sorted input. - [ ] Each run uses -count=10. - [ ] You report the ns/op for all three. - [ ] You explain the variation in one paragraph, referencing branch prediction. - [ ] You note which input shape matches your production data.
Task 12: PGO-aware benchmark¶
Build a small program, collect a CPU profile from one workload, and rebuild with -pgo=profile.pprof.
Acceptance criteria - [ ] You benchmark a function from that program with and without PGO. - [ ] You report the delta via benchstat. - [ ] You explain in writing whether the gain matches Go's documented PGO range (2–14%) and why your number falls where it does.
Stretch — Task 13: Zero-allocation kernel¶
Pick a small kernel (e.g., binary header parser, fixed-format log line writer, ring-buffer push). Drive it to 0 allocs/op.
Acceptance criteria - [ ] Benchmark with -benchmem reports 0 allocs/op. - [ ] Each technique used (preallocated output buffer, byte-slice in/out, no interfaces, no fmt.*) has a comment explaining why. - [ ] A TestKernelAllocBudget test enforces zero allocations in CI. - [ ] You run benchstat against an "easy" version (e.g., the same kernel using fmt.Sprintf) and report the delta.
Submission¶
Each task should produce:
- A short writeup (5–15 lines) of what you observed.
- The code you ran or modified.
- The benchmark or
benchstatoutput that backs your conclusions.
These artifacts are what turn "I ran go test -bench" into "I can defend a performance claim with data".