Benchmarks — Tasks¶
Hands-on exercises. Do them in order: each one introduces a tool or technique you will need in the next.
Task 1 — Write your first benchmark¶
Goal. Produce a working benchmark and read its output.
Steps.
- Create a new directory
bench-task-01withgo mod init example/bench01. - Add a file
sum.go:
- Add a file
sum_test.go:
package bench01
import "testing"
var input = make([]int, 1000)
func init() {
for i := range input {
input[i] = i
}
}
func BenchmarkSum(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = Sum(input)
}
}
- Run
go test -bench=. -benchmem.
Deliverable. Paste the output line. Identify:
- The chosen
b.N. - The
ns/op. - The
B/opandallocs/op.
Expected observation. allocs/op should be 0 — Sum does not allocate.
Task 2 — Convert to table-driven¶
Goal. Use b.Run so a single benchmark function exercises many input sizes.
Refactor BenchmarkSum so it runs four sub-benchmarks: sizes 100, 1_000, 10_000, 100_000.
func BenchmarkSum(b *testing.B) {
sizes := []int{100, 1_000, 10_000, 100_000}
for _, n := range sizes {
xs := make([]int, n)
for i := range xs {
xs[i] = i
}
b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = Sum(xs)
}
})
}
}
Run with -benchmem. Verify ns/op scales roughly linearly with n. If it does not, your benchmark has a bug — find it.
Task 3 — Add b.SetBytes¶
Sum reads n*8 bytes per call (assuming int is 8 bytes). Inside each sub-benchmark, call b.SetBytes(int64(len(xs) * 8)).
Deliverable. Output that now includes a MB/s column. The number should be roughly constant across sizes — that is the bandwidth your CPU can sustain on a tight integer-sum loop. Note the value.
Task 4 — Setup excluded with b.ResetTimer¶
Build a benchmark whose setup is deliberately slow. Compare with and without b.ResetTimer.
func BenchmarkWithSetup(b *testing.B) {
// Expensive setup we do NOT want timed.
data := make([]byte, 10_000_000)
for i := range data {
data[i] = byte(i)
}
// Without ResetTimer: setup time inflates ns/op.
// With ResetTimer: only the inner work is measured.
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = data[i%len(data)]
}
}
Run the benchmark twice — once with the b.ResetTimer() line, once without. Report the ratio of ns/op. Explain.
Task 5 — Compare two implementations with benchstat¶
Goal. Demonstrate a real comparison workflow.
- Install benchstat:
- Write two implementations of string concatenation:
package bench05
import "strings"
func ConcatPlus(parts []string) string {
var s string
for _, p := range parts {
s += p
}
return s
}
func ConcatBuilder(parts []string) string {
var b strings.Builder
for _, p := range parts {
b.WriteString(p)
}
return b.String()
}
- Benchmark both with the same input:
var parts = make([]string, 100)
func init() {
for i := range parts {
parts[i] = "abcdef"
}
}
func BenchmarkConcatPlus(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = ConcatPlus(parts)
}
}
func BenchmarkConcatBuilder(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = ConcatBuilder(parts)
}
}
- Run each ten times:
go test -bench=ConcatPlus -count=10 -benchmem > plus.txt
go test -bench=ConcatBuilder -count=10 -benchmem > builder.txt
benchstat plus.txt builder.txt
Deliverable. Paste the benchstat output. Identify the percentage delta and the p-value.
Task 6 — Identify a benchmark trap¶
The following benchmark gives 0.27 ns/op. Explain why and fix it.
package bench06
import "testing"
func square(x int) int { return x * x }
func BenchmarkSquare(b *testing.B) {
for i := 0; i < b.N; i++ {
square(i)
}
}
Expected fix. Either assign to a package-level sink:
var sink int
func BenchmarkSquare(b *testing.B) {
var s int
for i := 0; i < b.N; i++ {
s = square(i)
}
sink = s
}
Or use for b.Loop() on Go 1.24+. Run both forms; compare ns/op.
Task 7 — RunParallel on a mutex-protected counter¶
Implement two counters: one with sync.Mutex, one with sync/atomic.Int64. Benchmark both under contention with b.RunParallel.
func BenchmarkMutexCounter(b *testing.B) {
var (
mu sync.Mutex
n int64
)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
mu.Lock()
n++
mu.Unlock()
}
})
}
func BenchmarkAtomicCounter(b *testing.B) {
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
n.Add(1)
}
})
}
Run with -cpu=1,2,4,8 to see scaling. Report which scales better.
Task 8 — Profile-collecting run¶
For the slower of the two counters from Task 7, collect a CPU profile and a mutex profile.
go test -bench=BenchmarkMutexCounter -cpuprofile=cpu.out -mutexprofile=mutex.out -count=1
go tool pprof -top cpu.out
go tool pprof -top mutex.out
Deliverable. The top three entries of each profile. Explain where contention shows up.
Task 9 — Reproducibility experiment¶
Run Task 5's benchmark five times on your laptop, each time with -count=10. Save each as run-N.txt. Then run:
Question. Do the means drift across runs? By how much? This is your laptop's noise floor — improvements smaller than this number are statistically indistinguishable.
Task 10 — Stretch goal: noise reduction¶
If you are on a Linux box:
- Set the CPU governor to
performance(root):
- Disable turbo (Intel):
- Pin to one physical core:
Compare pinned.txt to a normal run. The reduction in stddev is what professional benchmarkers buy with this setup.
Submission checklist¶
- Task 1 raw output.
- Task 2 sub-benchmark output for all four sizes.
- Task 3
MB/snumbers and CPU model. - Task 4 ns/op ratio with and without
ResetTimer. - Task 5
benchstatoutput. - Task 6 fixed benchmark + before/after
ns/op. - Task 7 mutex vs atomic numbers under
-cpu=1,2,4,8. - Task 8 top-3 profile entries.
- Task 9 noise floor estimate.
- Task 10 (optional) stddev reduction percentage.