Optimization Workflow — Middle¶
1. The loop in detail¶
The five-step loop introduced at the junior level is correct but skeletal. At the middle level we expand each step into the concrete artifacts that experienced engineers produce.
| Step | Artifact | Verified by |
|---|---|---|
| Goal | Written target with numbers and context | A line in the ticket: "p99 ≤ 30 ms at 500 RPS" |
| Baseline | Saved benchmark output, profile file, or production query | A committed bench/baseline.txt or a saved pprof |
| Hotspot | A specific function name plus a percentage | "render.encode is 38% of CPU and 42% of allocations" |
| Hypothesis | One-sentence theory naming a technique | "Pre-allocating the byte buffer should reduce allocs by half" |
| Verification | benchstat diff, or canary comparison | p < 0.05 and the hotspot dropped out of top10 |
Each artifact is something you can show another engineer. If you can't show it, the step didn't happen.
2. Where the data actually lives¶
A middle-level engineer learns to gather the right data from the right place for the question being asked.
| Question | Best data source |
|---|---|
| Which function is slow? | CPU profile from production (real workload mix) |
| Which line within that function? | pprof -list or pprof -web |
| Which allocation is responsible for GC pressure? | Heap profile in -alloc_objects mode |
| Is this lock the bottleneck? | Mutex / block profile |
| Did the optimization help? | Microbenchmark + benchstat |
| Is the system meeting its SLO? | Metrics from the load balancer or APM |
| Is goroutine count growing without bound? | runtime.NumGoroutine() time series |
The two most common mistakes: using a microbenchmark to answer a system question, and using a system metric to compare two implementations of a function.
3. CPU vs. memory vs. contention vs. I/O¶
A workload's bottleneck almost always reduces to one of these four resources. The diagnostic pattern:
| Bottleneck | CPU usage | Latency under load | Diagnostic |
|---|---|---|---|
| CPU-bound | At or near 100% on saturated cores | Scales linearly with load | CPU profile shows hot functions |
| Memory / GC-bound | High, but GC CPU fraction > 20% | Variable; long-tail GC pauses | GODEBUG=gctrace=1, heap profile |
| Contention-bound | Low (cores idle) at high RPS | Latency jumps at low concurrency | Mutex profile shows wait time |
| I/O-bound | Low | Latency dominated by wait time | runtime/trace shows blocking on syscall |
Run a small experiment: scale request concurrency. If latency stays flat as you add concurrency but CPU stays low, you're contention-bound or I/O-bound, not CPU-bound. If latency rises linearly with concurrency and CPU is pegged, you're CPU-bound. The shape of the curve tells you which tool to reach for next.
4. Choosing benchmark vs. trace prod¶
This is the most common decision at this level, and most engineers get it backward.
| Situation | Reach for |
|---|---|
| You have a candidate optimization and want to know if it helps | Microbenchmark |
| You don't yet know which function is slow | Production profile |
| The problem only appears under realistic load mix | Production profile |
| The problem is reproducible with a fixed input | Microbenchmark |
| You're comparing five implementations of a parser | Microbenchmark |
| The slowness is tied to specific tenants or request shapes | Production profile (with labels) |
Rule of thumb: profile production to find the hotspot, then benchmark to optimize it. Going the other way — benchmarking a function you guessed was slow — wastes engineering hours.
5. Walking a CPU profile¶
| Command | What it shows |
|---|---|
top | Flat: time spent in this function only |
top -cum | Cumulative: time spent in this function and its callees |
list <fn> | Annotated source code with per-line CPU |
web | SVG/HTML flame graph in your browser |
traces | Sampled stack traces |
peek <fn> | Callers and callees of a function |
Read top -cum first to see "where does the program spend its time at a high level", then top to see "which leaf functions are actually working". Flame graphs are excellent for presentations and overview; the text views are faster for sustained analysis.
6. The benchmark you actually need¶
Beyond for i := 0; i < b.N; i++ { fn() }, several patterns are essential.
// Allocations + time, the default for "is this allocating?"
func BenchmarkEncode(b *testing.B) {
enc := newEncoder()
msg := newMessage()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = enc.Encode(msg)
}
}
// Parallel: tests scaling with GOMAXPROCS
func BenchmarkEncodeParallel(b *testing.B) {
msg := newMessage()
b.ReportAllocs()
b.RunParallel(func(pb *testing.PB) {
enc := newEncoder() // per-goroutine
for pb.Next() {
_ = enc.Encode(msg)
}
})
}
// Table-driven: compare implementations or input sizes
func BenchmarkHash(b *testing.B) {
sizes := []int{16, 256, 4096, 65536}
for _, n := range sizes {
b.Run(fmt.Sprintf("size=%d", n), func(b *testing.B) {
data := make([]byte, n)
b.SetBytes(int64(n))
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = hash(data)
}
})
}
}
b.SetBytes(n) reports MB/s for throughput-oriented work — invaluable when comparing implementations of streaming code.
7. Reading benchstat output¶
$ benchstat old.txt new.txt
name old time/op new time/op delta
Encode-8 412ns ± 2% 298ns ± 1% -27.67% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Encode-8 256B ± 0% 128B ± 0% -50.00% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Encode-8 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
The p value is the Mann-Whitney U test. p=0.000 means "the chance the difference is noise is essentially zero". p > 0.05 means the change is not detectable — even a 10% delta number with p=0.2 is not a real improvement.
The ± 2% after each number is the spread across the 10 runs. A noisy benchmark (spread > 5%) should be re-run with -count=20 or fixed by removing variability (background jobs, frequency scaling, thermal throttling).
8. Allocations are a CPU cost¶
Newcomers often think allocations only matter for memory pressure. They are also a major CPU cost, through two channels:
- The allocator itself runs code on every call.
- The garbage collector runs proportional to allocation rate.
For a CPU-bound service, dropping allocs/op from 12 to 3 may improve throughput by 15%, even if total memory usage barely changes. That's because GC CPU drops from (say) 18% to 5%, and that 13% flows directly to the application.
Always include -benchmem in your benchmark runs. Track allocs as a first-class metric, not a side note.
9. The hotspot has dependencies¶
When you find that function X is 40% of CPU, ask the next question: why is X being called so much? Often the win is not in optimizing X but in calling X less often.
Example. A profile shows json.Marshal is 35% of CPU. The naive fix is "find a faster JSON library." The better fix is often:
| Question | Possible fix |
|---|---|
| Is the same value being marshaled repeatedly? | Cache the marshaled bytes |
| Is half the marshaled payload thrown away? | Use a smaller struct |
| Is this on a path that doesn't need JSON? | Use a different serializer |
| Is this called per-element where it could be per-batch? | Marshal the whole slice once |
The fastest function is the one that doesn't run. Always check whether the bottleneck function can be called less often before trying to speed it up.
10. The hierarchy applied¶
When the hot function is identified, you have a menu of optimizations. Work the menu top down.
| Level | Concrete questions for myFunc |
|---|---|
| Algorithm | Is the asymptotic complexity wrong? Linear scan instead of map? |
| Data structure | Is a slice better than a map here? Sorted insert versus sort-then-scan? |
| Implementation | Are we doing the same work twice? Calling a thing on every iteration that's invariant? |
| Compiler / runtime | Is the function inline-able? Does it allocate where it could not? |
| Micro-opt | Branch hints, SIMD, hand-unroll |
Only descend if the upper level either (a) is already optimal or (b) you've measured that it isn't the cost.
11. Reading flame graphs more carefully¶
A flame graph reads top-down (call depth) and is sorted alphabetically left-to-right, not by time. The width of each box is the sample count.
Patterns worth recognizing:
- Tall, narrow stacks: deep call chains, classic of recursion or middleware. Investigate whether the depth itself is the cost.
- Wide, shallow plateaus: hot loops. Inspect the function and its loop body.
- A wide GC tower: see lots of
runtime.gcBgMarkWorkerorruntime.scanobject? You have allocation pressure, not CPU work. - A wide
runtime.mallocgc: same — allocations are the bottleneck. - A wide
runtime.futex/runtime.sysmon/runtime.notetsleep: contention or scheduling, not pure CPU.
The flame graph is best at telling you which family of cost dominates. Use it for triage; use pprof -list for the actual line.
12. The "trade-offs" conversation¶
Every middle-level optimization decision is also a trade-off decision. Document them explicitly.
| Choice | Cost | Benefit |
|---|---|---|
| Add an LRU cache | Memory, complexity, staleness | Latency, throughput |
| Pre-allocate a large pool | RSS at startup | Lower per-request variance |
Use sync.Pool | Code complexity, retention quirks | Reduced GC pressure |
Use unsafe.String/unsafe.Slice | Risk if memory is mutated | Skip one copy |
| Inline a helper | Code duplication | Lets escape analysis stack-allocate |
| Batch operations | Higher per-call latency | Higher throughput |
A change without a documented trade-off is suspicious. Either the trade-off exists and you didn't notice, or the change is too small to matter.
13. When to stop iterating¶
The discipline isn't only knowing how to optimize; it's knowing when to stop. Practical signals:
| Signal | Action |
|---|---|
| The next candidate is < 5% improvement | Stop on this hotspot; find the next |
| You're chasing benchmark variance, not real wins | Stop and improve the benchmark harness |
| The optimization makes the code substantially harder to read | Strongly consider reverting |
| You've met the goal with > 20% margin | Stop |
| Your changes have started reducing readability without measurable gain | Stop and revert the last change |
Knowing when to stop comes from honesty about the data. If benchstat says p=0.18, the win you're seeing is in your head, not the program.
14. The change log entry¶
A middle-level engineer writes a commit message that future-you, or your replacement, can act on:
render: pre-allocate output buffer in Encode
The buffer was growing 4 times per call because we appended without
hinting capacity. The output of `pprof -alloc_objects` showed Encode
at 38% of all allocations.
Before: After:
412ns ± 2% 298ns ± 1% -27.7%
256B ± 0% 128B ± 0% -50.0%
3 allocs 1 alloc
benchstat p=0.000, n=10+10.
We chose to leave the buffer initial capacity at 512 instead of the
worst-case 8192 because the median message is 300 bytes and the
worst case is rare. Worst case still works correctly via append.
Numbers, technique, trade-off. Three short paragraphs that pay back five times over the next year.
15. Summary¶
At the middle level, the loop is unchanged but each step has substance: artifacts you save, profiles you read, benchmark patterns you know by heart. Your toolset is pprof, benchstat, the standard library benchmark patterns, and the four bottleneck categories. The two skills that separate middle from junior: choosing the right data source for the question, and stopping when the goal is met instead of chasing 1% wins.
Further reading¶
pprofuser guide: https://github.com/google/pprof/blob/main/doc/README.mdruntime/trace: https://pkg.go.dev/runtime/trace- Dave Cheney, Five things that make Go fast: https://dave.cheney.net/2014/06/07/five-things-that-make-go-fast
- Damian Gryski, go-perfbook: https://github.com/dgryski/go-perfbook