IR & Middle-End — Optimize¶
The allocation-reduction playbook, organized as a workflow. The premise: on CPU-bound Go services, the cheapest large wins usually come from removing avoidable heap allocations (less GC work) and from letting the inliner do its job. The middle-end diagnostics tell you exactly where to act.
1. Measure first: alloc profile + -benchmem¶
Never optimize allocations you haven't measured. Two complementary tools:
# Benchmark with allocation accounting
go test -run=^$ -bench=. -benchmem -count=10 ./... > base.txt
# Heap allocation profile to find WHERE allocations happen
go test -run=^$ -bench=BenchmarkHot -memprofile=mem.out ./...
go tool pprof -alloc_objects mem.out # 'top', 'list Func', 'web'
-benchmem adds B/op and allocs/op columns. allocs/op is the number to drive to zero on hot paths. pprof -alloc_objects ranks call sites by allocation count (use -alloc_space for bytes). Always keep a base.txt to compare against.
2. Read -m: see the decisions¶
For every hot function from the profile, read the compiler's verdict:
You are looking for:
moved to heap: x/escapes to heap→ an allocation you might remove.leaking param: p→ the function forces its argument to escape at callers.cannot inline f/ absence ofcan inline f→ a missed inline.flow:lines (at-m=2) → the pointer path that caused an escape — follow it to the root store.
The decision (-m) tells you what to change; the profile tells you what's worth changing. Use them together.
3. Keep things on the stack¶
The highest-leverage moves, roughly in order of payoff:
- Return values, not pointers, for small structs.
func New() Toverfunc New() *Tlets the result stay on the caller's stack after inlining. - Avoid interface boxing on hot paths. Don't pass hot values through
any/error/...interface{}/fmt. Use typed paths (strconv.Append*, typed logger fields). - Pre-size slices/maps:
make([]T, 0, n),make(map[K]V, n). Stopsgrowslice/rehash and can keep small bounded slices on the stack. - Don't store locals into longer-lived structures unless they truly must outlive the frame. A store into a global/field that escapes drags the local with it.
- Don't return or store closures you only call synchronously — captured vars stay on the stack then.
- Use array/value receivers where mutation/identity isn't needed, so the receiver isn't forced to the heap.
After each change, re-run -m and confirm the moved to heap line is gone, then -benchmem to confirm allocs/op dropped.
4. Help the inliner¶
Inlining removes call overhead and enables escape analysis, devirtualization, and constant folding in the caller. To get more of it:
- Keep hot functions small. Stay under the ~80 cost budget. Split rare/cold branches into separate functions so the hot function shrinks and inlines.
- Outline the slow path behind
//go:noinlineso the hot fast path is small and inlinable:
func Get(k Key) (Val, bool) {
if v, ok := m[k]; ok { return v, true } // hot, tiny, inlines
return getSlow(k) // //go:noinline below
}
//go:noinline
func getSlow(k Key) (Val, bool) { /* big miss handling */ }
- Avoid inline blockers in hot leaf code:
select, heavydefer, very largeswitch. If a hot function just misses, removing one of these often tips it under budget. - Don't over-inline. If
-l -lstyle aggressiveness bloats the binary and hurts icache, that's a real regression — measure binary size and benchmarks, not just allocs.
Verify with go build -gcflags=-m 2>&1 | grep 'inlining call to <fn>'.
5. Devirtualize the hot interface calls¶
An interface call is an indirect jump through the itab. To turn the hot ones direct:
- Make the calling function inlinable so the concrete type flows in (static devirtualization).
- Or accept the concrete type on the hottest path instead of the interface.
- Or use PGO (next section) for profile-guided devirtualization when one dynamic type dominates.
Direct calls then become inline candidates themselves — devirtualization chains into inlining.
6. sync.Pool — when stack isn't possible¶
If an object legitimately escapes (handed to async I/O, genuinely large, retained across calls) and is reused at high rate, pool it:
var bufPool = sync.Pool{New: func() any { return make([]byte, 0, 4096) }}
func handle(w io.Writer) {
buf := bufPool.Get().([]byte)[:0]
defer bufPool.Put(buf)
// ... fill buf, write ...
}
Tradeoffs to respect before reaching for it:
- Pools are emptied at GC; only worth it when intra-cycle reuse is high.
- The
Get().(T)assertion andPutbookkeeping cost something; for small objects it can be slower than plain allocation. - It adds complexity and aliasing hazards (don't keep references to pooled buffers after
Put).
Decision order: stack > pre-sized reuse > sync.Pool > plain heap. Benchmark each step.
7. PGO: free hot-path inlining/devirtualization¶
For services, collect a production-representative CPU profile and feed it back:
# 1. capture a profile (runtime/pprof or net/http/pprof) under real load
# 2. place it as default.pgo next to the main package (auto-detected),
# or pass explicitly:
go build -pgo=cpu.pprof ./...
# compare before/after
go test -bench=. -benchmem -count=10 ./... > pgo.txt
benchstat base.txt pgo.txt
PGO raises the inline budget on hot call sites (inlining functions normally too big, only where it pays) and devirtualizes hot interface calls. It's safe (heuristics only, never correctness) and tolerant of slightly stale profiles. Typical gains are a few percent CPU with zero code change — do this before hand-tuning.
8. Benchmarking allocations correctly¶
func BenchmarkParse(b *testing.B) {
b.ReportAllocs() // force allocs/op even without -benchmem
input := makeInput()
b.ResetTimer()
for i := 0; i < b.N; i++ {
sink = parse(input) // assign to package var to prevent dead-code elimination
}
}
var sink Result
Rules:
- Assign results to a package-level
sinkso the optimizer can't delete the work. b.ResetTimer()after setup;b.ReportAllocs()to always show allocs.- Run
-count=10and compare withbenchstat— a single run is noise. - Use
testing.AllocsPerRun(n, f)in a unit test to assert zero allocs and fail CI on regressions.
9. Checklist¶
- Profiled first; targeting hot allocations (
pprof -alloc_objects), not all of them. - Read
-gcflags='pkg=-m=2'for each hot function; identifiedmoved to heap/ missed inlines. - Returned small structs by value instead of pointer.
- Removed interface boxing (
fmt/any) from hot paths. - Pre-sized slices and maps with known capacity.
- Hoisted per-iteration allocations out of loops; reused buffers with
buf[:0]. - Kept hot functions under inline budget; outlined cold paths with
//go:noinline. - Devirtualized hot interface calls (inline the caller / concrete type / PGO).
- Considered
sync.Poolonly for reused legitimately-escaping objects; benchmarked it. - Enabled PGO from a production profile; compared with
benchstat. - Pinned results with
AllocsPerRun/benchstatgates in CI. - Confirmed each change with
-m(decision) and-benchmem(runtime truth).
10. Summary¶
- Optimize allocations top-down: profile →
-m→ fix →-benchmem/benchstat. - Keep values on the stack (return by value, no boxing, pre-size, don't escape into globals/closures).
- Help the inliner by keeping hot functions small and outlining cold paths; inlining unlocks escape analysis and devirtualization.
- Use
sync.Poolonly for high-reuse escaping objects, and benchmark it. - PGO gives a safe few-percent hot-path win with no code change.
- Benchmark allocations honestly:
ReportAllocs, asink,-count=10,benchstat, and CIAllocsPerRungates.
Further reading¶
- Diagnostics — pprof, memprofile, trace
- Profile-guided optimization
golang.org/x/perf/cmd/benchstattestingpackage —ReportAllocs,AllocsPerRun, benchmarks- Go source:
cmd/compile/internal/escapeandinternal/inline