Profile-Guided Optimization (PGO) — Optimize¶

1. Goal of this file¶

This file is about extracting more from PGO than the default 2–5 % most teams get. The premise: you already have a working PGO pipeline; you want to push the gain to the higher end of the 5–10 % range. The levers are:

Make the profile more representative.
Make the workload more PGO-friendly.
Pair PGO with complementary build flags.
Audit the resulting binary to confirm the wins happened where you expected.

2. Representativeness: the dominant factor¶

Almost all "PGO gave me 1 %" stories trace back to the profile not matching the workload. Steps to fix:

Step	Why
Capture at peak hour	Hot paths at peak differ from those at 03:00
Capture for ≥ 60 seconds	< 30 s misses tail behavior
Capture from multiple pods and merge	Single pod is noisy; one slow client can skew
Exclude deploy/restart window	Init code is not your steady-state path
Re-capture after major refactors	Stale profile leaves money on the table

Combine all five: aggregated, peak-hour, steady-state, post-warmup profile from many instances. This is the configuration the Go team's PGO documentation effectively recommends.

3. Sample count is what makes the profile stable¶

The compiler uses sample counts. A profile with 50 samples is statistically noisy; a profile with 50 000 samples is stable.

Capture window	Approx samples (single pod)	Stability
10 s	~1 000	Very noisy
30 s	~3 000	Noisy
60 s	~6 000	Useable
5 min	~30 000	Stable
5 pods × 60 s merged	~30 000	Stable

Aim for 30 000+ aggregated samples. Below that, you may see inline decisions flip between profile refreshes for non-reasons.

4. Workload-side optimization: write PGO-friendly code¶

Some code patterns offer more for PGO to optimize than others. To maximize gains:

Pattern	Why PGO helps
Interface call with one dominant implementation	Devirtualization opportunity
Small accessor functions called millions of times	Inlining frees up time
Hot loops calling stable utility functions	Cross-package inlining
Generic functions with one heavy instantiation	Per-instantiation inlining

Anti-patterns where PGO has nothing to do:

Pattern	Why PGO doesn't help
cgo-bound hot path	C code not compiled by Go
Reflection-based dispatch (`reflect.Call`)	Not statically resolvable
Plugin / shared-library code	Different compilation unit
Allocation-bound (GC-dominated)	GC is unaffected by PGO
Already monomorphic, single function	Nothing to specialize

If your hot path falls in the right column, PGO will tell you so by giving you 0 %. Move on to other optimizations (allocation reduction, SIMD via golang.org/x/sys, syscalls).

5. Devirtualization wins: how to set them up¶

Devirtualization is the highest-leverage PGO mechanism. Set up your code so it can fire:

// Good for PGO: one dominant impl at runtime
type Cache interface { Get(string) []byte }

func handle(c Cache, k string) []byte {
    return c.Get(k)
}

// At runtime: 95% of calls pass *RedisCache → PGO devirtualizes

Anti-pattern:

// Bad for PGO: roughly equal mix of three impls
// → compiler will not devirtualize

If you have legitimate polymorphism (multiple impls used roughly equally), PGO can't help that call site. Consider whether a generic function or a switch over a typed enum would let you avoid the interface entirely.

6. The interplay with `-gcflags`¶

-gcflags and -pgo are independent but interact.

Flag combo	Effect
`-pgo=auto` (default)	PGO applies; inlining active
`-pgo=auto -gcflags="-l"`	Inlining off; PGO mostly inert
`-pgo=auto -gcflags="all=-N -l"`	Both inlining and optimization off; PGO inert. Debug only.
`-pgo=auto -gcflags="-m=2"`	PGO applies; you see inline decisions in stderr
`-pgo=auto -gcflags="-d=pgodebug=1"`	PGO applies; compiler prints debug info about PGO choices

For audit work:

go build -pgo=auto -gcflags='-m=2 -d=pgodebug=1' ./cmd/myapp 2> pgo-trace.txt

pgodebug=1 makes the compiler list every site where PGO made a decision: budget increases, devirtualization choices, etc.

7. Pairing PGO with bench-driven optimization¶

PGO is not a substitute for hand optimization; it is a multiplier. The workflow:

Profile and bench the service as-is.
Find the top three CPU hot spots.
Hand-optimize each (allocation reduction, algorithm change, sync.Pool, etc.).
Re-bench, confirm wins.
Then enable PGO.

PGO on top of hand-tuned code typically yields the higher end of the gain range (7–10 %). PGO on top of un-optimized code is muddled by larger inefficiencies and looks smaller (2–4 %).

Order matters: optimize first, PGO last.

8. Generic instantiation effects¶

Generic functions are instantiated per type-arg combination. Each instantiation is a distinct symbol from PGO's perspective.

func Encode[T any](v T) []byte { ... }

// At runtime: Encode[User] is hot, Encode[Order] is cold, Encode[int] is unused.
// PGO will inline Encode[User] aggressively, leave the others alone.

To benefit:

Capture a profile after you have a stable set of instantiations.
If you add a hot instantiation later, refresh the profile.

If you have many cold instantiations and one hot one, your binary doesn't bloat unfairly: PGO budget increases apply only to the hot instantiation.

9. PGO + `-buildmode` choices¶

`-buildmode`	PGO support
`default` (exe)	Yes
`pie`	Yes
`c-archive` / `c-shared`	Yes for the Go side
`plugin`	Yes for the plugin; profile must come from the host process running the plugin
`shared`	Yes

The interesting case is plugins: a profile captured from a host that loads many plugins can still drive optimization of each plugin independently, but you need the captured stacks to actually enter the plugin code paths.

10. Auditing the result¶

After enabling PGO, prove the wins happened where you expected.

10.1 List inlined functions¶

go build -gcflags='-m=2' -pgo=off  ./cmd/myapp 2>off.txt
go build -gcflags='-m=2' -pgo=auto ./cmd/myapp 2>on.txt
diff <(grep "can inline" off.txt | sort) <(grep "can inline" on.txt | sort) | head -50

You should see the hot functions in the diff (in the auto set, not the off set).

10.2 Look at machine code¶

go tool objdump -s 'myapp/internal/hot.Process' ./bin/myapp | head -40

If Process was small and hot, you may not even find it in objdump — it has been inlined into all its callers.

10.3 PGO debug log¶

go build -gcflags='-d=pgodebug=1' -pgo=auto ./cmd/myapp 2>pgo.log
grep -E '(inlining|devirtualizing)' pgo.log | head

You see lines like:

pgo: inlining call to hot.Process at hot/handler.go:42
pgo: devirtualizing iface call to (*Cache).Get at hot/lookup.go:18

That's the smoking gun — PGO actually fired on the sites you cared about.

11. Binary-size budget¶

PGO grows the binary modestly. Track:

ls -l ./bin/myapp.no-pgo ./bin/myapp.pgo
go tool nm -size -sort=size ./bin/myapp.pgo | head -20

Expected delta: +1 to +3 %. Anomalies:

Delta	Interpretation
< 1 % growth	Inlining had little to do (small wins)
1–3 % growth	Typical, healthy
3–10 % growth	Aggressive inlining; usually OK
> 10 % growth	Profile likely too narrow (over-fitted to one path)

If size growth is large, capture a broader profile (longer window, more pods).

12. Combining with `GOAMD64=v3` and friends¶

For latest-generation CPUs, the microarchitecture level matters too.

GOAMD64=v3 go build -pgo=auto ./cmd/myapp

PGO and GOAMD64 are orthogonal: PGO picks which functions to inline; GOAMD64 picks which instructions to emit (BMI2, AVX-class). Compose freely. Combined gain is usually additive.

For ARM: GOARM64=v8.1 and similar.

13. Continuous A/B in production¶

The mature setup: a small fraction of pods always run -pgo=off as a control. Compare their CPU/latency to PGO pods continuously.

# 95% pods: pgo-on
# 5% pods: pgo-off (control)

Prometheus diff query:

avg(rate(container_cpu_usage_seconds_total{pgo="on"}[5m]))
  /
avg(rate(container_cpu_usage_seconds_total{pgo="off"}[5m]))

A value steadily below 1.0 (e.g., 0.94) shows PGO saving ~6 % CPU on live traffic. Continuous proof, not a one-shot benchmark.

14. Edge cases that erode gain¶

Pattern	Erosion
Service shifted from interface dispatch to generics	Devirtualization wins disappear; reconsider
Recent refactor removed several hot functions	Profile partially stale; refresh
New feature added a parallel cold code path	Compiler spends inline budget on cold code
GC pressure increased substantially	PGO % shrinks even though absolute time saved is constant

When measured gain slips, walk through this list before assuming PGO is broken.

15. Summary¶

Maximum PGO gain comes from (a) a representative profile (peak hour, aggregated, 30 000+ samples), (b) PGO-friendly code patterns (single-impl interfaces, small hot functions, generics with one hot instantiation), (c) avoiding patterns PGO cannot help (cgo, reflection, GC-bound), (d) pairing PGO with hand optimization in the right order, (e) auditing the result with -gcflags='-m=2 -d=pgodebug=1'. Expect 1–3 % binary size growth and 5–10 % CPU savings on a well-suited workload.