Profile-Guided Optimization (PGO) — Optimize¶
1. Goal of this file¶
This file is about extracting more from PGO than the default 2–5 % most teams get. The premise: you already have a working PGO pipeline; you want to push the gain to the higher end of the 5–10 % range. The levers are:
- Make the profile more representative.
- Make the workload more PGO-friendly.
- Pair PGO with complementary build flags.
- Audit the resulting binary to confirm the wins happened where you expected.
2. Representativeness: the dominant factor¶
Almost all "PGO gave me 1 %" stories trace back to the profile not matching the workload. Steps to fix:
| Step | Why |
|---|---|
| Capture at peak hour | Hot paths at peak differ from those at 03:00 |
| Capture for ≥ 60 seconds | < 30 s misses tail behavior |
| Capture from multiple pods and merge | Single pod is noisy; one slow client can skew |
| Exclude deploy/restart window | Init code is not your steady-state path |
| Re-capture after major refactors | Stale profile leaves money on the table |
Combine all five: aggregated, peak-hour, steady-state, post-warmup profile from many instances. This is the configuration the Go team's PGO documentation effectively recommends.
3. Sample count is what makes the profile stable¶
The compiler uses sample counts. A profile with 50 samples is statistically noisy; a profile with 50 000 samples is stable.
| Capture window | Approx samples (single pod) | Stability |
|---|---|---|
| 10 s | ~1 000 | Very noisy |
| 30 s | ~3 000 | Noisy |
| 60 s | ~6 000 | Useable |
| 5 min | ~30 000 | Stable |
| 5 pods × 60 s merged | ~30 000 | Stable |
Aim for 30 000+ aggregated samples. Below that, you may see inline decisions flip between profile refreshes for non-reasons.
4. Workload-side optimization: write PGO-friendly code¶
Some code patterns offer more for PGO to optimize than others. To maximize gains:
| Pattern | Why PGO helps |
|---|---|
| Interface call with one dominant implementation | Devirtualization opportunity |
| Small accessor functions called millions of times | Inlining frees up time |
| Hot loops calling stable utility functions | Cross-package inlining |
| Generic functions with one heavy instantiation | Per-instantiation inlining |
Anti-patterns where PGO has nothing to do:
| Pattern | Why PGO doesn't help |
|---|---|
| cgo-bound hot path | C code not compiled by Go |
Reflection-based dispatch (reflect.Call) | Not statically resolvable |
| Plugin / shared-library code | Different compilation unit |
| Allocation-bound (GC-dominated) | GC is unaffected by PGO |
| Already monomorphic, single function | Nothing to specialize |
If your hot path falls in the right column, PGO will tell you so by giving you 0 %. Move on to other optimizations (allocation reduction, SIMD via golang.org/x/sys, syscalls).
5. Devirtualization wins: how to set them up¶
Devirtualization is the highest-leverage PGO mechanism. Set up your code so it can fire:
// Good for PGO: one dominant impl at runtime
type Cache interface { Get(string) []byte }
func handle(c Cache, k string) []byte {
return c.Get(k)
}
// At runtime: 95% of calls pass *RedisCache → PGO devirtualizes
Anti-pattern:
If you have legitimate polymorphism (multiple impls used roughly equally), PGO can't help that call site. Consider whether a generic function or a switch over a typed enum would let you avoid the interface entirely.
6. The interplay with -gcflags¶
-gcflags and -pgo are independent but interact.
| Flag combo | Effect |
|---|---|
-pgo=auto (default) | PGO applies; inlining active |
-pgo=auto -gcflags="-l" | Inlining off; PGO mostly inert |
-pgo=auto -gcflags="all=-N -l" | Both inlining and optimization off; PGO inert. Debug only. |
-pgo=auto -gcflags="-m=2" | PGO applies; you see inline decisions in stderr |
-pgo=auto -gcflags="-d=pgodebug=1" | PGO applies; compiler prints debug info about PGO choices |
For audit work:
pgodebug=1 makes the compiler list every site where PGO made a decision: budget increases, devirtualization choices, etc.
7. Pairing PGO with bench-driven optimization¶
PGO is not a substitute for hand optimization; it is a multiplier. The workflow:
- Profile and bench the service as-is.
- Find the top three CPU hot spots.
- Hand-optimize each (allocation reduction, algorithm change, sync.Pool, etc.).
- Re-bench, confirm wins.
- Then enable PGO.
PGO on top of hand-tuned code typically yields the higher end of the gain range (7–10 %). PGO on top of un-optimized code is muddled by larger inefficiencies and looks smaller (2–4 %).
Order matters: optimize first, PGO last.
8. Generic instantiation effects¶
Generic functions are instantiated per type-arg combination. Each instantiation is a distinct symbol from PGO's perspective.
func Encode[T any](v T) []byte { ... }
// At runtime: Encode[User] is hot, Encode[Order] is cold, Encode[int] is unused.
// PGO will inline Encode[User] aggressively, leave the others alone.
To benefit:
- Capture a profile after you have a stable set of instantiations.
- If you add a hot instantiation later, refresh the profile.
If you have many cold instantiations and one hot one, your binary doesn't bloat unfairly: PGO budget increases apply only to the hot instantiation.
9. PGO + -buildmode choices¶
-buildmode | PGO support |
|---|---|
default (exe) | Yes |
pie | Yes |
c-archive / c-shared | Yes for the Go side |
plugin | Yes for the plugin; profile must come from the host process running the plugin |
shared | Yes |
The interesting case is plugins: a profile captured from a host that loads many plugins can still drive optimization of each plugin independently, but you need the captured stacks to actually enter the plugin code paths.
10. Auditing the result¶
After enabling PGO, prove the wins happened where you expected.
10.1 List inlined functions¶
go build -gcflags='-m=2' -pgo=off ./cmd/myapp 2>off.txt
go build -gcflags='-m=2' -pgo=auto ./cmd/myapp 2>on.txt
diff <(grep "can inline" off.txt | sort) <(grep "can inline" on.txt | sort) | head -50
You should see the hot functions in the diff (in the auto set, not the off set).
10.2 Look at machine code¶
If Process was small and hot, you may not even find it in objdump — it has been inlined into all its callers.
10.3 PGO debug log¶
go build -gcflags='-d=pgodebug=1' -pgo=auto ./cmd/myapp 2>pgo.log
grep -E '(inlining|devirtualizing)' pgo.log | head
You see lines like:
pgo: inlining call to hot.Process at hot/handler.go:42
pgo: devirtualizing iface call to (*Cache).Get at hot/lookup.go:18
That's the smoking gun — PGO actually fired on the sites you cared about.
11. Binary-size budget¶
PGO grows the binary modestly. Track:
Expected delta: +1 to +3 %. Anomalies:
| Delta | Interpretation |
|---|---|
| < 1 % growth | Inlining had little to do (small wins) |
| 1–3 % growth | Typical, healthy |
| 3–10 % growth | Aggressive inlining; usually OK |
| > 10 % growth | Profile likely too narrow (over-fitted to one path) |
If size growth is large, capture a broader profile (longer window, more pods).
12. Combining with GOAMD64=v3 and friends¶
For latest-generation CPUs, the microarchitecture level matters too.
PGO and GOAMD64 are orthogonal: PGO picks which functions to inline; GOAMD64 picks which instructions to emit (BMI2, AVX-class). Compose freely. Combined gain is usually additive.
For ARM: GOARM64=v8.1 and similar.
13. Continuous A/B in production¶
The mature setup: a small fraction of pods always run -pgo=off as a control. Compare their CPU/latency to PGO pods continuously.
Prometheus diff query:
avg(rate(container_cpu_usage_seconds_total{pgo="on"}[5m]))
/
avg(rate(container_cpu_usage_seconds_total{pgo="off"}[5m]))
A value steadily below 1.0 (e.g., 0.94) shows PGO saving ~6 % CPU on live traffic. Continuous proof, not a one-shot benchmark.
14. Edge cases that erode gain¶
| Pattern | Erosion |
|---|---|
| Service shifted from interface dispatch to generics | Devirtualization wins disappear; reconsider |
| Recent refactor removed several hot functions | Profile partially stale; refresh |
| New feature added a parallel cold code path | Compiler spends inline budget on cold code |
| GC pressure increased substantially | PGO % shrinks even though absolute time saved is constant |
When measured gain slips, walk through this list before assuming PGO is broken.
15. Summary¶
Maximum PGO gain comes from (a) a representative profile (peak hour, aggregated, 30 000+ samples), (b) PGO-friendly code patterns (single-impl interfaces, small hot functions, generics with one hot instantiation), (c) avoiding patterns PGO cannot help (cgo, reflection, GC-bound), (d) pairing PGO with hand optimization in the right order, (e) auditing the result with -gcflags='-m=2 -d=pgodebug=1'. Expect 1–3 % binary size growth and 5–10 % CPU savings on a well-suited workload.
Further reading¶
- Official PGO guide: https://go.dev/doc/pgo
pgodebugflag: search forpgodebugin https://github.com/golang/go/tree/master/src/cmd/compile/internal/pgo- Devirtualization design: https://go.googlesource.com/proposal/+/master/design/55022-pgo-implementation.md
- Continuous profiling for PGO: https://www.polarsignals.com
GOAMD64levels: https://github.com/golang/go/wiki/MinimumRequirements#amd64