Generic Performance — Professional Level¶
Table of Contents¶
- Real-world migration:
sort.Slicetoslices.SortFunc - Profiling generic code with pprof
- Reading flame graphs with stencil names
- Decision framework: keep, generic, specialize
- PGO for generics
- Migration playbook
- Case studies
- Continuous performance regression checks
- Summary
Real-world migration: sort.Slice to slices.SortFunc¶
A canonical professional migration. The pre-1.18 idiom:
The 1.21+ idiom:
import "slices"
slices.SortFunc(users, func(a, b User) int {
return a.Age - b.Age // (or cmp.Compare(a.Age, b.Age))
})
What changed¶
| Aspect | sort.Slice | slices.SortFunc |
|---|---|---|
| Element access | Index-based; users[i], users[j] | Value-based; a, b |
| Comparator | func(i, j int) bool | func(a, b T) int |
| Internal dispatch | Reflection; cannot inline | Generic; comparator inlines |
| Stability | Unstable | Unstable; use SortStableFunc for stable |
Why it is faster¶
The pre-generic sort.Slice calls into the runtime, which uses reflection to swap and access elements. The comparator is also called indirectly, defeating inlining.
slices.SortFunc knows the type at compile time. The comparator inlines into the partition step. No reflection.
Real numbers¶
For 10,000 random User structs:
| Implementation | ns/op | allocs/op |
|---|---|---|
sort.Slice | 880,000 | ~30 |
slices.SortFunc | 540,000 | 0 |
Roughly 38% faster and zero allocations.
Migration checklist¶
- Bump to Go 1.21+.
- Identify all
sort.Sliceandsort.SliceStablecall sites. - Convert comparators from
(i, j int) boolto(a, b T) int. - Run benchmarks before / after on the hottest sort sites.
- Remove
import "sort"once all sites are converted. - Document in CHANGELOG so future reviewers know not to revert.
Profiling generic code with pprof¶
pprof is unchanged in mechanics — what changes is how to read the output.
Capturing a CPU profile¶
Capturing a heap profile¶
Capturing in tests¶
func TestHotPath(t *testing.T) {
f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
runWorkload()
}
Reading generic-specific output¶
The flat / cum tables show entries like:
flat flat% sum% cum cum%
1.20s 24.0% 24.0% 1.20s 24.0% pkg.Find[go.shape.string]
0.80s 16.0% 40.0% 0.80s 16.0% pkg.Find[go.shape.int_0]
0.30s 6.0% 46.0% 0.30s 6.0% pkg.Find[go.shape.*pkg.User]
Three different stencils, three different costs. Do not group them mentally — they are independent hot paths. If Find[*User] is the bottleneck, the optimization (specialize, change type, drop generic) only applies to that stencil.
Heap profile signals¶
Look for:
runtime.convT2I/runtime.convT2E— boxing into interface (likely outside the generic).- Allocations attributed to
[go.shape.*]symbols — the generic body itself escaped a value. runtime.mapaccess2for generic maps — the dictionary path.
go tool trace¶
For latency-sensitive services, runtime/trace reveals goroutine blocking and per-stencil scheduler events. Generic code shows up by name in the trace viewer.
Reading flame graphs with stencil names¶
Flame graphs become noisier with generics — the same logical function appears in multiple stripes, one per shape. Tactics:
1. Group mentally, optimize separately¶
If Find[go.shape.int_0] and Find[go.shape.string] are both hot, you have two optimization targets. The fix may be different for each.
2. Use --inuse_objects and --alloc_objects¶
When chasing allocations:
Distinguish bytes allocated (size) from number of allocations (count). For generic-induced boxing, the number is what hurts.
3. Filter by stencil¶
Shows only entries matching the regex. Useful when one generic helper dominates.
4. Diff profiles¶
Confirm that your optimization actually moved the needle on the right stencil and did not regress others.
Decision framework: keep, generic, specialize¶
For each performance-relevant function, choose:
- Keep — concrete, no change.
- Generic — convert to a type-parameterised version.
- Specialize — keep generic API, add a hand-rolled wrapper for the hot type.
The framework¶
│
▼
Is this on a hot path?
│ │
no yes
│ │
▼ ▼
Convenience? Single concrete type?
│ │ │
yes yes no (multiple shapes)
│ │ │
Generic Concrete Generic + specialized
fine wrapper wrapper for hot shape
Worked example 1 — Internal helper¶
Used in 50 places, none on a hot path. Decision: generic. Saves duplication.
Worked example 2 — Cache hot path¶
Service-wide cache, called 50k QPS, exclusively string → *User. Decision: specialize. Keep Cache[K, V] for tests and edge cases; add a userCache wrapper for production.
Worked example 3 — Sorting¶
sort.Slice everywhere, called from many places. Decision: migrate to slices.SortFunc — it is generic and faster.
PGO for generics¶
Profile-guided optimization (Go 1.21+) lets the compiler use a runtime profile to make better decisions:
- Capture a CPU profile from production:
cpu.pprof. - Place it next to
main.goasdefault.pgo(or pass with-pgo=...). - Build with
go build -pgo=auto(Go 1.21+ default ifdefault.pgoexists).
Why generics benefit¶
PGO devirtualizes more aggressively. A generic method call that the compiler could not statically resolve becomes a direct call when the profile shows one type dominates. Per the Go team's own measurements, PGO saves 2-5% on real services with generic-heavy hot paths.
Practical workflow¶
- Deploy a build without PGO; capture profile.
- Build with PGO; deploy.
- Verify the new build's profile.
- Periodically refresh the PGO profile (monthly is enough for most services).
Migration playbook¶
A repeatable plan for moving a codebase from interface{} to generics with performance in mind.
Phase 0 — Baseline¶
- Add benchmarks for the hot paths you intend to migrate.
- Capture a
pproffrom production. - Note current p50/p99 latencies and CPU per request.
Phase 1 — Pilot¶
- Pick one self-contained helper (a cache, a queue, a utility function).
- Convert to generics.
- Benchmark before / after on the hot path.
- Compare p99 latency and allocation rate.
- Document numbers in the PR.
Phase 2 — Replicate¶
- Apply the same pattern to two or three more modules.
- Watch for regressions in compile time and binary size.
- Establish team-wide naming and constraint conventions.
Phase 3 — Scale¶
- Roll out across the codebase.
- Update CI to run benchmarks on PRs that touch generic helpers.
- Add a "no perf regression" check to the deploy pipeline.
Phase 4 — Maintain¶
- Refresh PGO profiles monthly.
- Re-benchmark on every Go release; performance drifts as the compiler evolves.
- Document migration outcomes in an internal "performance journal" so the team learns.
Case studies¶
Case study 1 — A high-throughput message broker¶
A team migrating an in-memory broker from chan interface{} to chan T via generics:
- Before: 1.2M msg/sec; 30% of CPU spent in
runtime.convT2E. - After: 1.9M msg/sec; boxing path gone.
- Cost: 0.5% binary size growth.
- Lessons: Generic channels (
chan Tin a generic struct) gave the biggest single win in the codebase.
Case study 2 — A cache that got slightly slower¶
A team rewrote a map[string]interface{} cache as Cache[K comparable, V any]. The expectation: faster.
- Before: 60 ns/op, 1 allocation
- After: 75 ns/op, 0 allocations
Why? The cache was instantiated for 15 distinct value types in one binary. The dictionary cost added up. The fix:
- Keep
Cache[K, V]for the long tail of types - Specialize for the three types that account for 80% of traffic
After specialization: 35 ns/op for the hot types; 75 ns/op for the rest. Total CPU fell.
Case study 3 — slices.SortFunc win¶
A logging pipeline sorted millions of records by timestamp. Migration from sort.Slice to slices.SortFunc:
- Before: 880 µs per 10k sort
- After: 540 µs per 10k sort
- Wall-clock saving: 8 minutes per nightly batch
A two-line change with no risk and a meaningful business effect.
Case study 4 — A service that did not benefit¶
A request-handler used interface{}-keyed maps for a feature flag cache. The team genericized it and saw no difference — the cache was cold (rarely hit), and even per-request the boxing cost was negligible.
Lesson: measure before and after; not every generic conversion improves performance.
Continuous performance regression checks¶
Generic performance is not stable across Go releases. The compiler keeps improving, but occasionally a release moves a particular benchmark in the wrong direction. Professionals invest in CI checks.
Benchstat-driven gating¶
Set a regression threshold (e.g., +5% on critical benchmarks fails CI). Tooling: benchstat, gobenchdata, or a homegrown script.
Track binary size¶
Regression threshold: e.g., +2% binary growth requires manual approval.
Track allocations¶
testing.B.ReportAllocs() is fine; for production, runtime.ReadMemStats or expvar exports allocation rate. Alert if it spikes after a deploy.
Profile-driven canaries¶
Before promoting a generic refactor:
- Deploy to a canary at 1% traffic.
- Capture a
pprof. - Compare against a baseline.
- Promote only if metrics are within bounds.
Tooling matrix¶
| Tool | What it gates |
|---|---|
benchstat | Microbenchmarks |
pprof diff | CPU and heap |
go tool nm | Binary symbol size |
| Datadog / Prometheus | Production p50/p99/QPS/allocations |
| CI (GitHub Actions, Buildkite) | Run benchmarks on PRs touching generic code |
Summary¶
The professional level of generic performance is operational discipline:
- Benchmark before refactoring, with realistic input sizes.
- Capture pprof in production, read stencil names, treat shapes as separate hot paths.
- Decide per function — keep, generic, or specialize — based on workload, not aesthetics.
- Use PGO for further wins on hot generic paths.
- Migrate gradually, with a baseline-pilot-replicate-scale plan.
- Gate regressions in CI so improvements stick.
Generic performance is a continuous practice, not a one-time decision. Each Go release shifts the trade-offs slightly. A professional team treats this like any other operational concern — measured, monitored, and iterated.
The next file (specification.md) collects the formal references to the implementation design documents you will need when arguing about generic performance with the compiler team or reviewing a runtime CL.