Optimization Workflow — Professional¶

1. The production framing¶

In a production service, the optimization workflow is not "an engineer runs pprof and tunes a function." It is a continuous, team-level practice with budgets, gates, runbooks, and an incident pipeline. The professional job, roughly:

Define SLOs that express what "fast enough" means to users.
Translate SLOs into per-service performance budgets that the team can engineer against.
Continuously measure real user latency, tail behavior, allocation rate, GC CPU, and resource ceilings.
Gate releases on regression of the four numbers that matter.
Triage and fix the performance bugs that escape the gates.
Run periodic deep optimization sprints with executive support, not in spare time.

The rest of this file is what each of those looks like in practice.

2. SLOs and the error budget¶

A performance SLO is two numbers and a window:

"p99 of POST /checkout < 300 ms over a rolling 28-day window, with a 99.9% achievement target."

The "achievement target" leaves a budget: 0.1% of the 28 days (about 40 minutes) can be over the threshold without violating the contract. That budget is what gives the team room to take engineering risks.

Three principles:

Principle	Meaning
SLO is set by the business, not by engineering	"Our customers tolerate 300 ms" — measured, not assumed
Error budget allows risk-taking	Burn it on feature velocity if reliable; spend it carefully if recovering
Optimization work is justified by budget burn	If you're burning 80% of the budget, optimization is now top priority

The professional alternative to "optimize when someone complains" is "optimize when the error budget says to."

3. Translating SLO into budget¶

The user-visible SLO of 300 ms p99 must be allocated across all the systems that contribute. A typical allocation:

Component	Budget at p99
Edge / load balancer / TLS	20 ms
API gateway + auth	30 ms
Application (this service)	100 ms
Downstream service A	50 ms
Downstream service B	30 ms
Database (sum of queries)	50 ms
Margin / variance	20 ms
Total	300 ms

The application's 100 ms is the actual engineering target. Performance work in the service is judged against whether p99 of the service's own processing stays under 100 ms. The senior engineer who set the 100 ms is now the professional who has to defend it across releases.

4. The four dashboards every service needs¶

Dashboard	Source	What it tells you
Latency histogram	Server-side timing, exported to Prometheus / Datadog	p50, p95, p99, p99.9 of each endpoint
Throughput and saturation	Request rate, CPU%, queue depth	Whether the service is at capacity
GC and allocation	`runtime/metrics`, gctrace	GC CPU fraction, allocation rate, live heap
Resource ceilings	Container metrics	RSS vs. limit, FD count, goroutine count

Each chart should have an alert pointed at the threshold that maps to an action. "p99 climbing toward SLO" is not an alert; "p99 over 250 ms for 5 minutes" is.

5. Performance budgets per route¶

The 100 ms application budget is itself further decomposed into per-route budgets:

Route	Budget at p99	Reason
`POST /checkout`	100 ms	The main customer flow
`GET /products/{id}`	50 ms	High-frequency, cacheable
`POST /admin/*`	500 ms	Tolerated; low frequency
`GET /health`	10 ms	Probed every second; can't drift

The budget is enforced via dashboards. A PR that pushes /checkout p99 from 80 ms to 95 ms in canary is not "still in budget"; it is "burning the budget" and the team should know about it before it merges.

6. The CI gate¶

Microbenchmarks run in CI on every PR. The output is fed through benchstat against a known-good baseline branch.

git fetch origin main
git checkout origin/main
go test -bench=. -benchmem -run=^$ -count=10 ./... > /tmp/base.txt
git checkout -
go test -bench=. -benchmem -run=^$ -count=10 ./... > /tmp/new.txt
benchstat /tmp/base.txt /tmp/new.txt

The gating rule is configured per benchmark:

Benchmark	Allowed delta	Why
`BenchmarkHotPath_*`	< 2% slower	On the SLO critical path
`BenchmarkAdmin_*`	< 15% slower	Off-critical
`BenchmarkParser_*`	< 5% slower	Used by many endpoints
`BenchmarkAllocations`	0 new allocations	A regression of allocs/op is a release-blocker

The CI script enforces this. PRs that violate must add justification or include an off-set elsewhere.

7. Continuous profiling¶

Capturing a profile only when on-call gets paged is too late. The professional answer is continuous profiling: a background agent captures short profiles (10–30 seconds, every few minutes) and ships them to a central store.

Tools:

Tool	Notes
`pprof` over HTTP, scraped by a sidecar	Self-rolled but flexible
Pyroscope (now Grafana Pyroscope)	Open source, integrates with Grafana
Datadog Continuous Profiler	Commercial, low overhead
Polar Signals Parca	Open source, eBPF-based

The value is twofold:

When an incident happens, you have profile data from before the incident.
Slow drift across releases is visible as a change in profile shape, not just numbers.

A service running a continuous profiler exposes drift weeks before it shows up as an SLO breach.

8. The PGO pipeline¶

Profile-Guided Optimization (Go 1.21+) is integrated into the build pipeline.

production → continuous profile collection → 24h aggregate → default.pgo
                                                     ↓
                                                go build -pgo=auto
                                                     ↓
                                                next release binary

Steps in detail:

The continuous profiler aggregates 24 hours of CPU profiles into a single representative profile.
A scheduled job pulls the aggregate, validates it (sample count > 100k, average load > N), and writes it as default.pgo in the source tree.
The next CI build picks up default.pgo automatically with -pgo=auto.
The team reviews the PR that bumps the profile, looking for unexpected shape changes.

Typical wins: 2–10% on CPU. The discipline is keeping the profile fresh — a stale profile produces a binary optimized for the wrong workload. Re-collect monthly or on major workload changes.

9. The canary and the rollback¶

Every release is rolled out gradually. The professional shape of the rollout:

Stage	Traffic	Duration	Block if
Smoke	Internal traffic only	5 min	Any error or > 2× latency vs. prev
Canary 1%	1% of prod	30 min	p99 +15%, error rate +0.5%, GC CPU +30%
Canary 10%	10% of prod	1 hour	Same thresholds
Full	100%	—	Same thresholds, with auto-rollback hook

The auto-rollback hook reverts the deployment if any threshold is crossed in any stage. The release engineer doesn't have to be online; the automation does it.

This is the production version of "re-measure after the change." Microbenchmarks said the change was a win; the canary will tell you whether it's a win at real load.

10. Regression incident response¶

When a regression slips through the gate and hits production:

Confirm. Latency dashboard shows a step change at the deploy time? Compare canary vs. previous.
Rollback first, diagnose second. The service was fine yesterday; restore that state.
Bisect. Which commit in the release caused the regression?
Capture. A profile from the regressed version (if a pod still exists) vs. a profile from the fixed version.
Fix forward. Either revert that commit alone or fix the root cause in a follow-up.
Post-mortem. Why did the gate miss this? Add a benchmark or alert that would have caught it.

The post-mortem step is what makes the system better. Each escape is an opportunity to harden the gate.

11. Team practices¶

Performance is a property of the system, but it is maintained by the team. Practices that work:

Practice	Cadence
Performance review per PR (lightweight)	Each PR — reviewer looks at benchstat output
Performance backlog	Continuous; each known issue ticketed with a profile snapshot
Optimization sprints	Quarterly; week-long focused effort
Performance "office hours"	Weekly; engineers bring profiles, the team triages
Post-mortems for SLO breaches	Every breach
Shared dashboards	Always; the four-number panel is fixture for every service

The professional team treats performance as a first-class engineering concern, not as "we'll get to it after features."

12. The `runtime/metrics` panel¶

Build one Grafana panel per service exporting these:

Metric	Why it matters
`/sched/goroutines:goroutines`	Goroutine leaks
`/sched/latencies:seconds` (histogram)	Scheduler delay; signal of saturation
`/gc/heap/live:bytes`	Working set
`/gc/heap/allocs:bytes` (rate)	Allocation pressure
`/gc/pauses:seconds` (histogram)	GC pause distribution
`/cpu/classes/gc/total:cpu-seconds` (rate)	GC CPU fraction
`/cpu/classes/scavenge/total:cpu-seconds` (rate)	Background scavenger work
`/memory/classes/heap/objects:bytes`	Heap occupancy

These map directly to the prometheus collector exposed by prometheus/client_golang with GoRuntimeMetricsCollection. One panel per service, identical across services, so on-call engineers don't have to relearn each service.

13. Cost-aware optimization¶

At scale, performance is also cost.

Scenario	Cost dimension
1000 pods × 10% CPU saved	$X/month less in compute
1000 pods × 20% memory saved	Allows smaller pod size, lower cost
Allocation rate halved	Lower per-pod CPU, often allows fewer pods
Latency 50 ms → 40 ms on customer-facing path	Hard to monetize; sometimes maps to conversion lift
Latency 200 ms → 150 ms on internal job	Often nothing

Engineering hours are also money. A 10% improvement that costs $500/month in compute but two weeks of engineer time at $5000/week is a $9500 investment for $6000/year of return. The professional makes that math explicit before committing.

14. The optimization sprint¶

Periodic focused optimization works better than continuous trickle-optimization. Structure of a one-week sprint:

Day	Activity
Monday	Gather profiles, agree on top 3 hotspots, write benchmarks for each
Tuesday	Apply candidate optimizations; pair on benchstat results
Wednesday	Continue; merge wins to a branch, leave failures documented
Thursday	Soak tests on the canary cluster; gather production profiles
Friday	Document, write up findings, plan next sprint's targets

The output of the week is (a) measurable improvements landed and (b) a written record of what was tried and why. Both compound across sprints; the second sprint is faster than the first because the team is better calibrated.

15. Cross-team performance¶

When the bottleneck is in a service owned by another team, the workflow includes communication:

Step	Action
Identify	Profile shows the bottleneck is in a call to service Y
Quantify	Measure the contribution: "Service Y's p99 is 150 ms of our 200 ms budget"
Communicate	Open an issue with the data — profile, latency histogram, traffic shape
Collaborate	Offer to help; sometimes the fix is theirs, sometimes ours (batch, cache, retry policy)
Track	Add a dashboard for the cross-service latency; revisit

The professional doesn't say "their service is slow" and move on. Cross-team performance work is some of the highest-leverage performance work because the bottlenecks usually involve more than one team.

16. Long-lived performance documentation¶

Each service maintains a PERFORMANCE.md (or wiki page) with:

Section	Content
SLOs	Numbers, window, achievement target
Budget	Per-component breakdown of the SLO
Known hotspots	Functions you've optimized before and why they're shaped the way they are
Tuning history	Each non-obvious optimization, with date, person, before/after
Anti-patterns to avoid	"Do not replace this stack buffer with strconv"
Runbooks	RSS climbing, GC CPU high, goroutine leak
Benchmark baselines	Where to find them and how to run them

This is the document a new team member reads in their first week. It is the memory of a team. Without it, every new engineer relearns the same lessons.

17. The gates against entropy¶

Performance regresses without active resistance. The five mechanisms that hold the line:

Gate	Catches
Microbenchmark CI	Function-level regressions
Canary thresholds	System-level regressions before full rollout
Continuous profiling alerts	Slow drift across releases
Quarterly optimization sprint	Accumulated debt that didn't trigger any gate
`PERFORMANCE.md`	Knowledge loss across team changes

Each one catches what the others miss. A team running all five is far more performant than a team running just one or two.

18. Summary¶

Professional optimization is institutional: SLOs translate into per-route budgets; budgets are enforced by CI gates and canary thresholds; continuous profiling makes drift visible; PGO closes the loop from production back to compile; and team practices keep the work funded against feature pressure. The individual engineer's loop — measure, identify, change, re-measure — is unchanged, but it runs against a backdrop of institutional support and accountability. That's how performance survives organizational entropy.