Optimization Workflow — Professional¶
1. The production framing¶
In a production service, the optimization workflow is not "an engineer runs pprof and tunes a function." It is a continuous, team-level practice with budgets, gates, runbooks, and an incident pipeline. The professional job, roughly:
- Define SLOs that express what "fast enough" means to users.
- Translate SLOs into per-service performance budgets that the team can engineer against.
- Continuously measure real user latency, tail behavior, allocation rate, GC CPU, and resource ceilings.
- Gate releases on regression of the four numbers that matter.
- Triage and fix the performance bugs that escape the gates.
- Run periodic deep optimization sprints with executive support, not in spare time.
The rest of this file is what each of those looks like in practice.
2. SLOs and the error budget¶
A performance SLO is two numbers and a window:
"p99 of
POST /checkout< 300 ms over a rolling 28-day window, with a 99.9% achievement target."
The "achievement target" leaves a budget: 0.1% of the 28 days (about 40 minutes) can be over the threshold without violating the contract. That budget is what gives the team room to take engineering risks.
Three principles:
| Principle | Meaning |
|---|---|
| SLO is set by the business, not by engineering | "Our customers tolerate 300 ms" — measured, not assumed |
| Error budget allows risk-taking | Burn it on feature velocity if reliable; spend it carefully if recovering |
| Optimization work is justified by budget burn | If you're burning 80% of the budget, optimization is now top priority |
The professional alternative to "optimize when someone complains" is "optimize when the error budget says to."
3. Translating SLO into budget¶
The user-visible SLO of 300 ms p99 must be allocated across all the systems that contribute. A typical allocation:
| Component | Budget at p99 |
|---|---|
| Edge / load balancer / TLS | 20 ms |
| API gateway + auth | 30 ms |
| Application (this service) | 100 ms |
| Downstream service A | 50 ms |
| Downstream service B | 30 ms |
| Database (sum of queries) | 50 ms |
| Margin / variance | 20 ms |
| Total | 300 ms |
The application's 100 ms is the actual engineering target. Performance work in the service is judged against whether p99 of the service's own processing stays under 100 ms. The senior engineer who set the 100 ms is now the professional who has to defend it across releases.
4. The four dashboards every service needs¶
| Dashboard | Source | What it tells you |
|---|---|---|
| Latency histogram | Server-side timing, exported to Prometheus / Datadog | p50, p95, p99, p99.9 of each endpoint |
| Throughput and saturation | Request rate, CPU%, queue depth | Whether the service is at capacity |
| GC and allocation | runtime/metrics, gctrace | GC CPU fraction, allocation rate, live heap |
| Resource ceilings | Container metrics | RSS vs. limit, FD count, goroutine count |
Each chart should have an alert pointed at the threshold that maps to an action. "p99 climbing toward SLO" is not an alert; "p99 over 250 ms for 5 minutes" is.
5. Performance budgets per route¶
The 100 ms application budget is itself further decomposed into per-route budgets:
| Route | Budget at p99 | Reason |
|---|---|---|
POST /checkout | 100 ms | The main customer flow |
GET /products/{id} | 50 ms | High-frequency, cacheable |
POST /admin/* | 500 ms | Tolerated; low frequency |
GET /health | 10 ms | Probed every second; can't drift |
The budget is enforced via dashboards. A PR that pushes /checkout p99 from 80 ms to 95 ms in canary is not "still in budget"; it is "burning the budget" and the team should know about it before it merges.
6. The CI gate¶
Microbenchmarks run in CI on every PR. The output is fed through benchstat against a known-good baseline branch.
git fetch origin main
git checkout origin/main
go test -bench=. -benchmem -run=^$ -count=10 ./... > /tmp/base.txt
git checkout -
go test -bench=. -benchmem -run=^$ -count=10 ./... > /tmp/new.txt
benchstat /tmp/base.txt /tmp/new.txt
The gating rule is configured per benchmark:
| Benchmark | Allowed delta | Why |
|---|---|---|
BenchmarkHotPath_* | < 2% slower | On the SLO critical path |
BenchmarkAdmin_* | < 15% slower | Off-critical |
BenchmarkParser_* | < 5% slower | Used by many endpoints |
BenchmarkAllocations | 0 new allocations | A regression of allocs/op is a release-blocker |
The CI script enforces this. PRs that violate must add justification or include an off-set elsewhere.
7. Continuous profiling¶
Capturing a profile only when on-call gets paged is too late. The professional answer is continuous profiling: a background agent captures short profiles (10–30 seconds, every few minutes) and ships them to a central store.
Tools:
| Tool | Notes |
|---|---|
pprof over HTTP, scraped by a sidecar | Self-rolled but flexible |
| Pyroscope (now Grafana Pyroscope) | Open source, integrates with Grafana |
| Datadog Continuous Profiler | Commercial, low overhead |
| Polar Signals Parca | Open source, eBPF-based |
The value is twofold:
- When an incident happens, you have profile data from before the incident.
- Slow drift across releases is visible as a change in profile shape, not just numbers.
A service running a continuous profiler exposes drift weeks before it shows up as an SLO breach.
8. The PGO pipeline¶
Profile-Guided Optimization (Go 1.21+) is integrated into the build pipeline.
production → continuous profile collection → 24h aggregate → default.pgo
↓
go build -pgo=auto
↓
next release binary
Steps in detail:
- The continuous profiler aggregates 24 hours of CPU profiles into a single representative profile.
- A scheduled job pulls the aggregate, validates it (sample count > 100k, average load > N), and writes it as
default.pgoin the source tree. - The next CI build picks up
default.pgoautomatically with-pgo=auto. - The team reviews the PR that bumps the profile, looking for unexpected shape changes.
Typical wins: 2–10% on CPU. The discipline is keeping the profile fresh — a stale profile produces a binary optimized for the wrong workload. Re-collect monthly or on major workload changes.
9. The canary and the rollback¶
Every release is rolled out gradually. The professional shape of the rollout:
| Stage | Traffic | Duration | Block if |
|---|---|---|---|
| Smoke | Internal traffic only | 5 min | Any error or > 2× latency vs. prev |
| Canary 1% | 1% of prod | 30 min | p99 +15%, error rate +0.5%, GC CPU +30% |
| Canary 10% | 10% of prod | 1 hour | Same thresholds |
| Full | 100% | — | Same thresholds, with auto-rollback hook |
The auto-rollback hook reverts the deployment if any threshold is crossed in any stage. The release engineer doesn't have to be online; the automation does it.
This is the production version of "re-measure after the change." Microbenchmarks said the change was a win; the canary will tell you whether it's a win at real load.
10. Regression incident response¶
When a regression slips through the gate and hits production:
- Confirm. Latency dashboard shows a step change at the deploy time? Compare canary vs. previous.
- Rollback first, diagnose second. The service was fine yesterday; restore that state.
- Bisect. Which commit in the release caused the regression?
- Capture. A profile from the regressed version (if a pod still exists) vs. a profile from the fixed version.
- Fix forward. Either revert that commit alone or fix the root cause in a follow-up.
- Post-mortem. Why did the gate miss this? Add a benchmark or alert that would have caught it.
The post-mortem step is what makes the system better. Each escape is an opportunity to harden the gate.
11. Team practices¶
Performance is a property of the system, but it is maintained by the team. Practices that work:
| Practice | Cadence |
|---|---|
| Performance review per PR (lightweight) | Each PR — reviewer looks at benchstat output |
| Performance backlog | Continuous; each known issue ticketed with a profile snapshot |
| Optimization sprints | Quarterly; week-long focused effort |
| Performance "office hours" | Weekly; engineers bring profiles, the team triages |
| Post-mortems for SLO breaches | Every breach |
| Shared dashboards | Always; the four-number panel is fixture for every service |
The professional team treats performance as a first-class engineering concern, not as "we'll get to it after features."
12. The runtime/metrics panel¶
Build one Grafana panel per service exporting these:
| Metric | Why it matters |
|---|---|
/sched/goroutines:goroutines | Goroutine leaks |
/sched/latencies:seconds (histogram) | Scheduler delay; signal of saturation |
/gc/heap/live:bytes | Working set |
/gc/heap/allocs:bytes (rate) | Allocation pressure |
/gc/pauses:seconds (histogram) | GC pause distribution |
/cpu/classes/gc/total:cpu-seconds (rate) | GC CPU fraction |
/cpu/classes/scavenge/total:cpu-seconds (rate) | Background scavenger work |
/memory/classes/heap/objects:bytes | Heap occupancy |
These map directly to the prometheus collector exposed by prometheus/client_golang with GoRuntimeMetricsCollection. One panel per service, identical across services, so on-call engineers don't have to relearn each service.
13. Cost-aware optimization¶
At scale, performance is also cost.
| Scenario | Cost dimension |
|---|---|
| 1000 pods × 10% CPU saved | $X/month less in compute |
| 1000 pods × 20% memory saved | Allows smaller pod size, lower cost |
| Allocation rate halved | Lower per-pod CPU, often allows fewer pods |
| Latency 50 ms → 40 ms on customer-facing path | Hard to monetize; sometimes maps to conversion lift |
| Latency 200 ms → 150 ms on internal job | Often nothing |
Engineering hours are also money. A 10% improvement that costs $500/month in compute but two weeks of engineer time at $5000/week is a $9500 investment for $6000/year of return. The professional makes that math explicit before committing.
14. The optimization sprint¶
Periodic focused optimization works better than continuous trickle-optimization. Structure of a one-week sprint:
| Day | Activity |
|---|---|
| Monday | Gather profiles, agree on top 3 hotspots, write benchmarks for each |
| Tuesday | Apply candidate optimizations; pair on benchstat results |
| Wednesday | Continue; merge wins to a branch, leave failures documented |
| Thursday | Soak tests on the canary cluster; gather production profiles |
| Friday | Document, write up findings, plan next sprint's targets |
The output of the week is (a) measurable improvements landed and (b) a written record of what was tried and why. Both compound across sprints; the second sprint is faster than the first because the team is better calibrated.
15. Cross-team performance¶
When the bottleneck is in a service owned by another team, the workflow includes communication:
| Step | Action |
|---|---|
| Identify | Profile shows the bottleneck is in a call to service Y |
| Quantify | Measure the contribution: "Service Y's p99 is 150 ms of our 200 ms budget" |
| Communicate | Open an issue with the data — profile, latency histogram, traffic shape |
| Collaborate | Offer to help; sometimes the fix is theirs, sometimes ours (batch, cache, retry policy) |
| Track | Add a dashboard for the cross-service latency; revisit |
The professional doesn't say "their service is slow" and move on. Cross-team performance work is some of the highest-leverage performance work because the bottlenecks usually involve more than one team.
16. Long-lived performance documentation¶
Each service maintains a PERFORMANCE.md (or wiki page) with:
| Section | Content |
|---|---|
| SLOs | Numbers, window, achievement target |
| Budget | Per-component breakdown of the SLO |
| Known hotspots | Functions you've optimized before and why they're shaped the way they are |
| Tuning history | Each non-obvious optimization, with date, person, before/after |
| Anti-patterns to avoid | "Do not replace this stack buffer with strconv" |
| Runbooks | RSS climbing, GC CPU high, goroutine leak |
| Benchmark baselines | Where to find them and how to run them |
This is the document a new team member reads in their first week. It is the memory of a team. Without it, every new engineer relearns the same lessons.
17. The gates against entropy¶
Performance regresses without active resistance. The five mechanisms that hold the line:
| Gate | Catches |
|---|---|
| Microbenchmark CI | Function-level regressions |
| Canary thresholds | System-level regressions before full rollout |
| Continuous profiling alerts | Slow drift across releases |
| Quarterly optimization sprint | Accumulated debt that didn't trigger any gate |
PERFORMANCE.md | Knowledge loss across team changes |
Each one catches what the others miss. A team running all five is far more performant than a team running just one or two.
18. Summary¶
Professional optimization is institutional: SLOs translate into per-route budgets; budgets are enforced by CI gates and canary thresholds; continuous profiling makes drift visible; PGO closes the loop from production back to compile; and team practices keep the work funded against feature pressure. The individual engineer's loop — measure, identify, change, re-measure — is unchanged, but it runs against a backdrop of institutional support and accountability. That's how performance survives organizational entropy.
Further reading¶
- Google SRE book on SLOs: https://sre.google/sre-book/service-level-objectives/
- Go PGO documentation: https://go.dev/doc/pgo
- Grafana Pyroscope: https://grafana.com/oss/pyroscope/
- Polar Signals Parca: https://www.parca.dev/
- Datadog Continuous Profiler: https://docs.datadoghq.com/profiler/
- Susan Fowler, "Production-Ready Microservices" (book), chapter on performance