Profiling-Guided Optimization¶
Take a deliberately slow Go service, put it under real load, and make it fast the disciplined way: measure first, find the one bottleneck that actually matters, fix it, prove the win with
benchstatand a re-profile. No guessing. The deliverable is not a faster service — it's a chain of evidence.
| Tier | Load-testing (perf craft) |
| Primary domain | Profiling / performance engineering |
| Skills exercised | pprof (CPU/heap/block/mutex), flame graphs, go tool pprof (top/list/web/diff), runtime/trace, continuous profiling, the profile→hypothesis→change→benchstat→re-profile loop, Amdahl's law, profiling under load |
| Interview sections | 17 (performance engineering), 1 (go), 18 (observability) |
| Est. effort | 3–5 focused days |
1. Context¶
You've inherited a Go HTTP service — call it enrich — that takes a batch of events, looks each one up against an in-memory index, decorates it, and returns JSON. It is correct and it is slow: p99 sits near 400 ms at a modest 1.5k RPS and the box is pegged. The previous owner already "optimized" it twice — once by adding a sync.Mutex around a map (made it worse), once by switching JSON libraries (no measurable change) — because they tuned by intuition and never measured. You will not repeat that.
Your job is to drive this service to a stated SLO by profiling under load, naming the dominant cost, changing exactly that, and proving each step moved the number. Every optimization in your findings note must be backed by a before/after profile or a benchstat table. An optimization with no measurement attached is a guess, and guesses don't count.
2. Goals / Non-goals¶
Goals - Stand up the slow enrich service plus a load harness that holds a defined, open-model request rate (so a profile is taken while the queue is real). - Master the four profiles — CPU, heap (alloc & inuse), block, mutex — and state precisely what each one can and cannot tell you. - Run the optimize loop end to end: profile → hypothesis → one change → benchstat (micro) and re-profile (macro) → keep or revert. - Distinguish the four bottleneck regimes — CPU-bound, lock-bound, alloc/GC-bound, IO-bound — from profile shape alone, and prove which one you're in. - Apply Amdahl's law: refuse to optimize a 2% line; find the part whose fraction of total time makes the fix worth it. - Use runtime/trace to explain scheduler latency, GC pauses, and goroutine blocking that the aggregate profiles flatten away.
Non-goals - Rewriting the service in another language. The point is to optimize this Go code. - Chasing a leaderboard number. The target is a defended p99/throughput win, not the lowest possible nanosecond. - Premature unsafe/assembly micro-tricks. You earn those only after the profile says the hot loop is genuinely CPU-bound and algorithmically minimal.
3. Functional requirements¶
- The service (
cmd/enrich) exposesPOST /enrichtaking a JSON batch of N events; for each event it does a lookup against an in-memory index, enriches it, and returns the batch. Behavior is fixed and verified by a golden test — speed must never change correctness. net/http/pprofis mounted (on a separate admin port, not the hot path) so/debug/pprof/{profile,heap,block,mutex,trace,goroutine}are reachable live.runtime.SetBlockProfileRateandruntime.SetMutexProfileFractionare wired to flags (default off — they have overhead) so block/mutex profiles can be enabled for a measurement window.- A
make profile RPS=… DUR=…target drives load and captures all four profiles + a 5 s trace during the steady-state window, into a timestamped dir. - A microbenchmark suite (
*_test.go,testing.B) covers each hot function you touch, so every change has abenchstat-able A/B at the function level. - The slow version is preserved (tag/branch) as the baseline control so every later number is a delta against a fixed reference.
4. Load & data profile¶
- Index size: the lookup index holds 50M entries (≈ the "big data" axis). Key distribution is Zipfian (s≈1.1) so a few keys are hot — this is what makes the heap/inuse profile and cache behavior interesting.
- Request shape: batches of 100 events; event keys drawn from the same Zipfian. Two request mixes: fat-batch (big-data regime, 1k events/batch, low RPS) and thin-batch (high-RPS regime, 1 event/batch, very high request count).
- Traffic model: open model (fixed arrival rate, not closed-loop), so the service's own slowdown shows up as a growing queue and tail latency — a closed-loop "as fast as it drains" test hides exactly the contention you're hunting. Beware coordinated omission in the harness (see
load-testing/01). - Generator:
cmd/genbuilds the 50M-entry index and the request corpus deterministically from a seed. - The cardinal rule: profile under load, never at idle. An idle profile of this service is a lie — at idle there is no lock contention, the GC barely runs, and the scheduler never queues. The interesting profile only exists while the open-model load is applied.
5. Non-functional requirements / SLOs¶
Targets are stated as before → after. "Before" is the baseline control; "after" is what you must reach and prove with a re-profile.
| Metric | Before (baseline) | After (target) |
|---|---|---|
| p99 latency, thin-batch @ 1.5k RPS | ~400 ms | < 40 ms |
| Sustained throughput before p99 > 100 ms | ~1.5k RPS | ≥ 15k RPS |
| Allocations per request (heap alloc profile) | report it | ≥ 5× reduction on the hot path |
GC CPU fraction (runtime/trace / GODEBUG=gctrace=1) | report it | < 5% of CPU |
| Mutex wait time (mutex profile, high-RPS run) | report it | near-zero after sharding/lock fix |
| Correctness | golden test passes | identical golden output, every step |
The numbers are a forcing function, not the grade. The grade is: for each target, can you point at the profile that said this was the bottleneck, the one change you made, and the re-profile that proved it moved — and can you show the next bottleneck the win exposed?
6. Architecture constraints & guidance¶
- Two ports. Keep
pprofand the admin/metrics server on a port distinct from/enrich. Profiling endpoints under the hot path skew the very thing you measure. - Block & mutex profiling cost. They sample on every block/contention event. Leave the rates at 0 in production; enable them only for the measurement window, and note that enabling them slightly perturbs latency (acknowledge the observer effect).
- Symbolize correctly. Build with full symbols; capture profiles from the same binary you load-test so
list/webmap to real source lines. A stripped binary gives you useless flame graphs. - One change at a time. Each commit on the optimization branch is a single hypothesis. Two changes in one commit means you can't attribute the delta.
- Instrument request rate, in-flight requests, p50/p99/p999 (HdrHistogram), and the four
runtimegauges (heap inuse, goroutines, GC pauses) via Prometheus, so the macro picture lines up with the micro profile.
7. Data model¶
index entry: { key uint64, payload [48]byte } // 50M of these, the "big data"
request: { batch []Event } // Event = { key uint64, attrs map[string]string }
profile bundle (per measurement run):
cpu.pprof heap.pprof block.pprof mutex.pprof trace.out hist.json meta.json
^ rate, duration, commit, GOMAXPROCS
Each measurement run writes a self-describing bundle: the four profiles, a trace, the latency histogram, and a meta.json recording the input rate, duration, git commit, GOMAXPROCS, GOGC/GOMEMLIMIT, and the load model. A profile without its capture conditions is unattributable.
8. Interface contract¶
POST /enrich→{ "results": [ {enriched event}, … ] }(golden-verified).- Admin port:
/debug/pprof/*(CPU?seconds=30,heap,block,mutex,trace?seconds=5,goroutine?debug=2), and/metrics(Prometheus). - Toolchain the reviewer must be able to re-run from your committed commands:
go tool pprof -http=:0 cpu.pprof(flame graph),top,list <func>,web.go tool pprof -base old.pprof new.pprof(the diff — proves a delta).go tool trace trace.out(scheduler/GC/latency view).benchstat old.txt new.txt(statistical A/B at the function level).
9. Key technical challenges¶
- Reading flame width, not flame height. Width = total time in a call subtree; height = call depth. People optimize a tall-but-thin tower and gain nothing. Find the widest frame whose work is yours to remove.
- Knowing which profile to open first. A latency problem can be CPU, GC, locks, or IO — four different profiles. Opening the wrong one wastes the run. The trace and the
runtimegauges tell you which regime you're in before you guess. - alloc vs inuse.
alloc_space(everything ever allocated → GC pressure) andinuse_space(live now → footprint/leaks) answer different questions. Confusing them sends you optimizing the wrong thing. - Amdahl's law in practice. If a function is 2% of CPU, making it free buys you 2%. The profile's
topcumulative column is your Amdahl calculator — fix the line whose fraction makes the speedup worth the complexity. - The bottleneck moves. Fix the lock and the CPU profile reshapes; fix the CPU and allocation becomes dominant. You must re-profile after every change, because the next binding constraint is rarely the one you expected.
10. Experiments to run (each tied to a profile)¶
Record before/after for every one. Name the profile that drove it.
- Baseline bundle (control). Capture all four profiles + trace + histogram on the unmodified service under steady load. This is the reference every later number diffs against. Read the CPU flame graph and write down the top-3 widest frames.
- CPU-bound fix (CPU profile). Use
topthenliston the widest user frame. Likely culprit: per-request JSON (re)marshal or a redundant sort/scan in the hot loop. Change it; confirm withgo tool pprof -basethat the frame shrank and withbenchstatthat the micro-bench improved with significance (p < 0.05). - Alloc/GC-bound fix (heap
alloc_space+ trace). Find the allocation hot spot (per-request map,[]bytechurn,fmt.Sprintf). Cut it with async.Pool, pre-sized buffers, or by hoisting the allocation. Prove: allocs/op down ≥ 5× inbenchstat, GC CPU fraction down in the trace. - Lock-bound fix (mutex + block profiles). Enable mutex/block profiling under the high-RPS run. The single mutex around the index will dominate. Shard it (striped locks /
sync.Map/ sharded map) or make reads lock-free. Prove the mutex-wait frame collapses and throughput climbs. - IO-bound check (trace + block profile). Add a downstream call (or DB lookup); show in the trace that goroutines park on the network, not on CPU — and that throwing CPU/locks at it does nothing. The fix here is concurrency/batching, not micro-optimization. This teaches you to not CPU-optimize an IO wait.
- Amdahl discipline (CPU
topcumulative). Deliberately optimize a 2% frame to completion; show the end-to-end p99 barely moves. Then optimize the widest frame; show it moves a lot. Same effort, different fraction — that's the lesson. - Regime shift (big-data vs high-RPS). Run the same service under fat-batch (big-data) and thin-batch (high-RPS). Show the dominant profile changes: heap/CPU dominate under big-data; mutex/block dominate under high-RPS. One service, two bottlenecks, two different profiles to open.
- End-to-end proof (
-basediff + histogram). Stack all kept changes; produce a singlego tool pprof -base baseline newand a before/after latency histogram that show the SLO met. Revert any change whose diff doesn't justify its complexity.
11. Milestones¶
- Slow service +
cmd/gen(50M index) + open-model load harness + Prometheus board; capture the baseline bundle. - CPU + alloc passes (experiments 2–3); first p99 drop, with
-basediffs andbenchstattables committed. - Lock pass under high-RPS (experiment 4); mutex-wait collapses, throughput climbs.
- Regime + Amdahl studies (experiments 5–7); the "which profile, which regime" note.
- End-to-end proof (experiment 8); findings note with the full evidence chain.
12. Acceptance criteria (definition of done)¶
- SLO table met: p99 < 40 ms @ 1.5k RPS and ≥ 15k RPS before p99 > 100 ms, each with the profile that drove the fix attached.
- For every kept optimization: a
go tool pprof -basediff (macro) and abenchstattable with statistical significance (micro). - Allocs/op reduced ≥ 5× on the hot path; GC CPU fraction < 5% shown in the trace.
- Mutex-wait near-zero in the high-RPS mutex profile after the lock fix.
- A written demonstration of Amdahl's law (the 2%-vs-widest-frame experiment).
- A regime note showing the dominant profile differs between big-data and high-RPS.
- Golden output byte-identical across every step (correctness never traded).
- Every number reproducible from a committed command + the
meta.jsonbundle.
13. Stretch goals¶
- Continuous profiling. Wire Pyroscope/Parca (or Go's pprof labels +
pprof.Do) so profiles are tagged by endpoint/tenant and collected continuously; find a bottleneck that only shows up in production-shaped traffic, not in the lab. - pprof labels for attribution. Tag goroutines with
runtime/pproflabels and show a CPU profile split by request type — attributing cost to a code path you couldn't otherwise isolate. - PGO (profile-guided optimization). Feed a representative CPU profile to the Go compiler (
-pgo) and measure the inlining/devirtualization win on top of your manual fixes. - Escape-analysis pass. Use
-gcflags=-mto confirm an allocation moved to the stack after your fix; cross-check againstload-testing/05. - Flame-graph diff in CI. Gate a PR on a regression in the CPU profile's hottest frame, not just wall-clock — extending the benchstat gate from
load-testing/06.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Method | Profiles before changing code | Never guesses — every change traces to a profile and is proven by a re-profile; reverts changes the diff doesn't justify |
| Profile literacy | Reads a CPU flame graph | Picks the right profile per regime; distinguishes alloc vs inuse, block vs mutex, CPU vs IO from shape alone |
| Amdahl's law | Knows not to micro-optimize | Uses cumulative top to target the frame whose fraction pays; can predict the ceiling a fix can reach |
| Under-load discipline | Profiles a running service | Knows an idle profile lies; captures during steady-state open-model load and records the conditions |
| Tracing | Can open go tool trace | Explains scheduler latency, GC pauses, and goroutine blocking the aggregate profiles hide |
| Proof | Shows a faster number | benchstat significance + -base diff + histogram; the next exposed bottleneck named |
| Communication | Clear findings note | Could walk a staff panel through the evidence chain and defend why each fix was the right one to make |
15. References¶
- Go blog: "Profiling Go Programs";
runtime/pprof,net/http/pprof,runtime/tracepackage docs. go tool pprofREADME (top,list,web,-base,-diff_base) and the Go diagnostics guide.- Brendan Gregg — Flame Graphs (reading width vs height) and the USE method.
- Amdahl's law — and Systems Performance (Gregg) on choosing what to optimize.
- Go PGO docs (
-pgo); Pyroscope / Parca continuous-profiling docs. - Builds on
load-testing/06-microbenchmarking-and-benchstat(the micro A/B that proves each function-level change) andload-testing/03-soak-and-leak-hunting(profiling over time to separate a leak from a hot path). - See also:
Interview Question/17-performance-engineering/andInterview Question/18-observability/.