Mutex and Block Profiling — Professional¶
1. The production framing¶
Contention bugs are the silent killer. CPU dashboards look fine, request rate looks fine, and yet p99 latency drifts up release after release. The professional job around mutex/block profiling, in order:
- Enable both profiles in every service from day one, with conservative rates.
- Make captures cheap and routine — automation reaches them, not just humans.
- Continuously ingest contention data into long-term storage and dashboards.
- Diff per release. A new top-5 stack in the mutex profile is treated like a SLO regression.
- Maintain a runbook for the "latency rising, CPU flat" incident class.
This file is what each of those looks like in practice.
2. Enabling profiles in every service¶
A shared internal/profiling package does this exactly once per binary:
package profiling
import (
"net/http"
_ "net/http/pprof"
"runtime"
)
func Init(cfg Config) {
runtime.SetMutexProfileFraction(cfg.MutexRate)
runtime.SetBlockProfileRate(cfg.BlockRateNs)
mux := http.NewServeMux()
mux.Handle("/debug/pprof/", http.DefaultServeMux)
go func() {
_ = http.ListenAndServe(cfg.Addr, mux) // 127.0.0.1:6060 by default
}()
}
type Config struct {
Addr string
MutexRate int
BlockRateNs int
}
var Default = Config{
Addr: "127.0.0.1:6060",
MutexRate: 100,
BlockRateNs: 10_000,
}
cmd/myservice/main.go calls profiling.Init(profiling.Default) in the first dozen lines. Now every service exposes both profiles with identical defaults.
3. When to enable, when to tune¶
| Situation | Mutex fraction | Block rate (ns) | Why |
|---|---|---|---|
| Default in prod | 100 | 10_000 | Negligible overhead, useful trends |
| Active incident, low-traffic | 10 | 1_000 | More detail, still bounded |
| Active incident, high-traffic | 100 | 10_000 | Don't change rates — capture deltas instead |
| Staging benchmarks | 1 | 1 | All events; never on prod |
| Latency-critical paths confirmed safe | 1000 | 100_000 | Reduce overhead under microsecond-mutex storms |
The mistake to avoid is "enable on incident". By the time the incident is live, you want to compare against last week. Production-on-always is the only configuration that gives you that.
4. Continuous profiling pipelines¶
For one-off investigation, curl /debug/pprof/mutex works. For a fleet of services, you want a continuous profiling system that scrapes, dedupes, and stores profiles indefinitely. Three open options:
| System | Notes |
|---|---|
| Pyroscope / Grafana Phlare | Self-hostable; agent scrapes pprof endpoints; UI has time-series + flame graphs |
| Parca | Similar concept, eBPF-augmented for syscalls/locks below the runtime |
| Datadog / Polar Signals / GCP Profiler | Hosted SaaS; same profile format under the hood |
Whichever you pick, the contract is the same:
- Agent fetches
/debug/pprof/mutexand/debug/pprof/blockevery N seconds. - Server stores the profiles keyed by
(service, version, instance, timestamp). - UI lets you select a time range, drop to flame graph, diff with another range.
The single most useful query becomes: "show me the mutex profile for serviceX@v2.3 minus the same for v2.2." That's how regressions are caught.
5. Dashboard pattern: contention as a SLI¶
Add four panels to your service dashboard:
| Panel | Metric | Alert at |
|---|---|---|
| Mutex delay rate | sum(rate(profile_mutex_delay_ns[5m])) | 2× baseline for 10 min |
| Block delay rate | sum(rate(profile_block_delay_ns[5m])) | 2× baseline for 10 min |
| Top contended stack | topk(5, profile_mutex_delay_ns) | New entry appears |
| Goroutine count | go_goroutines | Trending up unboundedly |
The "top contended stack" panel deserves a story: continuous profiling lets you slice profiles by stack frame. You group by the top-3 frames and chart the delay attributed to each over time. A new bar appearing after a release is the signature of a contention regression.
6. Capturing profiles during incidents¶
A standardised incident command:
#!/usr/bin/env bash
# capture-contention.sh — run during an incident, attach to the ticket
host=$1
out=incident-$(date +%s)
mkdir "$out"
curl -s "http://$host/debug/pprof/mutex" -o "$out/mutex-1.pb.gz"
curl -s "http://$host/debug/pprof/block" -o "$out/block-1.pb.gz"
curl -s "http://$host/debug/pprof/goroutine" -o "$out/goroutine-1.pb.gz"
sleep 60
curl -s "http://$host/debug/pprof/mutex" -o "$out/mutex-2.pb.gz"
curl -s "http://$host/debug/pprof/block" -o "$out/block-2.pb.gz"
curl -s "http://$host/debug/pprof/goroutine" -o "$out/goroutine-2.pb.gz"
echo "captured to $out — analyse with:"
echo " go tool pprof -base $out/mutex-1.pb.gz $out/mutex-2.pb.gz"
Two snapshots a minute apart. Diff captures contention during the incident window, not lifetime-since-process-start. Put this script in your on-call runbook.
7. The "latency rising, CPU flat" runbook¶
When p99 climbs without CPU rising:
- Confirm. Check that QPS and CPU are flat; if not, this is a different runbook.
- Goroutine count. Hit
/debug/pprof/goroutine?debug=1. If many goroutines are parked onsemacquire, contention is live. - Mutex delta. Capture a minute, diff. Look at
top -cum. The leader is the bottleneck. - Block delta. Same. If mutex is empty but block is loud on a channel, it's back-pressure, not lock contention.
- Source.
listthe leader. The expensive line is what to fix. - Bypass. If you can deploy quickly, do; if not, scale out (add replicas) — under contention, more replicas helps even if more cores per replica wouldn't.
- Post-incident. Add the offending stack as a tracked metric. Bake the fix as a regression test.
The seventh point is what professionals do that hobbyists skip. Each incident expands the dashboards.
8. Release-time gating¶
Run the production-shaped benchmark on every PR:
func BenchmarkHotPath(b *testing.B) {
b.ResetTimer()
runtime.SetMutexProfileFraction(1)
runtime.SetBlockProfileRate(1)
for i := 0; i < b.N; i++ {
hotPath()
}
var buf bytes.Buffer
pprof.Lookup("mutex").WriteTo(&buf, 0)
os.WriteFile("mutex.pb.gz", buf.Bytes(), 0o644)
buf.Reset()
pprof.Lookup("block").WriteTo(&buf, 0)
os.WriteFile("block.pb.gz", buf.Bytes(), 0o644)
}
CI compares the new profile to the baseline. If a new stack appears or an existing stack's delay doubles, the PR fails. The threshold is workload-specific; pick a number that matches your team's tolerance.
Tools that automate this: benchstat for time/alloc deltas, custom scripts for pprof deltas. A simple approach: write pprof -top to a text file and diff lexically.
9. The profile retention policy¶
Profiles are small (low single-digit MiB each) but accumulate fast. A typical policy:
| Type | Retention | Storage |
|---|---|---|
| Routine 30-s scrape | 7 days | Hot storage |
| Hourly aggregate | 90 days | Warm storage |
| Daily baseline | 1 year | Cold storage |
| Per-release baseline | Forever | Object storage, immutable |
The per-release-forever is what enables "this regression came in between v2.3 and v2.4" investigations a year later.
10. Cost of running profiles continuously¶
For a service handling 10 000 QPS with moderate contention:
| Profile | Rate | CPU overhead | Memory overhead |
|---|---|---|---|
| Mutex | 100 | ~0.1% | ~1 MiB profile buffer |
| Block | 10_000 | ~0.5% | ~2 MiB profile buffer |
| Goroutine snapshot | per-scrape | < 50 ms pause once per minute | n/a |
A modern service can afford this. The CPU you pay back many times over the first time an incident is resolved in minutes instead of hours.
Be careful about three multipliers:
- Block rate of
1— recording every event — easily hits 5–10% on busy services. - Capturing
goroutine?debug=2with thousands of goroutines is expensive (full stacks, no sampling). - Continuous flame graph rendering in the UI is the client's CPU; doesn't affect the service.
11. Privacy and security¶
Both profiles include source paths and function names from your binary. This is unrelated to user data but does leak architecture. Treat the endpoints as sensitive:
- Bind to localhost or an admin interface, never the user-facing port.
- Authenticate the scrape agent (mTLS or token).
- Strip symbol names with
-ldflags="-s -w"only if you have an out-of-band symbol resolver — symbolless profiles are mostly useless. - Store profiles with the same access controls as code.
A production endpoint that responds to anonymous GETs of /debug/pprof/mutex is a CVE waiting to happen.
12. Cross-team workflows¶
When the contention top frame points at a shared library you don't own, the data is your leverage:
- File a ticket with the pprof flame graph URL, the diff against baseline, and the affected service.
- Include
list <function>output for the worst stacks. - Quantify impact (delay / second per replica × replicas × duration).
- Propose a fix — even a wrong one. The library owners read code, not prose.
Profiles are the only language that crosses teams unambiguously about contention. "Cache.Get is slow" sparks debate; "Cache.Get contributes 8 s/min of mutex delay across 60 replicas" gets the fix scheduled.
13. Profile-driven design reviews¶
For new components that include synchronisation, require the design doc to answer:
- What primitive (
Mutex,RWMutex,chan,atomic)? - Expected critical-section duration at p99?
- Expected QPS?
- Estimated contention (rough Amdahl) at 8, 32, 128 cores?
- Plan for sharding/cow if the estimate exceeds budget?
This single page costs an hour and saves the next on-call from re-discovering it the hard way. Reviewing it requires the reviewer to think about contention, which is itself the goal.
14. Common production anti-patterns¶
| Anti-pattern | Failure mode |
|---|---|
Block profile at rate=1 in prod | 5–10% CPU steal; intermittent latency spikes from sampling itself |
| Mutex profile disabled in prod | When the incident hits, you have nothing |
| Profile endpoint on the user-facing port | Source leakage; potential DoS via repeated capture |
| Comparing absolute totals across restarts | Profiles accumulate since enabling; restarts reset |
"Add sync.Map everywhere" | sync.Map is slower for balanced read/write workloads |
| One global lock guarding "small" hot map | Eventually contended even if "small" |
defer mu.Unlock() then time.Sleep(100ms) | Defer scope is the function; lock held the entire sleep |
| Continuous profiling without diffing | You'll drown in noise; deltas surface signal |
15. Summary¶
Production mutex/block profiling is a budget plus a pipeline plus a runbook. Enable both with conservative rates from day one, ingest into continuous profiling, dashboard the top stacks, gate releases on contention deltas, and exercise the "latency up, CPU flat" runbook before you need it. Atomics, sharding, and copy-on-write are the standard fixes; pick by profile evidence, never by reflex. The teams that take contention seriously look quiet; the teams that don't drift up and to the right until customers notice.
Further reading¶
- Continuous profiling overview: https://www.cncf.io/blog/2022/05/31/what-is-continuous-profiling/
- Pyroscope: https://grafana.com/oss/pyroscope/
- Parca: https://www.parca.dev/
- Google's "Always-on continuous profiling": https://research.google/pubs/google-wide-profiling-a-continuous-profiling-infrastructure-for-data-centers/
- Felix Geisendörfer, profiler notes: https://github.com/DataDog/go-profiler-notes