Mutex and Block Profiling — Professional¶

1. The production framing¶

Contention bugs are the silent killer. CPU dashboards look fine, request rate looks fine, and yet p99 latency drifts up release after release. The professional job around mutex/block profiling, in order:

Enable both profiles in every service from day one, with conservative rates.
Make captures cheap and routine — automation reaches them, not just humans.
Continuously ingest contention data into long-term storage and dashboards.
Diff per release. A new top-5 stack in the mutex profile is treated like a SLO regression.
Maintain a runbook for the "latency rising, CPU flat" incident class.

This file is what each of those looks like in practice.

2. Enabling profiles in every service¶

A shared internal/profiling package does this exactly once per binary:

package profiling

import (
    "net/http"
    _ "net/http/pprof"
    "runtime"
)

func Init(cfg Config) {
    runtime.SetMutexProfileFraction(cfg.MutexRate)
    runtime.SetBlockProfileRate(cfg.BlockRateNs)

    mux := http.NewServeMux()
    mux.Handle("/debug/pprof/", http.DefaultServeMux)

    go func() {
        _ = http.ListenAndServe(cfg.Addr, mux) // 127.0.0.1:6060 by default
    }()
}

type Config struct {
    Addr        string
    MutexRate   int
    BlockRateNs int
}

var Default = Config{
    Addr:        "127.0.0.1:6060",
    MutexRate:   100,
    BlockRateNs: 10_000,
}

cmd/myservice/main.go calls profiling.Init(profiling.Default) in the first dozen lines. Now every service exposes both profiles with identical defaults.

3. When to enable, when to tune¶

Situation	Mutex fraction	Block rate (ns)	Why
Default in prod	100	10_000	Negligible overhead, useful trends
Active incident, low-traffic	10	1_000	More detail, still bounded
Active incident, high-traffic	100	10_000	Don't change rates — capture deltas instead
Staging benchmarks	1	1	All events; never on prod
Latency-critical paths confirmed safe	1000	100_000	Reduce overhead under microsecond-mutex storms

The mistake to avoid is "enable on incident". By the time the incident is live, you want to compare against last week. Production-on-always is the only configuration that gives you that.

4. Continuous profiling pipelines¶

For one-off investigation, curl /debug/pprof/mutex works. For a fleet of services, you want a continuous profiling system that scrapes, dedupes, and stores profiles indefinitely. Three open options:

System	Notes
Pyroscope / Grafana Phlare	Self-hostable; agent scrapes pprof endpoints; UI has time-series + flame graphs
Parca	Similar concept, eBPF-augmented for syscalls/locks below the runtime
Datadog / Polar Signals / GCP Profiler	Hosted SaaS; same profile format under the hood

Whichever you pick, the contract is the same:

Agent fetches /debug/pprof/mutex and /debug/pprof/block every N seconds.
Server stores the profiles keyed by (service, version, instance, timestamp).
UI lets you select a time range, drop to flame graph, diff with another range.

The single most useful query becomes: "show me the mutex profile for serviceX@v2.3 minus the same for v2.2." That's how regressions are caught.

5. Dashboard pattern: contention as a SLI¶

Add four panels to your service dashboard:

Panel	Metric	Alert at
Mutex delay rate	`sum(rate(profile_mutex_delay_ns[5m]))`	2× baseline for 10 min
Block delay rate	`sum(rate(profile_block_delay_ns[5m]))`	2× baseline for 10 min
Top contended stack	`topk(5, profile_mutex_delay_ns)`	New entry appears
Goroutine count	`go_goroutines`	Trending up unboundedly

The "top contended stack" panel deserves a story: continuous profiling lets you slice profiles by stack frame. You group by the top-3 frames and chart the delay attributed to each over time. A new bar appearing after a release is the signature of a contention regression.

6. Capturing profiles during incidents¶

A standardised incident command:

#!/usr/bin/env bash
# capture-contention.sh — run during an incident, attach to the ticket
host=$1
out=incident-$(date +%s)

mkdir "$out"
curl -s "http://$host/debug/pprof/mutex" -o "$out/mutex-1.pb.gz"
curl -s "http://$host/debug/pprof/block" -o "$out/block-1.pb.gz"
curl -s "http://$host/debug/pprof/goroutine" -o "$out/goroutine-1.pb.gz"
sleep 60
curl -s "http://$host/debug/pprof/mutex" -o "$out/mutex-2.pb.gz"
curl -s "http://$host/debug/pprof/block" -o "$out/block-2.pb.gz"
curl -s "http://$host/debug/pprof/goroutine" -o "$out/goroutine-2.pb.gz"

echo "captured to $out — analyse with:"
echo "  go tool pprof -base $out/mutex-1.pb.gz $out/mutex-2.pb.gz"

Two snapshots a minute apart. Diff captures contention during the incident window, not lifetime-since-process-start. Put this script in your on-call runbook.

7. The "latency rising, CPU flat" runbook¶

When p99 climbs without CPU rising:

Confirm. Check that QPS and CPU are flat; if not, this is a different runbook.
Goroutine count. Hit /debug/pprof/goroutine?debug=1. If many goroutines are parked on semacquire, contention is live.
Mutex delta. Capture a minute, diff. Look at top -cum. The leader is the bottleneck.
Block delta. Same. If mutex is empty but block is loud on a channel, it's back-pressure, not lock contention.
Source. list the leader. The expensive line is what to fix.
Bypass. If you can deploy quickly, do; if not, scale out (add replicas) — under contention, more replicas helps even if more cores per replica wouldn't.
Post-incident. Add the offending stack as a tracked metric. Bake the fix as a regression test.

The seventh point is what professionals do that hobbyists skip. Each incident expands the dashboards.

8. Release-time gating¶

Run the production-shaped benchmark on every PR:

func BenchmarkHotPath(b *testing.B) {
    b.ResetTimer()
    runtime.SetMutexProfileFraction(1)
    runtime.SetBlockProfileRate(1)

    for i := 0; i < b.N; i++ {
        hotPath()
    }

    var buf bytes.Buffer
    pprof.Lookup("mutex").WriteTo(&buf, 0)
    os.WriteFile("mutex.pb.gz", buf.Bytes(), 0o644)
    buf.Reset()
    pprof.Lookup("block").WriteTo(&buf, 0)
    os.WriteFile("block.pb.gz", buf.Bytes(), 0o644)
}

CI compares the new profile to the baseline. If a new stack appears or an existing stack's delay doubles, the PR fails. The threshold is workload-specific; pick a number that matches your team's tolerance.

Tools that automate this: benchstat for time/alloc deltas, custom scripts for pprof deltas. A simple approach: write pprof -top to a text file and diff lexically.

9. The profile retention policy¶

Profiles are small (low single-digit MiB each) but accumulate fast. A typical policy:

Type	Retention	Storage
Routine 30-s scrape	7 days	Hot storage
Hourly aggregate	90 days	Warm storage
Daily baseline	1 year	Cold storage
Per-release baseline	Forever	Object storage, immutable

The per-release-forever is what enables "this regression came in between v2.3 and v2.4" investigations a year later.

10. Cost of running profiles continuously¶

For a service handling 10 000 QPS with moderate contention:

Profile	Rate	CPU overhead	Memory overhead
Mutex	100	~0.1%	~1 MiB profile buffer
Block	10_000	~0.5%	~2 MiB profile buffer
Goroutine snapshot	per-scrape	< 50 ms pause once per minute	n/a

A modern service can afford this. The CPU you pay back many times over the first time an incident is resolved in minutes instead of hours.

Be careful about three multipliers:

Block rate of 1 — recording every event — easily hits 5–10% on busy services.
Capturing goroutine?debug=2 with thousands of goroutines is expensive (full stacks, no sampling).
Continuous flame graph rendering in the UI is the client's CPU; doesn't affect the service.

11. Privacy and security¶

Both profiles include source paths and function names from your binary. This is unrelated to user data but does leak architecture. Treat the endpoints as sensitive:

Bind to localhost or an admin interface, never the user-facing port.
Authenticate the scrape agent (mTLS or token).
Strip symbol names with -ldflags="-s -w" only if you have an out-of-band symbol resolver — symbolless profiles are mostly useless.
Store profiles with the same access controls as code.

A production endpoint that responds to anonymous GETs of /debug/pprof/mutex is a CVE waiting to happen.

12. Cross-team workflows¶

When the contention top frame points at a shared library you don't own, the data is your leverage:

File a ticket with the pprof flame graph URL, the diff against baseline, and the affected service.
Include list <function> output for the worst stacks.
Quantify impact (delay / second per replica × replicas × duration).
Propose a fix — even a wrong one. The library owners read code, not prose.

Profiles are the only language that crosses teams unambiguously about contention. "Cache.Get is slow" sparks debate; "Cache.Get contributes 8 s/min of mutex delay across 60 replicas" gets the fix scheduled.

13. Profile-driven design reviews¶

For new components that include synchronisation, require the design doc to answer:

What primitive (Mutex, RWMutex, chan, atomic)?
Expected critical-section duration at p99?
Expected QPS?
Estimated contention (rough Amdahl) at 8, 32, 128 cores?
Plan for sharding/cow if the estimate exceeds budget?

This single page costs an hour and saves the next on-call from re-discovering it the hard way. Reviewing it requires the reviewer to think about contention, which is itself the goal.

14. Common production anti-patterns¶

Anti-pattern	Failure mode
Block profile at `rate=1` in prod	5–10% CPU steal; intermittent latency spikes from sampling itself
Mutex profile disabled in prod	When the incident hits, you have nothing
Profile endpoint on the user-facing port	Source leakage; potential DoS via repeated capture
Comparing absolute totals across restarts	Profiles accumulate since enabling; restarts reset
"Add `sync.Map` everywhere"	`sync.Map` is slower for balanced read/write workloads
One global lock guarding "small" hot map	Eventually contended even if "small"
`defer mu.Unlock()` then `time.Sleep(100ms)`	Defer scope is the function; lock held the entire sleep
Continuous profiling without diffing	You'll drown in noise; deltas surface signal

15. Summary¶

Production mutex/block profiling is a budget plus a pipeline plus a runbook. Enable both with conservative rates from day one, ingest into continuous profiling, dashboard the top stacks, gate releases on contention deltas, and exercise the "latency up, CPU flat" runbook before you need it. Atomics, sharding, and copy-on-write are the standard fixes; pick by profile evidence, never by reflex. The teams that take contention seriously look quiet; the teams that don't drift up and to the right until customers notice.