Profiling Concurrent Go Code — Specification¶
Table of Contents¶
- Scope
- Sampler APIs
- Sampling Rate Semantics
- Recorded Stack Attribution
- Profile Output Format
- Trace Event Stream
- Goroutine Label Propagation
- Endpoint Contracts
- Compatibility and Versioning
- References
Scope¶
This document specifies the observable behaviour of Go's concurrency profiling subsystem: the goroutine, mutex, block, and trace facilities exposed by runtime, runtime/pprof, runtime/trace, and net/http/pprof. The specification reflects Go 1.22+ and notes earlier divergences explicitly. Behaviour not documented here is implementation-defined; programs should not rely on it.
Sampler APIs¶
The runtime exposes four functions controlling concurrent profiling state:
runtime.SetMutexProfileFraction(rate int) int¶
Sets the fraction of contended sync.Mutex and sync.RWMutex events recorded.
rate == 0: disabled. No mutex events are recorded.rate == 1: every contended event is recorded.rate >= 2: each contended event is recorded with probability1/rate, independently.
Returns the previous fraction.
runtime.SetBlockProfileRate(rate int)¶
Sets the block profile sampling rate, in nanoseconds.
rate <= 0: disabled. No block events are recorded.rate == 1: every blocking event is recorded.rate > 1: events shorter thanratens are recorded with probability proportional to duration; events at leastratens long are always recorded.
The semantic guarantee is that, in expectation, the recorded total time approximates the actual total time when integrated over a sufficiently long window.
runtime.SetCPUProfileRate(hz int)¶
Sets the CPU profiling rate in Hz. Documented elsewhere; included here for completeness because concurrent programs frequently change it.
runtime/trace.Start(w io.Writer) error and Stop()¶
Begin and end trace recording. While a trace is active, every event in the trace event protocol is recorded. Only one trace may be active at a time; Start returns an error otherwise.
Sampling Rate Semantics¶
Mutex profile fraction¶
For each contended lock release where any goroutine had to wait, the runtime evaluates cheaprandn(rate) == 0. If true, the event is recorded. The sampling is independent across events — no cluster discount.
Counter accumulation: when a contended event is sampled, two values are added to the relevant sample bucket:
contentions: 1 (regardless of fraction).delay: total wait time across all waiters released, in nanoseconds.
To estimate the true total, multiply observed contentions by rate. Tools like go tool pprof may or may not perform this multiplication; check sample_type.
Block profile rate¶
The check at each potential blocking event is:
where cycles is the actual blocked duration in nanoseconds (after runtime.SetBlockProfileRate converts ns to cycles using the local clock rate).
This formula guarantees:
- Events of duration
rateor longer are always recorded. - The expected recorded total time approaches the actual total as the window grows.
CPU profile¶
The CPU profiler uses SIGPROF (Unix) or a timer interrupt (Windows) at the configured Hz. Each interrupt yields one sample with the stack of the currently-executing goroutine on each thread.
Recorded Stack Attribution¶
Mutex profile¶
The recorded stack is the stack of the goroutine that released the lock, captured at the call to sync.(*Mutex).Unlock (or RUnlock). Up to Go 1.17 this was inconsistent; from Go 1.18 onward, attribution is uniformly at the unlock site.
Stack walking skips internal sync and runtime frames so that the user's defer mu.Unlock() line is visible at the top.
Block profile¶
The recorded stack is the stack of the goroutine that blocked, captured at the time of gopark. The wait time is the difference between unblock and block timestamps.
Goroutine profile¶
The recorded stack is the current stack of each live goroutine at the moment of the snapshot. The snapshot is taken under a brief stop-the-world (since Go 1.19, foreground-only) to ensure consistency.
CPU profile¶
The recorded stack is the stack of the goroutine executing on the interrupted thread at the moment SIGPROF arrived.
Profile Output Format¶
All profiles are emitted as gzip-compressed protobuf following pprof.proto. The schema is shared across pprof-emitting runtimes (Go, C++, JVM, Rust, etc.).
Sample types¶
| Profile | sample_type entries |
|---|---|
goroutine | goroutines: count |
heap | alloc_objects, alloc_space, inuse_objects, inuse_space |
allocs | alloc_objects, alloc_space, inuse_objects, inuse_space |
mutex | contentions: count, delay: nanoseconds |
block | contentions: count, delay: nanoseconds |
profile (CPU) | samples: count, cpu: nanoseconds |
threadcreate | count |
Default sample index¶
When go tool pprof opens a profile without -sample_index=, it picks the last sample type. For mutex and block, that is delay.
Labels¶
pprof.Labels propagated through pprof.Do or SetGoroutineLabels are emitted as repeated Label entries on each sample. Tools may filter or aggregate by these labels.
Trace Event Stream¶
The trace stream is a custom binary format documented in src/runtime/trace2.go (Go 1.22+).
Guarantees¶
- Per-P ordering: events emitted by a given P are recorded in execution order.
- Cross-P ordering: timestamps allow global ordering with TSC precision on supported hardware. Without TSC,
nanotime()provides millisecond-or-better precision. - Completeness while enabled: every event is recorded; the tracer does not sample.
- No event reorder: the parser sees events in emission order.
What is not guaranteed¶
- Goroutines may briefly miss events if their P's trace buffer is being flushed when the goroutine moves between Ps. The parser detects and skips such gaps.
- Across a
GOMAXPROCSchange mid-trace, P identity is preserved but the count changes; events emitEvProcsChange. - Trace events emitted from inside cgo callbacks may be attributed to a different M than the original Go goroutine; this is documented as imprecise.
User-defined events¶
trace.NewTask, trace.StartRegion, and trace.Logf emit EvUserTaskBegin, EvUserRegion, and EvUserLog respectively. These are no-ops when no trace is active.
Goroutine Label Propagation¶
Goroutine labels live in g.labels, a pointer to a labels map stored on the goroutine.
Setting¶
pprof.SetGoroutineLabels(ctx): sets the current goroutine's labels to those ofctx.pprof.Do(ctx, labels, f): sets labels for the duration off. Restored on exit.pprof.WithLabels(ctx, labels): returns a derived context carrying labels (does not change the goroutine).
Propagation across go¶
When a goroutine starts another goroutine via the go statement inside a pprof.Do call, the child inherits the parent's labels. Standalone go f() outside of pprof.Do does not inherit unless the parent goroutine was already labelled with SetGoroutineLabels.
Specifically, the runtime copies g.labels to the new goroutine at runtime.newproc.
Restoration¶
pprof.Do saves the previous labels on entry and restores them on exit. Even if the function panics, the labels are restored via deferred cleanup.
Behaviour at profile sample time¶
When a profile sample is recorded, the runtime reads g.labels and emits them as sample labels. The read is atomic.
Endpoint Contracts¶
net/http/pprof registers handlers on the default HTTP mux. Each is documented below.
/debug/pprof/¶
Index page listing all registered profiles. Returns HTML.
/debug/pprof/goroutine¶
Goroutine profile. Query parameters:
debug=0(default): gzip-compressed protobuf.debug=1: plain-text, one line per unique stack with count.debug=2: plain-text, one block per goroutine with state and wait reason.
debug=2 does not stop the world. debug=0 and debug=1 invoke a brief STW.
/debug/pprof/heap¶
Heap profile. Query parameters:
gc=1: invokeruntime.GC()before sampling.debug=0(default): protobuf.
/debug/pprof/allocs¶
Allocations profile. Same data source as heap, different default sample index.
/debug/pprof/mutex¶
Mutex profile. No query parameters of note. Returns protobuf.
/debug/pprof/block¶
Block profile. No query parameters. Returns protobuf.
/debug/pprof/profile¶
CPU profile. Query parameters:
seconds=N(default 30): duration to sample.
Returns protobuf after the duration elapses.
/debug/pprof/trace¶
Execution trace. Query parameters:
seconds=N(default 1): duration to capture.
Returns the trace binary format. Only one active trace per process; concurrent requests serialise (or fail depending on Go version).
/debug/pprof/threadcreate¶
Thread creation profile. Returns protobuf.
Compatibility and Versioning¶
Profile protobuf¶
pprof.proto is versioned via the protobuf schema's compatibility rules. Newer fields are optional and ignored by older parsers. The format has been stable since 2015.
Trace format¶
The trace format has been rewritten twice (2014, 2022). Major version changes invalidate older tools — go tool trace of Go 1.22+ cannot read traces from Go 1.21 or earlier (and vice versa). Always parse a trace with the matching go binary.
SetMutexProfileFraction / SetBlockProfileRate defaults¶
Both are 0 (disabled) by default. This has been stable since the APIs were introduced (Go 1.8 for SetMutexProfileFraction, Go 1.1 for SetBlockProfileRate). Tests that rely on enabling them must do so explicitly.
Label propagation¶
Introduced in Go 1.9 (runtime/pprof.Do). Behaviour has been stable; minor improvements in label propagation across go statements landed in Go 1.18.
PGO¶
Profile-guided optimization is a build-system feature, not a runtime one, but accepts a CPU profile in the standard pprof format. Stable since Go 1.21.
References¶
src/runtime/cpuprof.gosrc/runtime/mprof.gosrc/runtime/trace2.gosrc/runtime/pprof/pprof.gosrc/runtime/pprof/label.gohttps://github.com/google/pprof/blob/main/proto/profile.proto- Go diagnostics overview:
https://go.dev/doc/diagnostics - Go 1.22 trace overhaul design:
https://go.googlesource.com/proposal/+/master/design/60773-execution-tracer-overhaul.md