CPU Profiling in Go — Specification¶

Focus: Precise reference for how Go captures CPU profiles — the sampling mechanism, the public APIs, the wire format, and the tools that read them.

Sources: - runtime/pprof package: https://pkg.go.dev/runtime/pprof - net/http/pprof package: https://pkg.go.dev/net/http/pprof - testing package profiling flags: https://pkg.go.dev/testing - pprof tool: https://github.com/google/pprof - pprof profile format: https://github.com/google/pprof/blob/main/proto/profile.proto

1. What a CPU profile is¶

A CPU profile is a statistical sample of program execution. The runtime periodically interrupts the program and records the call stack of whichever goroutine was running on each thread. The resulting set of stacks is aggregated into a weighted call graph: each frame's weight is proportional to the number of samples in which it appeared.

Property	Value
Mechanism	OS timer signal (`SIGPROF` on Linux/macOS) sent to threads
Default rate	100 Hz (one sample every 10 ms)
Resolution	The duration of a sample — 10 ms by default
Scope	One profile per process; covers all goroutines on all threads
Output	`pprof` protobuf (gzipped), file extension `.pprof` or `.pb.gz`
Cost	~1–5% CPU overhead at 100 Hz; negligible memory

A 30-second profile at 100 Hz on 8 cores yields up to 24,000 samples — enough resolution to find any function consuming ≥0.5% of CPU.

2. How sampling works under the hood¶

kernel → SIGPROF → signal handler → record stack of current goroutine → continue

The runtime calls setitimer(ITIMER_PROF, ...) to request a periodic signal.
On each tick, the kernel delivers SIGPROF to one thread at a time (round-robin).
The signal handler walks the goroutine's stack using compiler-emitted metadata and writes an entry to the lock-free profile buffer.
A reader goroutine drains the buffer and writes a pprof profile to the output writer.

Because the signal is delivered per-thread, the sampling rate scales with GOMAXPROCS: at 100 Hz with 8 threads, expect ~800 samples/second across all cores.

3. Public APIs¶

`runtime/pprof`¶

Function	Purpose
`pprof.StartCPUProfile(w io.Writer) error`	Begin writing the CPU profile to `w`
`pprof.StopCPUProfile()`	Stop the current CPU profile and flush
`pprof.Profile.WriteTo(w, debug)`	For non-CPU profiles (heap, goroutine, etc.)
`pprof.Lookup(name)`	Get a named profile (`"goroutine"`, `"heap"`, `"allocs"`, ...)
`pprof.Do(ctx, labels, fn)`	Tag samples with `runtime/pprof` labels
`pprof.SetGoroutineLabels(ctx)`	Attach labels to the current goroutine

Only one CPU profile may be active at a time. A second StartCPUProfile returns an error.

`runtime`¶

Function	Purpose
`runtime.SetCPUProfileRate(hz int)`	Override the default 100 Hz (call before `StartCPUProfile`)
`runtime.CPUProfile()`	Deprecated; returns the next chunk of profile data (use `pprof.StartCPUProfile`)

SetCPUProfileRate(0) disables profiling. Rates above 1000 Hz are clamped on most platforms because the kernel will not deliver SIGPROF faster than that.

`net/http/pprof`¶

Importing _ "net/http/pprof" registers these handlers on http.DefaultServeMux:

Endpoint	Profile
`/debug/pprof/profile?seconds=N`	CPU profile over N seconds (default 30)
`/debug/pprof/heap`	In-use heap profile
`/debug/pprof/allocs`	All allocations since process start
`/debug/pprof/goroutine`	Goroutine stacks
`/debug/pprof/block`	Blocking profile (after `runtime.SetBlockProfileRate`)
`/debug/pprof/mutex`	Mutex contention (after `runtime.SetMutexProfileFraction`)
`/debug/pprof/trace?seconds=N`	Execution trace (not a pprof; see `go tool trace`)
`/debug/pprof/cmdline`	Process command line
`/debug/pprof/symbol`	Address-to-symbol lookup

`testing`¶

Flag	Effect
`-cpuprofile=cpu.out`	Write a CPU profile from `TestMain` or `go test -bench`
`-blockprofile=block.out`	Write a block profile
`-mutexprofile=mutex.out`	Write a mutex profile
`-memprofile=mem.out`	Write a heap profile at the end
`-trace=trace.out`	Write an execution trace

These are honored by the standard testing.M flow without any code changes.

4. Profile file format¶

The on-disk format is a gzipped profile.proto (https://github.com/google/pprof/blob/main/proto/profile.proto).

Profile {
  repeated SampleType sample_type;  // e.g., "samples/count", "cpu/nanoseconds"
  repeated Sample sample;           // each sample: location stack + values
  repeated Location location;       // PC -> function + line
  repeated Function function;
  repeated Mapping mapping;         // module mappings
  repeated string string_table;
  int64 time_nanos;
  int64 duration_nanos;
  int64 period;                     // sampling period in nanoseconds (e.g., 10000000 for 100 Hz)
}

Two sample values per sample for a CPU profile: count (number of samples that hit this stack) and CPU nanoseconds (count × period). Tools display either; default is CPU time.

5. The `go tool pprof` interface¶

go tool pprof [flags] [binary] profile.pprof

Command	Effect
`top`	Top N functions by flat CPU
`top -cum`	Top N by cumulative CPU
`list <regex>`	Source-level annotation of matching functions
`web`	Open SVG call graph in a browser (needs Graphviz)
`peek <regex>`	Show callers/callees of matching functions
`disasm <regex>`	Annotated disassembly
`tree`	Indented caller/callee tree
`traces`	Show individual sample stacks
`tags`	Show pprof labels and their values
`help`	Full command list

Flags:

Flag	Effect
`-http=:8080`	Launch the interactive web UI (top, source, flame graph)
`-base=old.pprof`	Subtract `old` from the profile (regression diff)
`-diff_base=old.pprof`	Show only positive differences (where `new > old`)
`-output=out.svg`	Write graph/flame to a file
`-focus=<regex>`	Keep only samples passing through matching nodes
`-ignore=<regex>`	Drop samples passing through matching nodes
`-nodecount=N`	Limit nodes in graph output

The web UI (-http) is the modern default; the interactive REPL remains useful in CI and ssh sessions.

6. Flat vs cumulative time¶

Term	Meaning
Flat (self)	Time spent in the function itself, excluding calls it made
Cumulative (cum)	Time spent in the function plus everything it called

A leaf function (e.g., runtime.memmove) has flat ≈ cum. A wrapper function (http.serverHandler.ServeHTTP) has tiny flat but huge cum.

The first question to ask of any profile: "What has high flat?" Those are real hotspots. Functions with only high cum are not where the work happens — they are where it is organized.

7. Runtime functions you will see¶

Function	Meaning
`runtime.mallocgc`	Heap allocation slow path
`runtime.gcBgMarkWorker`	Background GC worker
`runtime.gcAssistAlloc1`	Mutator assisting GC (allocation-bound)
`runtime.futex` / `runtime.semasleep`	Goroutine sleeping for a lock or semaphore
`runtime.findrunnable` / `runtime.schedule`	Scheduler looking for work (often idle CPU)
`runtime.memmove` / `runtime.memclrNoHeapPointers`	Bulk copy/clear
`runtime.mapaccess1` / `runtime.mapassign`	Map operations
`runtime.growslice`	Slice reallocation on append
`runtime.convT*`	Boxing a value into an interface
`runtime.duffcopy` / `runtime.duffzero`	Inline copy/zero loops
`syscall.Syscall6`	Time spent in kernel calls

Seeing many seconds in runtime.findrunnable is idle CPU, not work — the profile counts on-CPU time for whatever the kernel scheduled. Filter with -ignore=runtime\\.findrunnable.

8. Profile labels (`pprof.Do`)¶

pprof.Do(ctx, pprof.Labels("endpoint", "/api/v1/users"), func(ctx context.Context) {
    handleUsers(ctx)
})

Samples taken while the goroutine carries these labels record them. go tool pprof can then group:

(pprof) tags endpoint
(pprof) -tagfocus=endpoint=/api/v1/users

Labels are propagated across go statements that capture the labeled context. Cost: a few words per Do call. Use to attribute CPU to logical units (request types, tenants, batch jobs).

9. Sample bias and limitations¶

Issue	Effect
Signal delivery skew	Functions that often have `SIGPROF` masked (CGO, syscalls) under-sample
Inlined functions	The inliner removes frames; profiles attribute work to the caller (use `-gcflags="-l"` to disable for analysis)
Tail calls / TCO	Not used by Go; not an issue
Short-lived programs	Fewer than ~1000 samples → noisy results; raise rate or extend duration
CGO time	Recorded as `_cgoexp_*` / `runtime.cgocall`; in-C time is opaque
Goroutine starvation	Profile shows what ran, not what waited; combine with block/mutex profiles

A CPU profile answers "where did CPU go", not "why was the request slow". For latency questions, use traces (go tool trace) plus block/mutex profiles.

10. Non-goals¶

pprof CPU profiling is not a code coverage tool. Cold paths are absent from the profile.
It is not a wall-clock latency profile. Off-CPU time (sleeping on I/O, locks) is invisible.
It cannot attribute time inside C libraries beyond the entry point.
It cannot tell you why memory was allocated — that is the heap profile's job.

pprof README: https://github.com/google/pprof/blob/main/doc/README.md
"Profiling Go Programs" (Go blog, classic): https://go.dev/blog/pprof
runtime/pprof source: https://github.com/golang/go/tree/master/src/runtime/pprof
Continuous profiling overview: https://grafana.com/oss/pyroscope/

CPU Profiling in Go — Specification¶

1. What a CPU profile is¶

2. How sampling works under the hood¶

3. Public APIs¶

runtime/pprof¶

runtime¶

net/http/pprof¶

testing¶