CPU Profiling in Go — Specification¶
Focus: Precise reference for how Go captures CPU profiles — the sampling mechanism, the public APIs, the wire format, and the tools that read them.
Sources: -
runtime/pprofpackage: https://pkg.go.dev/runtime/pprof -net/http/pprofpackage: https://pkg.go.dev/net/http/pprof -testingpackage profiling flags: https://pkg.go.dev/testing -pproftool: https://github.com/google/pprof - pprof profile format: https://github.com/google/pprof/blob/main/proto/profile.proto
1. What a CPU profile is¶
A CPU profile is a statistical sample of program execution. The runtime periodically interrupts the program and records the call stack of whichever goroutine was running on each thread. The resulting set of stacks is aggregated into a weighted call graph: each frame's weight is proportional to the number of samples in which it appeared.
| Property | Value |
|---|---|
| Mechanism | OS timer signal (SIGPROF on Linux/macOS) sent to threads |
| Default rate | 100 Hz (one sample every 10 ms) |
| Resolution | The duration of a sample — 10 ms by default |
| Scope | One profile per process; covers all goroutines on all threads |
| Output | pprof protobuf (gzipped), file extension .pprof or .pb.gz |
| Cost | ~1–5% CPU overhead at 100 Hz; negligible memory |
A 30-second profile at 100 Hz on 8 cores yields up to 24,000 samples — enough resolution to find any function consuming ≥0.5% of CPU.
2. How sampling works under the hood¶
- The runtime calls
setitimer(ITIMER_PROF, ...)to request a periodic signal. - On each tick, the kernel delivers
SIGPROFto one thread at a time (round-robin). - The signal handler walks the goroutine's stack using compiler-emitted metadata and writes an entry to the lock-free profile buffer.
- A reader goroutine drains the buffer and writes a
pprofprofile to the output writer.
Because the signal is delivered per-thread, the sampling rate scales with GOMAXPROCS: at 100 Hz with 8 threads, expect ~800 samples/second across all cores.
3. Public APIs¶
runtime/pprof¶
| Function | Purpose |
|---|---|
pprof.StartCPUProfile(w io.Writer) error | Begin writing the CPU profile to w |
pprof.StopCPUProfile() | Stop the current CPU profile and flush |
pprof.Profile.WriteTo(w, debug) | For non-CPU profiles (heap, goroutine, etc.) |
pprof.Lookup(name) | Get a named profile ("goroutine", "heap", "allocs", ...) |
pprof.Do(ctx, labels, fn) | Tag samples with runtime/pprof labels |
pprof.SetGoroutineLabels(ctx) | Attach labels to the current goroutine |
Only one CPU profile may be active at a time. A second StartCPUProfile returns an error.
runtime¶
| Function | Purpose |
|---|---|
runtime.SetCPUProfileRate(hz int) | Override the default 100 Hz (call before StartCPUProfile) |
runtime.CPUProfile() | Deprecated; returns the next chunk of profile data (use pprof.StartCPUProfile) |
SetCPUProfileRate(0) disables profiling. Rates above 1000 Hz are clamped on most platforms because the kernel will not deliver SIGPROF faster than that.
net/http/pprof¶
Importing _ "net/http/pprof" registers these handlers on http.DefaultServeMux:
| Endpoint | Profile |
|---|---|
/debug/pprof/profile?seconds=N | CPU profile over N seconds (default 30) |
/debug/pprof/heap | In-use heap profile |
/debug/pprof/allocs | All allocations since process start |
/debug/pprof/goroutine | Goroutine stacks |
/debug/pprof/block | Blocking profile (after runtime.SetBlockProfileRate) |
/debug/pprof/mutex | Mutex contention (after runtime.SetMutexProfileFraction) |
/debug/pprof/trace?seconds=N | Execution trace (not a pprof; see go tool trace) |
/debug/pprof/cmdline | Process command line |
/debug/pprof/symbol | Address-to-symbol lookup |
testing¶
| Flag | Effect |
|---|---|
-cpuprofile=cpu.out | Write a CPU profile from TestMain or go test -bench |
-blockprofile=block.out | Write a block profile |
-mutexprofile=mutex.out | Write a mutex profile |
-memprofile=mem.out | Write a heap profile at the end |
-trace=trace.out | Write an execution trace |
These are honored by the standard testing.M flow without any code changes.
4. Profile file format¶
The on-disk format is a gzipped profile.proto (https://github.com/google/pprof/blob/main/proto/profile.proto).
Profile {
repeated SampleType sample_type; // e.g., "samples/count", "cpu/nanoseconds"
repeated Sample sample; // each sample: location stack + values
repeated Location location; // PC -> function + line
repeated Function function;
repeated Mapping mapping; // module mappings
repeated string string_table;
int64 time_nanos;
int64 duration_nanos;
int64 period; // sampling period in nanoseconds (e.g., 10000000 for 100 Hz)
}
Two sample values per sample for a CPU profile: count (number of samples that hit this stack) and CPU nanoseconds (count × period). Tools display either; default is CPU time.
5. The go tool pprof interface¶
| Command | Effect |
|---|---|
top | Top N functions by flat CPU |
top -cum | Top N by cumulative CPU |
list <regex> | Source-level annotation of matching functions |
web | Open SVG call graph in a browser (needs Graphviz) |
peek <regex> | Show callers/callees of matching functions |
disasm <regex> | Annotated disassembly |
tree | Indented caller/callee tree |
traces | Show individual sample stacks |
tags | Show pprof labels and their values |
help | Full command list |
Flags:
| Flag | Effect |
|---|---|
-http=:8080 | Launch the interactive web UI (top, source, flame graph) |
-base=old.pprof | Subtract old from the profile (regression diff) |
-diff_base=old.pprof | Show only positive differences (where new > old) |
-output=out.svg | Write graph/flame to a file |
-focus=<regex> | Keep only samples passing through matching nodes |
-ignore=<regex> | Drop samples passing through matching nodes |
-nodecount=N | Limit nodes in graph output |
The web UI (-http) is the modern default; the interactive REPL remains useful in CI and ssh sessions.
6. Flat vs cumulative time¶
| Term | Meaning |
|---|---|
| Flat (self) | Time spent in the function itself, excluding calls it made |
| Cumulative (cum) | Time spent in the function plus everything it called |
A leaf function (e.g., runtime.memmove) has flat ≈ cum. A wrapper function (http.serverHandler.ServeHTTP) has tiny flat but huge cum.
The first question to ask of any profile: "What has high flat?" Those are real hotspots. Functions with only high cum are not where the work happens — they are where it is organized.
7. Runtime functions you will see¶
| Function | Meaning |
|---|---|
runtime.mallocgc | Heap allocation slow path |
runtime.gcBgMarkWorker | Background GC worker |
runtime.gcAssistAlloc1 | Mutator assisting GC (allocation-bound) |
runtime.futex / runtime.semasleep | Goroutine sleeping for a lock or semaphore |
runtime.findrunnable / runtime.schedule | Scheduler looking for work (often idle CPU) |
runtime.memmove / runtime.memclrNoHeapPointers | Bulk copy/clear |
runtime.mapaccess1 / runtime.mapassign | Map operations |
runtime.growslice | Slice reallocation on append |
runtime.convT* | Boxing a value into an interface |
runtime.duffcopy / runtime.duffzero | Inline copy/zero loops |
syscall.Syscall6 | Time spent in kernel calls |
Seeing many seconds in runtime.findrunnable is idle CPU, not work — the profile counts on-CPU time for whatever the kernel scheduled. Filter with -ignore=runtime\\.findrunnable.
8. Profile labels (pprof.Do)¶
pprof.Do(ctx, pprof.Labels("endpoint", "/api/v1/users"), func(ctx context.Context) {
handleUsers(ctx)
})
Samples taken while the goroutine carries these labels record them. go tool pprof can then group:
Labels are propagated across go statements that capture the labeled context. Cost: a few words per Do call. Use to attribute CPU to logical units (request types, tenants, batch jobs).
9. Sample bias and limitations¶
| Issue | Effect |
|---|---|
| Signal delivery skew | Functions that often have SIGPROF masked (CGO, syscalls) under-sample |
| Inlined functions | The inliner removes frames; profiles attribute work to the caller (use -gcflags="-l" to disable for analysis) |
| Tail calls / TCO | Not used by Go; not an issue |
| Short-lived programs | Fewer than ~1000 samples → noisy results; raise rate or extend duration |
| CGO time | Recorded as _cgoexp_* / runtime.cgocall; in-C time is opaque |
| Goroutine starvation | Profile shows what ran, not what waited; combine with block/mutex profiles |
A CPU profile answers "where did CPU go", not "why was the request slow". For latency questions, use traces (go tool trace) plus block/mutex profiles.
10. Non-goals¶
- pprof CPU profiling is not a code coverage tool. Cold paths are absent from the profile.
- It is not a wall-clock latency profile. Off-CPU time (sleeping on I/O, locks) is invisible.
- It cannot attribute time inside C libraries beyond the entry point.
- It cannot tell you why memory was allocated — that is the heap profile's job.
11. Related references¶
- pprof README: https://github.com/google/pprof/blob/main/doc/README.md
- "Profiling Go Programs" (Go blog, classic): https://go.dev/blog/pprof
runtime/pprofsource: https://github.com/golang/go/tree/master/src/runtime/pprof- Continuous profiling overview: https://grafana.com/oss/pyroscope/