Detecting Goroutine Leaks — Professional Level¶
Table of Contents¶
- Introduction
- How
runtime.NumGoroutineWorks - The
allgsTable and Stop-The-World runtime.StackInternals- The Goroutine Profile Format
pprof.Lookup("goroutine")Implementation- Goroutine States and Wait Reasons
- Goroutine Labels — How They Are Stored
schedtraceandgctracefor Detectionruntime/traceInternals- Cost of Profile Capture at Scale
- Building a Custom Detector
- Self-Assessment
- Summary
Introduction¶
This file digs into the runtime mechanics that everything else in this section stands on. Most engineers can detect leaks without knowing any of this. You need it when:
- You are writing a profiling library and want to understand the contract.
- You see a
pprofoutput that does not match your mental model and need to know whether it is a runtime bug or yours. - You need to reason about the cost of profile capture on a million-goroutine server.
- You are interviewing for a runtime/observability role.
References point at the Go source tree at the time of writing (Go 1.22). Line numbers drift; function names are more stable. Look for the function name in your installed $GOROOT/src/runtime.
How runtime.NumGoroutine Works¶
runtime.NumGoroutine (in runtime/proc.go) returns:
gcount() walks an internal counter that is incremented every time a goroutine is created (newproc) and decremented every time one exits (goexit0). Specifically:
allglen— the total number of goroutines ever observed by the runtime (a high-water mark).sched.ngsys— the number of system goroutines.- A running tally of "dead" goroutines kept on a free list for reuse.
gcount returns allglen - dead - system. It does not lock; it relies on atomic loads. This is why two consecutive calls can differ by a small amount that is not explained by user-visible spawning — the GC may have moved a goroutine between dead and alive states, or sysmon may have ticked.
For detection that requires an exact count, take a sample, GC, and re-sample. Even then expect ±2.
The allgs Table and Stop-The-World¶
When runtime.Stack(buf, true) or pprof.Lookup("goroutine").WriteTo walks all goroutines, it must read a coherent snapshot. The runtime does this by calling stopTheWorld("profile"). Briefly:
- The current goroutine grabs
sched.lock. - Every other P is signalled to stop at the next safepoint.
- Once all Ps are stopped, the calling goroutine walks
allgs(the global slice of every goroutine). - For each goroutine, it reads
gp.atomicstatus,gp.sched.pc, and the parked frame. - STW is released; the world resumes.
Total STW time for "profile" is typically microseconds, but grows linearly with goroutine count. On a server with 100,000 goroutines it can be a few milliseconds. On a server with 5,000,000 (yes, that exists), it is hundreds of milliseconds — a noticeable latency blip.
This is why the senior file recommended not hitting /debug/pprof/goroutine?debug=2 on every request. Even a single capture is expensive at scale.
Newer Go versions (1.19+) optimised this with per-goroutine stack collection that does not require a full STW for debug=2 — instead, each goroutine is individually preempted to a safepoint, walked, and resumed. The STW window is shortened to almost nothing, at the cost of slightly higher steady-state instrumentation. See runtime/mprof.go::goroutineProfileWithLabels.
runtime.Stack Internals¶
runtime.Stack(buf, all) calls (paraphrased):
func Stack(buf []byte, all bool) int {
if all {
stw := stopTheWorld(stwAllGoroutinesStack)
n := writeAllStacks(buf)
startTheWorld(stw)
return n
}
return writeMyStack(buf)
}
writeMyStack walks the current goroutine's gp.sched.bp (frame base pointer) chain, decoding each frame via the symbol table embedded in the binary (the pclntab). Decoding is what produces the human-readable file:line strings; without symbol info the stack is just addresses.
If your binary is stripped (-ldflags '-s -w'), runtime.Stack still works — the symbol table is separate from debug info — but the function-name decoding is best-effort. Profiles taken on stripped binaries are harder to read.
A subtle quirk: argument values printed in the stack trace ("main.handler(0xc0000a0080, 0x42)") are the values at the time the goroutine was parked. For inlined functions or escape-analysed locals, those values may be in the register file rather than memory; the runtime falls back to printing ? when it cannot recover them.
The Goroutine Profile Format¶
The protobuf format (?debug=0) is the pprof profile.proto. Each sample has:
value— a single int64, the count of goroutines sharing this stack.location— a list ofLocationIDs, each mapping to a function + line.label— key-value pairs (thepprof.SetGoroutineLabelsdata).
go tool pprof and github.com/google/pprof/profile consume this. The format is the same as for CPU, heap, and block profiles; only the sample_type differs. For goroutine profiles, sample_type is [{Type: "goroutine", Unit: "count"}].
?debug=1 is a text rendering: each unique stack is printed once with a count, then frame addresses, then resolved file:line.
?debug=2 is the most verbose form: every goroutine printed individually, with state, wait reason, duration, and full stack including arguments. This is what runtime.Stack(buf, true) produces.
There is no standard format between these — they are three different rendering paths through the same internal data. A tool that reads debug=2 must parse text; a tool that reads debug=0 parses protobuf.
pprof.Lookup("goroutine") Implementation¶
In runtime/pprof/pprof.go:
goroutineProfile.WriteTo(w, debug) dispatches on debug:
debug == 0— callwriteGoroutine(w)which invokesruntime.goroutineProfileWithLabelsand serialises into protobuf.debug == 1— callprintCountProfile; aggregates by stack, prints text counts.debug == 2— callwriteGoroutineStackswhich callsruntime.Stack(_, true)and writes the bytes directly.
The label-aware path is significant: only debug == 0 (protobuf) includes labels. debug=2 does not. If you want labels in human-readable form, post-process the protobuf yourself.
Goroutine States and Wait Reasons¶
The runtime distinguishes between status (what kind of state) and wait reason (why it is parked). Both appear in debug=2.
Status values (_Grunnable, _Grunning, _Gwaiting, etc.) are internal. The string the user sees in [chan send] etc. is the wait reason, an enum in runtime/runtime2.go:
const (
waitReasonChanReceive // "chan receive"
waitReasonChanSend // "chan send"
waitReasonSelect // "select"
waitReasonSelectNoCases // "select (no cases)"
waitReasonSyncMutexLock // "sync.Mutex.Lock"
waitReasonSyncRWMutexRLock // "sync.RWMutex.RLock"
waitReasonSyncWaitGroupWait // "sync.WaitGroup.Wait"
waitReasonIOWait // "IO wait"
waitReasonSemacquire // "semacquire"
waitReasonGCMarkTermination // "GC mark termination"
waitReasonForceGCIdle // "force gc (idle)"
// ... many more
)
When triaging, the wait reason is the first clue. chan send and chan receive are the classic leak shapes. select with (no cases) is select {} — typically intentional, sometimes a bug. sync.Mutex.Lock is rare for leaks but common for deadlocks.
runtime.Goexit exits the goroutine; the runtime never shows a gexit state because by the time you see the profile, the goroutine has been removed from allgs.
Goroutine Labels — How They Are Stored¶
pprof.SetGoroutineLabels(ctx) walks the context, extracts the labelMap (added by pprof.Do/pprof.Labels), and stores a pointer to it in gp.labels. The pointer is heap-allocated; the label map is immutable once set.
When goroutineProfileWithLabels collects samples, it reads gp.labels and embeds the labels into the protobuf sample. This is why debug=0 carries labels but debug=2 (which renders only gp.atomicstatus and the stack) does not.
A new goroutine spawned with go f() does not inherit its parent's labels. Only pprof.Do(ctx, labels, f) propagates labels into spawned goroutines — and only those spawned via pprof.Do(ctx, ..., func(ctx) { go inner() }) if you re-call pprof.Do inside inner. The mechanism is cooperative; the runtime does not magically propagate.
Cost: a label map is a few bytes per goroutine. Setting labels on millions of goroutines is fine; the cost is in the protobuf serialisation, which scales with sample count × label count.
schedtrace and gctrace for Detection¶
The GODEBUG environment variable exposes two tracers:
Every 1000 ms, the runtime prints:
SCHED 12345ms: gomaxprocs=8 idleprocs=0 threads=20 spinningthreads=2 needspinning=0 idlethreads=4 runqueue=128 [22 14 18 9 27 33 11 6]
gomaxprocs— the configured P count.idleprocs— Ps with no work.threads— total OS threads.runqueue— total runnable goroutines in the global queue.- The bracketed list — per-P local run queue lengths.
For leak detection, the interesting columns are:
- A sustained large
runqueuemeans goroutines are runnable but not running — possible scheduler contention. - A monotonically increasing
threadsmeans many goroutines are blocked in syscalls. - A
gomaxprocs=8, idleprocs=8while requests are coming in means goroutines are parked.
gctrace=1 is more general but also useful:
A long mark phase combined with high goroutine count suggests the GC is iterating over many live goroutine stacks — confirmation that goroutines are accumulating.
runtime/trace Internals¶
The trace records every scheduling event:
EvGoCreate—go f()was executed.EvGoStart— goroutine started running.EvGoBlock*— goroutine blocked, with a specific reason (Recv,Send,Mutex, etc.).EvGoUnblock— goroutine unblocked.EvGoEnd— goroutine returned.EvGoExit— goroutine calledruntime.Goexit.
The viewer reconstructs lifetimes by matching EvGoCreate to EvGoEnd/EvGoExit. A goroutine with EvGoCreate and EvGoBlock but no EvGoUnblock and no EvGoEnd within the trace window is, by definition, leaked for at least the trace window.
Programmatic access:
import "golang.org/x/exp/trace"
f, _ := os.Open("trace.out")
r, _ := trace.NewReader(f)
for {
ev, err := r.ReadEvent()
if err != nil { break }
switch ev.Kind() {
case trace.EventStateTransition:
// goroutine transition
}
}
Custom analysis is possible; it is how go tool trace's "Goroutine analysis" page is built.
Cost: trace recording is expensive. The default writes around 1 MB/sec under moderate load. Do not enable it for long durations or in production without rate-limiting. Use short windows (5–10 seconds) for diagnostic capture.
Cost of Profile Capture at Scale¶
Profile capture is O(N) in goroutine count. Concrete numbers from a 64-core machine, Go 1.22:
| Goroutines | NumGoroutine() | pprof.Lookup debug=0 | pprof.Lookup debug=2 | runtime.Stack(_, true) |
|---|---|---|---|---|
| 1,000 | 50 ns | 80 µs | 300 µs | 250 µs |
| 100,000 | 50 ns | 8 ms | 30 ms | 25 ms |
| 1,000,000 | 50 ns | 80 ms | 300 ms | 250 ms |
The NumGoroutine cost is constant — it is an atomic load. The other operations scale linearly. At a million goroutines, a debug=2 dump is a 300 ms STW (or near-STW with the per-G optimisation), which is a noticeable latency event. Capture profiles sparingly at that scale.
go_goroutines as a metric is always cheap — it is NumGoroutine once per scrape. Even at scrape intervals of 1 second, the overhead is negligible.
Building a Custom Detector¶
If you need to ship a detector that does not depend on goleak, the skeleton is:
package leakdetect
import (
"runtime"
"runtime/pprof"
"strings"
"time"
)
type Detector struct {
baseline int
paths []string // top-frame prefixes that mean "runtime, ignore"
}
func New() *Detector {
runtime.GC()
time.Sleep(10 * time.Millisecond)
return &Detector{
baseline: runtime.NumGoroutine(),
paths: []string{
"runtime.gopark",
"runtime.bgsweep",
"runtime.bgscavenge",
"runtime.gcBgMarkWorker",
"runtime.forcegchelper",
},
}
}
func (d *Detector) Suspects() ([]string, error) {
runtime.GC()
time.Sleep(10 * time.Millisecond)
var sb strings.Builder
if err := pprof.Lookup("goroutine").WriteTo(&sb, 2); err != nil {
return nil, err
}
return d.filter(sb.String()), nil
}
func (d *Detector) filter(profile string) []string {
blocks := strings.Split(profile, "\n\n")
var out []string
for _, b := range blocks {
if d.isRuntime(b) {
continue
}
if len(b) > 0 {
out = append(out, b)
}
}
return out
}
func (d *Detector) isRuntime(block string) bool {
for _, p := range d.paths {
if strings.Contains(block, p) {
return true
}
}
return false
}
This is roughly what goleak does internally, simplified. The real implementation is more careful about waitstate strings and parallel test isolation. Read the source of go.uber.org/goleak for the full details — it is a few hundred lines and worth a careful read.
Self-Assessment¶
- I can explain
runtime.NumGoroutinein terms ofallglenand the dead/system counters. - I can describe the STW behaviour of
runtime.Stack(_, true)and how Go 1.19+ optimised it. - I can read the protobuf goroutine profile format.
- I know which
debug=mode emits labels and which does not. - I can list at least six wait reasons and which ones typically indicate leaks.
- I can interpret a
GODEBUG=schedtrace=1000line. - I can write a
runtime/traceconsumer that detects goroutines with noEvGoEnd. - I have measured the cost of
pprof.Lookup("goroutine")on my own service. - I can build a minimal custom leak detector without depending on
goleak.
Summary¶
Professional-level detection is about owning the mechanism, not just the tool. runtime.NumGoroutine is an atomic load on a runtime-maintained counter. pprof.Lookup("goroutine") walks allgs under a (now-minimised) STW. Wait reasons are the runtime's classification of why a goroutine is parked, and they map cleanly to leak archetypes. runtime/trace captures the full creation/destruction event stream and lets you reconstruct lifetimes. The cost of capture scales linearly with goroutine count; at a million goroutines a full debug=2 is a noticeable latency event, and design must account for that. Armed with this mechanism, you can build custom detectors, write performant integrations, and reason precisely about what your monitoring system is showing you. The prevention story (03-preventing-leaks) is the natural sequel; the pprof tooling (04-pprof-tools) shares all the same machinery for heap, block, and CPU profiles.