Detecting Goroutine Leaks — Middle Level¶
Table of Contents¶
- Introduction
- The Full
pprof goroutineWorkflow - Reading
debug=1vsdebug=2 - Grouping Stacks by Frame
- False Positives — Runtime-Owned Goroutines
goleakin Depth- Filtering with
pprof.SetGoroutineLabels - Programmatic Profile Capture
- Diffing Two Profiles
go tool pprofInteractive SessiongopsWalk-throughruntime/tracefor Lifetime Events- Common Anti-Patterns
- Self-Assessment
- Summary
Introduction¶
At junior level you learned the tools individually: NumGoroutine, pprof, goleak. At middle level you put them together into an investigation workflow. The mindset shifts from "this tool exists" to "given a service whose memory is climbing at 50 MB per hour, in what order do I reach for which tool, and how do I interpret what each one tells me?"
After this file you will:
- Execute the full
pprof goroutineworkflow from production binary to root-caused file:line. - Read
debug=1anddebug=2outputs fluently, including state strings. - Group thousands of stacks into a handful of buckets and prioritise.
- Recognise runtime-internal goroutines and ignore them correctly.
- Use
goleakoptions (IgnoreTopFunction,IgnoreCurrent,Cleanup) idiomatically. - Label goroutines so you can filter profiles by subsystem.
- Diff two profiles taken minutes apart to find the leaking signature.
- Drive
go tool pprofin interactive mode and read atracesandpeekcommand output. - Inspect a running process with
gopswithout restarting it. - Capture a
runtime/traceand see goroutine creation/destruction events in the browser viewer.
This file does not yet cover production monitoring (Prometheus, OpenTelemetry) — that is the senior file. It also does not cover scheduler-level internals — that is the professional file. Cross-reference 01-lifecycle for goroutine state names, 03-preventing-leaks for fixes, and 04-pprof-tools for the broader pprof tool family.
The Full pprof goroutine Workflow¶
A realistic incident:
"Memory has been climbing 50 MB per hour since yesterday's deploy. Process is at 8 GB and not stabilising."
Step by step:
- Confirm it is a goroutine leak. Compare
go_goroutines(the metric) withgo_memstats_alloc_bytes. If both are rising in lockstep, suspect goroutines. If goroutines are flat and heap is climbing, it is a heap leak — different investigation. - Take a baseline goroutine profile. From a healthy replica or a cold start:
- Take a current profile. From the leaking pod:
- Diff and sort by count. Stacks that appear far more often in
now.txtthan inbase.txtare your suspects. - Open
now.txtand find the top stack by count. The first line of every block has a number — that is how many goroutines share that stack. - Read the topmost function. That is where they are parked. The
created byline says who spawned them. - Inspect the code at that file:line. Look for the missing
context.Done, the unclosed channel, the lock that no one releases. - Patch and verify. After deploy, re-run step 3. The count for that stack should drop to a small constant.
Total time for an experienced engineer: 10–20 minutes. The longest step is usually step 7 (understanding why the code is wrong).
Reading debug=1 vs debug=2¶
debug=1 — counts and unique stacks¶
goroutine profile: total 5187
5102 @ 0x103d8b6 0x103d7d1 0x1043f0f 0x1067e2a 0x1046521
# 0x1067e29 main.(*pollster).poll+0x29 /src/poll.go:42
# 0x1046520 main.startPollster.func1+0x40 /src/poll.go:18
12 @ 0x103d8b6 0x103d7d1 0x1043f0f 0x10a31a8
# 0x10a31a7 net/http.(*conn).serve+0x4a7 /usr/go/src/net/http/server.go:1990
...
Each block is one unique stack trace. The leading number is the count of goroutines sharing it. 5102 of 5187 are at poll.go:42. That is your leak. The other 85 are spread across legitimate work.
debug=2 — every goroutine printed individually¶
goroutine 1 [chan receive]:
main.main()
/src/main.go:25 +0x44
goroutine 2 [force gc (idle), 18 minutes]:
runtime.gopark(...)
runtime.forcegchelper()
/usr/go/src/runtime/proc.go:305 +0xb0
created by runtime.init.6
/usr/go/src/runtime/proc.go:293 +0x25
debug=2 is verbose but shows you:
- The state string in brackets:
chan receive,chan send,select,IO wait,sync.Mutex.Lock,sleep,force gc (idle). - The duration since the goroutine entered that state.
[chan receive, 18 minutes]is suspicious;[chan receive](no duration) means under a minute. - The argument values at each frame — the runtime captures them when the goroutine was parked.
Rule of thumb: start with debug=1 for triage, switch to debug=2 once you know which stack to investigate.
Grouping Stacks by Frame¶
When debug=1 shows millions of unique stacks (rare, but possible with deep recursion or closures spawned in loops), you need to group at a coarser granularity. The principle is "the closest function frame to the parked state is the root cause."
Approach in pure shell:
curl -s host:6060/debug/pprof/goroutine?debug=1 \
| awk '
/^[0-9]+ @/ { count=$1; getline; print count, $1 }
' \
| sort -rn | head
This prints, for every unique stack, the count and the topmost function. Sorted by count, the leak stands out.
Approach in Go with the protobuf format:
import (
"github.com/google/pprof/profile"
)
func loadTopFunctions(path string) (map[string]int64, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
defer f.Close()
p, err := profile.Parse(f)
if err != nil {
return nil, err
}
counts := make(map[string]int64)
for _, s := range p.Sample {
if len(s.Location) == 0 {
continue
}
top := s.Location[0]
if len(top.Line) == 0 {
continue
}
fn := top.Line[0].Function.Name
counts[fn] += s.Value[0]
}
return counts, nil
}
Now you have a map[function]count you can sort, log, or expose as a metric.
False Positives — Runtime-Owned Goroutines¶
The Go runtime spawns goroutines that look like leaks but are not. Treat them as wallpaper:
| Top frame | Role | Permanent? |
|---|---|---|
runtime.gopark from runtime.forcegchelper | Force-GC trigger | Yes |
runtime.gopark from runtime.bgscavenge | Memory scavenger | Yes |
runtime.gopark from runtime.bgsweep | GC sweeper | Yes |
runtime.gopark from runtime.runfinq | Finalisers runner | Yes |
runtime.notetsleep from runtime.sysmon | System monitor | Yes (but on its own OS thread, often not counted) |
internal/poll.runtime_pollWait from net.(*netFD).Read | Network read | Only while connection is open |
runtime.gopark from time.Sleep | A real sleep | Yes for the duration |
runtime.gcBgMarkWorker | GC mark worker | Comes and goes per GC cycle |
goleak already filters most of these. If you write your own detector, you must filter them yourself, or every test will fail.
A useful helper:
func isRuntime(stack string) bool {
for _, prefix := range []string{
"runtime.forcegchelper",
"runtime.bgscavenge",
"runtime.bgsweep",
"runtime.runfinq",
"runtime.gcBgMarkWorker",
"runtime.sysmon",
} {
if strings.Contains(stack, prefix) {
return true
}
}
return false
}
Be careful: filtering too aggressively hides real leaks. The above list is "things I have seen and verified are runtime"; a leak with a runtime.gopark topmost frame and a non-runtime created by line is still a leak.
goleak in Depth¶
VerifyTestMain with options¶
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m,
goleak.IgnoreTopFunction("github.com/golang/glog.(*loggingT).flushDaemon"),
goleak.IgnoreTopFunction("go.opencensus.io/stats/view.(*worker).start"),
goleak.IgnoreCurrent(),
)
}
IgnoreTopFunction("pkg.Fn")— drop goroutines whose topmost frame ispkg.Fn. Useful for known background workers in dependencies you cannot modify.IgnoreCurrent()— snapshot the goroutines alive at this call; treat them as the baseline for "no leaks." Useful whenTestMainitself spawns long-lived workers beforem.Run.IgnoreAnyFunction("pkg.Fn")— drop goroutines whose stack anywhere mentionspkg.Fn. More aggressive thanIgnoreTopFunction.Cleanup(cleanup func(error))— run a callback when a leak is detected, before failing.
VerifyNone inside a single test¶
func TestWorkerShutsDown(t *testing.T) {
defer goleak.VerifyNone(t,
goleak.IgnoreTopFunction("internal/poll.runtime_pollWait"),
)
w := startWorker()
w.Stop()
}
The defer ensures VerifyNone runs after all of the test body, including any t.Cleanup callbacks the test registered. If the worker did not actually stop, this test fails with the offender's stack printed.
Custom Option for tests with parallel subtests¶
func TestParallel(t *testing.T) {
snapshot := goleak.IgnoreCurrent()
t.Run("a", func(t *testing.T) {
t.Parallel()
defer goleak.VerifyNone(t, snapshot)
runScenarioA(t)
})
t.Run("b", func(t *testing.T) {
t.Parallel()
defer goleak.VerifyNone(t, snapshot)
runScenarioB(t)
})
}
IgnoreCurrent is captured once at the parent's entry; each child checks against it. Without this, parallel subtests see each other's goroutines and report false leaks.
When goleak is not enough¶
If a test legitimately needs to spawn a long-running goroutine that lives past the test (rare but real — e.g. a global initialiser), goleak will fail. Options:
- Move the legitimate goroutine into
TestMainand callgoleak.IgnoreCurrent()after starting it. - Use
IgnoreTopFunctionwith the exact function name. - Restructure the code so the goroutine has a
Close()method and the test can call it.
Option 3 is almost always the right answer. The other two are escape hatches.
Filtering with pprof.SetGoroutineLabels¶
You can attach key-value labels to a goroutine. Subsequent goroutine profiles include those labels, and go tool pprof lets you filter on them.
import (
"context"
"runtime/pprof"
)
func handleRequest(ctx context.Context, req *Request) {
labels := pprof.Labels(
"subsystem", "billing",
"tenant", req.Tenant,
)
pprof.Do(ctx, labels, func(ctx context.Context) {
processRequest(ctx, req)
})
}
pprof.Do sets the labels for the current goroutine and any goroutines it spawns inside the callback. After the callback returns, labels are restored.
To filter:
Only billing goroutines appear. The CPU profile and block profile honour the same labels.
When to use it: you have a server that handles many subsystems, and a leak is concentrated in one. Labels let you find the subsystem in seconds without parsing every stack.
Programmatic Profile Capture¶
Hard-coded triggers:
// On SIGUSR1, dump goroutine profile to /tmp/goroutines-<unix>.txt
func install() {
c := make(chan os.Signal, 1)
signal.Notify(c, syscall.SIGUSR1)
go func() {
for range c {
path := fmt.Sprintf("/tmp/goroutines-%d.txt", time.Now().Unix())
f, err := os.Create(path)
if err != nil {
log.Println(err)
continue
}
_ = pprof.Lookup("goroutine").WriteTo(f, 2)
f.Close()
log.Println("wrote", path)
}
}()
}
Now kill -USR1 <pid> writes a timestamped stack dump. Two snapshots five minutes apart give you a diff.
Threshold-based:
func watchLeaks(ctx context.Context, threshold int) {
t := time.NewTicker(30 * time.Second)
defer t.Stop()
seen := 0
for {
select {
case <-ctx.Done():
return
case <-t.C:
n := runtime.NumGoroutine()
if n > threshold && n > seen+100 {
seen = n
f, _ := os.Create(fmt.Sprintf("/var/log/goroutines-%d.txt", time.Now().Unix()))
_ = pprof.Lookup("goroutine").WriteTo(f, 1)
f.Close()
log.Printf("leak watch: %d goroutines, dumped", n)
}
}
}
}
Self-instrumenting: when the count crosses a threshold and keeps climbing, automatically write a profile to disk. The on-call engineer wakes up with the evidence already collected.
Diffing Two Profiles¶
The protobuf format works with go tool pprof -base:
curl -s http://host:6060/debug/pprof/goroutine > t0.pb.gz
sleep 300
curl -s http://host:6060/debug/pprof/goroutine > t1.pb.gz
go tool pprof -base t0.pb.gz t1.pb.gz
(pprof) top
The top output shows only the delta: goroutines that appeared between t0 and t1. The top entry is your leak signature.
Text-format diffing for debug=1 is rougher but works:
diff <(awk '/^[0-9]+ @/ {n=$1; getline; print n, $0}' base.txt | sort) \
<(awk '/^[0-9]+ @/ {n=$1; getline; print n, $0}' now.txt | sort)
You get added lines (in now.txt) and counts that changed. New stacks are by definition the leak.
go tool pprof Interactive Session¶
$ go tool pprof http://host:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://host:6060/debug/pprof/goroutine
Saved profile in /home/u/pprof/pprof.goroutine.001.pb.gz
Type: goroutine
(pprof) top
Showing nodes accounting for 5187, 100% of 5187 total
flat flat% sum% cum cum%
5102 98.36% 98.36% 5102 98.36% main.(*pollster).poll
12 0.23% 98.59% 12 0.23% net/http.(*conn).serve
...
Useful commands inside the prompt:
top— top N by goroutine count.top --cum— sort by cumulative count (includes descendants).list main.poll— show source code annotated with sample counts.peek main.poll— show callers and callees ofmain.poll.traces— print every individual stack with its count.web— render an SVG (needs graphviz installed).tree— text-form call tree, useful in remote shells.
The list command is the magic moment: it shows you the function source with the line where each goroutine is parked underlined by the sample count.
gops Walk-through¶
Install once:
Inside your program (only if you want the richer features):
import "github.com/google/gops/agent"
func main() {
if err := agent.Listen(agent.Options{}); err != nil {
log.Fatal(err)
}
// ... your program ...
}
Now from another terminal:
$ gops
12345 my-server go1.22.0 /home/u/bin/my-server
$ gops stack 12345
... full goroutine stack dump ...
$ gops stats 12345
goroutines: 5187
OS threads: 32
GOMAXPROCS: 8
num CPU: 8
$ gops memstats 12345
heap alloc: 4.8 GB
total alloc: 21.6 GB
GC cycles: 144
...
Useful when you cannot or will not expose pprof over HTTP — the gops agent uses a local socket, not the network. Great for daemons and CLIs.
runtime/trace for Lifetime Events¶
A goroutine trace captures every goroutine creation, blocking event, unblocking event, syscall, and GC pause. The viewer is a browser timeline.
Capture a 5-second trace:
import "runtime/trace"
func main() {
f, _ := os.Create("trace.out")
defer f.Close()
if err := trace.Start(f); err != nil {
log.Fatal(err)
}
defer trace.Stop()
// ... do your work ...
}
Or via HTTP if net/http/pprof is registered:
View:
This opens a browser. The "Goroutine analysis" page lists each goroutine's lifetime. The "Goroutines" timeline shows creation and destruction events. A leak shows up as a goroutine bar that begins but never ends within the trace window.
For long-running leaks you may not see the end event (because there is no end), but you will see the creation, the function that owns it, and the user-visible region tag if you used trace.WithRegion.
Common Anti-Patterns¶
- Catching leaks only in production. By then it has already cost you outages. Move detection left: tests, CI, staging.
- Trusting a single
NumGoroutinereading. Always take two with a delay; trends matter, not snapshots. - Ignoring
goleakfailures withIgnoreTopFunction. Every ignore is a future bug. - Setting the same label on every goroutine. Labels lose value when they have no cardinality. Use them on a hot path or a tenant boundary.
- Triggering
pprof.WriteToon every request. A profile dump on a 100k QPS endpoint will starve the runtime. Trigger on signal or on threshold, not on every call. - Diffing
debug=2text bydiffdirectly. Stacks have varying argument values and pointers;diffproduces garbage. Always diffdebug=1(counts + unique stacks) or the protobuf form via-base. - Leaving
runtime/tracerunning for minutes. Trace files grow at megabytes per second. Keep windows short (5–10 seconds).
Self-Assessment¶
- I can fetch a goroutine profile from a running process and tell which stack has the most goroutines.
- I know the difference between
debug=1anddebug=2and when to use each. - I can list at least five runtime-internal goroutines I should not flag as leaks.
- I have written a
TestMainwithgoleak.VerifyTestMainand at least oneIgnoreTopFunction. - I can use
pprof.Doto label a request's goroutines and filter a profile by tag. - I can capture a goroutine profile on
SIGUSR1from my own code. - I have used
go tool pprof -baseto diff two profiles. - I can drive
go tool pprofinteractively (top,list,peek). - I have installed
gopsand used it to inspect a running process. - I can capture a
runtime/traceand find a goroutine that never ends.
Summary¶
Middle-level leak detection is procedural. You learn the order of operations: confirm the symptom is goroutine-shaped, snapshot a baseline, snapshot the current state, diff, identify the highest-count stack, walk the code at that file:line, and patch. The tooling around this workflow — pprof goroutine?debug=1 for triage, pprof goroutine?debug=2 for stack reading, pprof.SetGoroutineLabels for filtering, goleak for tests, gops for live inspection, runtime/trace for lifetime events — covers everything from a unit test to a 2 AM incident. The senior file (senior.md) builds on this with production monitoring (Prometheus, OpenTelemetry, alerting); the professional file dives into the runtime internals that make these tools tick.