pprof and Profiling Tools — Middle Level¶

Table of Contents¶

Introduction
The Full Profile Catalogue
Mutex and Block Profiles in Depth
Heap, Allocs, and the gc Parameter
go tool pprof REPL Mastery
The Web UI and Flame Graphs
runtime/trace for Scheduler Insight
Profile Diffs
Production Snapshot Discipline
Testing with pprof
Common Anti-Patterns
Diagnostics Playbook
Self-Assessment
Summary

Introduction¶

At junior level you learned the surface: how to expose pprof, take a profile, and run top in the REPL. At middle level you treat pprof as a daily tool. You know all eight profile types and when to reach for each. You read flame graphs fluently. You compare profiles across time. You combine runtime/trace with pprof when scheduler behaviour matters. And you have hardened pprof so it can ship to production safely.

After this file you will:

Know every profile in the standard library and what it samples.
Enable mutex and block profiles correctly and read their output.
Use go tool pprof filters: -focus, -ignore, -hide, -show, -tagfocus.
Diff two heap profiles with -base.
Capture and analyse a runtime/trace and align it with pprof data.
Write tests that compare pre/post goroutine counts.
Recognise the anti-patterns: pprof on 0.0.0.0, infinite-window CPU profiles, ignoring the gc flag.

For the deeper material — labels, continuous profiling backends, custom profile registration — see senior.md. For internals — how the sampler decides what to record, the protobuf format itself — see professional.md.

The Full Profile Catalogue¶

The pprof registry exposes nine well-known profile types. Memorise them; they are the menu.

`goroutine`¶

Snapshot of every live goroutine. Use first for leaks, stuck services, and "why is runtime.NumGoroutine() so high".

HTTP: /debug/pprof/goroutine
Three debug levels: 0 binary, 1 grouped text, 2 every goroutine.
Slightly stops the world; cheap unless goroutine count is huge.

`heap`¶

Sampled allocations that are still in use (have not been garbage collected). The default view shows in-use memory by call site.

HTTP: /debug/pprof/heap
Useful flags: ?gc=1 forces GC before sampling.
Sample rate: runtime.MemProfileRate (default 512 KB).

`allocs`¶

Total allocations since program start. Same data source as heap but a different view: it shows everything sampled, not just what is still live.

HTTP: /debug/pprof/allocs
CLI: go tool pprof -alloc_objects or -alloc_space.

`threadcreate`¶

Stack at the point each OS thread was created. Useful when runtime.NumThread() or the M count climbs.

HTTP: /debug/pprof/threadcreate
Off-by-default? No, always available but rarely large.

`block`¶

Goroutines blocked on synchronisation: channel send/receive, sync.Mutex.Lock, sync.WaitGroup.Wait, time.Sleep.

HTTP: /debug/pprof/block
Off by default. Enable with runtime.SetBlockProfileRate(rate). The rate is in nanoseconds — a value of 1 records every blocking event, 100000 (100 microseconds) records about 1 in 10,000.

`mutex`¶

Contended sync.Mutex.Lock and sync.RWMutex.Lock calls — specifically the ones that had to wait.

HTTP: /debug/pprof/mutex
Off by default. Enable with runtime.SetMutexProfileFraction(rate). Rate of 1 samples every contention, 100 samples 1%.

`profile` (CPU)¶

Periodic CPU sampling. Duration-based. The HTTP endpoint takes a seconds=N parameter.

HTTP: /debug/pprof/profile?seconds=30
Default sample rate: 100 Hz.
Overhead: ~5% during collection.

`trace`¶

A full execution trace, not a sampled profile. Captured by runtime/trace. Massive output; richest signal.

HTTP: /debug/pprof/trace?seconds=5
Inspect with go tool trace trace.out, not go tool pprof.

A note on `cmdline` and `symbol`¶

The endpoints /debug/pprof/cmdline and /debug/pprof/symbol are not profiles. The first returns the binary's command-line arguments; the second resolves PC addresses to symbol names. They exist to support the pprof tool when fetching profiles over HTTP.

Mutex and Block Profiles in Depth¶

Most production services run with both off. To debug a contention problem you must enable one — usually at startup, behind a feature flag.

import "runtime"

func enableContentionProfiles() {
    // Sample 1 in 100 contended mutex events.
    runtime.SetMutexProfileFraction(100)
    // Record every blocking event longer than 1ms.
    runtime.SetBlockProfileRate(int(time.Millisecond))
}

Tuning the rates¶

For SetBlockProfileRate:

0 (default) — off.
1 — every event. Expensive.
N — record events that took at least N nanoseconds.

For SetMutexProfileFraction:

0 (default) — off.
1 — every contention.
N — about 1 in N.

In production, start with SetMutexProfileFraction(100) and SetBlockProfileRate(int(time.Millisecond)). That gives meaningful samples without significant overhead.

Reading a block profile¶

(pprof) top
Showing nodes accounting for 4.2s, 100% of 4.2s total
      flat  flat%   sum%        cum   cum%
     3.5s 83.33% 83.33%      3.5s 83.33%  main.(*Cache).Get.func1
     0.7s 16.67%   100%      0.7s 16.67%  main.worker

The numbers are time spent blocked, not wall-clock time. If three goroutines all blocked for 1 second simultaneously on the same call site, the profile shows 3 seconds.

Reading a mutex profile¶

(pprof) top
Showing nodes accounting for 2.1s, 100% of 2.1s total
      flat  flat%   sum%        cum   cum%
     1.8s 85.71% 85.71%      1.8s 85.71%  sync.(*Mutex).Lock /shared/cache.go:42
     0.3s 14.29%   100%      0.3s 14.29%  sync.(*RWMutex).RLock

The samples are time the holder kept the lock while another goroutine waited, not pure wait time. A high number means "this lock is a bottleneck."

Heap, Allocs, and the `gc` Parameter¶

The heap profile records sampled allocations. Two pivots are important.

In-use vs alloc¶

# memory currently in use (default)
go tool pprof http://host:6060/debug/pprof/heap

# memory ever allocated (often gives more clues for GC pressure)
go tool pprof -alloc_space http://host:6060/debug/pprof/allocs

A function that allocates 10 GB but releases it all immediately shows nothing in inuse_space but huge in alloc_space.

Forcing GC for a clean view¶

By default the heap profile includes objects that are dead but have not been collected yet. To see only live memory:

go tool pprof http://host:6060/debug/pprof/heap?gc=1

The HTTP handler runs runtime.GC() first. Slight pause, much cleaner data.

The four heap views¶

View	Flag	Meaning
`inuse_space`	default	bytes still allocated
`inuse_objects`	`-inuse_objects`	object count still allocated
`alloc_space`	`-alloc_space`	total bytes ever allocated
`alloc_objects`	`-alloc_objects`	total objects ever allocated

Switch live in the REPL with sample_index:

(pprof) sample_index=alloc_space
(pprof) top

`go tool pprof` REPL Mastery¶

The REPL has more commands than top and list. The useful ones:

`top`¶

(pprof) top         # top 10 by flat
(pprof) top 30      # top 30
(pprof) top -cum    # sort by cumulative
(pprof) top -flat   # default

Flat: time spent in this function itself.
Cum: time in this function plus everything it called.

For a goroutine profile, "time" is "goroutine count."

`list`¶

(pprof) list main.worker
ROUTINE ======================== main.worker in /path/main.go
        0          5   |  func worker(ch <-chan int) {
        0          5   |      for v := range ch {
        5          5   |          process(v)
        0          0   |      }
        0          0   |  }

Left column is flat, right is cum.

`peek`¶

(pprof) peek main.worker
   3.5s    main.dispatch
   1.5s    main.(*Pool).run

Shows callers of a function. Inverse of list.

`traces`¶

(pprof) traces
-----------+-------------------------------------------------------
       5  main.worker
          main.dispatch
          main.main

Each unique stack is shown once, with the sample count. For a goroutine profile, this is essentially ?debug=1 but interactive.

`web`¶

Opens a graph rendering in your browser (requires graphviz). Each node is a function; each edge a caller→callee relationship; node size is proportional to flat samples; edge weight is sample count.

Filters¶

Flag	Effect
`-focus=regex`	only samples matching the regex
`-ignore=regex`	drop samples matching the regex
`-hide=regex`	hide functions but keep their cost in parent
`-show=regex`	inverse of `-hide`
`-tagfocus=key=val`	only samples with this label
`-tagignore=key=val`	drop samples with this label

Examples:

(pprof) focus=worker top
(pprof) ignore=runtime top

Same flags work on the command line:

go tool pprof -focus=worker -ignore=runtime g.prof

The Web UI and Flame Graphs¶

go tool pprof -http=:9090 cpu.prof

Open http://localhost:9090. Five views in the dropdown:

Top — same as the REPL top.
Graph — the graphviz call graph.
Flame Graph — the icicle view.
Source — annotated source listing.
Peek — callers.

How to read a flame graph¶

x-axis: relative cost (CPU time, goroutine count, allocation bytes — depends on the profile).
y-axis: stack depth. The top of each tower is the function actually doing work.
Width: the wider the rectangle, the more cost.
Colour: usually random or by package — not meaningful unless your tool sets a colour scheme.

A leak hunt example: in a goroutine flame graph, the wide rectangle at the top is the stack on which most leaked goroutines are parked. Click it; it zooms in. The hierarchy below it is the call path that created those goroutines.

Icicle vs flame¶

The default go tool pprof web flame graph is icicle-shaped (stack grows down). Brendan Gregg's classical flame graph is upside-down (stack grows up). Same data, mirrored.

`runtime/trace` for Scheduler Insight¶

Profiles tell you what code is running. Traces tell you when and how the scheduler interleaved it. The trace records every:

Goroutine create / start / end / block / unblock
GC start / stop
Network poller wakeup
Syscall enter / exit
User-defined regions and tasks

Capturing a trace programmatically¶

import "runtime/trace"

f, _ := os.Create("trace.out")
defer f.Close()

trace.Start(f)
defer trace.Stop()

// ... work to be traced ...

Capturing over HTTP¶

curl -o trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out

The browser opens with views:

View trace — the timeline. Each goroutine on a horizontal lane.
Goroutine analysis — per-function aggregates.
Synchronization blocking profile — like the block profile but tied to specific events.
User-defined tasks / regions — if you used trace.NewTask and trace.WithRegion.
Scheduler latency profile — time goroutines spent runnable but not running.

When to reach for trace instead of pprof¶

Question	Better tool
Which function uses the most CPU?	CPU profile
Which goroutines exist right now?	goroutine profile
Where do my goroutines block?	block profile or trace
Why is my p99 latency bad?	trace
Is my GC interfering with requests?	trace
Do my goroutines wait long to be scheduled?	trace

Profile Diffs¶

The -base flag compares two profiles. The output is current - base. Positive numbers grew, negative shrank.

# baseline at startup
curl -o g1.prof http://localhost:6060/debug/pprof/goroutine

# after suspect load
curl -o g2.prof http://localhost:6060/debug/pprof/goroutine

go tool pprof -base g1.prof g2.prof

A leak hunt then becomes: (pprof) top and look at the largest positive numbers. Those are the call sites that gained goroutines between the two snapshots.

Heap diffs work the same way:

go tool pprof -base before.heap after.heap

Production Snapshot Discipline¶

Most teams hit pprof reactively — only after an incident. The mature pattern is to capture profiles continuously and routinely, so when something goes wrong you have history.

Hourly snapshots¶

func snapshotLoop(ctx context.Context, dir string) {
    t := time.NewTicker(time.Hour)
    defer t.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-t.C:
            ts := time.Now().UTC().Format("20060102T150405Z")
            for _, name := range []string{"goroutine", "heap", "allocs"} {
                path := filepath.Join(dir, fmt.Sprintf("%s-%s.prof", name, ts))
                f, err := os.Create(path)
                if err != nil {
                    log.Printf("snapshot: %v", err)
                    continue
                }
                _ = pprof.Lookup(name).WriteTo(f, 0)
                _ = f.Close()
            }
        }
    }
}

Rotate older files to S3 or just keep the last N. Disk is cheap.

Signal-triggered snapshots¶

signal.Notify(ch, syscall.SIGUSR1)
go func() {
    for range ch {
        path := fmt.Sprintf("/tmp/g-%d.prof", time.Now().Unix())
        f, _ := os.Create(path)
        _ = pprof.Lookup("goroutine").WriteTo(f, 0)
        _ = f.Close()
        log.Printf("snapshot at %s", path)
    }
}()

kill -USR1 <pid> from an on-call shell.

Snapshots from the orchestrator¶

In Kubernetes, exec into a pod and run wget against 127.0.0.1:6060/debug/pprof/goroutine?debug=2. Stream the output into your laptop and analyse offline.

Testing with pprof¶

A useful pattern in tests: assert that a function does not leak goroutines.

func TestNoLeak(t *testing.T) {
    before := runtime.NumGoroutine()

    doWork()
    time.Sleep(50 * time.Millisecond) // allow background to settle

    after := runtime.NumGoroutine()
    if after > before {
        // dump for diagnosis
        buf := new(bytes.Buffer)
        _ = pprof.Lookup("goroutine").WriteTo(buf, 1)
        t.Fatalf("leaked %d goroutines\n%s", after-before, buf.String())
    }
}

Better: use go.uber.org/goleak (covered in 02-detecting-leaks/). The pprof dump on failure makes diagnosis painless.

Compile-time profile injection¶

The go test command has flags for profiles:

go test -cpuprofile cpu.prof -memprofile mem.prof -mutexprofile mu.prof -blockprofile bl.prof
go tool pprof cpu.prof

For benchmarks:

go test -bench=. -benchtime=10s -cpuprofile cpu.prof -memprofile mem.prof

Common Anti-Patterns¶

Anti-pattern: pprof on `0.0.0.0`¶

http.ListenAndServe(":6060", nil) // BAD

If the host is on the internet, pprof is too. Always 127.0.0.1:6060.

Anti-pattern: massive CPU windows¶

curl http://host:6060/debug/pprof/profile?seconds=600

Ten-minute profiles are rarely useful and produce huge files. Two 30-second profiles are better than one 600-second profile.

Anti-pattern: ignoring `gc=1` on heap¶

Without ?gc=1, your heap profile contains dead objects waiting to be collected. Often the "leak" you are chasing is just retained garbage.

Anti-pattern: enabling mutex profile at fraction 1 in production¶

runtime.SetMutexProfileFraction(1) // BAD in prod

Every contention is sampled. The runtime spends measurable time recording. Use 100 or higher.

Anti-pattern: comparing CPU profiles of different durations¶

A 10-second profile and a 60-second profile are not directly comparable in absolute numbers. Always match seconds=.

Anti-pattern: importing `net/http/pprof` in a library¶

The package mutates the default mux via init(). Libraries should never do this. Only main packages should import it.

Diagnostics Playbook¶

Symptom	First step	Second step
Memory growing	`curl heap?gc=1`, compare two	`allocs` for total churn
CPU pegged	`profile?seconds=30`	`list` the hot functions
Goroutine count growing	`goroutine?debug=1`	`?debug=2` for state
p99 latency spike	`trace?seconds=5`	scheduler latency view
Lock-heavy code	enable mutex profile	`top -cum`, then `list`
Channel waits	enable block profile	distinguish chan vs mutex stacks
Too many threads	`threadcreate`	look for cgo or syscall stacks

A leak hunt walk-through¶

curl -s host:6060/debug/pprof/goroutine?debug=1 | head -1 — quick count.
Wait 1 minute under steady load. Repeat.
If count grew: curl -o g1.prof host:6060/debug/pprof/goroutine, wait, again -o g2.prof.
go tool pprof -base g1.prof g2.prof.
(pprof) top -cum.
The top stack is where goroutines are accumulating. Read it with list.

A latency walk-through¶

curl -o trace.out host:6060/debug/pprof/trace?seconds=5 during a busy period.
go tool trace trace.out.
Synchronization blocking profile for blocking culprits.
Scheduler latency profile for "runnable but not running" time.
View trace if you need to see one specific request.

Self-Assessment¶

I can list all eight built-in profile types and what they sample.
I know how to enable mutex and block profiles and at what rates.
I use ?gc=1 when looking at heap.
I read flame graphs by stack-depth and width.
I diff profiles with -base.
I capture a runtime/trace and know what to look at first.
I never bind pprof to a public address.
I have hourly snapshots wired in at least one service.
I can spot the anti-patterns: 0.0.0.0, fraction=1 in prod, library import of pprof.

Summary¶

Middle-level pprof is about fluency. The toolchain has a small surface — eight profile types, six REPL commands, a few CLI flags — but you only get value when you use them habitually. Enable mutex and block profiles where they help. Diff profiles instead of staring at one. Pair runtime/trace with pprof when latency is the question. Keep pprof on a private port. Snapshot continuously so you have a baseline. The combination of goroutine?debug=1 for triage, -base for diffs, and the web UI for visualisation handles most production debugging without ever stepping into a debugger.