pprof Deep Dive — Middle¶

1. The full interactive command set¶

Once you go beyond top and list, the shell becomes a small DSL. Here's everything you'll actually use, in roughly the order you'll need it.

Command	What it does	When
`top [N]`	Top N by flat self	First pass
`top -cum [N]`	Sort by cumulative	"Where does this subtree spend its time?"
`top -flat`	Force flat sort	After switching cumulative=true
`list <regex>`	Annotated source for matching functions	Once you have a suspect
`disasm <regex>`	Annotated assembly	When source isn't enough (loops, branches)
`web`	Open SVG call graph	Big-picture shape
`weblist <regex>`	Source + assembly side by side, in a browser	Deep inspection
`peek <regex>`	Callers and callees with edge weights	"Who calls this hot function?"
`traces`	Print every sample as a full stack	Rare paths, attribution
`tree`	Caller-to-callee tree (text)	Like `web`, in the terminal
`granularity=lines\\|files\\|functions\\|addresses`	Aggregation level	Line-level when you suspect one expression
`focus=<regex>`	Keep only samples whose stack touches the regex	Slice to one subsystem
`ignore=<regex>`	Drop samples whose stack touches the regex	Hide noise (e.g., `runtime\.`)
`hide=<regex>`	Hide frames (not whole samples)	Clean up output
`show=<regex>`	Keep only matching frames	Force a narrow view
`sample_index=<name>`	Switch value column	Heap: inuse vs. alloc
`nodecount=N`	Limit nodes in graph views	Big profiles
`nodefraction=F`	Drop nodes below F (default 0.005)	Big profiles
`unit=<unit>`	Display unit (`ms`, `kb`, `mb`)	Readability
`o`	Print current options	Sanity-check filters
`quit` / Ctrl-D	Exit	—

granularity=lines is the underused one. With it, top shows you which line of source carried the cost, not just which function.

2. Focus, ignore, hide, show — the regex language¶

These four are the filter primitives:

focus=parser           # only samples whose stack contains a frame matching "parser"
ignore=runtime\.       # drop samples that are mostly runtime overhead
hide=^runtime\.        # don't print runtime frames, but keep samples
show=^(main|service)\. # only show frames in main or service package

Regexes are RE2 and match against package.Func or (*Type).Method. The escape \. is important — runtime. is technically a wildcard.

A common combo for "look at my code only":

ignore=runtime\.|reflect\.|encoding/
hide=^runtime\.

Then top is dominated by your hot paths instead of allocator and reflection frames.

3. Peek — the function-centric view¶

(pprof) peek parseLine
Showing nodes accounting for 2.10s, 42.51% of 4.94s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            1.30s 61.90% |   main.parseLine
                                            0.50s 23.81% |   main.parseHeader
                                            0.30s 14.29% |   main.parseFooter
     800ms 16.19% 16.19%      2.10s 42.51%                | main.parseLine
                                            0.90s 42.86% |   strings.Split
                                            0.60s 28.57% |   runtime.mallocgc
                                            0.20s  9.52% |   strconv.Atoi

The middle row is parseLine itself; above it are callers (and how much of parseLine's cum came from each); below are callees (and how much they cost). This is the fastest way to answer "is this function expensive because of what it calls or because of what it does?"

4. Granularity¶

By default, samples aggregate to functions. Change that:

(pprof) granularity=lines
(pprof) top

Now top ranks individual source lines. A function with one expensive line and many cheap ones suddenly tells the truth: you see exactly which line is hot.

granularity=files collapses to files — useful when one package dominates and you want a per-file split. granularity=addresses is rarely needed outside disassembly work.

5. The web UI in detail¶

go tool pprof -http=:8080 cpu.pb.gz

Five views, switched in the top-left menu:

Top. Sortable table. Search box filters by function name.
Graph. Hot subtrees pop visually. Right-click a node for "Focus" / "Hide" / "Show source".
Flame Graph. Width = share of selected sample type. Click to zoom; the breadcrumb at the top lets you zoom out.
Peek. Type a function name to see callers/callees.
Source. Annotated source for a chosen function.

Two non-obvious controls:

The "Refine" menu applies focus, ignore, hide, show interactively. The current URL encodes them, so the page is shareable.
The "Sample" menu in the top-right switches sample_index — flame graph of alloc_objects looks very different from inuse_space.

-http=: (empty port) picks a free port at random. -no_browser prints the URL without opening anything — useful when you're SSH'd into a server with port forwarding.

6. Heap profiling, slightly more carefully¶

/debug/pprof/heap and /debug/pprof/allocs return the same underlying profile. The difference is the default sample_index:

Endpoint	Default `sample_index`
`/heap`	`inuse_space`
`/allocs`	`alloc_space`

So if you grab /heap and then ask sample_index=alloc_objects, you get exactly what /allocs would have given you. Save one profile, query both.

go tool pprof -alloc_objects cpu_or_heap.pb.gz is a synonym for -sample_index=alloc_objects.

# Bytes currently alive (RSS-like)
go tool pprof -inuse_space -http=: heap.pb.gz

# Bytes ever allocated (GC pressure)
go tool pprof -alloc_space -http=: heap.pb.gz

If one function dominates alloc_objects but is invisible in inuse_space, it's a churn problem — short-lived allocations the GC handles fine but that cost CPU. The fix is usually a sync.Pool or pre-sized buffer, not a leak hunt.

7. Diff two profiles¶

The single most valuable habit when optimizing: save a baseline, change one thing, diff.

curl -o before.pb.gz "http://localhost:6060/debug/pprof/profile?seconds=30"
# deploy a change
curl -o after.pb.gz  "http://localhost:6060/debug/pprof/profile?seconds=30"

go tool pprof -http=: -base=before.pb.gz after.pb.gz

-base shows only what increased — your top is the cost the new version added. Anything that got faster (negative delta) is omitted.

go tool pprof -http=: -diff_base=before.pb.gz after.pb.gz

-diff_base is the signed view: red boxes in the web UI got worse, green boxes got better. Use this in code review.

Both require the profiles to have the same sample_type. You can't diff a CPU profile against a heap profile.

8. Goroutine profiles¶

go tool pprof http://localhost:6060/debug/pprof/goroutine

A goroutine profile has one sample per stack, not per goroutine. The count value is "how many goroutines are currently at this exact stack".

(pprof) top
Showing nodes accounting for 4203, 99.95% of 4205 total
      flat  flat%   sum%        cum   cum%
      4000 95.13% 95.13%       4000 95.13%  runtime.gopark
       200  4.76% 99.88%        200  4.76%  net/http.(*persistConn).readLoop

That 4000 is "4000 goroutines parked". peek runtime.gopark then shows what they're parked on. The text dump (?debug=2) is often easier:

curl -s "http://localhost:6060/debug/pprof/goroutine?debug=2" | less

…shows the full stack of every goroutine, with a duration ("goroutine has been blocked for 5 minutes"). The fastest way to find a deadlock or stuck worker pool.

9. Block and mutex profiles¶

Both are disabled by default. Enable them in your program before they produce anything useful:

import "runtime"

runtime.SetBlockProfileRate(1)        // sample every blocking event (high overhead)
runtime.SetBlockProfileRate(10000)    // sample 1 in ~10000 blocking ns
runtime.SetMutexProfileFraction(1)    // sample every mutex contention
runtime.SetMutexProfileFraction(100)  // sample 1%

In production, use a fraction (e.g., 100) to bound overhead. In a benchmark, 1 is fine.

go tool pprof -http=: http://localhost:6060/debug/pprof/block
go tool pprof -http=: http://localhost:6060/debug/pprof/mutex

block shows wall-clock blocking time (channels, select, network I/O usually appears here). mutex shows time blocked specifically on sync.Mutex / sync.RWMutex contention. They overlap but are distinct.

10. The `traces` command¶

(pprof) traces
-----------+-------------------------------------------------
       80ms   runtime.mallocgc
              runtime.makeslice
              main.parseLine
              main.handleRequest
              main.(*server).serveHTTP
-----------+-------------------------------------------------
       30ms   runtime.mallocgc
              runtime.makeslice
              main.encodeResponse
              main.handleRequest
              main.(*server).serveHTTP

Every sample, with its full stack and its flat value. Slow to read for big profiles but invaluable when:

You suspect one specific call path is responsible.
The graph view collapsed the path you care about.
You want to grep for a string in your stacks.

traces | grep parseLine gives you every sample that touched parseLine, in order.

11. List, with line-level precision¶

(pprof) granularity=lines
(pprof) list parseLine
ROUTINE ======================== main.parseLine in /home/me/app/parse.go
     800ms      2.10s (flat, cum) 42.51% of Total
         .          .     12:func parseLine(s string) (Record, error) {
      90ms       90ms     13:    parts := strings.Split(s, ",")
     700ms      1.30s     14:    for _, p := range parts {
         .       50ms     15:        n, err := strconv.Atoi(p)
         .          .     16:        if err != nil { return Record{}, err }
       10ms       10ms     17:        r.values = append(r.values, n)
         .          .     18:    }
         .          .     19:    return r, nil
         .          .     20:}

Read it line by line. The for _, p := range parts { line is itself 700 ms. Why? Because each iteration's append is growslice-ing. The fix: pre-size r.values = make([]int, 0, len(parts)).

This is the form in which pprof answers most real questions. Get comfortable with it.

12. Combining profiles¶

go tool pprof a.pb.gz b.pb.gz c.pb.gz does a union: same-stack samples are added. Useful for:

Aggregating profiles from many replicas of the same service.
Combining short profiles into one with enough samples to be useful.
Building a "fleet-wide" view from a sample set.

The profiles must agree on sample_type. Mixing CPU profiles from different binaries also works — symbols are embedded — but the union only makes sense if the binaries share code.

13. Reading the graph view¶

Each node is function — flat (flat%) cum (cum%). Edge width = how much of the parent's cumulative time was spent in the child. A wide edge into a single child means that child explains the parent.

Three patterns to recognize:

Wide self-loop or wide flat in a leaf — the function itself is slow.
One wide outgoing edge — the function exists mainly as a wrapper; look at the child.
Many narrow outgoing edges — the function dispatches; cost is distributed.

The default nodefraction=0.005 hides anything below 0.5%. For complex programs, bump it to 0.02 to declutter; for missing context, drop to 0.001.

14. The `--seconds` trap¶

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

If your workload only runs for 5 seconds inside that 30-second window, 25 seconds of your profile is idle. Idle time samples almost nothing — but they dilute the percentages. The profile will look like "CPU is mostly idle" because, well, it was.

Match seconds=N to your workload. For HTTP services under load, 30 s is fine. For short batch jobs, profile from inside the program with pprof.StartCPUProfile instead.

15. Summary¶

The interactive shell is a small filter language: focus, ignore, hide, show, sample_index, granularity. The web UI exposes all of that interactively. peek answers "what calls this?" and "what does this call?"; traces answers "show me every sample". Saving profiles to disk and diffing with -base turns optimization from guesswork into a measured loop. Beyond this, the topics are: how the format works (senior), how labels and continuous profiling work (professional), and how to read flame graphs really well (optimize).