pprof Deep Dive — Professional¶

1. From "I can profile" to "we run profiling"¶

A senior engineer can read a profile. A professional builds the system that makes profiles cheap to collect, safe to expose, easy to compare across releases, and routine to inspect during incidents. The rest of this file is the work that turns pprof from a debugging tool into a production capability.

The four pillars:

A locked-down profiling endpoint on every service.
Continuous profiling that captures profiles automatically, week in, week out.
Label conventions so profiles can be sliced by route, tenant, version.
A small set of runbooks that everyone on call has read.

2. Production-safe pprof endpoint¶

import (
    "net/http"
    "net/http/pprof"
    "runtime"
)

func startDebugServer(addr string) {
    mux := http.NewServeMux()

    mux.HandleFunc("/debug/pprof/", pprof.Index)
    mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
    mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    mux.HandleFunc("/debug/pprof/trace", pprof.Trace)

    // Enable block and mutex profiling with bounded overhead.
    runtime.SetBlockProfileRate(10000)        // sample ~1 in 10000 blocking ns
    runtime.SetMutexProfileFraction(100)      // sample 1% of contentions

    srv := &http.Server{
        Addr:    addr,
        Handler: mux,
    }
    go srv.ListenAndServe()
}

Operational rules:

Bind to 127.0.0.1 (or a Unix socket). pprof endpoints leak source-level information; never expose them on a public port.
Use a dedicated mux and port. The convenience side-effect _ "net/http/pprof" registers handlers on http.DefaultServeMux. If you also serve your application on DefaultServeMux, you've published your profiles. Always use an explicit mux for either path.
Enable block/mutex with a fraction, not 1. A rate of 1 records every event and can cost noticeable CPU under contention.
Don't authenticate via cookies. If you must expose the port through a proxy, use mTLS or a header secret.

Access pattern: kubectl port-forward from a workstation to the localhost-only port. No NetworkPolicy hole, no auth proxy.

3. Continuous profiling — what and why¶

The case for continuous profiling: most performance regressions land in production and stay invisible for weeks. By the time someone notices, the responsible commit is buried. Continuous profiling solves this by storing profiles forever and letting you -base compare any two timestamps.

Modern options:

Tool	What it is	Storage model
Pyroscope / Grafana Pyroscope	Self-hosted; agent pulls from `/debug/pprof/*`	Time-series indexed by labels
Parca	Self-hosted; eBPF + pprof agent	Open-source, kube-native
GCP Cloud Profiler	Managed; agent pushes from inside the process	Linked to GCP project
AWS CodeGuru Profiler	Managed; similar to GCP	Linked to AWS account
Polar Signals Cloud	Managed Parca	Hosted
Datadog / NR continuous profiler	Managed APM extension	Tied to vendor APM

The pattern is the same regardless of vendor: a sidecar or in-process agent collects profiles at a fixed cadence (e.g., 30 s every 10 min), tags them with service/version/host/labels, and uploads.

4. Pyroscope-style agent setup¶

import "github.com/grafana/pyroscope-go"

pyroscope.Start(pyroscope.Config{
    ApplicationName: "checkout-api",
    ServerAddress:   "http://pyroscope:4040",
    Tags: map[string]string{
        "version": buildInfo.Version,
        "env":     "prod",
        "region":  os.Getenv("REGION"),
    },
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileInuseObjects,
        pyroscope.ProfileInuseSpace,
        pyroscope.ProfileAllocObjects,
        pyroscope.ProfileAllocSpace,
        pyroscope.ProfileGoroutines,
        pyroscope.ProfileMutexCount,
        pyroscope.ProfileMutexDuration,
        pyroscope.ProfileBlockCount,
        pyroscope.ProfileBlockDuration,
    },
})

The agent itself uses runtime/pprof under the hood. The crucial value-add is the label propagation: any pprof.Labels you attach with pprof.Do become Pyroscope query dimensions. You can answer "show me the CPU flame graph for /checkout requests from tenant=acme in the last hour" without writing any new code.

5. Label conventions¶

Pick a small, consistent label vocabulary across all your services. A starting set:

Label	Cardinality	Purpose
`service`	low	Service name
`version`	low	Build / commit
`env`	low	dev / staging / prod
`region`	low	Cloud region
`route`	medium	HTTP route pattern (`/api/v1/orders/:id`, not `/api/v1/orders/4839`)
`endpoint`	medium	RPC method name
`tenant`	high (in multi-tenant)	Customer ID
`priority`	low	Request priority class

Cardinality matters. Pyroscope and Parca index by labels; high-cardinality labels (user_id, request_id) bloat the index and slow queries. Use route (the pattern), not the path with IDs.

Apply labels at the boundary of each request:

func instrumented(next http.HandlerFunc, routePattern string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        ctx := pprof.WithLabels(r.Context(), pprof.Labels(
            "route", routePattern,
            "method", r.Method,
        ))
        pprof.Do(ctx, pprof.Labels(), func(ctx context.Context) {
            next(w, r.WithContext(ctx))
        })
    }
}

Wire this into your router middleware. Now every CPU sample taken during a handler invocation carries the route — a huge improvement in profile readability.

6. Labels and goroutines¶

pprof.Do labels the current goroutine. When the handler launches workers, those workers are unlabeled unless you explicitly extend them:

labels := pprof.Labels("workunit", "batchJobX")

for i := 0; i < N; i++ {
    go func(i int) {
        // Worker starts unlabeled. Apply labels from the parent context here.
        pprof.SetGoroutineLabels(pprof.WithLabels(ctx, labels))
        worker(i)
    }(i)
}

A cleaner idiom:

go func(i int) {
    pprof.Do(ctx, pprof.Labels("worker_id", strconv.Itoa(i)), func(ctx context.Context) {
        worker(ctx, i)
    })
}(i)

In a continuous profiling deployment, this means every goroutine launched inside pprof.Do is unlabeled by default. Plan for it: spawn helper goroutines from inside a pprof.Do block, or always re-label.

7. Continuous profiling vs. on-demand¶

You'll have both. Continuous profiling lives in the background, gives you the long view, and powers regressions detection. On-demand pprof lives on each pod's debug port and gives you the deep dive during an incident.

Use case	Tool
"Did this release regress CPU?"	Continuous (diff two windows)
"Why is this one pod slow right now?"	On-demand `pprof -http`
"What allocates over the next 24 h?"	Continuous
"Pod has 50k goroutines stuck"	On-demand `pprof goroutine?debug=2`
"Mutex contention spike at 14:23 yesterday"	Continuous
"Reproducer in a load test"	On-demand, save profiles to a folder

Treat them as separate disciplines that share a tool.

8. Release-gate with profile diffs¶

Wire profile collection into your CI/CD:

# pseudocode pipeline step
- name: "Collect baseline profile"
  run: |
    kubectl port-forward svc/myapp-baseline 6060:6060 &
    sleep 5
    curl -o baseline.pb.gz "http://localhost:6060/debug/pprof/profile?seconds=60"

- name: "Collect candidate profile"
  run: |
    kubectl port-forward svc/myapp-canary 6060:6060 &
    sleep 5
    curl -o candidate.pb.gz "http://localhost:6060/debug/pprof/profile?seconds=60"

- name: "Diff"
  run: |
    go tool pprof -text -base=baseline.pb.gz candidate.pb.gz | tee diff.txt
    # Custom script asserts no function regresses by > 20%

The trickiest part is reproducing the same load on both. The two practical patterns:

Canary in production traffic. Route 1–5% of real traffic to the new version. Collect both profiles concurrently for 60 seconds. Diff. This is honest data but noisy.
Synthetic load test. Replay a captured RPS pattern against both versions in a sandbox. More deterministic; less representative.

9. The four profiles to capture every release¶

For each release, archive these in your build artifacts:

Profile	Why archive
`cpu_60s.pb.gz` under steady-state load	Catch CPU regressions
`heap.pb.gz` at end of 30-min soak	Catch retained-memory regressions
`allocs.pb.gz` after 30-min soak	Catch allocation-rate regressions
`goroutine.pb.gz` at end	Catch leaked-goroutine regressions

Six months later, when someone asks "what changed?", you have receipts.

# nightly snapshot script
DATE=$(date +%Y%m%d)
mkdir -p profiles/$DATE
for kind in profile heap allocs goroutine; do
  args=""
  [ "$kind" = profile ] && args="?seconds=60"
  curl -s "http://localhost:6060/debug/pprof/$kind$args" \
       -o "profiles/$DATE/$kind.pb.gz"
done

Store in object storage with a 1-year retention. Disk is cheap; regret is expensive.

10. GCP Cloud Profiler example¶

import "cloud.google.com/go/profiler"

func main() {
    if err := profiler.Start(profiler.Config{
        Service:        "checkout-api",
        ServiceVersion: buildVersion,
        ProjectID:      "my-gcp-project",
        MutexProfiling: true,
    }); err != nil {
        log.Fatal(err)
    }
    runServer()
}

The agent runs in-process, captures profiles periodically, and uploads. The GCP console gives you a flame graph and a built-in diff between any two time windows. The data model is the same pprof format; you can download a .pb.gz from the UI and use go tool pprof locally.

Costs to budget: ~5% CPU overhead averaged, a few MiB per pod per day of network egress.

11. Operating block and mutex profiles at scale¶

These two are tricky in production:

They are disabled by default because they cost something.
The "right" sampling rate depends on contention level — too low and you see nothing, too high and you slow the program.

A reasonable default for a busy service:

runtime.SetBlockProfileRate(10_000)        // 10 µs threshold (events ≥ 10 µs)
runtime.SetMutexProfileFraction(100)       // 1% of contention events

If you find a contention hotspot is too rare to show in the default profile, lower temporarily via an admin endpoint, capture, restore. Don't leave it at 1 in steady-state.

12. Cross-process diff (canary)¶

# canary pod
curl -s "http://canary:6060/debug/pprof/profile?seconds=120" -o canary.pb.gz

# stable pod
curl -s "http://stable:6060/debug/pprof/profile?seconds=120" -o stable.pb.gz

# compare on your laptop
go tool pprof -http=: -diff_base=stable.pb.gz canary.pb.gz

In the web UI, red boxes got worse on the canary, green got better. This is the production version of "does my change help?". When a release seems risky, run it for 5 minutes on a canary, collect both, diff. If red boxes exceed your tolerance, roll back.

Two practical gotchas:

The two pods must serve similar traffic. Compare route-labeled subsets via tagfocus.
A 2-minute window on a noisy service has high variance. For a meaningful diff, prefer 5–10 minutes.

13. Incident response runbook: "service is slow"¶

1. SSH-tunnel to a slow pod's 6060.
2. curl /debug/pprof/profile?seconds=30 → cpu.pb.gz
3. go tool pprof -http=: cpu.pb.gz
   - Flame graph: is the hot box your code or runtime/syscall?
4. If your code: jump to source view, identify the line.
   If runtime.mallocgc dominates: also grab /heap, sample_index=alloc_objects.
   If syscall.Syscall dominates: also grab /goroutine?debug=2.
5. Cross-check with metrics: GC CPU? Allocation rate?
6. Save the profile. Add it to the incident ticket.

The pattern: always save profiles. They are your evidence trail.

14. Incident response runbook: "RSS climbing"¶

1. curl -o heap1.pb.gz /debug/pprof/heap
   ... wait 30 minutes ...
2. curl -o heap2.pb.gz /debug/pprof/heap
3. go tool pprof -http=: -base=heap1.pb.gz heap2.pb.gz
   - The "top" shows what grew. Each function is "new bytes since heap1".
4. List the suspects. Read what they retain.
5. Confirm with /goroutine?debug=2 — sometimes the leak is goroutines holding objects, not direct heap leak.

The -base diff is the difference between "I see lots of memory" and "I see exactly what new code is responsible".

15. Incident response runbook: "goroutine count exploded"¶

1. curl -o gr1.txt "/debug/pprof/goroutine?debug=2"
2. Read with less / pager. Search for repeated stacks.
3. The most-repeated stack is the leak. Look at the line where it's parked.
   - chan receive → who's not sending?
   - chan send    → who's not receiving?
   - select       → which case never fires?
   - sync.Cond.Wait → who's not Broadcasting?

The text dump (?debug=2) is more useful than the binary profile here because you want to read representative stacks, not aggregate them.

16. Storage costs¶

For continuous profiling, plan storage budget per service:

profile_size × profiles_per_hour × hours × replicas

Typical numbers for a Go service:

CPU profile (30 s, 100 Hz, normal service): 50–500 KiB compressed.
Heap profile: 50–500 KiB.
Goroutine profile: depends on goroutine count; usually <100 KiB.

With Pyroscope's defaults (10/min × 4 profile types × 10 KiB average), each replica writes ~3 MiB/h, ~70 MiB/day. A 50-pod fleet generates ~3 GiB/day. Plan for tiered storage (hot 30 days, cold 1 year).

17. Don'ts¶

Don't	Why
Expose pprof on a public listener	Source-level information leak
Import `_ "net/http/pprof"` alongside your real handlers on the same mux	Same listener serves both
`SetBlockProfileRate(1)` in production for hours	Real CPU cost
`MemProfileRate = 1` in production	Profile overhead, big profile files
Diff CPU vs. heap profiles	Different sample_type; tool errors
Trust heap profile bytes as "ground truth"	Sampled; use MemStats for absolute numbers
Profile a 5 s workload over 30 s	25 s of idle dilutes percentages

18. Summary¶

Production pprof is two systems, not one: an on-demand interactive endpoint on every service, and a continuous-profiling pipeline that archives profiles long-term. Pair them with a strict label vocabulary (route, version, env, region, tenant), a hardened debug listener bound to localhost, and a few runbooks for the canonical incidents. Use diffs (-base / -diff_base) for both release-gating and incident triage. Treat profiles as build artifacts: archive them so you can answer "what changed?" six months later.