Runtime Internals Used by Stdlib — Professional Level¶

Table of Contents¶

Introduction
Production Debugging Workflow
The Block Profile in Anger
The Mutex Profile in Anger
Reading /debug/pprof/goroutine
runtime.Stack for Crash Dumps
Reading the Goroutine Dump Format
Triggering Dumps on Signals
sysmon and Async Preemption Diagnostics
runtime/trace for I/O-Bound Services
Hardening Finalizer Code Paths
Cgo Thread Affinity in Production
GODEBUG Flags You Need to Know
Sample Incident Walkthroughs
Performance Budget for Profiling
Checklist for Production Readiness
Further Reading

Introduction¶

Focus: how to use runtime primitives to diagnose, harden, and debug a production Go service.

This file is for the engineer who is paged at 03:00 because a service is hung, has runaway memory, or has tail latency spikes that look like nothing in the CPU profile. The runtime exposes enough machinery — block profiles, mutex profiles, goroutine dumps, trace events, GODEBUG flags — that almost any concurrency-related production issue can be diagnosed without attaching a debugger to a live process. But you have to know which primitive to reach for, what overhead it costs, and how to read its output.

By the end of this file you should be able to:

Choose between block profile, mutex profile, CPU profile, and trace for a given symptom.
Read a goroutine dump and triage by state (runnable, chan receive, sync.Mutex, IO wait, syscall).
Write a signal-driven dump handler suitable for production.
Reason about the cost of SetBlockProfileRate(1) vs SetBlockProfileRate(1000).
Diagnose async-preemption-related stalls.
Use GODEBUG=gctrace=1,schedtrace=1000,scheddetail=1,asyncpreemptoff=1 to narrow scope.

Production Debugging Workflow¶

Concrete decision tree for "my service is slow / hung / leaking":

Get a goroutine snapshot first — cheapest signal, often diagnostic.

curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

Count goroutines by state. If thousands are in chan receive, your producer died. If thousands are in IO wait, your downstream is slow. If thousands are in sync.Mutex.Lock, you have lock contention.
If contention is suspected, enable mutex profile briefly.
If blocking duration is the symptom, enable block profile briefly.
If CPU is the symptom, CPU profile.
If "nothing is the symptom" but tail latency is bad, runtime/trace.

Always observe before changing. Each profile has overhead; turn off when done.

The Block Profile in Anger¶

What it measures¶

Time goroutines spend blocked on synchronization (channels, mutexes, sync.Cond.Wait, select, network I/O via internal/poll).

Sample rate is set with runtime.SetBlockProfileRate(rate int): - rate = 0 — off (default). - rate = 1 — every event. - rate = n — sample if blocking duration was >= n nanoseconds, with proportional weighting.

Capture¶

import _ "net/http/pprof"
import "net/http"

func main() {
    runtime.SetBlockProfileRate(1)
    go http.ListenAndServe(":6060", nil)
    // ...
    runtime.SetBlockProfileRate(0) // turn off when done
}

go tool pprof http://localhost:6060/debug/pprof/block
(pprof) top 20
(pprof) list mySuspectFunction
(pprof) web

Interpreting¶

The profile shows blocking duration (not call count). A function with one call blocked for 5 s ranks above 1000 calls each blocked for 1 ms.

Common patterns. - runtime.semacquire in sync.(*WaitGroup).Wait: producer goroutine forgot Done or is wedged. - runtime.chanrecv from a fan-in collector: senders crashed before sending. - runtime.gopark from internal/poll.(*pollDesc).wait: blocked on network I/O — symptom of slow downstream.

Overhead¶

For rate = 1, each blocking call adds a stack-trace capture (~1 us). Under heavy contention, expect 5-15% throughput loss. Use rate = 1000 (one event per microsecond of block time) in production-on-demand.

The Mutex Profile in Anger¶

What it measures¶

Contention on sync.Mutex and sync.RWMutex only. Each Unlock of a contended mutex records the waiting time of the previous holder along with the unlocker's stack.

Sample rate via runtime.SetMutexProfileFraction(rate int): - rate = 0 — off (default). - rate = 1 — every contended unlock. - rate = n — 1 in n events.

Capture and read¶

go tool pprof http://localhost:6060/debug/pprof/mutex
(pprof) top
(pprof) list hotMutexUnlocker

Interpreting¶

A line attributed to function F means: "another goroutine waited for the mutex that F holds". So F is the contention source. The fix is to: - shrink the critical section, - shard the mutex, - replace with lock-free / atomics where possible, - replace Mutex with RWMutex if reads dominate.

Caveats¶

The mutex profile does not see runtime.lock (runtime-internal locks).
It does not see channel contention. Use the block profile for that.
A high mutex profile total is not always a problem — what matters is whether waiters are accumulating. Cross-check with goroutine dump.

Reading `/debug/pprof/goroutine`¶

curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 > g.txt

debug=2 gives full stack dumps in text form. Sample:

goroutine 42 [chan receive, 5 minutes]:
github.com/foo/bar.(*Pipeline).consume(0xc0001a8000)
    /src/pipeline.go:81 +0xa0
created by github.com/foo/bar.(*Pipeline).Start
    /src/pipeline.go:65 +0x12c

Fields. - 42 — goroutine id (not stable across Go versions — diagnostic only). - chan receive — current state (see below for the list). - 5 minutes — duration in current state. Long durations are red flags.

Goroutine states you will see¶

State string	Meaning	Common cause if many
`running`	currently on a P	normal
`runnable`	on a run queue, waiting for CPU	starved Ps (insufficient GOMAXPROCS or busy goroutines)
`syscall`	in a blocking syscall	downstream syscall slow / many syscalls
`IO wait`	parked on netpoller	network I/O slow / no traffic
`chan send`	blocked sending to chan	consumer slow or dead
`chan receive`	blocked receiving from chan	producer slow or dead
`select`	blocked in select with no default	one of several channels is slow
`sync.Mutex.Lock`	waiting for mutex	mutex contention
`sync.WaitGroup.Wait`	waiting for WG counter to hit 0	a goroutine forgot `Done`
`sync.Cond.Wait`	waiting on Cond	broadcaster wedged
`semacquire`	low-level sema wait	similar to Mutex
`finalizer wait`	finalizer goroutine idle	normal (you have one of these)
`force gc (idle)`	force-gc goroutine idle	normal
`GC worker (idle)`	GC worker idle	normal
`timer goroutine (idle)`	(older Go) timer goroutine idle	pre-1.14

Triage by counting¶

grep '^goroutine' g.txt | wc -l       # total
grep 'chan receive' g.txt | wc -l     # by state
grep 'sync.Mutex.Lock' g.txt | wc -l

If chan receive is 10000 and your service has 100 connections, you have a leak.

`runtime.Stack` for Crash Dumps¶

Inside a panic handler:

func dumpAllGoroutines(w io.Writer) {
    buf := make([]byte, 1<<20)
    for {
        n := runtime.Stack(buf, true)
        if n < len(buf) {
            w.Write(buf[:n])
            return
        }
        buf = make([]byte, 2*len(buf))
    }
}

The growth loop handles dumps larger than 1 MiB. Always cap (e.g., 64 MiB) so a runaway dump doesn't OOM the process.

Cost. runtime.Stack(buf, true) stops the world. On a service with 50k goroutines this can pause 100-500 ms. Acceptable for a crash dump; unacceptable for periodic logging.

Reading the Goroutine Dump Format¶

Format documented at src/runtime/traceback.go goroutineheader:

goroutine <id> [<status>, <waitsincemin> minutes]:
<func1>(<args>)
    <file>:<line> +<pcoffset>
<func2>(<args>)
    <file>:<line> +<pcoffset>
created by <func>
    <file>:<line> +<pcoffset>

Arg parsing. <args> are the actual register values at the time the function was entered. Go's calling convention from 1.17 stores args in registers, so the displayed values may be opaque; older binaries show stack-passed args inline.

created by — the parent goroutine's stack frame. Tells you which goroutine spawned this one. If thousands of goroutines all have created by mypkg.startWorker, you spawn workers in a loop without bound.

Triggering Dumps on Signals¶

import (
    "os"
    "os/signal"
    "runtime"
    "syscall"
)

func installDumpSignal() {
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGUSR1)
    go func() {
        for range c {
            buf := make([]byte, 1<<20)
            for {
                n := runtime.Stack(buf, true)
                if n < len(buf) {
                    os.Stderr.Write(buf[:n])
                    break
                }
                buf = make([]byte, 2*len(buf))
            }
        }
    }()
}

Then kill -USR1 <pid> from operations to capture state without restarting.

Note: Go's runtime already dumps on SIGQUIT (default Ctrl-\). For production use a less commonly-used signal like SIGUSR1 or SIGUSR2 that you control.

sysmon and Async Preemption Diagnostics¶

sysmon (src/runtime/proc.go:sysmon) is a special goroutine that runs without a P, every 10-20 µs to 10 ms (back-off based on idleness). It:

Polls the netpoller (timed netpoll with 0 ns timeout for non-blocking check).
Forces preemption via preemptone on goroutines that have run > 10 ms.
Triggers GC when needed.
Runs forced timers.

Symptom: tight loop with no preemption¶

Pre-1.14 Go required function-call safepoints for preemption. A tight loop with no calls would never be preempted. Since Go 1.14, sysmon sends a SIGURG signal to the target M, the signal handler turns the next instruction into a fake function call, and the goroutine is preempted on entry.

Diagnostic. Set GODEBUG=asyncpreemptoff=1 and watch the program hang at the same spot — confirmation that async preemption was your safety net.

Symptoms in goroutine dump. Goroutines stuck in running state for many minutes, despite the service being interactive. They will all be in the same tight loop.

`runtime/trace` for I/O-Bound Services¶

CPU profile is useless when the service spends its time blocked on I/O — every sample lands in runtime.netpoll. The execution trace shows you what happened in between.

import "runtime/trace"
import "os"

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()
    // your service
}

Then go tool trace trace.out. Open in browser; you see: - Goroutine analysis — per-goroutine lifelines. - Network blocking profile — which call sites parked goroutines on pollWait. - Synchronization blocking profile — which channels and mutexes were the slowest. - Scheduler latency profile — time from goready to actual execution. - User-defined regions — if you wrap code in trace.WithRegion.

Cost. Tracing writes ~1-10 MB/s of binary data. Run for 5-30 s at most. Do not leave on in production.

Hardening Finalizer Code Paths¶

Production rules for finalizers:

Never block in a finalizer. All finalizers share one goroutine; a slow one stalls all others.
Never panic in a finalizer. Panicking from the finalizer goroutine crashes the program.
Never re-SetFinalizer from inside a finalizer unless you understand reachability.

Always provide an explicit Close and clear the finalizer in Close:

func (r *Resource) Close() error {
    if !r.closed.CompareAndSwap(false, true) {
        return nil
    }
    runtime.SetFinalizer(r, nil)
    return r.cleanup()
}

Treat finalizers as leak detectors, not cleanup. Log a warning, do not perform real cleanup, because timing is non-deterministic and cleanup may run after the resource is needed elsewhere.

Sample warning-only finalizer:

runtime.SetFinalizer(r, func(p *Resource) {
    if !p.closed.Load() {
        log.Warn().Int("id", p.id).Msg("Resource leaked: Close was never called")
    }
})

Cgo Thread Affinity in Production¶

Rules:

Pin only when C requires it. Examples: OpenGL contexts (per-thread state), GUI main thread on macOS, OpenSSL error queue (per-thread on some versions), GTK main loop.
Pair with defer runtime.UnlockOSThread in 99% of cases.
For dedicated worker goroutines that own thread-state for life, omit UnlockOSThread; the runtime will terminate the M when the G exits.
Avoid LockOSThread in worker pools unless absolutely needed — it reduces parallelism.
Monitor M count. Excessive locked Ms can exhaust GOMAXPROCS * threads budget and cause "runtime: program exceeds 10000-thread limit".

Production pattern: a single dedicated cgo-bound goroutine, fed by a channel:

var cgoWork = make(chan func())

func init() {
    go func() {
        runtime.LockOSThread()
        for f := range cgoWork {
            f()
        }
    }()
}

func callCFromAny(f func()) {
    done := make(chan struct{})
    cgoWork <- func() { defer close(done); f() }
    <-done
}

This serialises all cgo work on one OS thread; the rest of the program is unaffected.

GODEBUG Flags You Need to Know¶

Set via environment, no recompile:

Flag	Effect
`gctrace=1`	Log every GC cycle to stderr
`schedtrace=1000`	Log scheduler state every 1000 ms
`scheddetail=1`	(with schedtrace) include per-G/P/M details
`asyncpreemptoff=1`	Disable signal-based preemption (debug only)
`madvdontneed=1`	Use MADV_DONTNEED on Linux (older default)
`cgocheck=0`	Disable cgo pointer checking (faster, less safe)
`cgocheck=2`	Full cgo pointer checking (debug only)
`tracebackancestors=N`	Show N ancestor stack traces on panic
`gocachehash=1`	Trace the build cache (compile-time only)
`allocfreetrace=1`	Trace every allocation/free (very slow; tiny programs only)
`tracebrokenpipe=1`	Don't suppress broken-pipe SIGPIPE backtraces
`panicnil=1`	Allow `panic(nil)` (legacy compat)

Production tip. Set GODEBUG=gctrace=0 explicitly to override any cluster-level default that might add overhead.

Sample Incident Walkthroughs¶

Incident A — service hung at 100% CPU¶

top shows the Go process pinned at one core × GOMAXPROCS.
kill -USR1 <pid> (custom signal handler installed) → all goroutines.
Triage: many goroutines in running state, all in the same call site.
Disassemble: tight loop without function call.
Fix: introduce runtime.Gosched() or refactor to break out.

Incident B — service has 100k goroutines and growing¶

curl /debug/pprof/goroutine?debug=2 > g.txt.
awk '/^goroutine/{print $3}' g.txt | sort | uniq -c | sort -rn.
99000 are in chan receive. Trace the channel: producer goroutine died.
Fix: producer must close the channel; consumers must respect _, ok := <-ch.

Incident C — tail latency p99 = 200 ms, p50 = 5 ms¶

CPU profile: nothing obvious.
Block profile (rate=1000): top entry is runtime.semacquire in a custom mutex slow path.
Mutex profile: contention on cache.mu.
Fix: shard the cache, replace global mutex with per-shard mutex.

Incident D — OOM after 6 hours¶

Heap profile (/debug/pprof/heap): shows 8 GB of []byte.
Goroutine dump: 100k goroutines in IO wait, each holding a buffer.
Root cause: handler does not return after timeout, holds buffer reference.
Fix: enforce request timeouts via http.Server.ReadTimeout and context.WithTimeout.

Performance Budget for Profiling¶

Profile	Overhead	When to enable
Goroutine dump (one-shot)	STW briefly	Always on demand
Heap profile	~1% steady	Always-on with default `MemProfileRate=512K`
CPU profile (rate=100)	<1%	On demand, 30 s
CPU profile (rate=1000)	2-5%	Diagnosis only
Block profile (rate=1)	5-15%	Diagnosis only
Block profile (rate=1000)	<1%	Steady acceptable
Mutex profile (frac=1)	5-15%	Diagnosis only
Mutex profile (frac=100)	<1%	Steady acceptable
Execution trace	~10% + huge data	Short bursts only

Rule: never leave block/mutex profiles at full rate in production. Use a debug endpoint that enables them for a bounded window and turns them off automatically.

Runtime Internals Used by Stdlib — Professional Level¶

Table of Contents¶

Introduction¶

Production Debugging Workflow¶

The Block Profile in Anger¶

What it measures¶

Capture¶

Interpreting¶

Overhead¶

The Mutex Profile in Anger¶

What it measures¶

Capture and read¶

Interpreting¶

Caveats¶

Reading `/debug/pprof/goroutine`¶

Goroutine states you will see¶

Triage by counting¶

`runtime.Stack` for Crash Dumps¶

Reading the Goroutine Dump Format¶

Triggering Dumps on Signals¶

sysmon and Async Preemption Diagnostics¶

Symptom: tight loop with no preemption¶

`runtime/trace` for I/O-Bound Services¶

Hardening Finalizer Code Paths¶

Cgo Thread Affinity in Production¶

GODEBUG Flags You Need to Know¶

Sample Incident Walkthroughs¶

Incident A — service hung at 100% CPU¶

Incident B — service has 100k goroutines and growing¶

Incident C — tail latency p99 = 200 ms, p50 = 5 ms¶

Incident D — OOM after 6 hours¶

Performance Budget for Profiling¶

Checklist for Production Readiness¶

Further Reading¶

Runtime Internals Used by Stdlib — Professional Level¶

Table of Contents¶

Introduction¶

Production Debugging Workflow¶

The Block Profile in Anger¶

What it measures¶

Capture¶

Interpreting¶

Overhead¶

The Mutex Profile in Anger¶

What it measures¶

Capture and read¶

Interpreting¶

Caveats¶

Reading /debug/pprof/goroutine¶

Goroutine states you will see¶

Triage by counting¶

runtime.Stack for Crash Dumps¶

Reading the Goroutine Dump Format¶

Triggering Dumps on Signals¶

sysmon and Async Preemption Diagnostics¶

Symptom: tight loop with no preemption¶

runtime/trace for I/O-Bound Services¶

Hardening Finalizer Code Paths¶

Cgo Thread Affinity in Production¶

GODEBUG Flags You Need to Know¶

Sample Incident Walkthroughs¶

Incident A — service hung at 100% CPU¶

Incident B — service has 100k goroutines and growing¶

Incident C — tail latency p99 = 200 ms, p50 = 5 ms¶

Incident D — OOM after 6 hours¶

Performance Budget for Profiling¶

Checklist for Production Readiness¶

Further Reading¶

Reading `/debug/pprof/goroutine`¶

`runtime.Stack` for Crash Dumps¶

`runtime/trace` for I/O-Bound Services¶