runtime Package Deep — Find the Bug¶
1. How to use this file¶
Fourteen buggy snippets of Go code that touch the runtime, runtime/debug, runtime/pprof, and runtime/trace packages: finalizers, OS-thread locking, GC tuning, memory stats, profiling. Read each in 30-60 seconds, decide where the defect is, then expand <details> for the answer.
Runtime-package bugs almost never produce a compile error and rarely produce a clean panic. They warp behaviour around the edges: a finalizer that fires too late or not at all, an OS thread that quietly leaks, a goroutine count read at a moment that doesn't exist yet, a CPU profile that grows to gigabytes because no one called StopCPUProfile. Three questions to ask every snippet:
- What does this
runtimecall actually promise — and when (or whether) does the runtime keep that promise? - Who owns the lifecycle — the caller, the GC, the scheduler, the kernel?
- What is the cost? Most of
runtimeis "advisory hint" or "stop-the-world request"; both have non-obvious blast radius.
If a snippet can't answer all three, there's a bug.
Bug 1 — Finalizer closure captures the object; finalizer never fires¶
type Conn struct{ fd int }
func NewConn(fd int) *Conn {
c := &Conn{fd: fd}
runtime.SetFinalizer(c, func(_ *Conn) {
log.Printf("closing fd %d for conn %p", c.fd, c) // BUG: captures c
syscall.Close(c.fd)
})
return c
}
Observed behavior: Conn instances are allocated and dropped by callers, but the finalizer never runs. pprof shows the heap full of unreachable-looking *Conns; file descriptors leak until ulimit is hit.
Hint
What does the finalizer closure refer to? Trace every reachable pointer from the closure back to the object the finalizer is attached to. Read the comment block at the top of `runtime/mfinal.go` — there's a worked example of exactly this trap.Diagnosis: runtime.SetFinalizer(obj, fn) registers fn to run when obj becomes unreachable. The closure here closes over c itself — so the closure object holds a strong reference to *Conn. The finalizer keeps c alive forever; c is never collected; the finalizer never runs.
runtime/mfinal.go (SetFinalizer's doc comment) spells it out explicitly: "the finalizer function ... must not refer to the object itself; doing so will prevent it from being collected." Internally, addfinalizer (in runtime/mfinal.go) records (obj, fn, ...) in a side table; the GC's reachability scan walks the closure, finds c, and marks the object live every cycle.
This is the single most common finalizer bug. It usually surfaces not as "the finalizer was buggy" but as "the GC is broken — my objects don't free."
Fix: Use the parameter, never the outer variable:
func NewConn(fd int) *Conn {
c := &Conn{fd: fd}
fd := c.fd // copy, not c
runtime.SetFinalizer(c, func(x *Conn) {
log.Printf("closing fd %d for conn %p", x.fd, x)
syscall.Close(x.fd)
})
return c
}
Even better: don't put log.Printf inside a finalizer at all (finalizers run on a dedicated goroutine; logging there is fine but races with shutdown). Best of all: don't rely on finalizers — see Bug 2.
Bug 2 — Finalizer used to close a file; relies on GC timing¶
func OpenLog(path string) (*os.File, error) {
f, err := os.OpenFile(path, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0o644)
if err != nil { return nil, err }
runtime.SetFinalizer(f, func(f *os.File) { _ = f.Close() }) // BUG: deferred to GC
return f, nil
}
// caller
for i := 0; i < 100_000; i++ {
f, _ := OpenLog("/var/log/app.log")
_, _ = f.Write([]byte("entry\n"))
// no explicit Close — "the finalizer will get it"
}
Observed behavior: Program crashes with too many open files. The GC is running (MemStats.NumGC increases) but file descriptors keep climbing.
Hint
When does a finalizer actually run, relative to when the object becomes unreachable? Read the second paragraph of `SetFinalizer`'s doc in `runtime/mfinal.go`. Then ask: how big is `os.File`? How many of them fit in the heap before the GC bothers to look?Diagnosis: Finalizers run some time after the object becomes unreachable. There's no upper bound. The GC's trigger is heap growth — not file-descriptor count. *os.File is tiny (a few words); millions of unreachable *os.File values fit in the live heap before the GC trigger fires.
Even when GC does run, runtime/mfinal.go's runfinq enqueues finalizers onto a single dedicated goroutine; under pressure they back up. The kernel's FD table is a hard, low limit (default 1024 on Linux). Heap memory is a soft, high limit. Tying FD release to heap pressure mismatches the resource the system actually constrains.
os.File's standard-library implementation does set a finalizer (os/file_unix.go's newFile calls runtime.SetFinalizer(f.file, (*file).close)) — but it's a safety net for code that forgot to Close, not a primary mechanism. The doc on os.File.Close says: "Programs should close files as soon as they are no longer needed."
Fix: Explicit defer f.Close() at the call site. The finalizer is the last line of defence, not the plan.
for i := 0; i < 100_000; i++ {
f, err := OpenLog("/var/log/app.log")
if err != nil { return err }
_, _ = f.Write([]byte("entry\n"))
f.Close() // explicit, deterministic
}
For pooled or shared file handles, use sync.Pool or a reference-counted wrapper. Never use finalizers for resources whose limit is enforced outside the Go heap: file descriptors, sockets, kernel handles, GPU buffers, database connections, locks.
Bug 3 — LockOSThread without UnlockOSThread; M is destroyed when goroutine exits¶
func runInPinnedThread(work func()) {
go func() {
runtime.LockOSThread() // BUG: no UnlockOSThread
// set up thread-local state: CUDA context, GTK GUI, signal mask, etc.
cudaSetDevice(0)
work()
// goroutine returns here
}()
}
// caller invokes runInPinnedThread thousands of times
Observed behavior: Over hours, ps -o thcount shows the process's thread count climbing slowly even though active goroutines stay flat. Eventually the kernel refuses to create more threads, or memory bloats from per-M stacks.
Hint
What happens to the OS thread (M, in scheduler terms) when a goroutine that has locked itself to it exits? Is the M returned to the pool? Read the comment on `LockOSThread` in `runtime/proc.go` — specifically the paragraph beginning "If the calling goroutine exits without unlocking the thread".Diagnosis: runtime.LockOSThread pins the calling goroutine to its current M (OS thread). If the goroutine exits while still locked, the runtime destroys that M rather than returning it to the idle-M pool — runtime/proc.go's goexit0 checks gp.lockedm and calls mexit(true) which tears the thread down.
This is intentional: thread-local state the goroutine set up (CUDA context, OpenGL context, X11 connection, sigaltstack, syscall identity on Linux when unshared) may not be safe to inherit. The Go runtime takes the conservative path and burns the thread.
For cudaSetDevice, gtk_main, pthread_setname_np, runtime.LockOSThread is exactly the right tool. But the contract is symmetric: lock for as long as the thread-local state is needed, unlock before the goroutine returns if the thread is to be reused.
The leak is slow because Go's scheduler creates new M's on demand; the program works, just with growing thread count. On Linux, /proc/<pid>/status's Threads: field rises; top -H shows phantom threads. Eventually pthread_create returns EAGAIN.
Fix: Pair every LockOSThread with an UnlockOSThread — and use defer:
func runInPinnedThread(work func()) {
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread() // returns M to the pool
cudaSetDevice(0)
work()
}()
}
If the thread-local state genuinely cannot be safely reused (CUDA contexts often can't be re-initialised cleanly), keep the lock — but then dedicate one long-lived goroutine that owns the thread for the lifetime of the program, and feed it work over a channel. Don't spawn-and-burn.
Bug 4 — runtime.GC() in a hot loop; stop-the-world cascade¶
func ingest(items []Record) {
for _, r := range items {
process(r)
if r.Big {
runtime.GC() // BUG: "free that memory now"
}
}
}
Observed behavior: Single-threaded throughput drops 10x once Big records appear. p99 latency for unrelated request handlers spikes simultaneously. GODEBUG=gctrace=1 shows GC cycles back-to-back, each STW pause adding to tail latency.
Hint
What does `runtime.GC()` actually do? Look at `runtime/mgc.go` — `GC()` calls `gcStart` with the `gcTriggerCycle` trigger and *blocks until the cycle completes*, including sweep. Compare that to the runtime's automatic pacing.Diagnosis: runtime.GC() runs a full mark-and-sweep cycle synchronously. From runtime/mgc.go:
// GC runs a garbage collection and blocks the caller until the
// garbage collection is complete. It may also block the entire
// program.
func GC() {
n := work.cycles.Load()
gcWaitOnMark(n)
gcStart(gcTrigger{kind: gcTriggerCycle, n: n + 1})
gcWaitOnMark(n + 1)
// ... and waits for sweep, too
}
Every call walks the entire reachable heap once. The pacer (runtime/mgcpacer.go) is designed to amortize this cost — calling GC() manually defeats it. Worse, each manual GC resets the heap-growth target for the next automatic GC, so even non-Big records that follow trigger sooner than they would have.
Two stop-the-world phases bracket each cycle (mark-setup and mark-termination). Each STW is hundreds of microseconds to single-digit milliseconds on a healthy server; calling GC() thousands of times stacks those pauses.
runtime.GC() has exactly two legitimate uses: (1) before benchmarking, to get a clean baseline; (2) before snapshotting memory stats for a regression test. Calling it to "free memory now" is always wrong — the runtime decides when, based on the pacer.
Fix: Trust the pacer. If memory pressure is real, tune GOGC (lower = more frequent GC, smaller heap) or set a soft memory limit via debug.SetMemoryLimit (Go 1.19+):
import "runtime/debug"
func main() {
debug.SetMemoryLimit(8 << 30) // 8 GiB soft cap
ingest(items) // no manual GC
}
If a specific large allocation needs early release, scope its lifetime tightly (use a function so the stack frame frees its references) — but don't call runtime.GC().
Bug 5 — Missing runtime.KeepAlive after unsafe.Pointer conversion¶
import (
"syscall"
"unsafe"
)
type Buffer struct {
data []byte
}
func (b *Buffer) WriteToFD(fd int) (int, error) {
ptr := unsafe.Pointer(&b.data[0])
n, _, errno := syscall.Syscall(
syscall.SYS_WRITE,
uintptr(fd),
uintptr(ptr), // BUG: ptr is now uintptr, GC can't see it
uintptr(len(b.data)),
)
if errno != 0 { return 0, errno }
return int(n), nil
}
// elsewhere, b has a finalizer that releases pinned memory back to a pool
Observed behavior: Most calls succeed. Under GC pressure (concurrent allocators on other goroutines), rare calls return garbage bytes, return EFAULT, or trigger a SIGSEGV inside the kernel write path.
Hint
Once `&b.data[0]` is converted to `uintptr`, does the garbage collector know `b` is still in use? Read `runtime.KeepAlive`'s doc in `runtime/mfinal.go` and the "Conversion of a Pointer to a uintptr" rule in `unsafe`'s package docs.Diagnosis: A uintptr is an integer to the GC, not a pointer. Once unsafe.Pointer(&b.data[0]) is cast to uintptr for the syscall, the GC and finalizer scanner stop tracking b through that variable. If b has no other live references, the runtime is free to:
- Run
b's finalizer, returning the underlying memory to a pool or releasing it. - Move the underlying array if
b.datalives in a region the runtime relocates (the current Go runtime doesn't move heap objects, but compaction is a long-standing roadmap item and the contract assumes it can).
While the kernel is inside write(2), dereferencing the now-stale pointer, either of these is a use-after-free.
The unsafe package's documented rule:
Note that the pointer must point into an allocated object, so it may not be nil. ... The compiler handles a Pointer converted to a uintptr in the argument list of a call to a function implemented in assembly by arranging that the referenced allocated object, if any, is retained and not moved until the call completes ...
syscall.Syscall is implemented in assembly — the runtime does keep the pointer alive across that single call. But the moment the function returns, retention ends. If b has a finalizer that's racing the syscall return, it can fire while the kernel still has a stale view.
In practice with syscall.Syscall the per-call retention saves most code. But the moment you split the conversion across statements, or wrap the syscall in a helper, or interleave anything between the conversion and the call, the retention contract breaks.
Fix: Use runtime.KeepAlive(b) after the syscall to extend b's reachability past the call:
func (b *Buffer) WriteToFD(fd int) (int, error) {
n, _, errno := syscall.Syscall(
syscall.SYS_WRITE,
uintptr(fd),
uintptr(unsafe.Pointer(&b.data[0])), // conversion in arg list
uintptr(len(b.data)),
)
runtime.KeepAlive(b) // b reachable until here
if errno != 0 { return 0, errno }
return int(n), nil
}
runtime.KeepAlive (in runtime/mfinal.go) is a no-op at runtime — its sole purpose is to mark the variable as live to the compiler's escape analysis and the GC's reachability scan, anchoring it past the call. It costs nothing and prevents an entire class of unsafe bugs.
Bug 6 — Reading runtime.MemStats in a loop without realising it requires STW¶
import (
"runtime"
"time"
)
func memMonitor() {
var m runtime.MemStats
for {
runtime.ReadMemStats(&m) // BUG: STW every tick
log.Printf("heap_alloc=%d gc_count=%d", m.HeapAlloc, m.NumGC)
time.Sleep(100 * time.Millisecond)
}
}
Observed behavior: Application's p99 latency rises by 1-5 ms once memMonitor is enabled. Throughput on hot endpoints drops a few percent. pprof shows time spent in runtime.stopTheWorld.
Hint
How does `ReadMemStats` get a consistent snapshot of every per-P heap-stat counter at once? Look at the function in `runtime/mstats.go`. What does it call before reading?Diagnosis: runtime.ReadMemStats stops the world to take a consistent snapshot. From runtime/mstats.go:
func ReadMemStats(m *MemStats) {
_ = m.Alloc // nil check
stw := stopTheWorld(stwReadMemStats)
systemstack(func() {
readmemstats_m(m)
})
startTheWorld(stw)
}
Per-P stats are kept lock-free and only reconciled to global counters at GC checkpoints. To produce a coherent MemStats value across all P's, ReadMemStats halts every goroutine, flushes per-P deltas, copies the global stats, and resumes. The pause is short (microseconds to low milliseconds depending on GOMAXPROCS and the worst-case syscall-exit latency) but it's a full STW.
Calling this ten times a second adds ten STW events per second to your latency profile. On a process serving 50k QPS, that's measurable in p99.
Fix: Read sparingly — once per scrape interval (Prometheus default 15s) is the right cadence:
func memMonitor() {
var m runtime.MemStats
tick := time.NewTicker(15 * time.Second)
for range tick.C {
runtime.ReadMemStats(&m)
log.Printf("heap_alloc=%d gc_count=%d", m.HeapAlloc, m.NumGC)
}
}
For high-frequency observability, use the runtime/metrics package (Go 1.16+) — metrics.Read reads a single per-P counter at a time without STW. It's the modern replacement for ReadMemStats for monitoring:
import "runtime/metrics"
samples := []metrics.Sample{
{Name: "/memory/classes/heap/objects:bytes"},
{Name: "/gc/cycles/total:gc-cycles"},
}
metrics.Read(samples) // no STW
Bug 7 — CPU profile started but never stopped; file grows forever¶
import "runtime/pprof"
func main() {
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f) // BUG: no StopCPUProfile
// defer pprof.StopCPUProfile() omitted
runApplication()
}
Observed behavior: cpu.prof grows at a steady rate (a few KB/s on a typical workload). Days later, the file is tens of GB and disk fills up. go tool pprof cpu.prof either rejects the file as malformed or takes minutes to load.
Hint
What is the CPU profiler doing in the background while it's running? Look at `runtime/cpuprof.go` — specifically the signal handler that fires every 10 ms. Where does each sample go? When does it stop being collected?Diagnosis: pprof.StartCPUProfile enables the runtime's SIGPROF-based sampler: every 10ms of CPU time on every OS thread, the kernel delivers SIGPROF; the Go runtime captures the current stack and appends it to an in-memory ring buffer. A flusher goroutine drains the buffer to the output file periodically.
From runtime/cpuprof.go:
func SetCPUProfileRate(hz int) {
// ... starts the signal-driven sampler at hz Hz (100 Hz default)
}
Without StopCPUProfile, sampling never stops and the file never closes properly. Two consequences:
- The flusher keeps appending samples forever. A long-running server accumulates megabytes per minute.
- The pprof profile format has a header and a trailer.
StartCPUProfilewrites the header;StopCPUProfilewrites the final symbol table and closes out the profile. WithoutStop,pprofsees a file that's all body, no trailer — it may parse what it can or reject entirely.
The SIGPROF signal itself has a small but non-zero cost: ~1% overhead at the default 100 Hz, more if the sampler hits a thread inside a tight syscall and has to wait. Forever-on CPU profiling in production wastes CPU.
Fix: Always pair Start with Stop. The canonical idiom:
func main() {
f, err := os.Create("cpu.prof")
if err != nil { log.Fatal(err) }
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil { log.Fatal(err) }
defer pprof.StopCPUProfile() // critical
runApplication()
}
For long-running services, expose pprof via net/http/pprof and let operators trigger profiles on demand (go tool pprof http://host/debug/pprof/profile?seconds=30) — the HTTP handler starts and stops the profiler per request, so leaks are impossible.
Bug 8 — runtime.NumGoroutine() called from init(), assumed final¶
var initialGoroutines int
func init() {
initialGoroutines = runtime.NumGoroutine() // BUG: main hasn't started
log.Printf("baseline goroutines: %d", initialGoroutines)
}
func checkLeaks() {
extra := runtime.NumGoroutine() - initialGoroutines
if extra > 10 {
log.Printf("goroutine leak: %d extra", extra)
}
}
Observed behavior: baseline goroutines: 1 printed at startup. Later, checkLeaks reports "leaks" of 3-5 goroutines even on a quiescent process. The numbers don't match what pprof goroutine shows in the running server.
Hint
What goroutines exist at the moment `init()` runs? When do the runtime's built-in goroutines (GC worker, finalizer, scavenger, timer) actually start? Look for `forcegchelper`, `bgsweep`, `runfinq` in `runtime/mgc.go` and `runtime/mfinal.go`.Diagnosis: Package init() functions run on the main goroutine before runtime.main has finished setting up the runtime's auxiliary goroutines. From runtime/proc.go's main function:
func main() {
// ...
gcenable() // starts GC worker, bgsweep
// ...
main_init_done = make(chan bool)
// run user init functions
doInit(&main_inittask)
close(main_init_done)
main_main() // user's main()
}
gcenable (in runtime/mgc.go) spawns the background sweeper (bgsweep); runfinq is started lazily on first finalizer; the timer goroutine is per-P and spawned as P's come online; forcegchelper is started on first runtime.GC trigger. None of these exist when your init() runs.
So NumGoroutine() at init() time returns 1 (just main). At any later point — after the first GC, the first finalizer, the first time.Sleep — there will be several runtime-internal goroutines visible. Your "leak detector" reports them as leaks.
Fix: Take the baseline after the runtime has warmed up, not in init:
var initialGoroutines atomic.Int64
func main() {
// let runtime spin up its goroutines
time.Sleep(50 * time.Millisecond)
runtime.GC() // force the GC workers to exist
initialGoroutines.Store(int64(runtime.NumGoroutine()))
runApplication()
}
Better: for leak detection in tests, use go.uber.org/goleak or testing.M-wrapped checks that snapshot after m.Run() and ignore known runtime goroutines by stack-prefix. runtime.NumGoroutine() alone is too blunt — it counts everything, including the ones you don't own.
Bug 9 — runtime/trace.Start called twice; second call's error is silently ignored¶
import "runtime/trace"
var tracing bool
func startTrace(path string) {
f, _ := os.Create(path)
trace.Start(f) // BUG: no error check, no idempotency
tracing = true
}
// HTTP handler exposed to operators
func handleTrace(w http.ResponseWriter, r *http.Request) {
startTrace(r.URL.Query().Get("path"))
fmt.Fprintln(w, "tracing started")
}
Observed behavior: First call works; the trace file fills. Second call returns 200 to the HTTP client but the new trace file stays at 0 bytes. The original trace continues writing into the first file. Operators see "trace started" and wait — nothing arrives.
Hint
What does `trace.Start` return? What happens inside if tracing is already running? Read `runtime/trace/trace.go` and follow into `runtime/trace.go`'s `StartTrace`.Diagnosis: trace.Start returns an error if tracing is already enabled. From runtime/trace/trace.go:
func Start(w io.Writer) error {
tracing.Lock()
defer tracing.Unlock()
if tracing.enabled {
return errors.New("tracing is already enabled")
}
// ... set up writer, call runtime.StartTrace
}
The runtime maintains a single global trace state (runtime/trace.go's traceLock, trace.enabled). The second trace.Start returns an error immediately without ever wiring up the new writer; the first writer keeps receiving events.
Ignoring the error masks the fact that the second Start did nothing. The caller is left wondering why the new file is empty.
Fix: Check the error. Implement Stop/Start cycling correctly:
var (
traceMu sync.Mutex
traceF *os.File
)
func startTrace(path string) error {
traceMu.Lock()
defer traceMu.Unlock()
if traceF != nil {
trace.Stop()
traceF.Close()
traceF = nil
}
f, err := os.Create(path)
if err != nil { return err }
if err := trace.Start(f); err != nil {
f.Close()
return err
}
traceF = f
return nil
}
func stopTrace() {
traceMu.Lock()
defer traceMu.Unlock()
if traceF == nil { return }
trace.Stop()
traceF.Close()
traceF = nil
}
net/http/pprof.Trace handles the lifecycle for you and is the safer default for ops-exposed tracing.
Bug 10 — runtime.SetBlockProfileRate(1); one in one is the most aggressive setting¶
import "runtime"
func init() {
runtime.SetBlockProfileRate(1) // BUG: every blocking event sampled
runtime.SetMutexProfileFraction(1) // same trap on the mutex side
}
Observed behavior: Service throughput drops 30-50% with no other change. pprof shows huge amounts of time in runtime.chansend1, runtime.semacquire1, and inside the profiler's own recording functions.
Hint
What does `SetBlockProfileRate(rate int)` mean — is `1` "1 Hz" or something else? Read the doc and follow into `runtime/mprof.go`'s `blocksampled` / `mutexevent`.Diagnosis: The argument to SetBlockProfileRate is not a frequency. It's a sampling threshold in nanoseconds — or, equivalently, a 1-in-N sampling rate where smaller is more frequent. From the doc:
SetBlockProfileRate controls the fraction of goroutine blocking events that are reported in the blocking profile. The profiler aims to sample an average of one blocking event per rate nanoseconds spent blocked. To include every blocking event in the profile, pass rate = 1.
So 1 means "record every blocking event, no matter how short". On a busy server with channel ops, mutex contention, and time.Sleep calls in the millions per second, every one of those events gets a stack trace captured, hashed, and stored. The profiler's hash table lock becomes the new contention bottleneck.
SetMutexProfileFraction(1) is the same trap on the mutex side: 1 means "1 in 1" = every mutex contention event sampled.
Reasonable production values: SetBlockProfileRate(10000) (sample one event per 10µs of cumulative block time on average) or higher; SetMutexProfileFraction(100) (1 in 100 contention events). Even those are usually only enabled when investigating a specific suspected contention problem.
runtime/mprof.go's blocksampled runs cheaply when the cycles threshold is high; with rate=1, the cycles threshold is 1, so the check always succeeds and the full stack walk + hash + lock path runs on every block.
Fix: Pick a sampling rate. Keep it off by default; enable behind a debug flag:
import (
"flag"
"runtime"
)
var blockProfRate = flag.Int("block-profile-rate", 0,
"ns; <=0 disables, 10000 = once per 10µs of block time")
func init() {
flag.Parse()
if *blockProfRate > 0 {
runtime.SetBlockProfileRate(*blockProfRate)
}
}
For production "always on but cheap" telemetry, leave block/mutex profiling off and use distributed tracing (OpenTelemetry) plus per-handler timing histograms.
Bug 11 — runtime.Stack(buf, true) with too-small buf; dump truncated silently¶
import "runtime"
func crashHandler() {
if r := recover(); r != nil {
buf := make([]byte, 4096) // BUG: too small for goroutine dump
n := runtime.Stack(buf, true)
log.Printf("PANIC: %v\nSTACK:\n%s", r, buf[:n])
os.Exit(2)
}
}
Observed behavior: On a panic in production, the log shows the panic message and a truncated stack — usually just the first few goroutines, with the actual panicking goroutine missing or cut off mid-frame. Debugging the panic requires guessing.
Hint
What does `runtime.Stack(buf, all bool) int` do when the dump doesn't fit? Does it return an error? Does it grow the buffer? Read its doc and `runtime/mprof.go`.Diagnosis: runtime.Stack(buf, all) writes as many bytes as fit into buf and returns the count written. If the dump is larger than buf, the excess is silently discarded — no error, no indicator. From the doc:
Stack formats a stack trace of the calling goroutine into buf and returns the number of bytes written to buf. If all is true, Stack formats stack traces of all other goroutines into buf after the trace for the current goroutine.
With all=true, the buffer must hold every goroutine's stack. On a production server with 10,000 goroutines, that's hundreds of KB to several MB. A 4 KB buffer holds maybe 5-10 goroutines.
Worse, the truncation happens mid-frame: the last line of the buffer is usually a half-printed PC or file path, and the panicking goroutine — typically the one you most need — is not necessarily the first one written. Go's runtime walks the goroutine list in scheduler order, not panic order.
runtime/mprof.go's Stack calls goroutineProfileWithLabels which fills the buffer until it's full and stops. There's no second-pass "tell me how much I need."
Fix: Use a growable buffer. Double until it fits:
func dumpAllStacks() []byte {
buf := make([]byte, 1<<20) // start at 1 MiB
for {
n := runtime.Stack(buf, true)
if n < len(buf) {
return buf[:n]
}
buf = make([]byte, 2*len(buf)) // double and retry
}
}
func crashHandler() {
if r := recover(); r != nil {
log.Printf("PANIC: %v\nSTACK:\n%s", r, dumpAllStacks())
os.Exit(2)
}
}
For just the current goroutine (all=false), 64 KB is almost always enough. For all=true, start at 1 MB and grow.
Alternative: write directly to a file via pprof.Lookup("goroutine").WriteTo(f, 2) — the 2 is the "full goroutine dump including stacks" mode and it streams to the writer without a fixed buffer.
Bug 12 — Setting GOMAXPROCS in init() before reading the cgroup CPU quota¶
import "runtime"
func init() {
runtime.GOMAXPROCS(runtime.NumCPU()) // BUG: NumCPU sees host, not cgroup
}
Observed behavior: Containerised service runs on Kubernetes with resources.limits.cpu: 2. The node has 64 cores. The process sets GOMAXPROCS=64. The scheduler creates 64 P's, runs all 64 in parallel for short bursts, then the cgroup throttler suspends them for tens of milliseconds while quota replenishes. Latency p99 is terrible; CPU utilization on the cgroup is pegged at 100%.
Hint
What does `runtime.NumCPU` actually count? Look at `runtime/os_linux.go` — what syscall does it use? Does the result reflect the Linux cgroup CPU quota?Diagnosis: runtime.NumCPU returns the number of logical CPUs visible to the process via sched_getaffinity(2) on Linux. From runtime/os_linux.go:
sched_getaffinity reflects the kernel's CPU set for the process — which Kubernetes does not restrict for the common cpu cgroup limit (CFS quota). The cgroup quota limits how much CPU time per period the process gets; it does not change the set of CPUs the process can be scheduled on. So NumCPU returns 64 on a 64-core host even with a 2-CPU CFS quota.
Setting GOMAXPROCS=64 makes the runtime create 64 P's. The scheduler tries to run 64 goroutines in parallel. The CFS quota is consumed in 2 cores' worth of wall time, then the kernel throttles every thread in the cgroup until the next 100ms period — for tens of milliseconds at a time. Tail latency explodes.
The fix is to set GOMAXPROCS to match the quota, not the visible CPU count. Reading the cgroup's quota directly is annoying (different paths for cgroup v1 vs v2), so use the standard library's helper.
Fix: Use automaxprocs (Uber) or, since Go 1.25, the built-in runtime.GOMAXPROCS(-1)-respecting behavior driven by GOMAXPROCS= env var that supports the cgroup integration. The common pre-1.25 approach:
import _ "go.uber.org/automaxprocs" // sets GOMAXPROCS from cgroup at init time
func main() {
log.Printf("GOMAXPROCS=%d", runtime.GOMAXPROCS(0))
runApplication()
}
Or manually parse /sys/fs/cgroup/cpu.max (cgroup v2) and divide quota by period.
Setting it in init() is fine if the value is correct; the bug is reading NumCPU rather than the cgroup quota. runtime/proc.go's default GOMAXPROCS (when unset) is NumCPU, which has the same problem — for containerized workloads always override.
Bug 13 — runtime.GC() then expecting RSS to drop instantly¶
import (
"runtime"
"runtime/debug"
)
func releaseMemory() {
runtime.GC() // BUG: doesn't return memory to OS
// expected: ps shows lower RSS
// actual: ps shows same RSS
}
Observed behavior: A monitoring script triggers releaseMemory when a job finishes; immediately after, ps -o rss shows no change. Engineer concludes "GC is broken" and adds more aggressive calls, none of which help.
Hint
What does `runtime.GC()` do that's different from returning memory to the OS? What's the difference between heap-free spans and unmapped memory? Read `runtime/mheap.go`'s `scavenge` and `runtime/debug.FreeOSMemory`.Diagnosis: Two distinct steps:
runtime.GC()runs a mark-and-sweep, marking now-unreachable objects as free. The memory is returned to the runtime's heap allocator's free spans — not to the OS.- The scavenger (
runtime/mgcscavenge.go'sbgscavenge) gradually returns unused heap spans to the OS viamadvise(MADV_DONTNEED)(Linux) orMADV_FREE. The scavenger runs in the background, paced to the runtime's memory-limit signal, and may take seconds to minutes to release a large chunk.
So after runtime.GC(), the Go heap has free space, but the kernel still accounts the pages against your process's RSS until the scavenger releases them and the kernel reclaims them.
To force immediate release, you need both steps: GC to free the heap objects, then debug.FreeOSMemory to run the scavenger synchronously. From runtime/debug/garbage.go:
// FreeOSMemory forces a garbage collection followed by an
// attempt to return as much memory to the operating system as possible.
func FreeOSMemory() {
freeOSMemory()
}
Internally freeOSMemory does gcStart + scavenge(^uintptr(0)) — GC then a full sync scavenge.
Even after FreeOSMemory, RSS may not drop instantly depending on the platform:
- Linux
MADV_DONTNEED: kernel zeroes the pages on next access; RSS drops intopimmediately on most kernels. - Linux
MADV_FREE(default on 4.5+): kernel keeps pages mapped but lazily reclaimable; RSS only drops under memory pressure.psstill shows the high number.
To force the older, more visible-in-RSS behavior, set GODEBUG=madvdontneed=1.
Fix: Call both, and understand the platform:
import (
"runtime"
"runtime/debug"
)
func releaseMemory() {
runtime.GC()
debug.FreeOSMemory() // also runs GC + scavenge
}
Even better: don't try to manage RSS manually. Set debug.SetMemoryLimit and let the runtime keep the process within budget automatically. Manual FreeOSMemory is a niche tool for batch jobs that have a clear "now is a good time to give back" moment.
Bug 14 — runtime/pprof exposed on a public port without auth¶
import (
"net/http"
_ "net/http/pprof" // BUG: registers handlers on DefaultServeMux
)
func main() {
go func() {
// expose admin port for ops
http.ListenAndServe(":6060", nil) // BUG: 0.0.0.0, no auth
}()
runApplication()
}
Observed behavior: Service runs fine for weeks. A security scan reports the :6060 port is exposed to the internet (LB misconfig). Attacker hits /debug/pprof/heap and downloads a heap profile containing stack traces, function names, file paths, and string contents of recent allocations — effectively a snapshot of the program's recent activity, including possibly user data in unexported buffers. Worse, /debug/pprof/profile?seconds=300 makes the server burn 5 minutes of CPU on sampling-and-symbolizing each request — an instant DoS amplifier.
Hint
What does `import _ "net/http/pprof"` actually do at init? Where does it register handlers? What's reachable at `/debug/pprof/`?Diagnosis: Importing net/http/pprof for its side effect registers debugging handlers on http.DefaultServeMux. From net/http/pprof/pprof.go:
func init() {
http.HandleFunc("/debug/pprof/", Index)
http.HandleFunc("/debug/pprof/cmdline", Cmdline)
http.HandleFunc("/debug/pprof/profile", Profile)
http.HandleFunc("/debug/pprof/symbol", Symbol)
http.HandleFunc("/debug/pprof/trace", Trace)
}
If anywhere in the program http.ListenAndServe(addr, nil) is called with nil (meaning DefaultServeMux) on a public interface, the pprof endpoints are live for the world. The exposure:
/debug/pprof/heap: complete heap profile with function/file/line for every allocation site. Reveals code structure, library versions, sometimes string contents./debug/pprof/goroutine?debug=2: full goroutine stacks. Reveals every concurrent operation in progress — including in-flight request URLs if held in stack-local variables./debug/pprof/profile?seconds=N: starts a CPU profile for N seconds. An attacker can requestseconds=3600repeatedly, each one running the SIGPROF sampler for an hour, eating real CPU./debug/pprof/cmdline: fullos.Args. Often contains config paths, secrets accidentally on the command line, or hostnames./debug/pprof/symbol: symbolisation oracle — turns PC values into function names + file:line.
These are not "debug info"; they are a forensic surface and a DoS amplifier.
Fix: Bind to localhost, or use a separate mux that is reachable only from inside the cluster:
import (
"net/http"
"net/http/pprof"
)
func main() {
pprofMux := http.NewServeMux()
pprofMux.HandleFunc("/debug/pprof/", pprof.Index)
pprofMux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
pprofMux.HandleFunc("/debug/pprof/profile", pprof.Profile)
pprofMux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
pprofMux.HandleFunc("/debug/pprof/trace", pprof.Trace)
go func() {
// localhost-only; reached via kubectl port-forward or SSH tunnel
log.Fatal(http.ListenAndServe("127.0.0.1:6060", pprofMux))
}()
// public API on a separate mux with its own auth
apiMux := http.NewServeMux()
apiMux.HandleFunc("/v1/", apiHandler)
log.Fatal(http.ListenAndServe(":8080", apiMux))
}
Belt and braces: put pprof behind authentication middleware (mTLS, bearer token from a vault, SPIFFE). Never rely on "we'll remember to firewall it"; the default should be unreachable.
Also: never use http.DefaultServeMux for the public listener. Once any dependency imports net/http/pprof (or expvar), DefaultServeMux is contaminated. Always create your own *http.ServeMux.
Summary¶
These bugs cluster into four families.
Lifecycle and timing (1, 2, 3, 7, 9): finalizers that never fire because they pin their own object; finalizers used in place of explicit Close; OS threads burned when locked goroutines exit; CPU profiles that grow forever because Stop was forgotten; traces double-started with the error swallowed. The runtime package brokers between you and the GC/scheduler/profiler — every "start" needs a matching "stop", and every callback runs on the runtime's timeline, not yours.
Cost not visible at the call site (4, 6, 10): runtime.GC() invoked manually; ReadMemStats polled every 100ms; block/mutex profile rate set to 1. These look like single function calls; each one is a STW or per-event hot-path tax. Read the implementation in runtime/mgc.go, runtime/mstats.go, runtime/mprof.go once and you stop reaching for them.
Pointer safety and platform reality (5, 12, 13): unsafe.Pointer → uintptr for syscalls without KeepAlive; NumCPU instead of the cgroup quota; GC() instead of FreeOSMemory plus scavenger reality. The runtime's contract is precise; the OS's contract is messier; gaps between them are where bugs live.
Operational exposure (8, 11, 14): NumGoroutine in init returning a meaningless baseline; Stack(buf, true) silently truncating; pprof on a public port. These don't crash anything — they quietly lie about the system or hand it to whoever asks.
Review checklist for any PR touching runtime, runtime/debug, runtime/pprof, or runtime/trace:
- Every
SetFinalizerclosure uses the parameter, never the outer variable. The function does not reference the object it's attached to, directly or transitively. - No finalizer is the primary mechanism for releasing a non-heap resource (FDs, sockets, kernel handles, locks). Explicit
Close/deferat the call site, finalizer only as backstop. - Every
runtime.LockOSThreadis paired withruntime.UnlockOSThread, ideally viadefer. Goroutines that exit while locked do so deliberately, with comments explaining why the M should be destroyed. - No
runtime.GC()calls outside benchmarks and tests. Memory pressure tuned viaGOGCordebug.SetMemoryLimit. - After
unsafe.Pointeris converted touintptrfor a syscall (or any non-Go consumer),runtime.KeepAlive(holder)follows the call to anchor the lifetime. -
runtime.ReadMemStatsis not called on a hot loop. High-frequency observability usesruntime/metricsinstead. - Every
pprof.StartCPUProfileis paired withdefer pprof.StopCPUProfile(); everytrace.Startis paired withtrace.Stopand theStarterror is checked. -
runtime.NumGoroutinebaselines are taken after main is running and the runtime is warm, not ininit. -
runtime.SetBlockProfileRateandruntime.SetMutexProfileFractionare off by default; turned on with values appropriate to the host (rate=10000block,fraction=100mutex are reasonable starting points) and behind a flag. -
runtime.Stack(buf, true)uses a growable buffer (start ≥1 MiB, double on overflow), or routes throughpprof.Lookup("goroutine").WriteTo(w, 2). -
GOMAXPROCSin containerised deployments reads the cgroup quota (automaxprocsor equivalent), notruntime.NumCPU. - Code that wants RSS to drop calls both
runtime.GC()anddebug.FreeOSMemory()and understands platform-specificMADV_FREEsemantics. -
net/http/pprofis bound to localhost or behind authentication. The public listener uses a private*http.ServeMux, neverDefaultServeMux.