Runtime Goroutine Management — Middle Level¶
Table of Contents¶
- Introduction
- GOMAXPROCS in Production
- GOMEMLIMIT and SetMemoryLimit
- GOGC and SetGCPercent
- SetMaxStack and SetMaxThreads as Safety Nets
- Stack Trace Discipline
- LockOSThread Patterns
- Adopting runtime/metrics
- Profile and Trace Sampling Rates
- GODEBUG Companion Knobs
- Common Anti-Patterns
- Diagnostics Playbook
- Self-Assessment
- Summary
Introduction¶
At junior level you learned which APIs exist. At middle level you apply them in production: containers, capacity planning, leak hunts, OOM avoidance, and pprof-driven debugging. The mindset shifts from "what does this function do" to "given a Kubernetes pod with 2 CPU, 1 GB memory, and 30k req/s, which runtime settings are correct?"
After this file you will:
- Choose
GOMAXPROCScorrectly for containers, NUMA hosts, and mixed workloads. - Set
GOMEMLIMITto prevent OOM kills without crippling throughput. - Tune
GOGCfor the right CPU/memory tradeoff. - Use
SetMaxStackandSetMaxThreadsas production safety nets. - Take diagnostic dumps without surprising production traffic.
- Integrate
runtime/metricsinto Prometheus. - Recognise the common anti-patterns:
Goschedsprinkles, manualGCcalls, library-setGOMAXPROCS.
For internals — how the scheduler actually consumes GOMAXPROCS, how SetMemoryLimit interacts with the pacer — see the professional file and the scheduler/GC sections. Cross-reference to 02-vs-os-threads for the M:N model that GOMAXPROCS directly tunes.
GOMAXPROCS in Production¶
What it actually controls¶
GOMAXPROCS is the number of P (processor) contexts the Go scheduler creates. Each P is the resource an M (OS thread) must hold to execute Go code. Net effect: at most GOMAXPROCS goroutines run user code simultaneously. Other Ms exist for blocking syscalls and cgo, so the OS thread count exceeds GOMAXPROCS under I/O load.
Default behaviour¶
- Before Go 1.5: default 1 (single-core).
- Go 1.5 to 1.24: default
runtime.NumCPU(). In a container withcpu.cfs_quota_us=2, this still returned the host's core count. - Go 1.25+: default reads cgroup CPU quota when present and rounds up.
If you cannot rely on 1.25+, you must set it yourself. The standard library does not.
The container problem¶
A Kubernetes pod with resources.limits.cpu: 2 runs on a 64-core host. Before Go 1.25, runtime.NumCPU() returned 64. The scheduler created 64 Ps, each spinning trying to run goroutines, but the kernel only let 2 cores actually progress. Effects:
- Heavy lock contention on internal scheduler structures.
- GC workers oversubscribing CPU.
- p99 latency spikes from CPU throttling.
- Higher kernel CPU usage (scheduling, throttling).
The fix:
Or manually:
quota := readCPUQuota("/sys/fs/cgroup/cpu.max") // implement per platform
runtime.GOMAXPROCS(int(math.Ceil(quota)))
Or via env: GOMAXPROCS=2 ./server.
When to deviate from NumCPU¶
- Containerised, no auto-detection: set to the container's CPU quota.
- Cgo-heavy workload: sometimes lower; cgo holds Ms, and oversubscription thrashes.
- Mixed I/O and CPU: leave at
NumCPU. Blocking I/O parks goroutines without consuming a P. - Single-core box, CPU-bound: leave at 1. Extra Ps do nothing useful.
- Diagnosing a race: set to 1 to serialize goroutines. Useful for race reproduction; do not ship.
What GOMAXPROCS is not¶
- Not a hard cap on OS threads. Use
SetMaxThreadsfor that. - Not a goroutine cap. Spawn as many as you like.
- Not a guarantee of parallelism if the OS does not give you the cores.
Reading GOMAXPROCS from env¶
The env var is parsed once at startup. Subsequent changes to GOMAXPROCS after the program starts must go through runtime.GOMAXPROCS(n). You cannot change it via env mid-process.
GOMEMLIMIT and SetMemoryLimit¶
What it does¶
debug.SetMemoryLimit(bytes int64) int64 (Go 1.19+) sets a soft memory cap. The GC pacer uses the cap as a target: as live memory + stacks + reserved memory approaches the cap, the runtime triggers GC more aggressively. The "soft" qualifier means the runtime cannot stop user code mid-allocation; under burst allocation, you may briefly exceed the cap.
The environment variable equivalent: GOMEMLIMIT=900MiB.
Why you should set it in containers¶
The default GC trigger is purely heap-growth based (GOGC=100, GC when heap doubles since last live). It has no awareness of "I am running in a 1 GB container." Sequence of events without GOMEMLIMIT:
- Heap grows to 600 MB.
- Working set lives there for a while.
- A burst happens. Heap grows to 1100 MB before next GC trigger.
- Kernel OOM kills the process.
With GOMEMLIMIT=900MiB:
- Heap grows to 600 MB.
- As live + stacks approach 900 MB, GC frequency rises.
- The runtime may GC every few allocations to stay under the limit.
- Throughput degrades but the process survives.
Net: GOMEMLIMIT trades CPU for memory in a controlled way.
How to choose the value¶
- Set below the hard cap. A common heuristic is 90% of the container's memory limit, leaving headroom for stacks, runtime overhead, and burst.
- Subtract a guess for non-heap memory: goroutine stacks (~8 KB per active goroutine), runtime metadata, scratch space.
- If your workload's allocation rate is bursty, leave more headroom. If it is steady, you can be tighter.
// In init() or main, before serving traffic.
func init() {
const containerMem = 1024 << 20 // 1 GiB
debug.SetMemoryLimit(int64(containerMem) * 9 / 10)
}
Pathological case: GC death spiral¶
If GOMEMLIMIT is set too tight relative to working-set, the runtime enters a GC death spiral: it spends nearly all CPU running GC, throughput collapses, requests time out, the working set never shrinks. The runtime caps GC CPU usage at ~50% to mitigate this, but the symptom is still real.
Signs:
/cpu/classes/gc/total:cpu-secondsrising sharply./gc/pauses:secondsshowing many small pauses.- p99 latency spiking even though the program "is not OOM."
Fix: raise GOMEMLIMIT or reduce allocation rate.
Disabling¶
debug.SetMemoryLimit(math.MaxInt64) removes the cap. Same as not setting it.
GOGC and SetGCPercent¶
What it does¶
debug.SetGCPercent(p int) sets the GC trigger ratio. With GOGC=100 (the default), the GC starts when the new heap is 100% larger than the live heap at the end of the previous GC. GOGC=50 triggers when heap grows by 50%; GOGC=200 when it doubles plus once more.
Special values:
0: invalid. Treated as100.-1: disables GC entirely. Used ingo test -benchsetups, never in production.
CPU/memory tradeoff¶
GOGC | Behaviour |
|---|---|
| 25 | Aggressive GC. Lower peak heap. ~2–4× more GC CPU. |
| 100 | Default. Heap roughly doubles between GCs. |
| 200 | Lazier GC. Higher peak heap. ~30% less GC CPU. |
| 500 | Very lazy. Useful when memory is cheap and CPU is the bottleneck. |
| off (-1) | No automatic GC. Memory grows unbounded. |
Interaction with GOMEMLIMIT¶
When GOMEMLIMIT is set, the runtime uses both triggers: GC runs at GOGC ratio or when approaching the memory limit, whichever comes first. In practice, once you set GOMEMLIMIT, GOGC matters most when memory is abundant.
When to tune¶
- Latency-sensitive service with abundant memory: raise
GOGCto 200 or 300 to reduce GC frequency. - Memory-tight service: lower
GOGCto 50 to keep heap smaller. - Throughput-focused batch job: leave at 100.
Always measure before and after with runtime/metrics or pprof.
SetMaxStack and SetMaxThreads as Safety Nets¶
SetMaxStack¶
Default: 1 GB on 64-bit platforms. Hitting the cap crashes the goroutine (and the program if not recovered) with runtime: goroutine stack exceeds N-byte limit.
Reasons to lower:
- Catch infinite recursion faster (1 MB is plenty for normal code).
- Container with limited memory — a single runaway goroutine eating 1 GB is wasteful.
- Compliance / testing environments where stack-overflow detection should be early.
Reasons to raise: very deep recursive algorithms (you should usually iterativise instead).
SetMaxThreads¶
Default: 10 000. The runtime calls exit(1) when this is exceeded, not a panic. There is no recovery.
Reasons to lower:
- Defend against cgo-induced thread explosions. A cgo call holds an M; many concurrent cgo calls create many Ms.
- Crash on bug rather than thrash the host.
- Make thread limits explicit in capacity planning.
Reasons to keep the default: most pure-Go workloads never approach it.
Putting them together¶
func init() {
debug.SetMaxStack(8 << 20) // 8 MB; catch runaway recursion early
debug.SetMaxThreads(2000) // crash if cgo storms
}
Add a comment explaining why these are set. Future readers will not know.
Stack Trace Discipline¶
When to call which¶
| API | When to use |
|---|---|
debug.Stack() | In a panic handler. Returns current goroutine. |
debug.PrintStack() | Quick local debug. Writes to stderr. |
runtime.Stack(buf, false) | Same as debug.Stack but with a caller-supplied buffer. |
runtime.Stack(buf, true) | "Show me everything." Stops the world. |
SIGQUIT (kill -3) | Built-in: triggers runtime.Stack(_, true) and exits. |
pprof.Lookup("goroutine").WriteTo(w, n) | Structured dump via pprof. Does not stop the world for long. |
The pprof.Lookup alternative¶
import "runtime/pprof"
func dump(w io.Writer) error {
return pprof.Lookup("goroutine").WriteTo(w, 2)
}
debug verbosity:
0: protobuf format (forgo tool pprof).1: human-readable, counts only.2: human-readable, full stacks (likekill -3).
pprof's dump is less disruptive than runtime.Stack(_, true) because it can be sampled — but for "every goroutine, full stack" the cost is similar.
Production endpoint¶
mux.HandleFunc("/debug/stacks", func(w http.ResponseWriter, r *http.Request) {
pprof.Lookup("goroutine").WriteTo(w, 2)
})
Behind authentication. Always.
Buffer sizing for runtime.Stack¶
Naive code uses a 64 KB buffer and silently truncates. Correct pattern:
func allStacks() []byte {
n := 64 << 10
for {
buf := make([]byte, n)
got := runtime.Stack(buf, true)
if got < n {
return buf[:got]
}
n *= 2
if n > 64<<20 {
return buf // give up; 64 MB of stacks is already absurd
}
}
}
A server with 50 000 goroutines may need several MB of buffer. Do not hardcode a single size.
LockOSThread Patterns¶
Full coverage: 02-vs-os-threads. Summary at middle level:
Use cases¶
- Calling C libraries that use TLS (OpenGL, GTK, some OpenSSL paths).
- Setting a signal mask on the current thread.
- Setting CPU affinity on the current thread.
- Setting Linux capabilities or namespaces (these are thread-local on Linux).
Reference-counting since Go 1.10¶
runtime.LockOSThread() // count = 1
runtime.LockOSThread() // count = 2
runtime.UnlockOSThread() // count = 1
runtime.UnlockOSThread() // count = 0; goroutine unpinned
Mismatch (more Unlocks than Locks) panics.
Idiom: lock-as-init¶
func initOpenGLThread() {
runtime.LockOSThread()
C.gladLoadGL()
// never UnlockOSThread; this goroutine is the GL thread for life
}
Goroutine that exits while locked → its OS thread is destroyed. Safety net: prevents leaked thread-local state.
Anti-pattern: lock for performance¶
A common bad idea: "if I lock to a thread, my goroutine will be faster because no migration." False. The Go scheduler's migration cost is sub-microsecond; the L1 cache penalty is usually less than the cost of waiting for the thread to be free.
Adopting runtime/metrics¶
Why over MemStats¶
runtime.ReadMemStats returns a 30-field struct of mostly memory data. It stops the world briefly. runtime/metrics:
- Stops nothing (most metrics are continuously updated by the runtime).
- Typed (
KindFloat64,KindUint64,KindFloat64Histogram). - Versioned: stable names like
/sched/goroutines:goroutines. - Includes scheduler latencies and GC pause histograms that
MemStatslacks.
Reading¶
import "runtime/metrics"
var (
goroutines metrics.Sample
heapBytes metrics.Sample
gcCPU metrics.Sample
schedLat metrics.Sample
)
func init() {
goroutines.Name = "/sched/goroutines:goroutines"
heapBytes.Name = "/memory/classes/heap/objects:bytes"
gcCPU.Name = "/cpu/classes/gc/total:cpu-seconds"
schedLat.Name = "/sched/latencies:seconds"
}
func scrape() {
samples := []metrics.Sample{goroutines, heapBytes, gcCPU, schedLat}
metrics.Read(samples)
for _, s := range samples {
switch s.Value.Kind() {
case metrics.KindUint64:
log.Printf("%s = %d", s.Name, s.Value.Uint64())
case metrics.KindFloat64:
log.Printf("%s = %f", s.Name, s.Value.Float64())
case metrics.KindFloat64Histogram:
h := s.Value.Float64Histogram()
log.Printf("%s p99 = %f", s.Name, percentile(h, 0.99))
}
}
}
Useful metric names¶
| Name | Meaning |
|---|---|
/sched/goroutines:goroutines | Live goroutine count (cheaper than NumGoroutine). |
/sched/latencies:seconds | Histogram of "time goroutine was runnable but not running." |
/memory/classes/heap/objects:bytes | Live heap allocations. |
/memory/classes/heap/free:bytes | Free space in heap not yet returned to OS. |
/memory/classes/total:bytes | All memory the runtime tracks. |
/gc/heap/allocs:bytes | Cumulative bytes allocated (total). |
/gc/pauses:seconds | Histogram of GC stop-the-world pause durations. |
/gc/cycles/total:gc-cycles | Number of completed GC cycles. |
/cpu/classes/gc/total:cpu-seconds | CPU spent in GC. |
/cpu/classes/scavenge/total:cpu-seconds | CPU spent returning memory to OS. |
Full list: metrics.All().
Histograms¶
KindFloat64Histogram gives you Counts []uint64 and Buckets []float64. You sum counts to get totals, walk buckets to compute percentiles. Helper:
func percentile(h *metrics.Float64Histogram, p float64) float64 {
var total uint64
for _, c := range h.Counts {
total += c
}
target := uint64(float64(total) * p)
var sum uint64
for i, c := range h.Counts {
sum += c
if sum >= target {
return h.Buckets[i+1]
}
}
return h.Buckets[len(h.Buckets)-1]
}
Prometheus integration¶
Use prometheus/collectors.NewGoCollector(collectors.WithGoCollections(collectors.GoRuntimeMetricsCollection)). It exposes the entire runtime/metrics catalog as Prometheus metrics with go_ prefix. One line of setup, dozens of free dashboards.
Profile and Trace Sampling Rates¶
Mutex profile¶
0 = disabled (default). 1 = every event (high overhead). 5 = 20% sampling, recommended for production.
Read with pprof.Lookup("mutex").WriteTo(w, 0).
Block profile¶
Argument is the rate at which a blocking event is recorded, sampled by total blocked nanoseconds. 0 disables; 1 records every event (high overhead); 10000 records ~one event per 10 µs of blocking time.
Read with pprof.Lookup("block").WriteTo(w, 0).
Both default off¶
These profiles are opt-in because they add overhead per channel-op / mutex-op. Turn them on for a few minutes during an investigation, then turn them back off.
Goroutine profile¶
Always on. Read with pprof.Lookup("goroutine").WriteTo(w, 0). Gives a profile of goroutine creation sites and their counts — perfect for leak hunts.
Heap profile¶
Always on. pprof.Lookup("heap").WriteTo(w, 0). Reports current and total allocations.
runtime/trace¶
import "runtime/trace"
f, _ := os.Create("trace.out")
trace.Start(f)
// run workload
trace.Stop()
f.Close()
Then go tool trace trace.out opens an interactive UI showing every goroutine state transition, every GC event, every syscall. Heaviest profiling tool Go has. Run for a few seconds at most.
Production tip: expose via HTTP for on-demand capture, behind authentication.
mux.HandleFunc("/debug/trace", func(w http.ResponseWriter, r *http.Request) {
sec, _ := strconv.Atoi(r.URL.Query().Get("seconds"))
if sec <= 0 { sec = 5 }
trace.Start(w)
time.Sleep(time.Duration(sec) * time.Second)
trace.Stop()
})
GODEBUG Companion Knobs¶
Some runtime knobs are env-only, no API equivalent. Notable ones:
GODEBUG setting | Effect |
|---|---|
gctrace=1 | Print one line per GC cycle to stderr. Cheap, informative. |
schedtrace=1000 | Every 1000 ms, print scheduler state (Ps, Ms, queues). |
scheddetail=1 | Combined with schedtrace, also print per-goroutine state. |
allocfreetrace=1 | Trace every allocation/free. Use only for tiny test programs. |
madvdontneed=1 | Use MADV_DONTNEED instead of MADV_FREE on Linux. Useful when working with monitoring tools that only see RSS. |
asyncpreemptoff=1 | Disable async preemption. Diagnostic only — your program may starve. |
cgocheck=2 | Heavy cgo argument checking. Slow but catches pointer-passing bugs. |
inittrace=1 | Print init function timing. |
panicnil=1 | Allow panic(nil) (legacy compat). |
You set them via env: GODEBUG=gctrace=1,schedtrace=1000 ./server.
Common Anti-Patterns¶
Anti-pattern: runtime.Gosched in hot loops¶
Adds a scheduler entry per iteration. Async preemption already interrupts after ~10 ms; manual yields just add overhead. Remove them.
Anti-pattern: manual runtime.GC() calls¶
// WRONG
func handleRequest(req Request) Response {
runtime.GC() // "clean up before responding"
return process(req)
}
A 5–50 ms forced pause per request, with no long-term memory improvement. The runtime would have GC'd at a better moment.
The legitimate uses of runtime.GC():
- Microbenchmarks: force a clean baseline before measurement.
- Tests that check finalizer behaviour.
- After known massive deallocations in a startup hook, if you can prove it reduces future GC pressure.
Anti-pattern: runtime.GOMAXPROCS in a library¶
A library has no business setting a process-global tunable. The application controls GOMAXPROCS; a library cannot know how many cores are available for its share.
Anti-pattern: SetMaxThreads(N) "to limit thread usage"¶
SetMaxThreads is a kill-switch. Hitting it terminates the program. Use it for safety nets, not as a normal-operations limit. If your service routinely needs more threads, raise the limit, don't set it tight.
Anti-pattern: finalizers for resource cleanup¶
// WRONG
type FileWrapper struct{ f *os.File }
func NewFileWrapper(path string) *FileWrapper {
f, _ := os.Open(path)
fw := &FileWrapper{f: f}
runtime.SetFinalizer(fw, func(fw *FileWrapper) { fw.f.Close() })
return fw
}
The file may stay open arbitrarily long. Use explicit Close() with defer. Reserve finalizers for the standard library's last-resort safety net pattern (and even there, the std lib is moving away from them).
Anti-pattern: LockOSThread for "performance"¶
Locking does not improve cache locality enough to matter for typical Go code. It does block scheduler migration, which can hurt throughput in mixed workloads. Lock only when there is a thread-affine API (TLS, signals) that requires it.
Diagnostics Playbook¶
"My goroutine count is rising"¶
- Run
pprof.Lookup("goroutine").WriteTo(w, 1)to get a count-by-creation-site dump. - Identify the call site with the highest count. That is your leak.
- Check: does the goroutine block on a channel that is never closed? On a
context.Contextthat is never cancelled? On a network read with no timeout? - Fix the exit condition. Run again. Count should plateau.
"My memory is rising"¶
- Capture two heap profiles 60 seconds apart:
go tool pprof http://host/debug/pprof/heap. - Compare with
top -diff_base: which allocation site grew? - Distinguish: more live objects vs slower GC.
- Inspect
/memory/classes/heap/free:bytes— if huge, the runtime is just not releasing to OS yet. SetGOMEMLIMITor wait for the scavenger.
"My latency p99 is bad"¶
- Read
/sched/latencies:secondshistogram. If p99 is > 1 ms, the scheduler is overloaded — too many runnable goroutines. - Read
/gc/pauses:seconds. High p99 = stop-the-world pauses too long. - Read
/cpu/classes/gc/total:cpu-secondsover time. If GC is consuming > 20% CPU, GC is over-pressured. - Possible fixes: raise
GOMEMLIMIT, raiseGOGC, reduce allocation rate via pooling.
"My service was OOM-killed"¶
- Did you set
GOMEMLIMIT? If no, set it. - If yes, check the container's actual memory at kill time: was the limit reasonable?
- Look at
/memory/classes/heap/objects:bytesover time. Was there a sustained climb or a burst? - If burst: investigate the request that caused it. Consider per-request memory budgets.
"My program is stuck"¶
kill -SIGQUIT <pid>to dump every goroutine's stack. Capture in logs.- Look for
runtime.goparkandchan send/chan receivepatterns — common deadlock signature. - Look for
runtime.semacquire— mutex deadlock. - Cross-reference goroutine creation sites: which call introduced the deadlock?
"Sudden thread spike"¶
- Check
pid's thread count:cat /proc/<pid>/status | grep Threads. - Is the spike correlated with a cgo call site? Each cgo call holds an M.
- Is the spike correlated with a blocking syscall on a non-net fd?
- Mitigate: bound concurrency for the cgo call, or use the netpoller-eligible API.
Self-Assessment¶
- I can choose
GOMAXPROCSfor a Kubernetes pod and justify the number. - I have set
GOMEMLIMITin at least one production service and verified it prevents OOM. - I can read a
/sched/latencies:secondshistogram and explain what it tells me. - I have wired
runtime/metricsinto Prometheus. - I know which goroutine and heap profiles are on by default and which require
SetMutexProfileFraction/SetBlockProfileFraction. - I can take a
runtime/tracesnapshot from a running service. - I know
runtime.Goschedandruntime.GCare anti-patterns in production code. - I have an HTTP endpoint that dumps all goroutine stacks, behind authentication.
- I can explain why
LockOSThreadis rare and what its lock count means. - I can dump and read a goroutine count by creation site.
Summary¶
At middle level the runtime API moves from curiosity to discipline. Three rules cover most production scenarios:
- Set
GOMAXPROCSandGOMEMLIMITfor your environment. Containers in particular cannot rely on defaults until Go 1.25+. - Read, do not write, the diagnostic surface.
NumGoroutine,runtime/metrics,pprof.Lookupshould be in every dashboard.GC,Gosched,Goexitshould be in almost no business code. - Make profiling part of the deployment.
net/http/pprofendpoint, mutex/block fractions tuned,pprof.Dowrapping handlers,SIGQUITworking.
The senior file goes deeper into labels, request-scoped profiling, and runtime/trace analysis. The professional file explains how each of these knobs is wired into the runtime internals.