Tuning GOMAXPROCS — Professional Level¶
Table of Contents¶
- Introduction
runtime.GOMAXPROCSin the Sourceprocresize— The STW Path- What "Stop-the-World" Costs
- Invariants Preserved by
procresize - The
allpSlice and Lock Order - P State Transitions During Resize
- Cgroup Detection in
runtime/proc.go - Comparison With Java
ForkJoinPool - Comparison With Tokio Worker Threads
- Practical Implications
- Self-Assessment
- Summary
Introduction¶
Professional-level treatment of GOMAXPROCS means reading the actual Go runtime code, understanding the STW pause it incurs, and knowing exactly which invariants must hold across a resize. Most engineers will never need this depth — but if you write runtime patches, debug rare scheduler hangs, or design a runtime-level autoscaler, you must know it. The file references functions in runtime/proc.go and assumes you can navigate the runtime source.
runtime.GOMAXPROCS in the Source¶
The public entry point is in src/runtime/debug.go:
// GOMAXPROCS sets the maximum number of CPUs that can be executing
// simultaneously and returns the previous setting. It defaults to
// the value of runtime.NumCPU. If n < 1, it does not change the current setting.
// This call will go away when the scheduler improves.
func GOMAXPROCS(n int) int {
if GOOS == "wasip1" || GOOS == "js" {
// ... wasm always single-threaded ...
return 1
}
lock(&sched.lock)
ret := int(gomaxprocs)
unlock(&sched.lock)
if n <= 0 || n == ret {
return ret
}
stopTheWorldGC(stwGOMAXPROCS)
// newprocs will be processed by startTheWorld
newprocs = int32(n)
startTheWorldGC()
return ret
}
Key observations:
- It locks
sched.lockto readgomaxprocsatomically. The read itself is cheap. - It returns the previous value. Crucial for restore patterns in tests.
- For a no-op set (
n == ret), it bails before the STW. Callingruntime.GOMAXPROCS(8)when it is already 8 is free. stopTheWorldGC(stwGOMAXPROCS)is the STW request with a reason tag. The runtime tracks STW reasons for traces.- The actual resize happens in
startTheWorld— by the time it returns,newprocshas been applied and Ps have been allocated/freed.
A subtle point: the function comment ends with "This call will go away when the scheduler improves." That comment has been in the file since at least Go 1.5. It is wishful — GOMAXPROCS remains the canonical user-facing knob.
procresize — The STW Path¶
The heavy lifting is in procresize(nprocs int32) *p in runtime/proc.go. Greatly simplified:
func procresize(nprocs int32) *p {
// Caller must hold sched.lock and be at STW.
old := gomaxprocs
if old < 0 || nprocs <= 0 {
throw("procresize: invalid arg")
}
// 1. Update timer of each P to track totals if needed.
now := nanotime()
if sched.procresizetime != 0 {
sched.totaltime += int64(old) * (now - sched.procresizetime)
}
sched.procresizetime = now
// 2. Grow allp if needed.
maskWords := (nprocs + 31) / 32
if nprocs > int32(len(allp)) {
lock(&allpLock)
if nprocs <= int32(cap(allp)) {
allp = allp[:nprocs]
} else {
nallp := make([]*p, nprocs)
copy(nallp, allp[:cap(allp)])
allp = nallp
}
unlock(&allpLock)
}
// 3. Initialize new Ps.
for i := old; i < nprocs; i++ {
pp := allp[i]
if pp == nil {
pp = new(p)
}
pp.init(i)
atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
}
// 4. Free old Ps when shrinking.
for i := nprocs; i < old; i++ {
pp := allp[i]
// Move all runnable goroutines to global runq.
// Move all timers to other Ps.
// Move all defer caches, sudog pool, GC work bufs.
pp.destroy()
allp[i] = nil
}
// 5. Trim allp slice.
if int32(len(allp)) != nprocs {
lock(&allpLock)
allp = allp[:nprocs]
unlock(&allpLock)
}
// 6. Update global gomaxprocs.
var runnablePs *p
for i := nprocs - 1; i >= 0; i-- {
pp := allp[i]
if _g_.m.p.ptr() == pp {
continue
}
pp.status = _Pidle
if runqempty(pp) {
pidleput(pp, now)
} else {
pp.m.set(mget())
pp.link.set(runnablePs)
runnablePs = pp
}
}
gomaxprocs = nprocs
return runnablePs
}
What this function does, step by step:
- Captures the current
gomaxprocsand validates the new value. - Grows the
allpslice if the new P count is larger than current capacity. - Initialises new
pstructs for indices[old, nprocs). Eachp.init(i)sets up the local runqueue, the per-P GC work buffer, the defer pool, and so on. - Destroys removed Ps if
nprocs < old. Their local runqueues are drained into the global runqueue; timers are migrated; caches flushed. - Trims
allp. - Marks remaining Ps idle and puts them on the idle-P list (or, if they have local work, queues them for wakeup).
- Updates
gomaxprocsto the new value.
The function returns a linked list of Ps with runnable work; startTheWorld walks the list and wakes Ms to attach to them.
What "Stop-the-World" Costs¶
STW means every other goroutine is paused. The runtime calls stopTheWorldGC, which:
- Sets a global "preempt" flag.
- Sends async preemption signals (SIGURG) to every running M.
- Waits for every G to reach a safe point (function preamble, channel op, syscall) and stop.
- Once all Gs are stopped, only the STW caller can run.
procresize then runs without interruption. After it returns, startTheWorld:
- Reverses the stop: clears the preempt flag.
- Wakes Ms that have runnable work.
- Resumes execution of paused Gs.
Cost. STW for GOMAXPROCS is typically dozens of microseconds to a few hundred microseconds on a healthy process. Three things drive cost:
- Time to reach all goroutines. Most goroutines hit a safe point within tens of µs. A goroutine deep in a non-cooperative loop may take longer; async preemption (since 1.14) ensures it does not block STW indefinitely.
procresizeitself. Allocating P structs is fast (~1 µs per P). Destroying Ps requires draining the local runqueue and migrating timers — bounded by P-local state size.- Wakeup of Ms. After STW ends, Ms must be unparked and attached to Ps. Each M wakeup is ~10 µs.
For a GOMAXPROCS=8 → 16 resize on a quiescent process, expect ~50 µs total pause. Under load with high goroutine count, expect 200–500 µs. Under pathological conditions (very long cgo calls preventing safe-point arrival), the pause can be longer — but async preemption makes this rare.
Implication. Calling runtime.GOMAXPROCS(n) once at startup is invisible. Calling it 100 times per second in production is a 5–50 ms continuous latency penalty.
Invariants Preserved by procresize¶
The function must preserve several invariants. If it violated any, the scheduler would corrupt state.
- Every runnable G has a home. When a P is destroyed, its local runqueue is drained into the global runqueue. No G is lost.
- Every timer fires. Timers attached to a destroyed P are migrated to surviving Ps.
- GC work bufs are flushed. Each P holds a small per-P GC work buffer (
gcw). On destroy, it is flushed back to the global GC work pool. - The
mcacheis reattached. Each P holds anmcache(allocator local cache). On destroy, it is returned tomheap. - The defer pool and sudog pool are returned. Per-P pools are merged into global pools.
allp[0]is never destroyed during a resize. P0 is special; it is preserved as the "anchor" P.- The calling M's P is preserved. The M running
procresizeitself must still have a P to return to. The function explicitly excludes_g_.m.pfrom idle-listing. gomaxprocsis updated last. All other state must be consistent before the global value flips, so any concurrent reader ofgomaxprocs(post-STW) sees a coherent world.
If any of these invariants is violated, you get races, lost goroutines, or stuck Ps. The Go runtime tests cover most of them, but bugs have shipped in past versions — search the Go issue tracker for "procresize" to see the history.
The allp Slice and Lock Order¶
allp is the slice of all Ps in the runtime. It is read frequently (every scheduler decision) and written rarely (only during procresize).
Lock order: allpLock is held below sched.lock in the lock-rank hierarchy. Code that takes both must take sched.lock first. Violating this triggers a runtime panic in lock-rank-instrumented builds (GOEXPERIMENT=lockrank).
Reading without the lock: allp is also read by scheduler hot paths like findrunnable. Reads use atomic operations and a snapshot of len(allp) at the start of the scan. A resize that grows allp is safe because the new slots are nil-checked. A resize that shrinks is the dangerous case — the scheduler must not race with destruction.
The shrinking protocol:
procresizeis called from STW. No other Gs are running.- Old slots are nil'd out and then
allpis trimmed. - STW ends. Other Gs resume; they see the trimmed
allp.
Because all this happens under STW, the readers do not need to lock — they read a snapshot of len(allp) and trust it. This is why procresize must be STW.
If you ever wonder "why can't GOMAXPROCS be cheap?" — this is why. The allp snapshot protocol depends on STW.
P State Transitions During Resize¶
Each P has a state machine: _Pidle, _Prunning, _Psyscall, _Pgcstop, _Pdead. During procresize:
- Before STW: Ps are in various states (
_Prunning,_Pidle, etc.). - STW begins: all Ps move to
_Pgcstopas their Ms reach safe points. procresizeruns: shrinking Ps are moved through_Pdeadand freed; growing Ps are created in_Pidle.- STW ends: Ps that have work go to
_Prunning(attached to a fresh M); others stay_Pidle.
The _Pdead state is short-lived — a P is only _Pdead between "destruction started" and "memory freed". You will only see it in scheduler traces during a resize.
Cgroup Detection in runtime/proc.go¶
The runtime's cgroup detection is in getCPUCount() (Linux-specific, file runtime/os_linux.go and related). Simplified:
func getCPUCount() int32 {
// Try cgroup v2 first.
if n, ok := readCgroupV2CPU(); ok {
return max(1, n)
}
// Fall back to cgroup v1.
if n, ok := readCgroupV1CPU(); ok {
return max(1, n)
}
// Fall back to sched_getaffinity.
return int32(numCPUFromAffinity())
}
For cgroup v2:
func readCgroupV2CPU() (int32, bool) {
// Read /sys/fs/cgroup/cpu.max
data, err := readFile("/sys/fs/cgroup/cpu.max")
if err != nil { return 0, false }
// Parse "quota period"
var quota, period int64
if _, err := fmt.Sscanf(data, "%d %d", "a, &period); err != nil {
if strings.HasPrefix(data, "max") { return 0, false }
return 0, false
}
n := (quota + period - 1) / period // ceil
return int32(n), true
}
(The real code is more careful: it walks /proc/self/mountinfo and /proc/self/cgroup to find the right cgroup subpath, handles edge cases for max, and respects environment overrides.)
The detection runs once at program startup, before main(). It does not re-read the cgroup if quotas are mutated at runtime. If your orchestrator dynamically resizes pod limits, you must restart the process or implement your own re-read.
Comparison With Java ForkJoinPool¶
Java's analogue is ForkJoinPool.commonPool(), whose parallelism is set by Runtime.availableProcessors(). The corresponding internal entity is a worker thread; the JVM does not have a separate "processor context" abstraction like Go's P.
Key differences:
- No procresize-style STW. The pool can grow workers on demand without pausing the world. The JVM pays for this by holding a separate work-stealing deque per worker; resizing means allocating a new deque and migrating tasks, but no global pause.
- Multiple pools coexist. A JVM may have a
commonPool, a separateScheduledExecutorServicepool, a database connection pool, etc. Go has only one scheduler. - Cgroup-awareness arrived in JDK 10 / JDK 8u191. Earlier JVMs in containers over-threaded. The
-XX:ActiveProcessorCount=Nflag overrides. - No equivalent to
runtime.GOMAXPROCS(n)at runtime. The pool sizes are typically fixed at creation. Dynamic resizing happens at the application level (custom executors).
For a Java engineer reading Go code: think of GOMAXPROCS as "the parallelism of the entire JVM" — there is only one pool, and you cannot create alternatives.
Comparison With Tokio Worker Threads¶
Rust's Tokio runtime is the closest analogue to Go's scheduler. The relevant knob:
worker_threads(n) is Tokio's GOMAXPROCS. Defaults to num_cpus::get().
Differences:
- No cgroup detection by default.
num_cpus::get()reads/proc/self/statusor affinity; cgroup-aware variants exist but are not the default. Container deployments must call out explicitly. TOKIO_WORKER_THREADSenv var — equivalent toGOMAXPROCSenv var.- No equivalent of
procresize. You cannot resize a Tokio runtime after build; create a new runtime instead. - Threads are real OS threads. Tokio does not have an M/P split — workers are threads. Closer to a thread pool than to Go's M:N.
For a Rust engineer: Go's runtime is similar to Tokio's multi_thread runtime, with the added flexibility that the P/M split lets the runtime spawn extra threads for blocking calls. Tokio handles this differently via spawn_blocking (delegates to a separate pool).
The trade-off: Tokio is more explicit (you build the runtime; you know what you have) but less adaptive (no syscall handoff equivalent for arbitrary blocking).
Practical Implications¶
Three concrete things to remember when writing low-level Go.
1. Never call runtime.GOMAXPROCS(n) in a hot path. Each call is STW. Even n == current bails before STW, but the lock acquisition still costs ~10 ns. Read with runtime.GOMAXPROCS(0) if you need it frequently.
2. If you build an autoscaler, batch decisions. Adjust at most once every minute or so. Frequent STWs add up.
3. If you need to know whether the runtime is cgroup-aware, log it at startup. Compare runtime.NumCPU() with the cgroup file content. If they match, you are getting container-aware sizing. If not, you may be on an old Go or an unusual sandbox.
func reportSizing() {
log.Printf("NumCPU=%d GOMAXPROCS=%d", runtime.NumCPU(), runtime.GOMAXPROCS(0))
if data, err := os.ReadFile("/sys/fs/cgroup/cpu.max"); err == nil {
log.Printf("cgroup.cpu.max=%s", strings.TrimSpace(string(data)))
}
}
The diagnostic value of these three log lines exceeds nearly any other piece of runtime introspection.
Self-Assessment¶
- I can read
runtime.GOMAXPROCSinruntime/debug.goand explain it line by line. - I can describe what
procresizedoes and which invariants it preserves. - I can quantify STW cost for
GOMAXPROCSresizes under typical conditions. - I know that
allpreads in scheduler hot paths rely on STW for safety. - I can read
runtime/os_linux.goand find the cgroup detection routine. - I can compare Go's
GOMAXPROCSto Java'savailableProcessorsand Tokio'sworker_threads. - I know that Java's
ForkJoinPoolcan resize without STW; Go cannot. - I have logged the cgroup file content to verify runtime detection.
Summary¶
GOMAXPROCS is, mechanically, a single line in runtime/proc.go that updates the gomaxprocs global. But that update is wrapped in a stop-the-world because the allp slice — the scheduler's index of all processors — is read lock-free in hot paths and relies on STW for its mutation protocol. Every other invariant (runqueue drain, timer migration, mcache flush) follows from that mutation needing to be atomic.
The practical lessons:
procresizeSTW is small (tens to hundreds of microseconds) but real.- Calling
runtime.GOMAXPROCSfrequently is a continuous latency penalty. - Cgroup detection is one-shot at startup; runtime quota changes are not picked up.
- Go's design is comparable to Java's
availableProcessorsand Tokio'sworker_threads, with Go and Java being the more container-aware defaults.
The detailed runtime walk is below the surface most engineers ever touch. Knowing it is the difference between "I trust the scheduler" and "I can debug it when it surprises me".