Goroutines — Professional Level (Under the Hood)¶

Table of Contents¶

Introduction
The GMP Model
Anatomy of a g Struct
The go Statement at the Assembly Level
Run Queues: Local, Global, and Net Poller
Work-Stealing
Sysmon: the Background Monitor
Asynchronous Preemption (Go 1.14+)
Stack Growth and Shrinking
The Network Poller and Gwaiting
Syscalls and the M-Park Dance
GOMAXPROCS and Pidle
Goroutine Identity and Reuse
Tracing the Scheduler
Limits, Failure Modes, and Defaults
Self-Assessment
Summary

Introduction¶

Everything you have learned about goroutines so far is true at the level of behaviour. This document explains how the Go runtime makes them work: the scheduler's data structures, the algorithms that move goroutines between threads, the technique used to preempt a tight loop, and the mechanisms that let a million goroutines share a handful of OS threads.

The references throughout point at files in the Go runtime source (src/runtime/), which is by design readable Go code. The runtime is not a black box; you can read it.

The GMP Model¶

The Go scheduler is a work-stealing scheduler over three abstractions:

Letter	Meaning	Lives in
G	Goroutine — the unit of work	`runtime.g` struct
M	Machine — an OS thread	`runtime.m` struct
P	Processor — a logical scheduler context	`runtime.p` struct

Their relationship:

              Gs (millions)
              ┌──────────────────────────────┐
              │  G G G G G G G G G G G G ... │
              └──────────────────────────────┘
                          ↓ scheduled onto
              ┌──────────────────────────────┐
              │     P     P     P     P      │  <- GOMAXPROCS Ps
              └──────────────────────────────┘
                          ↓ executed by
              ┌──────────────────────────────┐
              │  M  M  M  M  M  M  M  ...    │  <- as many Ms as needed
              └──────────────────────────────┘
                          ↓ run on
              ┌──────────────────────────────┐
              │     OS kernel threads         │
              └──────────────────────────────┘

What is a P?¶

A P is the scheduler's per-CPU bookkeeping: a local run queue of runnable Gs, a small free list of Gs and stacks, and the scheduler state for one logical processor. By default GOMAXPROCS = NumCPU, so on a 16-core machine there are 16 Ps.

A G can run only when bound to a P, and a P can run only when bound to an M. The triple (G, P, M) is the runtime's atomic unit of execution.

Why Ps exist¶

Without Ps, the scheduler would need a global lock to dequeue runnable goroutines. Ps give each scheduler "lane" its own runqueue, so most scheduling is contention-free. Ps also make it possible to enforce GOMAXPROCS without counting Ms: there is exactly one P per concurrent execution slot.

Ms come and go¶

A new M is created when:

A G makes a blocking syscall and the P needs another M to keep running.
All Ms are busy and there are runnable Gs.

Idle Ms are parked on a list and reused. Go 1.21+ also retires Ms that have been idle for too long, returning their resources to the OS.

Anatomy of a `g` Struct¶

runtime.g (in runtime/runtime2.go) is the in-memory representation of a goroutine. Selected fields:

type g struct {
    stack       stack       // current stack range [stack.lo, stack.hi)
    stackguard0 uintptr     // checked by stack-growth prologue
    m           *m          // current M (or nil if not running)
    sched       gobuf       // saved register state when not running
    atomicstatus uint32     // Grunnable, Grunning, Gwaiting, ...
    goid         int64      // unique ID (not exposed via API)
    waitreason  waitReason  // why is it blocked, if any
    preempt     bool        // request to preempt at next safe point
    parentGoid  int64       // who spawned us (since 1.21)
    // ... ~50 more fields
}

A g is small — roughly a few hundred bytes — separate from its stack. The runtime keeps a free list of g structs and reuses them across goroutines, so the per-goroutine bookkeeping cost amortises near zero.

The goid is intentionally not exposed through runtime package: it is unstable across versions and would tempt patterns the runtime authors consider harmful.

The `go` Statement at the Assembly Level¶

When you write go f(x), the compiler emits a call to runtime.newproc:

// runtime/proc.go
func newproc(fn *funcval) {
    gp := getg()
    pc := getcallerpc()
    systemstack(func() {
        newg := newproc1(fn, gp, pc)
        // place on local runq
        runqput(pp, newg, true)
        if mainStarted {
            wakep()  // maybe wake another P
        }
    })
}

Concretely:

Allocate or recycle a g struct from the P-local free list.
Allocate a 2 KB stack (also from a free list when possible).
Initialise g.sched so that, when this G is dispatched, control jumps to a small assembly trampoline that calls f.
Push the new G onto the current P's local run queue.
If there are idle Ps and we have surplus work, wake one (wakep).

The whole thing runs in a few hundred nanoseconds — far cheaper than pthread_create, which calls into the kernel and allocates a megabyte.

The arguments are a closure¶

The compiler synthesises a funcval that captures f and the evaluated arguments. That is why go f(getValue()) evaluates getValue() in the parent goroutine before the new G is even allocated.

Run Queues: Local, Global, and Net Poller¶

The scheduler chooses the next G to run by looking, in order, at three queues:

1. Local run queue of the current P  (lock-free, ~256 capacity)
2. Global run queue                  (mutex-protected, unbounded)
3. Network poller                    (Gs woken by I/O readiness)

Local run queue¶

A 256-slot ring buffer per P. Lock-free for the owner P (uses atomics). Pushes and pops are nanosecond-scale.

When the queue overflows, the owner P moves half of it to the global run queue.

Global run queue¶

A linked list, protected by sched.lock. Used as overflow for local queues and as the seed for new Ps.

To prevent global-queue starvation, the scheduler pulls from the global queue every 61 ticks (sched_tick % 61 == 0), even if the local queue is non-empty.

Network poller¶

The runtime has a single goroutine per platform that calls epoll_wait (Linux), kevent (BSD/macOS), or GetQueuedCompletionStatus (Windows). When an I/O is ready, the runtime moves the parked G back to runnable state. The scheduler treats the net poller as a third source of work.

This is why a Go web server with 100 000 idle WebSocket connections runs on 4 OS threads: only the connection that just became readable is on a runqueue; the other 99 999 Gs are parked in Gwaiting and consume no scheduling attention until their I/O event fires.

Work-Stealing¶

When a P's local runqueue is empty and the global queue is empty, the P does not idle — it tries to steal work from another P.

// runtime/proc.go (simplified)
for i := 0; i < 4; i++ {
    p2 := randomPotherThanSelf()
    if g := runqsteal(self, p2, ...); g != nil {
        return g
    }
}

The thief takes half of the victim's local queue. Stealing is the load-balancing mechanism: an unevenly distributed workload self-balances within a few microseconds.

If stealing also fails:

Check the network poller for ready Gs.
Run the GC if it is helpful.
Park the M.

A parked M is genuinely idle: it is sleeping on a futex/condition variable, costing zero CPU. When new work arrives and a P needs help, wakep wakes the M.

Sysmon: the Background Monitor¶

sysmon is a special M that runs without a P. It runs forever in runtime.sysmon (in runtime/proc.go). Its job is to do the things the regular scheduler can't:

Retake Ps from goroutines that have been running too long. Sysmon checks every ~10ms; if a G has been on the same P for >10ms, sysmon sets g.preempt = true and (in Go 1.14+) sends a signal to preempt it.
Retake Ps from blocked syscalls. If an M has been in a syscall longer than ~20μs, sysmon hands the P to another M so the runtime can keep scheduling.
Trigger garbage collection if it has been longer than GOGC allows.
Force network poll to make sure I/O is checked even when all goroutines are CPU-bound.
Forcibly close idle network conns in some scenarios.

Sysmon runs at adaptive frequency: as fast as 20 microseconds when the system is busy, slowing to 10 milliseconds when idle.

Sysmon is the reason your CPU-bound goroutines do not starve the scheduler: even if every goroutine is in a tight loop, sysmon eventually tells the runtime to preempt them.

Asynchronous Preemption (Go 1.14+)¶

Before Go 1.14, preemption was cooperative: the runtime could only preempt a goroutine at function-call boundaries (where the stack-growth check lived). A function with no inner calls — say, a tight loop — could run forever, ignoring preemption requests.

Famous bug:

for { /* no calls */ }

In pre-1.14 Go with GOMAXPROCS=1, this loop blocked GC and every other goroutine. The whole runtime froze.

Go 1.14 introduced asynchronous preemption based on POSIX signals (Linux: SIGURG):

Sysmon sees a G has run too long.
The runtime sends SIGURG to the M running that G.
The signal handler installs a frame that, on return, puts the G into a safe parking state.
The G is descheduled; the M picks up another G.

Asynchronous preemption makes the scheduler truly preemptive. Tight loops, infinite recursion (until stack exhaustion), and CPU-hot workloads no longer stall GC or other goroutines.

The implementation is delicate: signals can interrupt at any instruction, so the runtime must verify that the interrupted state is "safe" — registers properly saved, stack maps known. The mechanism is documented in the Go runtime sources and was the subject of Austin Clements' GopherCon 2020 talk.

Stack Growth and Shrinking¶

A goroutine starts with a stack of 2 KB (since 1.4; before that, 8 KB). When the stack overflows, the runtime grows it.

How the check works¶

Every function prologue (compiled by cmd/compile) inserts a stack-bound check:

CMPQ SP, g_stackguard0(R14)   ; compare SP to lo guard
JLS  morestack                ; if too low, grow

If SP falls below stackguard0, the function jumps into runtime.morestack, which:

Allocates a new stack twice the size of the current one.
Copies the contents of the old stack to the new stack.
Adjusts every pointer that points into the old stack (the runtime knows where they all are because of the per-instruction stack maps the compiler emits).
Resumes execution at the calling function.

Stack growth is rare in steady state but occurs on first call, deep recursion, or large local variables.

Shrinking¶

The garbage collector triggers stack shrinking. If a goroutine's stack is mostly empty, the GC may copy it back down to a smaller stack to save memory. Shrinking is conservative — it only happens if at least 75% of the stack is unused.

Limits¶

The default maximum stack size is 1 GB on 64-bit systems (1 GiB on Linux). You can change it with runtime/debug.SetMaxStack. Hitting the limit causes the program to crash with "stack overflow."

The Network Poller and `Gwaiting`¶

Every blocking I/O operation in Go's net, os, and time packages is implemented via the network poller. When a goroutine calls conn.Read:

The runtime sets the file descriptor to non-blocking.
The goroutine attempts the read; if no data, the syscall returns EAGAIN.
The runtime parks the goroutine in Gwaiting, registers the FD with epoll/kqueue, and the M moves on to other work.
When epoll/kqueue reports the FD is readable, the netpoll goroutine puts the G back into Grunnable.

This is why goroutine I/O does not consume threads. The OS knows about a small pool of Ms; it does not know there are 50 000 goroutines reading from sockets.

Why this is more efficient than thread-per-connection¶

The kernel's epoll_wait is a single syscall per N I/O events, far cheaper than waking N threads. Combine that with the per-goroutine 2KB stack (versus per-thread 1MB stack), and Go's I/O model is roughly two orders of magnitude more memory-efficient than the thread-per-connection model used by classical Apache or Java's older NIO patterns.

Syscalls and the `M`-Park Dance¶

Some syscalls are not pollable (file I/O on most filesystems, DNS lookups via cgo). The goroutine cannot be parked on epoll; the M actually blocks in the kernel.

The runtime handles this via a hand-off:

Before a blocking syscall, the runtime calls entersyscall. This detaches the M from its P.
Sysmon notices the M has been in syscall too long and assigns the P to another M (creating one if necessary).
When the syscall returns, the M tries to reacquire its old P. If unavailable, it grabs any idle P. If none, the M is parked.

Effect: a goroutine doing os.Read on a file does block one OS thread, but the rest of the runtime keeps running on other threads.

Cost¶

Every blocking syscall adds overhead from the P hand-off: typically ~1-2μs. For high-frequency syscall workloads, this matters; for an HTTP server, it disappears in the noise.

GOMAXPROCS and `Pidle`¶

GOMAXPROCS controls the number of Ps. Setting it to N means up to N goroutines can run in parallel (truly simultaneously on different cores).

Defaults¶

Since Go 1.5, the default is runtime.NumCPU(). Since Go 1.16, on Linux, it respects cgroup CPU quotas (so containers see the right number).

In Go 1.22+, runtime.GOMAXPROCS(0) returns the current value without changing it.

When to override¶

Situation	Adjustment
Container with CPU limit but old Go runtime	Set `GOMAXPROCS` to the limit explicitly via `automaxprocs` library
Latency-sensitive service co-located with other work	Reduce GOMAXPROCS to leave headroom
CPU-bound batch job alone on a host	Default is correct
Bench mark to compare single-thread performance	Set `GOMAXPROCS=1` to remove scheduler noise

`Pidle` and wake-up logic¶

Idle Ps live on a stack (sched.pidle). When a goroutine is spawned and the spawner has surplus work, the runtime pops a P from pidle and wakes a corresponding M. This is the wakep step of newproc.

If there are no idle Ps, the goroutine simply lands on the spawner's local runq and will be picked up later. No M-creation is forced.

Goroutine Identity and Reuse¶

Goroutines have IDs (g.goid), but the runtime does not expose them. The reasons:

IDs would tempt code to track goroutines as identities (anti-pattern; use context.Context).
Goroutine structs are recycled; IDs change semantics across versions.
A stable ID would enable goroutine-local storage, which the Go authors deliberately do not provide.

You can reach the ID via runtime/debug.Stack parsing or unsafe tricks, but don't. Pass identity in context.Context instead.

Reuse¶

When a G exits, its struct is placed on the P-local g free list (p.gFree) for reuse. The stack may be recycled too. This is why goroutine creation amortises so cheaply — most "creations" are reuses.

Tracing the Scheduler¶

GODEBUG=schedtrace=1000 prints scheduler statistics every 1000 ms:

SCHED 1003ms: gomaxprocs=8 idleprocs=2 threads=12 spinningthreads=1 idlethreads=4 runqueue=3 [0 1 0 4 0 0 2 0]

Fields:

gomaxprocs = current P count
idleprocs = Ps not running anything
threads = total Ms
spinningthreads = Ms actively looking for work (not yet parked)
idlethreads = Ms parked
runqueue = global runq depth
[...] = each P's local runq depth

Add scheddetail=1 for per-P / per-M / per-G detail. Beware: extremely chatty.

`runtime/trace`¶

trace.Start(f)
defer trace.Stop()

Produces a binary trace consumable by go tool trace. The browser visualisation shows every goroutine's life, every syscall, every GC pause, every preemption. Indispensable for debugging scheduler-induced latency.

Limits, Failure Modes, and Defaults¶

Limit	Default	Adjustable?
Goroutine count	unbounded	implicitly by memory
Per-goroutine starting stack	2 KB	not via API
Per-goroutine max stack	1 GB on 64-bit	`debug.SetMaxStack`
`GOMAXPROCS`	`NumCPU()`	env var or `runtime.GOMAXPROCS`
Local runq capacity	256	not adjustable
Sysmon period	20 μs to 10 ms (adaptive)	not adjustable
Async preemption signal	`SIGURG` (Linux)	not adjustable
Scheduler tick period for global runq pickup	every 61 ticks	not adjustable

Failure modes¶

OOM from goroutine leak. Each leaked goroutine costs ~2 KB stack + closure heap. A million leaks is ~2-4 GB.
Stack overflow. Hit at 1 GB; usually means infinite recursion.
Scheduler livelock. Pre-1.14 with a tight loop and GOMAXPROCS=1. Solved by async preemption.
GC starvation. Too few cycles between GC triggers; goroutines pile on the global queue. Tune GOGC or reduce allocation rate.
Cgo deadlock. Cgo calls do not yield; if every M is in cgo, the runtime cannot schedule. Set GOMAXPROCS higher than expected concurrent cgo callers.

Self-Assessment¶

Summary¶

The Go scheduler is a work-stealing, partly preemptive scheduler over an (M, P, G) triple. Each P holds a local 256-entry runqueue; idle Ps steal work from busy ones. A background monitor thread (sysmon) preempts goroutines that run too long, hands off Ps stuck in syscalls, and pokes the network poller. Asynchronous preemption (Go 1.14+) ensures even tight loops can be paused. Goroutines start with 2 KB stacks that grow on demand. I/O is multiplexed through a single epoll/kqueue/IOCP loop, so 100 000 idle network goroutines cost almost nothing.

The whole edifice exists for one reason: to make "spawn a goroutine" feel free, while still scaling to millions of them. Once you understand how it works, you can debug the few failure modes that occur (cgo deadlocks, GC starvation, scheduler latency) instead of treating the runtime as magic.

Read runtime/proc.go. Read runtime/runtime2.go. Read runtime/netpoll.go. They are some of the best-commented Go code in existence, and they make every goroutine you ever write a little less mysterious.