The G-M-P Model — Professional¶
Table of Contents¶
- Introduction
- Source Layout
- Reading
gField by Field - Reading
mField by Field - Reading
pField by Field - Reading
schedtField by Field newprocLine by LinescheduleLine by LinefindrunnableWalk-Throughrunqput,runqget,runqsteal— the Atomic ProtocolgoparkandgoreadyInternalswakepand the Spinning Capstopmandstartmpidleget,pidleput,mput,mget- Lock Ranks and Invariants
- Version-by-Version Evolution
- Reading Path for Maximum Yield
- Summary
Introduction¶
The professional level is where the runtime stops being a black box. You read src/runtime/runtime2.go for the structs and src/runtime/proc.go for the scheduler logic, and you know — for any line of your Go program — which function in the runtime executes when, on which goroutine. This document walks the canonical paths in order.
References are to Go 1.22 source; line numbers approximate. The struct names and field positions have been broadly stable since Go 1.5 (when the runtime was rewritten in Go from C); cosmetic changes happen each release but the model is unchanged.
Source Layout¶
src/runtime/
runtime2.go — struct definitions for g, m, p, schedt, sudog, ...
proc.go — schedule, findrunnable, newproc, gopark, ready, ...
asm_amd64.s — gogo, mcall, systemstack assembly
chan.go — channels (calls gopark/goready)
mheap.go, mcache.go, mcentral.go — allocator (uses P-local mcache)
sema.go, lock_futex.go — semaphores, futex notes
sigqueue.go — signal forwarding
sys_linux_amd64.s — clone(2) and syscall stubs
runtime2.go is the data structure file. proc.go is by far the largest source file at ~5000 lines; it contains the entire scheduler. Everything you read in this document maps to those two files unless otherwise noted.
Reading g Field by Field¶
From runtime/runtime2.go (around line 400):
type g struct {
// Stack parameters.
// stack describes the actual stack memory: [stack.lo, stack.hi).
// stackguard0 is the stack pointer compared in the Go stack growth prologue.
// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
// stackguard1 is the stack pointer compared in the C stack growth prologue.
// It is stack.lo+StackGuard on g0 and gsignal stacks.
// It is ~0 on other goroutine stacks, to trigger a call to morestackc.
stack stack // offset known to runtime/cgo
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
_panic *_panic // innermost panic - offset known to liblink
_defer *_defer // innermost defer
m *m // current m; offset known to arm liblink
sched gobuf
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
stktopsp uintptr // expected sp at top of stack, to check in traceback
// param is a generic pointer parameter field used to pass values
// in particular contexts where other storage for the parameter would be difficult to find.
param unsafe.Pointer
atomicstatus atomic.Uint32
stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
goid uint64
schedlink guintptr
waitsince int64 // approx time when the g become blocked
waitreason waitReason // if status==Gwaiting
preempt bool // preemption signal, duplicates stackguard0 = stackpreempt
preemptStop bool // transition to _Gpreempted on preemption; otherwise, just deschedule
preemptShrink bool // shrink stack at synchronous safe point
// asyncSafePoint is set if g is stopped at an asynchronous
// safe point. This means there are frames on the stack
// without precise pointer information.
asyncSafePoint bool
paniconfault bool // panic (instead of crash) on unexpected fault address
gcscandone bool // g has scanned stack; protected by _Gscan bit in status
throwsplit bool // must not split stack
activeStackChans bool
parkingOnChan atomic.Bool
raceignore int8 // ignore race detection events
nocgocallback bool // whether disable callback from C
tracking bool // whether we're tracking this G for sched latency statistics
trackingSeq uint8 // used to decide whether to track this G
trackingStamp int64 // timestamp of when the G last started being tracked
runnableTime int64 // the amount of time spent runnable, cleared when running, only used when tracking
lockedm muintptr
sig uint32
writebuf []byte
sigcode0 uintptr
sigcode1 uintptr
sigpc uintptr
parentGoid uint64 // goid of goroutine that created this goroutine
gopc uintptr // pc of go statement that created this goroutine
ancestors *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
startpc uintptr // pc of goroutine function
racectx uintptr
waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
cgoCtxt []uintptr // cgo traceback context
labels unsafe.Pointer // profiler labels
timer *timer // cached timer for time.Sleep
selectDone atomic.Uint32 // are we participating in a select and did someone win the race?
coroarg *coro // argument during coroutine transfers
gcAssistBytes int64
}
Field roles, grouped:
| Group | Fields | Notes |
|---|---|---|
| Stack | stack, stackguard0, stackguard1, stktopsp | Stack bounds; preempt is signalled by poisoning stackguard0. |
| Saved regs | sched gobuf | Saved when parked; restored by gogo. |
| Status | atomicstatus | CAS'd between states. |
| Linkage | schedlink, m, lockedm, waiting, param | Pointers used by the scheduler. |
| Identity | goid, parentGoid, gopc, startpc | Tracking, traceback, debugging. |
| Preemption | preempt, preemptStop, preemptShrink, asyncSafePoint | See 02-preemption. |
| Sigs | sig, sigcode0, sigcode1, sigpc | Signal forwarding to user. |
| GC | gcAssistBytes, gcscandone | Per-G GC assist credit. |
| Tracing | tracking, trackingSeq, trackingStamp, runnableTime | Scheduler latency tracking. |
The gobuf:
type gobuf struct {
sp uintptr
pc uintptr
g guintptr
ctxt unsafe.Pointer
ret uintptr
lr uintptr
bp uintptr // for framepointer-enabled architectures
}
pc, sp, bp, ret, lr — the saved register set. ctxt is the function-value pointer that gogo will install in the closure context register. g is a back-pointer ensuring resumption restores the right G pointer in TLS.
Reading m Field by Field¶
type m struct {
g0 *g // goroutine with scheduling stack
morebuf gobuf // gobuf arg to morestack
divmod uint32 // div/mod denominator for arm - known to liblink
// Fields not known to debuggers.
procid uint64 // for debuggers, but offset not hard-coded
gsignal *g // signal-handling g
goSigStack gsignalStack // Go-allocated signal handling stack
sigmask sigset // storage for saved signal mask
tls [tlsSlots]uintptr // thread-local storage (for x86 extern register)
mstartfn func()
curg *g // current running goroutine
caughtsig guintptr // goroutine running during fatal signal
p puintptr // attached p for executing go code (nil if not executing go code)
nextp puintptr
oldp puintptr // the p that was attached before executing a syscall
id int64
mallocing int32
throwing throwType
preemptoff string // if != "", keep curg running on this m
locks int32
dying int32
profilehz int32
spinning bool // m is out of work and is actively looking for work
blocked bool // m is blocked on a note
newSigstack bool // minit on C thread called sigaltstack
printlock int8
incgo bool // m is executing a cgo call
isextra bool // m is an extra m
isExtraInC bool // m is an extra m that is not executing Go code
isExtraInSig bool // m is an extra m in a signal handler
freeWait atomic.Uint32 // Whether it is safe to free g0 and delete m (one of freeMRef, freeMStack, freeMWait)
needextram bool
traceback uint8
ncgocall uint64 // number of cgo calls in total
ncgo int32 // number of cgo calls currently in progress
cgoCallersUse atomic.Uint32 // if non-zero, cgoCallers in use temporarily
cgoCallers *cgoCallers // cgo traceback if crashing in cgo call
park note
alllink *m // on allm
schedlink muintptr
lockedg guintptr
createstack [32]uintptr // stack that created this thread, it's used for StackRecord.Stack0, so it must align with it.
lockedExt uint32 // tracking for external LockOSThread
lockedInt uint32 // tracking for internal lockOSThread
nextwaitm muintptr // next m waiting for lock
mLockProfile mLockProfile // fields relating to runtime.lock contention
// wait* are used to carry arguments from gopark into park_m, because
// there's no stack to put them on. That is their sole purpose.
waitunlockf func(*g, unsafe.Pointer) bool
waitlock unsafe.Pointer
waitTraceBlockReason traceBlockReason
waitTraceSkip int
syscalltick uint32
freelink *m // on sched.freem
trace mTraceState
// these are here because they are too large to be on the stack
// of low-level NOSPLIT functions.
libcall libcall
libcallpc uintptr // for cpu profiler
libcallsp uintptr
libcallg guintptr
winsyscall winlibcall // stores syscall parameters on windows
vdsoSP uintptr // SP for traceback while in VDSO call (0 if not in call)
vdsoPC uintptr // PC for traceback while in VDSO call
// preemptGen counts the number of completed preemption signals. This is used to detect if a preemption call missed.
preemptGen atomic.Uint32
// Whether this is a pending preemption signal on this M.
signalPending atomic.Uint32
dlogPerM
mOS
chacha8 chacha8rand.State
cheaprand uint64
// Up to 10 locks held by this m, maintained by the lock ranking code.
locksHeldLen int
locksHeld [10]heldLockInfo
}
Critical fields:
g0— system goroutine for scheduler/GC code; stack is the OS thread's stack.curg— currently running user G or nil.p/nextp/oldp— current, hinted-next, and previously-bound P.park— futex word for idle parking.spinning/blocked— booleans counted insched.nmspinning/nmidle.incgo— true during a cgo call; sysmon uses this to avoid retaking the P.lockedg/lockedExt/lockedInt— LockOSThread bookkeeping.preemptGen/signalPending— async-preemption signal tracking.
The mOS field embeds OS-specific fields (e.g., procid, vdso* on Linux).
Reading p Field by Field¶
type p struct {
id int32
status uint32 // one of pidle/prunning/...
link puintptr
schedtick uint32 // incremented on every scheduler call
syscalltick uint32 // incremented on every system call
sysmontick sysmontick // last tick observed by sysmon
m muintptr // back-link to associated m (nil if idle)
mcache *mcache
pcache pageCache
raceprocctx uintptr
deferpool []*_defer // pool of available defer structs (see panic.go)
deferpoolbuf [32]*_defer
// Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
goidcache uint64
goidcacheend uint64
// Queue of runnable goroutines. Accessed without lock.
runqhead uint32
runqtail uint32
runq [256]guintptr
// runnext, if non-nil, is a runnable G that was ready'd by
// the current G and should be run next instead of what's in
// runq if there's time remaining in the running G's time
// slice. It will inherit the time left in the current time
// slice. If a set of goroutines is locked in a
// communicate-and-wait pattern, this schedules that set as a
// unit and eliminates the (potentially large) scheduling
// latency that otherwise arises from adding the ready'd
// goroutines to the end of the run queue.
//
// Note that while other P's may atomically CAS this to zero,
// only the owner can CAS it to a valid G.
runnext guintptr
// Available G's (status == Gdead)
gFree struct {
gList
n int32
}
sudogcache []*sudog
sudogbuf [128]*sudog
// Cache of mspan objects from the heap.
mspancache struct {
len int
buf [128]*mspan
}
pinnerCache *pinner
trace pTraceState
palloc persistentAlloc // per-P to avoid mutex
// Per-P GC state
gcAssistTime int64 // Nanoseconds in assistAlloc
gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker (atomic)
// limiterEvent tracks events for the GC CPU limiter.
limiterEvent limiterEvent
// gcMarkWorkerMode is the mode for the next mark worker to run in.
gcMarkWorkerMode gcMarkWorkerMode
gcMarkWorkerStartTime int64
// gcw is this P's GC work buffer cache. The work buffer is
// filled by write barriers, drained by mutator assists, and
// disposed on certain GC state transitions.
gcw gcWork
// wbBuf is this P's GC write barrier buffer.
//
// TODO: Consider caching this in the running G.
wbBuf wbBuf
runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
// statsSeq is a counter indicating whether this P is currently
// writing any stats. Its value is even when not, odd when it is.
statsSeq atomic.Uint32
// Lock for timers. We normally access the timers while running
// on this P, but the scheduler can also do it from a different P.
timersLock mutex
// Actions to take at some time. This is used to implement the
// standard library's time package.
// Must hold timersLock to access.
timers []*timer
// Number of timers in P's heap.
numTimers atomic.Uint32
// Number of timerDeleted timers in P's heap.
deletedTimers atomic.Uint32
// Race context used while executing timer functions.
timerRaceCtx uintptr
maxStackScanDelta int64
// gc-time statistics about current goroutines
scannedStackSize uint64 // stack size of goroutines scanned by this P
scannedStacks uint64 // number of goroutines scanned by this P
// preempt is set to indicate that this P should be enter the
// scheduler ASAP (regardless of what G is running on it).
preempt bool
// pageTraceBuf is a buffer for writing out page allocation/free/scavenge traces.
pageTraceBuf pageTraceBuf
// Padding is no longer needed. False sharing is now not a worry because p is large enough
// that its size class is an even multiple of the cache line size (for any of our supported
// architectures).
}
Eight sub-systems live inside a P:
- Identity and status —
id,status,link,m. - Tick counters —
schedtick,syscalltick,sysmontick. - Runqueue —
runqhead,runqtail,runq[256],runnext. - G cache —
gFree,goidcache,goidcacheend. - Sync caches —
sudogcache,sudogbuf,deferpool,deferpoolbuf. - Allocator caches —
mcache,pcache,mspancache,palloc,pinnerCache. - GC state —
gcAssistTime,gcMarkWorkerMode,gcw,wbBuf,scannedStacks,scannedStackSize,maxStackScanDelta. - Timers —
timers,timersLock,numTimers,deletedTimers,timerRaceCtx.
The size of p is around 5–6 KiB on 64-bit. It is the largest scheduler struct, justified by the volume of per-P caches.
Reading schedt Field by Field¶
type schedt struct {
goidgen atomic.Uint64
lastpoll atomic.Int64 // time of last network poll, 0 if currently polling
pollUntil atomic.Int64 // time to which current poll is sleeping
lock mutex
// When increasing nmidle, nmidlelocked, nmsys, or nmfreed, be sure to
// call checkdead().
midle muintptr // idle m's waiting for work
nmidle int32 // number of idle m's waiting for work
nmidlelocked int32 // number of locked m's waiting for work
mnext int64 // number of m's that have been created and next M ID
maxmcount int32 // maximum number of m's allowed (or die)
nmsys int32 // number of system m's not counted for deadlock
nmfreed int64 // cumulative number of freed m's
ngsys atomic.Int32 // number of system goroutines
pidle puintptr // idle p's
npidle atomic.Int32
nmspinning atomic.Int32
needspinning atomic.Uint32 // See "Delicate dance" comment in proc.go. Boolean. Must hold sched.lock to set to 1.
// Global runnable queue.
runq gQueue
runqsize int32
// disable controls selective disabling of the scheduler.
//
// Use schedEnableUser to control this.
//
// disable is protected by sched.lock.
disable struct {
// user disables scheduling of user goroutines.
user bool
runnable gQueue // pending runnable Gs
n int32 // length of runnable
}
// Global cache of dead G's.
gFree struct {
lock mutex
stack gList // Gs with stacks
noStack gList // Gs without stacks
n int32
}
// Central cache of sudog structs.
sudoglock mutex
sudogcache *sudog
// Central pool of available defer structs.
deferlock mutex
deferpool *_defer
// freem is the list of m's waiting to be freed when their
// m.exited is set. Linked through m.freelink.
freem *m
gcwaiting atomic.Bool // gc is waiting to run
stopwait int32
stopnote note
sysmonwait atomic.Bool
sysmonnote note
// safepointFn should be called on each P at the next GC
// safepoint if p.runSafePointFn is set.
safePointFn func(*p)
safePointWait int32
safePointNote note
profilehz int32 // cpu profiling rate
procresizetime int64 // nanotime() of last change to gomaxprocs
totaltime int64 // ∫gomaxprocs dt up to procresizetime
// sysmonlock protects sysmon's actions on the runtime.
//
// Acquire and release with sysmon being on its own M makes
// contended uses of this lock predictable.
sysmonlock mutex
// timeToRun is a distribution of scheduling latencies, defined
// as the sum of time a G spends in the _Grunnable state before
// it transitions to _Grunning.
timeToRun timeHistogram
// idleTime is the total CPU time Ps have "spent" idle.
//
// Reset on each GC cycle.
idleTime atomic.Int64
// totalMutexWaitTime is the sum of time goroutines have spent in _Gwaiting
// with a waitreason of the form waitReasonSync{RW,}Mutex{R,}Lock.
totalMutexWaitTime atomic.Int64
}
The sched global is the answer to "where do I look up state that is not per-P?"
newproc Line by Line¶
From runtime/proc.go:
// Create a new g running fn.
// Put it on the queue of g's waiting to run.
// The compiler turns a go statement into a call to this.
func newproc(fn *funcval) {
gp := getg()
pc := getcallerpc()
systemstack(func() {
newg := newproc1(fn, gp, pc)
pp := getg().m.p.ptr()
runqput(pp, newg, true)
if mainStarted {
wakep()
}
})
}
// Create a new g in state _Grunnable, starting at fn. callerpc is the
// address of the go statement that created this. The caller is responsible
// for adding the new g to the scheduler.
func newproc1(fn *funcval, callergp *g, callerpc uintptr) *g {
if fn == nil {
fatal("go of nil func value")
}
mp := acquirem() // disable preemption because we hold M and P in local vars.
pp := mp.p.ptr()
newg := gfget(pp)
if newg == nil {
newg = malg(stackMin)
casgstatus(newg, _Gidle, _Gdead)
allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
}
if newg.stack.hi == 0 {
throw("newproc1: newg missing stack")
}
if readgstatus(newg) != _Gdead {
throw("newproc1: new g is not Gdead")
}
totalSize := uintptr(4*goarch.PtrSize + sys.MinFrameSize) // extra space in case of reads slightly beyond frame
totalSize = alignUp(totalSize, sys.StackAlign)
sp := newg.stack.hi - totalSize
spArg := sp
if usesLR {
// caller's LR
*(*uintptr)(unsafe.Pointer(sp)) = 0
prepGoExitFrame(sp)
spArg += sys.MinFrameSize
}
memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
newg.sched.sp = sp
newg.stktopsp = sp
newg.sched.pc = abi.FuncPCABI0(goexit) + sys.PCQuantum
newg.sched.g = guintptr(unsafe.Pointer(newg))
gostartcallfn(&newg.sched, fn)
newg.parentGoid = callergp.goid
newg.gopc = callerpc
newg.ancestors = saveAncestors(callergp)
newg.startpc = fn.fn
if isSystemGoroutine(newg, false) {
sched.ngsys.Add(1)
} else {
// Only user goroutines inherit pprof labels.
if mp.curg != nil {
newg.labels = mp.curg.labels
}
if goroutineProfile.active {
newg.goroutineProfiled.Store(goroutineProfileSatisfied)
}
}
newg.trackingSeq = uint8(cheaprand())
if newg.trackingSeq%gTrackingPeriod == 0 {
newg.tracking = true
}
gcController.addScannableStack(pp, int64(newg.stack.hi-newg.stack.lo))
// Get a goid and switch to runnable. Make all this atomic
// to the tracer.
trace := traceAcquire()
var status uint32 = _Grunnable
if parked {
status = _Gwaiting
newg.waitreason = waitreason
}
if pp.goidcache == pp.goidcacheend {
// Sched.goidgen is the last allocated id, this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
pp.goidcache = sched.goidgen.Add(_GoidCacheBatch)
pp.goidcache -= _GoidCacheBatch - 1
pp.goidcacheend = pp.goidcache + _GoidCacheBatch
}
newg.goid = pp.goidcache
pp.goidcache++
casgstatus(newg, _Gdead, status)
if trace.ok() {
trace.GoCreate(newg, newg.startpc, parked)
traceRelease(trace)
}
releasem(mp)
return newg
}
Step by step:
acquirem()disables preemption — we are about to readmp.p, which would be invalid if we got preempted to another M.gfget(pp)tries to reuse a dead G frompp.gFree. Hit: returns existing G with a stack. Miss: returns nil.malg(stackMin)allocates a fresh G with a 2-KiB stack.allgaddpublishes it underallglockso GC can find it.memclrNoHeapPointerszerosnewg.schedto start clean.gostartcallfnpushes a "call" frame onto the new G's stack so that when scheduled, it executesfn. The return PC isgoexit, so whenfnreturns, control jumps togoexitwhich cleans up.casgstatus(newg, _Gdead, status)atomically marks the G as_Grunnable.runqput(pp, newg, true)places onrunnext(and bumps the previousrunnextintorunq).wakep()wakes an idle P+M pair if available, so the new G has a chance to run in parallel.
schedule Line by Line¶
// One round of scheduler: find a runnable goroutine and execute it.
// Never returns.
func schedule() {
mp := getg().m
if mp.locks != 0 {
throw("schedule: holding locks")
}
if mp.lockedg != 0 {
stoplockedm()
execute(mp.lockedg.ptr(), false) // Never returns.
}
// We should not schedule away from a g that is executing a cgo call,
// since the cgo call is using the m's g0 stack.
if mp.incgo {
throw("schedule: in cgo")
}
top:
pp := mp.p.ptr()
pp.preempt = false
// Safety check: if we are spinning, the runqueue should be empty.
// Check this before calling checkTimers, as that might call goready
// to put a ready goroutine on the local runqueue.
if mp.spinning && (pp.runnext != 0 || pp.runqhead != pp.runqtail) {
throw("schedule: spinning with local work")
}
gp, inheritTime, tryWakeP := findRunnable() // blocks until work is available
if debug.dontfreezetheworld > 0 && freezing.Load() {
// See comment in freezetheworld. We don't want to perturb
// scheduler state, so we didn't gcstopm in findRunnable, but
// we aren't going to let this 'g' run unless gp == mp.g0.
if gp != mp.g0 {
// We didn't find any user code to run. Just stop the M.
stopm()
goto top
}
}
// This thread is going to run a goroutine and is not spinning anymore,
// so if it was marked as spinning we need to reset it now and potentially
// start a new spinning M.
if mp.spinning {
resetspinning()
}
if sched.disable.user && !schedEnabled(gp) {
// Scheduling of this goroutine is disabled. Put it on
// the list of pending runnable goroutines for when we re-enable
// user scheduling and look again.
lock(&sched.lock)
if schedEnabled(gp) {
// Something re-enabled scheduling while we
// were acquiring the lock.
unlock(&sched.lock)
} else {
sched.disable.runnable.pushBack(gp)
sched.disable.n++
unlock(&sched.lock)
goto top
}
}
// If about to schedule a not-normal goroutine (a GCworker or tracereader),
// wake a P if there is one.
if tryWakeP {
wakep()
}
if gp.lockedm != 0 {
// Hands off own p to the locked m,
// then blocks waiting for a new p.
startlockedm(gp)
goto top
}
execute(gp, inheritTime)
}
The function does not return; it always ends in execute, which switches into the G's code via gogo. Control returns to schedule only when the G calls Gosched, gopark, or goexit.
findrunnable Walk-Through¶
The actual function is ~250 lines including comments. We will not reproduce it verbatim but list the search order and the gotchas:
func findRunnable() (gp *g, inheritTime, tryWakeP bool) {
mp := getg().m
top:
pp := mp.p.ptr()
// STW check
if sched.gcwaiting.Load() { gcstopm(); goto top }
if pp.runSafePointFn != 0 { runSafePointFn() }
// (1) Local runq, including runnext.
if gp, inheritTime = runqget(pp); gp != nil {
return gp, inheritTime, false
}
// (2) Global runq.
if sched.runqsize != 0 {
lock(&sched.lock)
gp = globrunqget(pp, 0)
unlock(&sched.lock)
if gp != nil { return gp, false, false }
}
// (3) Network poller (non-blocking).
if netpollinited() && netpollAnyWaiters() && sched.lastpoll.Load() != 0 {
if list, delta := netpoll(0); !list.empty() {
gp := list.pop()
injectglist(&list)
netpollAdjustWaiters(delta)
casgstatus(gp, _Gwaiting, _Grunnable)
return gp, false, false
}
}
// (4) Spinning to steal.
if mp.spinning || 2*sched.nmspinning.Load() < gomaxprocs-sched.npidle.Load() {
if !mp.spinning {
mp.becomeSpinning()
}
gp, inheritTime, tnow, w, newWork := stealWork(now)
if gp != nil { return gp, inheritTime, false }
if newWork { goto top } // we made it ourselves
}
// (5) GC-only Gs.
if gcBlackenEnabled != 0 && gcMarkWorkAvailable(pp) {
node := (*gcBgMarkWorkerNode)(gcBgMarkWorkerPool.pop())
if node != nil { ... return mark worker G ... }
}
// (6) Last chance global runq.
lock(&sched.lock)
if sched.runqsize != 0 {
gp := globrunqget(pp, 0)
unlock(&sched.lock)
return gp, false, false
}
if releasep() != pp { throw("findrunnable: wrong p") }
now := pidleput(pp, now)
unlock(&sched.lock)
// (7) Stop spinning if applicable.
wasSpinning := mp.spinning
if mp.spinning {
mp.spinning = false
if sched.nmspinning.Add(-1) < 0 {
throw("findrunnable: negative nmspinning")
}
// Final check
...
}
// (8) Park.
stopm()
goto top
}
Key invariant: when an M is spinning, no Gs are runnable that it could see. If a G becomes runnable on another P during a steal attempt, the spinning M will see it on the next pass (or wakep will wake another M).
Steps (6) and (7) include the "delicate dance" — racing between dropping the P, checking once more, and parking. The full code has detailed comments explaining why each re-check is necessary.
runqput, runqget, runqsteal — the Atomic Protocol¶
const _RunqMask = 255
// runqput tries to put g on the local runnable queue.
// If next is false, runqput adds g to the tail of the runnable queue.
// If next is true, runqput puts g in the pp.runnext slot.
// If the run queue is full, runnable g is put on the global queue.
// Executed only by the owner P.
func runqput(pp *p, gp *g, next bool) {
if randomizeScheduler && next && fastrandn(2) == 0 {
next = false
}
if next {
retryNext:
oldnext := pp.runnext
if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
goto retryNext
}
if oldnext == 0 {
return
}
// Kick the old runnext out to the regular run queue.
gp = oldnext.ptr()
}
retry:
h := atomic.LoadAcq(&pp.runqhead) // load-acquire, synchronize with consumers
t := pp.runqtail
if t-h < uint32(len(pp.runq)) {
pp.runq[t%uint32(len(pp.runq))].set(gp)
atomic.StoreRel(&pp.runqtail, t+1) // store-release, makes the item available for consumption
return
}
if runqputslow(pp, gp, h, t) {
return
}
// the queue is not full, now the put above must succeed
goto retry
}
The atomics:
runqheadis accessed by multiple writers (the owning P and stealing Ps via CAS).runqtailis accessed by a single writer (the owning P), so it usesLoadAcq/StoreRelonly.runnextis accessed by multiple writers (the owning P and stealing Ps via CAS, only on later steal passes).
runqputslow:
func runqputslow(pp *p, gp *g, h, t uint32) bool {
var batch [len(pp.runq)/2 + 1]*g
// First, grab a batch from local queue.
n := t - h
n = n / 2
if n != uint32(len(pp.runq)/2) {
throw("runqputslow: queue is not full")
}
for i := uint32(0); i < n; i++ {
batch[i] = pp.runq[(h+i)%uint32(len(pp.runq))].ptr()
}
if !atomic.CasRel(&pp.runqhead, h, h+n) {
return false
}
batch[n] = gp
if randomizeScheduler {
for i := uint32(1); i <= n; i++ {
j := fastrandn(i + 1)
batch[i], batch[j] = batch[j], batch[i]
}
}
// Link the goroutines.
for i := uint32(0); i < n; i++ {
batch[i].schedlink.set(batch[i+1])
}
var q gQueue
q.head.set(batch[0])
q.tail.set(batch[n])
// Now put the batch on global queue.
lock(&sched.lock)
globrunqputbatch(&q, int32(n+1))
unlock(&sched.lock)
return true
}
Half of the LRQ (128 Gs) is moved to the GRQ under sched.lock. The "random shuffle" branch (in randomizeScheduler race-detector builds) ensures tests don't accidentally depend on FIFO order.
gopark and goready Internals¶
gopark parks the calling G. goready makes a parked G runnable. Together they implement every blocking primitive.
func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason waitReason, traceReason traceBlockReason, traceskip int) {
if reason != waitReasonSleep {
checkTimeouts()
}
mp := acquirem()
gp := mp.curg
status := readgstatus(gp)
if status != _Grunning && status != _Gscanrunning {
throw("gopark: bad g status")
}
mp.waitlock = lock
mp.waitunlockf = unlockf
gp.waitreason = reason
mp.waitTraceBlockReason = traceReason
mp.waitTraceSkip = traceskip
releasem(mp)
// can't do anything that might move the G between Ms here.
mcall(park_m)
}
// park continuation on g0.
func park_m(gp *g) {
mp := getg().m
trace := traceAcquire()
// N.B. Not using casGToWaiting here because the waitreason is
// set by park_m's caller.
casgstatus(gp, _Grunning, _Gwaiting)
if trace.ok() {
trace.GoPark(mp.waitTraceBlockReason, mp.waitTraceSkip)
traceRelease(trace)
}
dropg()
if fn := mp.waitunlockf; fn != nil {
ok := fn(gp, mp.waitlock)
mp.waitunlockf = nil
mp.waitlock = nil
if !ok {
// spurious wake — put G back.
if trace.ok() {
trace.GoUnpark(gp, 2)
traceRelease(trace)
}
casgstatus(gp, _Gwaiting, _Grunnable)
execute(gp, true) // never returns.
}
}
schedule()
}
goready:
// goready marks the G as ready to run.
func goready(gp *g, traceskip int) {
systemstack(func() {
ready(gp, traceskip, true)
})
}
// Mark gp ready to run.
func ready(gp *g, traceskip int, next bool) {
status := readgstatus(gp)
// Mark runnable.
mp := acquirem()
if status&^_Gscan != _Gwaiting {
dumpgstatus(gp)
throw("bad g->status in ready")
}
casgstatus(gp, _Gwaiting, _Grunnable)
runqput(mp.p.ptr(), gp, next)
wakep()
releasem(mp)
}
The next=true argument from goready is why channel-driven wake places G in runnext. The pattern is so important that goready does not even accept next=false; callers that want the regular queue use ready(gp, ..., false) directly.
wakep and the Spinning Cap¶
// Tries to add one more P to execute G's.
// Called when a G is made runnable (newproc, ready).
// Must be called with a P.
func wakep() {
// Be conservative about spinning threads, only start one if none exist
// already.
if sched.nmspinning.Load() != 0 || !sched.nmspinning.CompareAndSwap(0, 1) {
return
}
// Disable preemption until ownership of pp transfers to the next M in startm.
// Otherwise preemption here would leave pp stuck in _Pgcstop.
mp := acquirem()
var pp *p
lock(&sched.lock)
pp, _ = pidlegetSpinning(0)
if pp == nil {
if sched.nmspinning.Add(-1) < 0 {
throw("wakep: negative nmspinning")
}
unlock(&sched.lock)
releasem(mp)
return
}
unlock(&sched.lock)
startm(pp, true, false)
releasem(mp)
}
Key check: nmspinning.CAS(0, 1). Only one M-wake is allowed at a time globally. If many wakep calls race, only the first proceeds; the rest see nmspinning != 0 and return.
startm:
// Schedules some M to run the p (creates an M if necessary).
// If p == nil, tries to get an idle P, if no idle P's does nothing.
func startm(pp *p, spinning bool, lockheld bool) {
mp := acquirem()
if !lockheld {
lock(&sched.lock)
}
if pp == nil {
if spinning {
// ...
unlock(&sched.lock)
releasem(mp)
return
}
pp, _ = pidleget(0)
if pp == nil {
if !lockheld { unlock(&sched.lock) }
releasem(mp)
return
}
}
nmp := mget()
if nmp == nil {
// No M is available, we must drop sched.lock and call newm.
id := mReserveID()
unlock(&sched.lock)
var fn func()
if spinning {
fn = mspinning
}
newm(fn, pp, id)
if lockheld { lock(&sched.lock) }
releasem(mp)
return
}
if !lockheld { unlock(&sched.lock) }
nmp.spinning = spinning
nmp.nextp.set(pp)
notewakeup(&nmp.park)
releasem(mp)
}
The decision tree: idle M available → wake it; otherwise → newm creates a new OS thread via clone(2).
stopm and startm¶
stopm parks the calling M:
// Stops execution of the current m until new work is available.
// Returns with acquired P.
func stopm() {
gp := getg()
if gp.m.locks != 0 {
throw("stopm holding locks")
}
if gp.m.p != 0 {
throw("stopm holding p")
}
if gp.m.spinning {
throw("stopm spinning")
}
lock(&sched.lock)
mput(gp.m)
unlock(&sched.lock)
mPark()
acquirep(gp.m.nextp.ptr())
gp.m.nextp = 0
}
func mPark() {
gp := getg()
notesleep(&gp.m.park)
noteclear(&gp.m.park)
}
mput puts the M on sched.midle. notesleep is a futex wait. acquirep binds m.nextp to the M (nextp was set by whoever woke us).
pidleget, pidleput, mput, mget¶
The idle-list management:
func pidleput(pp *p, now int64) int64 {
assertLockHeld(&sched.lock)
if !runqempty(pp) {
throw("pidleput: P has non-empty run queue")
}
if now == 0 {
now = nanotime()
}
updateTimerPMask(pp) // clear timerpMask bit
idlepMask.set(pp.id)
pp.link = sched.pidle
sched.pidle.set(pp)
sched.npidle.Add(1)
if !pp.limiterEvent.start(limiterEventIdle, now) {
throw("pidleput: idle event already started")
}
return now
}
func pidleget(now int64) (*p, int64) {
assertLockHeld(&sched.lock)
pp := sched.pidle.ptr()
if pp != nil {
if now == 0 {
now = nanotime()
}
timerpMask.set(pp.id)
idlepMask.clear(pp.id)
sched.pidle = pp.link
sched.npidle.Add(-1)
pp.limiterEvent.stop(limiterEventIdle, now)
}
return pp, now
}
func mput(mp *m) {
assertLockHeld(&sched.lock)
mp.schedlink = sched.midle
sched.midle.set(mp)
sched.nmidle++
checkdead()
}
func mget() *m {
assertLockHeld(&sched.lock)
mp := sched.midle.ptr()
if mp != nil {
sched.midle = mp.schedlink
sched.nmidle--
}
return mp
}
idlepMask and timerpMask are bitmasks the runtime maintains alongside the list, enabling fast "any idle P?" and "any P has timers?" checks without walking the list.
checkdead is called after mput: if every G is parked and no Ms can do anything, the runtime prints the famous deadlock fatal.
Lock Ranks and Invariants¶
From runtime/lockrank.go (the relevant ranks for the scheduler):
const (
lockRankSysmon = lockRank(iota)
lockRankScavenge
lockRankForcegc
lockRankDefer
lockRankSudog
lockRankAllg
lockRankAllp
lockRankSched
lockRankAllocmW
lockRankExecW
lockRankCpuprof
lockRankPollDesc
lockRankWakeableSleep
lockRankHchan
lockRankNotifyList
lockRankSweep
lockRankMheap
...
)
sched.lock sits in the middle. Channel locks (hchan.lock) and heap locks (mheap.lock) are above it; sudog/defer pool locks are below. The rule "do not acquire higher-rank locks while holding lower-rank ones" governs every call sequence in the scheduler.
A consequence: channel code releases hchan.lock before calling goready, because goready may need sched.lock (which is lower-rank). Acquiring sched.lock while holding hchan.lock would violate the rank order.
Version-by-Version Evolution¶
| Version | Change |
|---|---|
| Go 1.0 (Mar 2012) | Single global runqueue, single lock. |
| Go 1.1 (Apr 2013) | G-M-P introduced (Vyukov design). Work-stealing scheduler. |
| Go 1.2 | Per-P timer heaps (replacing a global heap). |
| Go 1.5 | Runtime rewritten in Go (was C). mcache per-P. Cooperative preemption via morestack calls only. |
| Go 1.7 | Net poller integrated into findrunnable. |
| Go 1.14 | Async preemption via SIGURG. Loops without function calls now preemptible. |
| Go 1.18 | Internal cleanup; runnext semantics formalised. |
| Go 1.21 | Improved scheduler latency tracking via timeToRun histogram. |
| Go 1.22 | Cgroup-aware GOMAXPROCS proposals begin. |
| Go 1.25 | Default GOMAXPROCS reads cgroup CPU quota. |
Each release tweaks heuristics — steal pass count, fairness sip interval, spinning cap, gFree thresholds — but the struct shapes and the core function signatures have not changed since 1.5.
Reading Path for Maximum Yield¶
If you have an hour, read these in order:
runtime/runtime2.golines 380–550 (struct definitions).runtime/proc.goschedule()(around line 3500).runtime/proc.gofindRunnable()(just afterschedule).runtime/proc.gorunqput,runqget,runqsteal(toward the bottom).runtime/proc.gogopark,park_m,ready(in the middle of the file).runtime/proc.gonewproc,newproc1.
If you have a day, also:
runtime/HACKING.md— Russ Cox's developer notes.- The Vyukov proposal in full.
runtime/lockrank.gofor the rank list.
You will emerge able to read any stack trace in any Go panic and know precisely which scheduler path was active.
Summary¶
The G-M-P model is implemented by three structs (g, m, p) plus one global (sched) defined in runtime/runtime2.go, manipulated by a handful of functions in runtime/proc.go: newproc, schedule, findrunnable, runqput/runqget/runqsteal, gopark/ready, wakep/startm/stopm. Every Go program — every go f(), every channel send, every time.Sleep, every network read — exercises some path through this code. The atomic protocol on the ring-buffer runqueue, the lock-rank discipline, and the spinning-M cap are the three subtleties that make it scalable. Read the source once and the rest of the deep-dive section becomes commentary on what you have already seen.