Syscall Handling — Professional Level¶

Table of Contents¶

Introduction
Source Files You Will Read
reentersyscall Line by Line
exitsyscall and the Fast/Slow Branch
exitsyscallfast Internals
sysmon Loop and retake for _Psyscall
handoffp Implementation
The _Psyscall State Machine
Per-Platform Thread Creation: newosproc
The Netpoller Init and Wakeup Paths
entersyscallblock — the Pre-Emptive Handoff
Cgo Call Internals
Locked Goroutines Across Syscalls
Reading Runtime Telemetry from C Code
Summary

Introduction¶

At professional level you read the Go runtime source. You can point to the exact lines that implement syscall handoff, sysmon's check, and per-OS thread creation. You can patch the runtime to add diagnostics or experiment with alternative policies.

References are to Go 1.22 source, with file paths relative to src/runtime/. Line numbers drift between minor versions; the structure is stable. Open the source in another window as you read.

The mental model from middle and senior pages stays the same — we are now annotating it with code citations.

Source Files You Will Read¶

File	Contents
`runtime2.go`	Type definitions for `g`, `m`, `p`, `sched`, state constants like `_Psyscall`.
`proc.go`	Scheduler core: `entersyscall`, `exitsyscall`, `sysmon`, `handoffp`, `schedule`, `findRunnable`.
`os_linux.go`	Linux `newosproc` (calls `clone(2)`), signal init, OS time.
`os_darwin.go`	macOS `newosproc` (uses `bsdthread_create`).
`os_windows.go`	Windows `newosproc` (`CreateThread`).
`netpoll.go`	Cross-platform netpoller core.
`netpoll_epoll.go`	Linux epoll integration.
`netpoll_kqueue.go`	BSD/macOS kqueue integration.
`cgocall.go`	`cgocall`, `cgocallback`, cgo M state.
`preempt.go`	Async preemption (touches syscall paths).
`sys_linux_amd64.s`	Assembly syscall stubs (where `entersyscall`/`exitsyscall` are emitted around `SYSCALL`).

HACKING.md (alongside the source) is required prerequisite reading.

`reentersyscall` Line by Line¶

runtime.entersyscall is a thin wrapper:

//go:nosplit
func entersyscall() {
    reentersyscall(getcallerpc(), getcallersp())
}

The real work is reentersyscall:

//go:nosplit
func reentersyscall(pc, sp uintptr) {
    gp := getg()

    // Disable preemption because we hold m.p.
    gp.m.locks++
    gp.stackguard0 = stackPreempt
    gp.throwsplit = true

    // Leave SP and PC for trace.
    save(pc, sp)
    gp.syscallsp = sp
    gp.syscallpc = pc
    casgstatus(gp, _Grunning, _Gsyscall)

    if staticLockRanking {
        // ... lock rank tracking ...
    }

    if sched.sysmonwait.Load() {
        systemstack(entersyscall_sysmon)
    }

    if gp.m.p.ptr().runSafePointFn != 0 {
        // runSafePointFn may schedule, so it has to be done on the systemstack.
        systemstack(runSafePointFn)
    }

    gp.m.syscalltick = gp.m.p.ptr().syscalltick
    gp.sysblocktraced = true
    pp := gp.m.p.ptr()
    pp.m = 0
    gp.m.oldp.set(pp)
    gp.m.p = 0
    atomic.Store(&pp.status, _Psyscall)
    if sched.gcwaiting.Load() {
        systemstack(entersyscall_gcwait)
    }

    gp.m.locks--
}

Annotated:

Step	Code	Purpose
1	`gp.m.locks++`	Disable preemption — we are about to be in an inconsistent state.
2	`gp.stackguard0 = stackPreempt`	Make sure any function-prologue stack check fires (paranoia).
3	`save(pc, sp)`	Save the caller's PC and SP into `gp` so the runtime can restore them.
4	`casgstatus(gp, _Grunning, _Gsyscall)`	Atomically transition G state.
5	`sched.sysmonwait` check	If sysmon is asleep waiting for activity, wake it so it can do handoff if needed.
6	`runSafePointFn` check	GC may have scheduled a "do this safe point work" function.
7	Detach P from M	Set `pp.m = 0`, save oldp, clear `m.p`.
8	`atomic.Store(&pp.status, _Psyscall)`	Flag the P.
9	`gcwaiting` check	If GC wants to stop the world, help it.
10	`gp.m.locks--`	Re-enable preemption.

After this the M executes the actual SYSCALL instruction in the wrapper (e.g., Syscall6).

Why the `m.locks++` / `m.locks--` bracket?¶

m.locks is the runtime's preemption disable counter. Anything > 0 means "don't preempt me". Setting it during reentersyscall is essential because halfway through, the M's P is in an inconsistent state (m.p == 0 but pp.status == _Psyscall).

`exitsyscall` and the Fast/Slow Branch¶

//go:nosplit
func exitsyscall() {
    gp := getg()

    gp.m.locks++
    if getcallersp() > gp.syscallsp {
        throw("exitsyscall: syscall frame is no longer valid")
    }

    gp.waitsince = 0
    oldp := gp.m.oldp.ptr()
    gp.m.oldp = 0
    if exitsyscallfast(oldp) {
        // Fast path: re-acquired oldp (or some P).
        // ... profiling hooks ...
        gp.m.locks--
        if gp.preempt {
            // Preemption was requested while we were in syscall.
            gp.stackguard0 = stackPreempt
        } else {
            gp.stackguard0 = gp.stack.lo + _StackGuard
        }
        gp.throwsplit = false

        if sched.disable.user && !schedEnabled(gp) {
            // Schedule was disabled while we were away.
            Gosched()
        }
        return
    }

    gp.m.locks--

    // Slow path. Park M, requeue G.
    mcall(exitsyscall0)

    // Re-enter Go scheduling on the new M.
    gp.syscallsp = 0
    gp.m.syscalltick++
    gp.throwsplit = false
}

The two outcomes:

Fast path: exitsyscallfast(oldp) returns true. We have a P. Mark the G running and continue inline.
Slow path: mcall(exitsyscall0) switches to the M's g0 stack and parks the M.

Notice the gp.preempt check: a preemption can be requested while we were in syscall. We honour it on the way out by setting stackguard0 = stackPreempt, which causes the next function-prologue check to call morestack, which calls into the scheduler.

`exitsyscallfast` Internals¶

//go:nosplit
func exitsyscallfast(oldp *p) bool {
    // Freezetheworld set the stopwait but did not retake P's.
    if sched.stopwait == freezeStopWait {
        return false
    }

    // Try to re-acquire the old P.
    if oldp != nil && oldp.status == _Psyscall && atomic.Cas(&oldp.status, _Psyscall, _Pidle) {
        // Yes! Re-attach.
        wirep(oldp)
        exitsyscallfast_reacquired()
        return true
    }

    // Try to get any other P.
    if sched.pidle != 0 {
        var ok bool
        systemstack(func() {
            ok = exitsyscallfast_pidle()
        })
        if ok {
            return true
        }
    }
    return false
}

The key CAS:

atomic.Cas(&oldp.status, _Psyscall, _Pidle)

If sysmon already CAS'd it to _Pidle (handoff started), our CAS fails. We then fall to looking for any idle P.

exitsyscallfast_pidle is the systemstack helper:

func exitsyscallfast_pidle() bool {
    lock(&sched.lock)
    pp, _ := pidlegetSpinning(0)
    if pp != nil && atomic.Load(&sched.sysmonwait) != 0 {
        atomic.Store(&sched.sysmonwait, 0)
        notewakeup(&sched.sysmonnote)
    }
    unlock(&sched.lock)
    if pp != nil {
        acquirep(pp)
        return true
    }
    return false
}

A global lock taken briefly (sched.lock). pidlegetSpinning pops the head of the idle-P list. If we got one, attach.

The wakeup of sysmon: if we just acquired a P, sysmon should know that the system has work — wake it if it was sleeping.

`sysmon` Loop and `retake` for `_Psyscall`¶

sysmon is launched at startup with no P:

func sysmon() {
    lock(&sched.lock)
    sched.nmsys++
    unlock(&sched.lock)

    lasttrace := int64(0)
    idle := 0
    delay := uint32(0)

    for {
        if idle == 0 {
            delay = 20  // 20 microseconds
        } else if idle > 50 {
            delay *= 2
        }
        if delay > 10*1000 {
            delay = 10 * 1000  // cap at 10 ms
        }
        usleep(delay)

        now := nanotime()

        // ... scheduler bookkeeping ...

        // Retake: preempt long-running Gs, hand off syscall Ps.
        if retake(now) != 0 {
            idle = 0
        } else {
            idle++
        }

        // ... GC trigger, scavenge, etc ...
    }
}

retake is the function we care about:

func retake(now int64) uint32 {
    n := 0
    lock(&allpLock)
    for i := 0; i < len(allp); i++ {
        pp := allp[i]
        if pp == nil {
            continue
        }
        pd := &pp.sysmontick
        s := pp.status
        sysretake := false
        if s == _Prunning || s == _Psyscall {
            // Preempt G if it's running too long.
            t := int64(pp.schedtick)
            if int64(pd.schedtick) != t {
                pd.schedtick = uint32(t)
                pd.schedwhen = now
            } else if pd.schedwhen+forcePreemptNS <= now {
                preemptone(pp)
                sysretake = true
            }
        }
        if s == _Psyscall {
            // Retake P from syscall if it's been more than 1 sysmon tick.
            t := int64(pp.syscalltick)
            if !sysretake && int64(pd.syscalltick) != t {
                pd.syscalltick = uint32(t)
                pd.syscallwhen = now
                continue
            }
            if runqempty(pp) && sched.nmspinning.Load()+sched.npidle.Load() > 0 &&
                pd.syscallwhen+10*1000 > now {
                continue
            }
            // Need to decrement number of idle locked M's
            // (pretending that one more is running) before the CAS.
            unlock(&allpLock)
            incidlelocked(-1)
            if atomic.Cas(&pp.status, s, _Pidle) {
                if traceEnabled() {
                    traceGoSysBlock(pp)
                    traceProcStop(pp)
                }
                n++
                pp.syscalltick++
                handoffp(pp)
            }
            incidlelocked(1)
            lock(&allpLock)
        }
    }
    unlock(&allpLock)
    return uint32(n)
}

The interesting check:

if runqempty(pp) && sched.nmspinning.Load()+sched.npidle.Load() > 0 &&
    pd.syscallwhen+10*1000 > now {
    continue
}

This says: if the P has nothing to do AND there are spinning Ms or idle Ps available AND the syscall is < 10 µs old, don't bother handing off. The runtime is being conservative: handoff is only worthwhile if (a) there is work waiting, or (b) the syscall is long enough that we cannot afford to wait.

The CAS from _Psyscall to _Pidle is the moment of handoff. If we win, we own the P and call handoffp(pp) to dispose of it.

`handoffp` Implementation¶

func handoffp(pp *p) {
    // 1. If P has runnable work, start an M to run it.
    if !runqempty(pp) || sched.runqsize != 0 {
        startm(pp, false)
        return
    }
    // 2. If GC wants the P, start an M.
    if gcBlackenEnabled != 0 && gcMarkWorkAvailable(pp) {
        startm(pp, false)
        return
    }
    // 3. If no spinning M and no idle P, start a spinning M.
    if sched.nmspinning.Load()+sched.npidle.Load() == 0 &&
        sched.nmspinning.CompareAndSwap(0, 1) {
        startm(pp, true)
        return
    }
    lock(&sched.lock)
    if sched.gcwaiting.Load() {
        pp.status = _Pgcstop
        sched.stopwait--
        if sched.stopwait == 0 {
            notewakeup(&sched.stopnote)
        }
        unlock(&sched.lock)
        return
    }
    // 4. If safe-point function pending, run it.
    if pp.runSafePointFn != 0 && atomic.Cas(&pp.runSafePointFn, 1, 0) {
        sched.safePointFn(pp)
        sched.safePointWait--
        if sched.safePointWait == 0 {
            notewakeup(&sched.safePointNote)
        }
    }
    if sched.runqsize != 0 {
        unlock(&sched.lock)
        startm(pp, false)
        return
    }
    // ... more special cases ...

    // 5. No work for this P; park it.
    pidleput(pp, 0)
    unlock(&sched.lock)
}

The cases (numbered above):

P has runnable Gs: spin up an M to run them. Most common case.
GC needs CPU: same.
No spinning Ms anywhere: create one — important for waking up the scheduler.
Safe-point work: run it now.
Nothing to do: park the P on the idle list.

startm(pp, spinning):

func startm(pp *p, spinning bool) {
    mp := mget()  // try to get a parked M from sched.midle
    if mp == nil {
        // No parked M. Create one.
        var fn func()
        if spinning {
            fn = mspinning
        }
        newm(fn, pp, -1)
        return
    }
    // Wake an existing parked M.
    if mp.spinning {
        throw("startm: m is spinning")
    }
    if mp.nextp != 0 {
        throw("startm: m has p")
    }
    mp.spinning = spinning
    mp.nextp.set(pp)
    notewakeup(&mp.park)
}

mget() pulls from sched.midle (the M pool). If empty, newm → newm1 → newosproc → clone(2) (Linux).

Time cost:

mget success: ~100 ns + the cost of waking the M (a futex wake on Linux, ~1 µs).
newm (cold path): ~5–50 µs because of clone(2).

The `_Psyscall` State Machine¶

        Schedule()
      [_Prunning]
            |
        entersyscall
            |
       [_Psyscall]  <----- M still attached via oldp,
            |                 P available for handoff
            |
       +----+----+---------------------------------------+
       |         |                                       |
   exitsyscallfast                                  sysmon's retake
   CAS _Psyscall->_Pidle                            CAS _Psyscall->_Pidle
       |                                                 |
       v                                                 v
   acquirep -> [_Prunning]                          handoffp(pp)
                                                          |
                                                          v
                                                    [_Pidle] / [_Prunning]
                                                    (depending on whether
                                                     handoffp parked or
                                                     started an M)

The CAS race between exitsyscallfast and retake is the central concurrency invariant. Only one of them wins. If exitsyscallfast wins, the M re-attaches and continues — sysmon's handoff is skipped this round. If sysmon wins, the M takes the slow path.

This is what makes the syscall handling lock-free in the common case.

Per-Platform Thread Creation: `newosproc`¶

Linux¶

// runtime/os_linux.go
func newosproc(mp *m) {
    stk := unsafe.Pointer(mp.g0.stack.hi)
    var oset sigset
    sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
    ret := retryOnEAGAIN(func() int32 {
        r := clone(cloneFlags, stk,
            unsafe.Pointer(mp),
            unsafe.Pointer(mp.g0),
            unsafe.Pointer(abi.FuncPCABI0(mstart)))
        if r >= 0 {
            return 0
        }
        return -r
    })
    sigprocmask(_SIG_SETMASK, &oset, nil)
    if ret != 0 {
        print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", ret, ")\n")
        if ret == _EAGAIN {
            println("runtime: may need to increase max user processes (ulimit -u)")
        }
        throw("newosproc")
    }
}

var cloneFlags = uintptr(_CLONE_VM | _CLONE_FS | _CLONE_FILES | _CLONE_SIGHAND |
    _CLONE_SYSVSEM | _CLONE_THREAD)

Note sigprocmask blocks all signals during clone. The child inherits the blocked set; minit later unblocks signals once the M is fully set up. This prevents signals being delivered to a half-initialised M.

macOS¶

// runtime/os_darwin.go
func newosproc(mp *m) {
    stk := unsafe.Pointer(mp.g0.stack.hi)
    var attr pthreadattr
    if pthread_attr_init(&attr) != 0 {
        writeErrStr(failthreadcreate)
        exit(1)
    }
    if pthread_attr_setstacksize(&attr, threadStackSize) != 0 {
        writeErrStr(failthreadcreate)
        exit(1)
    }
    if pthread_attr_setdetachstate(&attr, _PTHREAD_CREATE_DETACHED) != 0 {
        writeErrStr(failthreadcreate)
        exit(1)
    }
    var oset sigset
    sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
    err := pthread_create(&attr, abi.FuncPCABI0(mstart_stub), unsafe.Pointer(mp))
    sigprocmask(_SIG_SETMASK, &oset, nil)
    if err != 0 {
        writeErrStr(failthreadcreate)
        exit(1)
    }
}

macOS uses pthread_create (which under the hood calls bsdthread_create). The detached state means the runtime does not need to pthread_join.

Windows¶

// runtime/os_windows.go
func newosproc(mp *m) {
    thandle := stdcall6(_CreateThread, 0, 0,
        abi.FuncPCABI0(tstart_stdcall), uintptr(unsafe.Pointer(mp)),
        0, 0)
    // ...
}

Windows uses CreateThread directly. The new thread starts at tstart_stdcall (assembly stub), which calls mstart.

Cost comparison¶

Platform	Thread create	Notes
Linux	`clone(2)`	~5–50 µs. Direct syscall; no libc round-trip.
macOS	`bsdthread_create` via pthreads	~50–100 µs. Slightly heavier; pthreads adds overhead.
Windows	`CreateThread`	~50–200 µs depending on AV/EDR hooking.
FreeBSD	`thr_new`	~10–50 µs. Similar to Linux.

The runtime's M pool absorbs this cost. Once an M exists, waking it is ~1 µs (a futex wake on Linux, condition signal on macOS, event on Windows).

The Netpoller Init and Wakeup Paths¶

The netpoller has two activities: registering fds and polling for readiness.

Registration (per-fd, once)¶

// runtime/netpoll.go
func poll_runtime_pollOpen(fd uintptr) (*pollDesc, int) {
    pd := pollcache.alloc()
    // ... init pollDesc ...
    var errno int32
    errno = netpollopen(fd, pd)
    return pd, int(errno)
}

// runtime/netpoll_epoll.go
func netpollopen(fd uintptr, pd *pollDesc) int32 {
    var ev epollevent
    ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
    *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd
    return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev)
}

_EPOLLET is edge-triggered. The kernel reports readiness once per state change, not continuously. The runtime is responsible for draining the fd until EAGAIN.

Wait¶

// runtime/netpoll.go (called by scheduler)
func netpoll(delay int64) gList {
    if epfd == -1 {
        return gList{}
    }
    var waitms int32
    if delay < 0 {
        waitms = -1
    } else if delay == 0 {
        waitms = 0
    } else if delay < 1e6 {
        waitms = 1
    } else if delay < 1e15 {
        waitms = int32(delay / 1e6)
    } else {
        waitms = 1e9
    }

    var events [128]epollevent
retry:
    n := epollwait(epfd, &events[0], int32(len(events)), waitms)
    if n < 0 {
        if n != -_EINTR {
            println("runtime: epollwait on fd", epfd, "failed with", -n)
            throw("runtime: netpoll failed")
        }
        if waitms > 0 {
            return gList{}
        }
        goto retry
    }
    var toRun gList
    for i := int32(0); i < n; i++ {
        ev := &events[i]
        if ev.events == 0 {
            continue
        }
        // ... handle ev_wakeup_pipe ...
        pd := *(**pollDesc)(unsafe.Pointer(&ev.data))
        var mode int32
        if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
            mode += 'r'
        }
        if ev.events&(_EPOLLOUT|_EPOLLHUP|_EPOLLERR) != 0 {
            mode += 'w'
        }
        if mode != 0 {
            netpollready(&toRun, pd, mode)
        }
    }
    return toRun
}

netpoll(delay) is called from findRunnable (the scheduler's main "find me a goroutine" function). It returns a gList of goroutines now ready to resume.

netpollready unparks the goroutine attached to the pollDesc and adds it to the list.

Wakeup pipe¶

The netpoller registers a special pipe (netpollWakeup). Calling netpollBreak() writes one byte to this pipe, which makes epollwait return immediately. This is how the scheduler wakes the netpoller when other work appears (avoiding busy-waiting on epollwait's timeout).

`entersyscallblock` — the Pre-Emptive Handoff¶

//go:nosplit
func entersyscallblock() {
    gp := getg()
    gp.m.locks++
    gp.throwsplit = true
    gp.stackguard0 = stackPreempt
    gp.m.syscalltick = gp.m.p.ptr().syscalltick
    gp.sysblocktraced = true
    gp.m.p.ptr().syscalltick++

    pc := getcallerpc()
    sp := getcallersp()
    save(pc, sp)
    gp.syscallsp = gp.sched.sp
    gp.syscallpc = gp.sched.pc

    casgstatus(gp, _Grunning, _Gsyscall)

    systemstack(entersyscallblock_handoff)

    save(getcallerpc(), getcallersp())
    gp.m.locks--
}

func entersyscallblock_handoff() {
    if traceEnabled() {
        traceGoSysCall()
        traceGoSysBlock(getg().m.p.ptr())
    }
    handoffp(releasep())
}

Unlike entersyscall, which detaches the P but leaves the handoff to sysmon, entersyscallblock calls handoffp immediately. Used when the runtime knows the next operation will block for a while:

semasleep (semaphore wait).
notetsleep with deep timeouts.
Internal blocking operations.

When you write user-level code calling <-ch (an unbuffered channel receive with no sender), the runtime parks the G via gopark, not entersyscallblock. They are different mechanisms — gopark is purely user-space and doesn't involve the kernel at all.

Cgo Call Internals¶

// runtime/cgocall.go
func cgocall(fn, arg unsafe.Pointer) int32 {
    if !iscgo && GOOS != "solaris" && GOOS != "illumos" && GOOS != "windows" {
        throw("cgocall unavailable")
    }
    if fn == nil {
        throw("cgocall nil")
    }
    if raceenabled {
        racereleasemerge(unsafe.Pointer(&racecgosync))
    }
    mp := getg().m
    mp.ncgocall++
    mp.ncgo++

    // Reset traceback so that it does not see the cgo call.
    mp.cgoCallers = nil

    // Announce we are entering a system call.
    entersyscall()

    osPreemptExtEnter(mp)

    mp.incgo = true
    errno := asmcgocall(fn, arg)

    mp.incgo = false
    mp.ncgo--

    osPreemptExtExit(mp)

    exitsyscall()

    // From the now-current goroutine, scan stack for any pointers and
    // pin them so GC won't reclaim them while we're in cgo.
    KeepAlive(fn)
    KeepAlive(arg)
    KeepAlive(mp)

    return errno
}

Note:

entersyscall is called first (so the P can be handed off).
mp.incgo = true is set after, distinguishing "in cgo" from "in regular syscall" for telemetry.
asmcgocall is the assembly that switches stacks. It saves the goroutine context, switches to the M's g0 stack, calls the C function, switches back, restores the goroutine context.
exitsyscall is called after the C function returns.
osPreemptExtEnter/Exit manage preemption signals during the cgo call (different platforms have different rules about signal delivery to threads in foreign code).

The cgo cost breakdown (for a 1 ns C function):

Step	Cost
`entersyscall`	~30 ns
Stack switch (asm)	~10 ns
C function itself	~1 ns
Stack switch back	~10 ns
`exitsyscall` (fast path)	~30 ns
Total	~80–150 ns

For C functions that take > 1 µs, this overhead is negligible. For C functions that take < 100 ns, the cgo overhead dominates and is usually not worth it.

Locked Goroutines Across Syscalls¶

When a LockOSThread'd G enters a syscall:

entersyscall runs normally. P detaches.
Sysmon may hand off the P. Handoff is unaffected by the lock.
When the syscall returns, exitsyscall tries to reacquire. It must find a P for the specific M (since the G is locked to it).
If no P is free, the M parks. The G waits on the M, not on a generic runqueue.

The interesting case is the slow path. In exitsyscall0:

func exitsyscall0(gp *g) {
    casgstatus(gp, _Gsyscall, _Grunnable)
    dropg()
    var pp *p
    if schedEnabled(gp) {
        pp, _ = pidleget(0)
    }
    var locked bool
    if pp == nil {
        globrunqput(gp)
        // Below, we stoplockedm if locked. The schedule on
        // the locked M will check the special handle.
        locked = gp.lockedm != 0
    } else {
        // ... ack sysmon, acquirep ...
    }
    if locked {
        stoplockedm()
        execute(gp, false)
    }
    stopm()
    schedule()
}

When the G is locked and we have no P, stoplockedm is called. It parks the M but in a state where only this specific G can wake it. When eventually a P is available, the runtime knows to wake this M (not any M).

Performance: locked goroutines can starve worse than unlocked ones, because the runtime cannot run them on any free M. Always size your locked-G workload carefully.

Reading Runtime Telemetry from C Code¶

Sometimes you need to read scheduler state from inside C (cgo callbacks, profiling). The runtime exposes some via _cgo_thread_start and friends, but most state is private.

What is available:

runtime.callers(skip, pcbuf) returns the Go stack at the current point.
runtime/metrics (Go 1.16+) is a stable API for reading scheduler metrics.
/proc/self/status on Linux gives thread count.

What is not available:

The raw P/M/G state machine. The runtime does not export it.
goid (intentionally hidden).
The exact sched.midle size.

For low-level inspection, you can patch the runtime locally (it is just Go code). Don't ship patched runtimes; do use them for debugging.

Summary¶

At professional level, syscall handling is no longer a mechanism — it is a piece of source code you can navigate. You know:

reentersyscall in runtime/proc.go is where the G/P/M state transition happens for syscalls.
exitsyscall and exitsyscallfast in the same file are the return paths, with the critical CAS on _Psyscall → _Pidle.
sysmon loops every 20 µs–10 ms, calling retake, which hands off Ps in _Psyscall for > 10 µs.
handoffp decides whether to start an M, run a safe-point function, or park the P.
newosproc is per-OS: clone(2) on Linux, pthread_create on macOS, CreateThread on Windows.
The netpoller (netpoll_epoll.go, netpoll_kqueue.go, netpoll_windows.go) is a separate path that does not involve entersyscall at all.
entersyscallblock is the pre-emptive variant for known-long calls.
cgocall is entersyscall + stack switch + exitsyscall, with M-state bookkeeping for "in cgo".
Locked Gs complicate the slow path: the M can only resume its locked G.

You can patch the runtime, write profilers that inspect syscall paths, and explain to anyone why a specific syscall took the path it took.

The specification level catalogues the invariants and runtime guarantees more formally, separating "what the language promises" from "what the runtime currently does".