Goroutines vs OS Threads — Middle Level¶

Table of Contents¶

Introduction
Numbers You Should Know
The M:N Model in Detail
How a Blocking Syscall Is Handed Off
The Netpoller — Network I/O Without Threads
Cgo and Thread Costs
LockOSThread in Practice
Tuning GOMAXPROCS
Containers, cgroups, and GOMAXPROCS
Goroutine Identity vs Thread Identity
Signals and Threads
Comparing Goroutines to Other Languages' Concurrency
Diagnostics — Counting Threads and Goroutines
Performance Gotchas
Best Practices for Established Codebases
Self-Assessment
Summary

Introduction¶

At junior level you learned that goroutines are user-space, threads are kernel-space, and the runtime multiplexes one onto the other. At middle level you start applying that knowledge: choosing GOMAXPROCS correctly in containers, diagnosing thread-count anomalies, recognising cgo-induced M storms, and knowing when LockOSThread is the right hammer.

After this file you will:

Reason about why a Go service uses N threads under load and predict that N.
Know what happens to the thread pool when goroutines block in syscalls.
Tune GOMAXPROCS for containers, NUMA boxes, and mixed workloads.
Spot performance regressions caused by cgo, signals, or thread starvation.
Use runtime/trace and pprof to see goroutine-vs-thread behaviour.

Numbers You Should Know¶

Order-of-magnitude figures for a modern x86-64 Linux machine, Go 1.22+:

Operation	Goroutine	OS thread
Create + first-yield	~200–1 000 ns	~5 000–50 000 ns
Context switch (user-space)	~100–300 ns	n/a
Context switch (kernel-mediated)	~1 000 ns (cgo / syscall)	~1 000–10 000 ns
Initial stack	2 KB	8 MB on Linux (mmaped, lazy-paged)
Hard upper bound (memory)	Millions	A few thousand to tens of thousands
Hard upper bound (cores)	n/a (concurrency, not parallelism)	One thread per core, fully parallel

These are not promises. They depend on kernel version, CPU, and Go version. Measure on your hardware.

A practical heuristic:

For pure CPU work: number of useful Ms ≈ number of cores ≈ GOMAXPROCS. Spawning more goroutines does not help.
For blocking I/O: number of Ms is less than the number of in-flight requests (because the netpoller handles waiters); thread count rises only with cgo and non-net syscalls.

The `M:N` Model in Detail¶

Go is an M:N scheduler: M goroutines mapped onto N OS threads, where typically M ≫ N.

Three runtime entities (mnemonic: G = goroutine, M = machine = OS thread, P = processor context):

G (goroutine)   ── unit of work, ~2 KB stack
P (processor)   ── scheduler context; holds a runqueue, GC state, defer pool
M (machine)     ── OS thread

Scheduling invariants:

To execute Go code, an M must hold a P.
The number of Ps is fixed at GOMAXPROCS.
The number of Ms is dynamic; the runtime spawns and parks them as needed.
Goroutines live in a P's local runqueue (LRQ) primarily; overflow goes to the global runqueue (GRQ).
Idle Ps "steal" half the goroutines from a busy P's LRQ when their own is empty.

            GRQ (global runqueue)
             |
   +---------|-----------+----------+
   |         |           |          |
   P0        P1          P2         P3       <- count = GOMAXPROCS
   |         |           |          |
  LRQ        LRQ         LRQ        LRQ
   |         |           |          |
   M-a       M-b         M-c        M-d      <- attached Ms
                                              + extras parked / in syscall

This separation matters because:

Most scheduler operations are lock-free on the P-local queue (no global mutex).
Work stealing happens when a P empties, balancing load with minimal coordination.
Blocking syscalls hand the P off to a fresh M, keeping the queue active.

The full GMP design is covered in 10-scheduler-deep-dive — at middle level you should understand the layering well enough to read scheduler traces.

How a Blocking Syscall Is Handed Off¶

A blocking syscall is one where the kernel may not return for a while: read, write, open of a slow device, connect, accept, flock. Linux non-network I/O does not go through the netpoller — it goes straight into the kernel.

When a goroutine calls a blocking syscall:

The runtime's entersyscall() is called (instrumented by the compiler/runtime).
The current M detaches from its P. The P is now idle with a non-empty runqueue.
The M makes the syscall. It is now stuck in the kernel.
If there is a G waiting and no idle M, the runtime spawns or unparks an M and attaches it to the orphaned P. That M starts running goroutines.
When the syscall returns, the original M tries to re-attach to a P. If one is available, fine. If not, the M parks itself (returns to the M pool) and the goroutine is put on the global runqueue to be resumed later.

Step 0:  M1 attached to P0. G42 calls read(fd).
Step 1:  runtime.entersyscall() — M1 detaches from P0.
Step 2:  M1 enters the kernel (blocked).
         P0 is idle with a runqueue full of pending goroutines.
Step 3:  sysmon notices P0 is idle. It pulls an M from the M-pool
         (or spawns a new one via clone(2)). Call it M5.
Step 4:  M5 attaches to P0. Resumes scheduling other goroutines.
Step 5:  Eventually, kernel completes read. M1 returns from the syscall.
Step 6:  runtime.exitsyscall() — M1 tries to grab a P.
         If P available: M1 attaches, runs G42 to continue.
         If no P:        M1 parks itself; G42 goes to a runqueue.

The consequence: a Go program may have more Ms than GOMAXPROCS, because some Ms are blocked in syscalls. top will reflect that.

Threshold for handoff: 20 µs¶

The runtime does not detach the P immediately. For very short syscalls (< ~20 µs), the M just runs them quickly. For longer ones, sysmon (a runtime monitor running every ~20 µs) sees the M is still in _Psyscall state and forcibly hands off the P.

Why is this important at middle level?¶

Two consequences you will encounter:

Thread count grows under heavy blocking I/O. A workload calling read on many slow files spawns Ms proportional to in-flight syscalls.
The Go scheduler is not "free" if you abuse syscalls. Each handoff is more expensive than a goroutine context switch.

For network I/O, see the next section — that path is much cheaper.

The Netpoller — Network I/O Without Threads¶

Network reads, network writes, and timer waits go through the netpoller, a runtime component that uses non-blocking I/O plus epoll (Linux), kqueue (BSD/macOS), or IOCP (Windows).

When a goroutine calls conn.Read:

The runtime sets the underlying fd non-blocking (once, on first use).
The runtime calls read. If data is ready, it returns immediately.
If not, the runtime registers the fd with the netpoller and parks the goroutine. No M is held by the parked G.
The netpoller goroutine — running on one M — calls epoll_wait periodically (or continuously, depending on activity).
When epoll_wait says the fd is ready, the netpoller unparks the goroutine.
The goroutine is placed on a runqueue. Some M picks it up and read retries (it will succeed this time).

Effect: 50 000 idle WebSocket connections cost ~50 000 goroutines and ~one M. The M handling epoll_wait is the same M as any other — there is no dedicated netpoller thread (one of the Ms takes turns handling it, called from the scheduler).

// Pseudocode for what happens under the hood of conn.Read
func (conn *netConn) Read(p []byte) (int, error) {
    for {
        n, err := nonblockingRead(conn.fd, p)
        if err == nil { return n, nil }
        if !errors.Is(err, syscall.EAGAIN) { return n, err }
        // EAGAIN: not ready. Park.
        netpoll.waitFor(conn.fd, 'r')   // park goroutine, return when fd ready
    }
}

The blocking-style conn.Read you write in Go is the cheap-and-friendly API on top of an event loop. Best of both worlds.

What can use the netpoller?¶

Sockets (TCP, UDP, Unix domain).
Pipes (sometimes).
Disk I/O? No. Linux epoll on regular files is broken — always returns "ready" — so disk reads are not netpolled. They go through the blocking syscall path.
File watchers (inotify) — depends on the implementation.

This is why a server reading from a 10 GB log file may spike thread count, but a server with 50 000 idle TCP connections does not.

Cgo and Thread Costs¶

A cgo call (C.some_function()) holds an M for its entire duration. Why:

A C function may block, may store thread-local state, may interact with signal masks. The Go runtime cannot safely interrupt or move it.
During the cgo call, the M is in a special state (_Pcgo). It is detached from its P (so other goroutines can run) but cannot be reused.
When the cgo call returns, the M tries to re-attach to a P; if none is available, the M parks.

The M-creation storm¶

If many goroutines simultaneously call into C and each call blocks for a while:

for i := 0; i < 100; i++ {
    go C.slow_call()  // each holds an M for, say, 50 ms
}

For 50 ms, the runtime needs 100 Ms to keep 100 cgo calls in flight. If only GOMAXPROCS + a few Ms exist, the runtime calls clone(2) to create more. Thread count grows from ~10 to ~100 in milliseconds.

Recovery: after the calls return, Ms park. The runtime keeps a small pool of parked Ms (free Ms) to reuse but eventually destroys excess.

Cgo overhead per call¶

Even a tiny cgo call costs ~100–200 ns of overhead just for the M-state transition. For a hot loop, batch as much C work into a single cgo call as possible:

// Bad: 1M cgo calls
for _, x := range xs {
    C.process_one(C.int(x))
}

// Good: 1 cgo call
C.process_many((*C.int)(unsafe.Pointer(&xs[0])), C.int(len(xs)))

When cgo is unavoidable¶

A few mitigations:

Pool the C resources (open files, contexts) so each Go goroutine does not have to create one.
Use runtime.LockOSThread if the C library is thread-affine, to avoid migration costs.
Profile with runtime/trace; cgo state changes show up as separate events.

`LockOSThread` in Practice¶

runtime.LockOSThread() pins the calling goroutine to the OS thread it is currently running on. Subsequent calls to UnlockOSThread() (the same number) release the pin.

Use cases (in order of frequency):

Thread-local C library state. OpenGL contexts, GnuTLS sessions, some database client libraries (older Oracle OCI), CUDA contexts.
OS APIs that are thread-scoped. Linux setns (namespace switching), unshare, prctl(PR_SET_NAME), setpriority, sched_setaffinity.
Signal handling on a specific thread. Rare but real.
Profiling. runtime/pprof.SetGoroutineLabels can be combined with pinning to get a single thread of execution that is easier to read in perf.

Example: switching Linux namespaces¶

func enterNetworkNamespace(fd int) error {
    // setns must be called on the thread that you want to switch.
    // Pin the goroutine so it never moves.
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    return syscall.Setns(fd, syscall.CLONE_NEWNET)
}

If you forget LockOSThread, the goroutine may run on M1 when you call Setns (switching M1's namespace), then drift to M2 before the next syscall — which is in a different namespace. Programs that need namespace switching almost always pin.

Cost of `LockOSThread`¶

The pinned goroutine occupies its M exclusively. The runtime cannot move it.
If the pinned goroutine blocks, that M is unusable for other Go work.
If too many goroutines are pinned, you can deadlock: every M is held by a pinned goroutine, and GOMAXPROCS Ms have no P to run on.

Use sparingly; one or two pinned goroutines per process is typical.

`UnlockOSThread` semantics¶

runtime.LockOSThread()
runtime.LockOSThread()  // counter = 2
runtime.UnlockOSThread() // counter = 1, still pinned
runtime.UnlockOSThread() // counter = 0, unpinned

Match each Lock with an Unlock. If the goroutine exits while still pinned, the runtime may destroy the M (does so when it cannot safely reuse the thread).

Tuning `GOMAXPROCS`¶

Defaults are usually right, but you may need to tune:

Situation	Recommended value
Bare metal, dedicated machine	Default (`NumCPU`).
Container with CPU limit, Go ≥ 1.16 on Linux	Default (runtime reads cgroup v1/v2 quota).
Container with CPU limit, Go < 1.16 or non-Linux	Use `go.uber.org/automaxprocs`.
Mixed CPU-bound + I/O-bound workload	Default; the netpoller and syscall handoff handle the I/O.
Latency-critical, very low concurrency	`GOMAXPROCS=1` can reduce tail latency by avoiding cross-core cache effects.
Embarrassingly parallel CPU work	`GOMAXPROCS=NumCPU`. More does not help.
Wide NUMA box (≥ 32 cores)	Profile; sometimes a lower value reduces cross-socket cache traffic.

Reading and setting¶

import "runtime"

n := runtime.GOMAXPROCS(0)  // read
runtime.GOMAXPROCS(4)       // set to 4

Set early — before any work spawns — to avoid runtime reshuffling.

Environment¶

GOMAXPROCS=4 ./my-server

Honoured by the runtime at startup. Overrides the default.

Heuristic: never set above `NumCPU`¶

GOMAXPROCS > NumCPU means more Ps than cores. Some of them sit idle (no M to attach to), and the kernel context-switches Ms more than necessary. Almost always a regression.

Heuristic: set below `NumCPU` when sharing the box¶

A Go service co-tenanted on a 32-core machine with three other Go services: each at GOMAXPROCS=8 is sometimes better than each at GOMAXPROCS=32. The scheduler does not coordinate across processes; you must.

Containers, cgroups, and `GOMAXPROCS`¶

The most common production bug at middle level: Go service in Kubernetes pod with cpu: 500m, but runtime.GOMAXPROCS(0) returns 64 (the node count).

Pre-Go 1.16 / 1.16 / 1.17 / 1.18¶

Version	Behaviour
≤ 1.15	Ignores cgroup quota. `NumCPU` returns node CPU count.
1.16	First version to read cgroup v1 CPU quota on Linux. Partial.
1.18	Reads cgroup v2 quota. Modern containers (k8s ≥ 1.25) use v2.
≥ 1.18	Treats fractional quotas correctly: `cpu: 500m` → `GOMAXPROCS=1` (rounded up from 0.5).

If you cannot upgrade Go, use the Uber library:

import _ "go.uber.org/automaxprocs"

It reads the cgroup at init time and sets GOMAXPROCS accordingly. Print a log line if you want to confirm.

Symptoms of mis-tuning¶

Latency spikes during bursts (scheduler thrash).
High CPU usage with low throughput.
GC pause variability.
Inconsistent throughput across pods.

Always log runtime.GOMAXPROCS(0) at startup. It is a one-line change that catches container-bound surprises.

Goroutine Identity vs Thread Identity¶

Threads have OS identity (TID, visible in /proc/<pid>/task/). Goroutines do not.

The Go runtime does maintain an internal goid, but does not expose it via a public API. Reasons:

Goroutines were designed to not be addressable from outside. Cancellation is cooperative via context.Context. Identification would enable broken patterns ("send a kill to goroutine 47").
Goroutine pools recycle G structs; goid is not stable across runs.

How to identify a goroutine in your own program¶

For logging, use context.Context to carry a request ID. Do not chase goid. There are unofficial hacks (runtime.Stack parsing) but they are fragile and slow.

// Good
func handle(ctx context.Context, req Request) {
    log.Printf("[%s] handling", req.ID)
}

// Bad (don't do this)
func goid() uint64 { /* parse runtime.Stack output */ }

Thread identity¶

Sometimes you do need thread identity (for gettid() to call a kernel API). Use syscall.Gettid() after runtime.LockOSThread:

runtime.LockOSThread()
tid := syscall.Gettid()
// tid is stable as long as you stay pinned.

Signals and Threads¶

POSIX signals can be delivered to any thread of a multi-threaded process. The Go runtime installs handlers that route signals into Go via os/signal.Notify. From the application's perspective, signals look like channel sends.

Two flavours of signal¶

Type	Examples	Behaviour
Synchronous	SIGSEGV, SIGFPE, SIGBUS	Delivered to the thread that caused them. The Go runtime turns these into runtime panics (or crashes).
Asynchronous	SIGINT, SIGTERM, SIGHUP	Delivered to "some" thread of the process. Go routes them through `signal.Notify`.

You can normally ignore the per-thread aspect and just use signal.Notify(ch, sig).

When the per-thread aspect matters¶

If you call a C library that installs its own signal handlers, those handlers run on whichever thread received the signal. If they depend on thread-local state, you must pin or arrange handler delivery on a specific thread. This is exotic and rare.

`SIGURG` and async preemption¶

Go 1.14+ uses SIGURG internally for asynchronous goroutine preemption. If your C library uses SIGURG, it conflicts. Workarounds:

Change the C library to use a different signal.
Run with GODEBUG=asyncpreemptoff=1 (disables async preemption; not recommended for production).

Comparing Goroutines to Other Languages' Concurrency¶

Model	Language	M:N?	Cost	Notes
OS threads	C, C++, Java pre-21	1:1	High	Familiar, kernel-managed.
OS threads + futures	Java, C#	1:1	High	Thread pool reduces creation cost.
Coroutines (cooperative)	Python `asyncio`, JS	N:1 typically	Very low	Single-thread; CPU-bound work blocks all.
Stackful coroutines	Lua, Boost.Coroutine	N:1	Very low	No preemption.
Goroutines	Go	M:N	Very low	Preemptive (since 1.14), runtime-scheduled.
Virtual threads	Java 21+ (Project Loom)	M:N	Very low	Like goroutines: cheap, parked on blocking.
Green threads	Erlang processes	M:N	Very low	Strict isolation: no shared memory between processes.
Async/await	Rust, JS, Python, C#	varies	low	Compile-time transformation to state machines.

Goroutines are closest to Java virtual threads (Loom) and Erlang processes, with the caveat that Go shares memory (Erlang does not).

When to prefer goroutines¶

High concurrency over high parallelism.
Blocking-style code is desirable.
Network-heavy workloads.

When to prefer something else¶

True hard real-time → low-level C/Rust with priority threads.
Strict isolation (one bad task does not kill peers) → Erlang/Elixir.
Web frontend → JS async/await.

Diagnostics — Counting Threads and Goroutines¶

Goroutine count¶

import "runtime"
fmt.Println(runtime.NumGoroutine())

For a stack trace of every goroutine:

import (
    "bytes"
    "runtime"
)

buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true)
os.Stderr.Write(buf[:n])

Production: use net/http/pprof:

curl localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

Thread count (Linux)¶

cat /proc/$(pgrep my-server)/status | grep Threads
# Threads: 11

Or:

ls /proc/$(pgrep my-server)/task | wc -l

Per-thread CPU:

top -H -p $(pgrep my-server)

Thread count (Go runtime)¶

There is no public API. Workarounds:

Read /proc/self/status from inside the program on Linux.
Use runtime.ReadMemStats and runtime/debug for indirect signs.
runtime/metrics exposes /sched/threads:threads since Go 1.21:

import "runtime/metrics"

samples := []metrics.Sample{{Name: "/sched/gomaxprocs:threads"}}
metrics.Read(samples)
fmt.Println("gomaxprocs:", samples[0].Value.Uint64())

Future-proof your monitoring by going through runtime/metrics.

Scheduler trace¶

GODEBUG=schedtrace=1000 ./my-server

Prints one line per second:

SCHED 1000ms: gomaxprocs=4 idleprocs=2 threads=11 spinningthreads=0 needspinning=0 idlethreads=5 runqueue=0 [0 0 0 0]

threads=11: total Ms (not just runnable).
idlethreads=5: parked Ms in the pool.
runqueue=0 [0 0 0 0]: global runqueue + per-P queues.

Add scheddetail=1 for per-G/P/M detail. Useful but very chatty.

Performance Gotchas¶

Gotcha 1: spurious thread growth from `os/exec`¶

os.exec.Cmd.Run uses fork+exec. On Linux, the runtime is careful: it has special code to do the fork without breaking the runtime state. But the post-fork child briefly has all the parent's Ms.

Symptom: thread count visible in ps looks weird during exec spikes.

Usually harmless. If you exec aggressively, profile.

Gotcha 2: cgo storms¶

Covered above. Symptom: thread count climbs to hundreds during steady-state.

Mitigation: batch C calls; pin to a single goroutine that processes a queue.

Gotcha 3: file I/O parks Ms¶

Non-network blocking I/O does not use the netpoller. Reading a 10 GB file in 100 goroutines spawns ~100 Ms.

Mitigation: bound the parallelism of file I/O via a semaphore. Disk parallelism is rarely > 4–8 useful anyway.

Gotcha 4: GC writes can stall a goroutine¶

The Go GC is concurrent but has stop-the-world (STW) phases of ~100 µs–1 ms. During STW, every goroutine pauses. This is unrelated to threading but often confused with scheduler issues.

Gotcha 5: `LockOSThread` causes scheduler imbalance¶

A pinned goroutine that does little work still occupies an M. With GOMAXPROCS=4 and 4 long-pinned goroutines, the runtime has no Ms left for other work.

Mitigation: minimise pinned goroutines; offload work to non-pinned ones via channels.

Gotcha 6: `runtime.Gosched` is rarely useful¶

A leftover from pre-1.14 era. With async preemption, the scheduler interrupts you when it needs to. Calling Gosched in tight loops adds overhead.

Gotcha 7: misreading `top`¶

top shows threads, not goroutines. A Go server with Threads: 12 is normal regardless of goroutine count.

Best Practices for Established Codebases¶

Log GOMAXPROCS at startup. One line. Catches all cgroup misconfigurations.
Run cgo-heavy code in a bounded pool — never "1 cgo call per request" without bounding.
Pair every LockOSThread with a documented reason in the source comment.
Track thread count as a metric (/sched/threads:threads since Go 1.21). Alarm if it exceeds a sane threshold.
Avoid runtime.Gosched in production code. Tests sometimes need it; production rarely.
Use context.Context for cancellation across goroutines. Do not try to cancel a goroutine via signals or other thread-level mechanisms.
Never assume thread identity is stable unless you have called LockOSThread.
Prefer the netpoller (network) to blocking syscalls when you have a choice — abstractions like net.Pipe are netpoller-backed.

Self-Assessment¶

Summary¶

The middle-level view of goroutines vs threads is about applying the layered model in real code:

The Go scheduler is an M:N scheduler with three abstractions (G, M, P). Most operations are P-local and lock-free.
Blocking syscalls trigger a P handoff: the M is parked in the kernel, a fresh M takes over the P, and goroutine scheduling continues. Thread count may briefly exceed GOMAXPROCS.
Network I/O (and timers) goes through the netpoller, which uses non-blocking syscalls + epoll/kqueue. 50 000 idle connections cost zero extra threads.
Cgo is the main source of M-creation storms; bound your cgo concurrency.
LockOSThread is the explicit pin needed for OS- or library-thread-local state. Use sparingly.
GOMAXPROCS in containers: trust Go 1.16+ on Linux; otherwise use automaxprocs.
Goroutines have no public identity. Use context.Context for tracing and cancellation.

The senior level adds architectural decisions: choosing between a Go service, a thread-pool service, and an event-loop service; cross-language interop patterns; and surviving thread-level failures (cgo segfaults, signal masks, fork/exec) in long-running production systems.