Goroutines vs OS Threads — Middle Level¶
Table of Contents¶
- Introduction
- Numbers You Should Know
- The
M:NModel in Detail - How a Blocking Syscall Is Handed Off
- The Netpoller — Network I/O Without Threads
- Cgo and Thread Costs
LockOSThreadin Practice- Tuning
GOMAXPROCS - Containers, cgroups, and
GOMAXPROCS - Goroutine Identity vs Thread Identity
- Signals and Threads
- Comparing Goroutines to Other Languages' Concurrency
- Diagnostics — Counting Threads and Goroutines
- Performance Gotchas
- Best Practices for Established Codebases
- Self-Assessment
- Summary
Introduction¶
At junior level you learned that goroutines are user-space, threads are kernel-space, and the runtime multiplexes one onto the other. At middle level you start applying that knowledge: choosing GOMAXPROCS correctly in containers, diagnosing thread-count anomalies, recognising cgo-induced M storms, and knowing when LockOSThread is the right hammer.
After this file you will:
- Reason about why a Go service uses N threads under load and predict that N.
- Know what happens to the thread pool when goroutines block in syscalls.
- Tune
GOMAXPROCSfor containers, NUMA boxes, and mixed workloads. - Spot performance regressions caused by cgo, signals, or thread starvation.
- Use
runtime/traceandpprofto see goroutine-vs-thread behaviour.
Numbers You Should Know¶
Order-of-magnitude figures for a modern x86-64 Linux machine, Go 1.22+:
| Operation | Goroutine | OS thread |
|---|---|---|
| Create + first-yield | ~200–1 000 ns | ~5 000–50 000 ns |
| Context switch (user-space) | ~100–300 ns | n/a |
| Context switch (kernel-mediated) | ~1 000 ns (cgo / syscall) | ~1 000–10 000 ns |
| Initial stack | 2 KB | 8 MB on Linux (mmaped, lazy-paged) |
| Hard upper bound (memory) | Millions | A few thousand to tens of thousands |
| Hard upper bound (cores) | n/a (concurrency, not parallelism) | One thread per core, fully parallel |
These are not promises. They depend on kernel version, CPU, and Go version. Measure on your hardware.
A practical heuristic:
- For pure CPU work: number of useful Ms ≈ number of cores ≈
GOMAXPROCS. Spawning more goroutines does not help. - For blocking I/O: number of Ms is less than the number of in-flight requests (because the netpoller handles waiters); thread count rises only with cgo and non-net syscalls.
The M:N Model in Detail¶
Go is an M:N scheduler: M goroutines mapped onto N OS threads, where typically M ≫ N.
Three runtime entities (mnemonic: G = goroutine, M = machine = OS thread, P = processor context):
G (goroutine) ── unit of work, ~2 KB stack
P (processor) ── scheduler context; holds a runqueue, GC state, defer pool
M (machine) ── OS thread
Scheduling invariants:
- To execute Go code, an M must hold a P.
- The number of Ps is fixed at
GOMAXPROCS. - The number of Ms is dynamic; the runtime spawns and parks them as needed.
- Goroutines live in a P's local runqueue (LRQ) primarily; overflow goes to the global runqueue (GRQ).
- Idle Ps "steal" half the goroutines from a busy P's LRQ when their own is empty.
GRQ (global runqueue)
|
+---------|-----------+----------+
| | | |
P0 P1 P2 P3 <- count = GOMAXPROCS
| | | |
LRQ LRQ LRQ LRQ
| | | |
M-a M-b M-c M-d <- attached Ms
+ extras parked / in syscall
This separation matters because:
- Most scheduler operations are lock-free on the P-local queue (no global mutex).
- Work stealing happens when a P empties, balancing load with minimal coordination.
- Blocking syscalls hand the P off to a fresh M, keeping the queue active.
The full GMP design is covered in 10-scheduler-deep-dive — at middle level you should understand the layering well enough to read scheduler traces.
How a Blocking Syscall Is Handed Off¶
A blocking syscall is one where the kernel may not return for a while: read, write, open of a slow device, connect, accept, flock. Linux non-network I/O does not go through the netpoller — it goes straight into the kernel.
When a goroutine calls a blocking syscall:
- The runtime's
entersyscall()is called (instrumented by the compiler/runtime). - The current M detaches from its P. The P is now idle with a non-empty runqueue.
- The M makes the syscall. It is now stuck in the kernel.
- If there is a
Gwaiting and no idle M, the runtime spawns or unparks an M and attaches it to the orphaned P. That M starts running goroutines. - When the syscall returns, the original M tries to re-attach to a P. If one is available, fine. If not, the M parks itself (returns to the M pool) and the goroutine is put on the global runqueue to be resumed later.
Step 0: M1 attached to P0. G42 calls read(fd).
Step 1: runtime.entersyscall() — M1 detaches from P0.
Step 2: M1 enters the kernel (blocked).
P0 is idle with a runqueue full of pending goroutines.
Step 3: sysmon notices P0 is idle. It pulls an M from the M-pool
(or spawns a new one via clone(2)). Call it M5.
Step 4: M5 attaches to P0. Resumes scheduling other goroutines.
Step 5: Eventually, kernel completes read. M1 returns from the syscall.
Step 6: runtime.exitsyscall() — M1 tries to grab a P.
If P available: M1 attaches, runs G42 to continue.
If no P: M1 parks itself; G42 goes to a runqueue.
The consequence: a Go program may have more Ms than GOMAXPROCS, because some Ms are blocked in syscalls. top will reflect that.
Threshold for handoff: 20 µs¶
The runtime does not detach the P immediately. For very short syscalls (< ~20 µs), the M just runs them quickly. For longer ones, sysmon (a runtime monitor running every ~20 µs) sees the M is still in _Psyscall state and forcibly hands off the P.
Why is this important at middle level?¶
Two consequences you will encounter:
- Thread count grows under heavy blocking I/O. A workload calling
readon many slow files spawns Ms proportional to in-flight syscalls. - The Go scheduler is not "free" if you abuse syscalls. Each handoff is more expensive than a goroutine context switch.
For network I/O, see the next section — that path is much cheaper.
The Netpoller — Network I/O Without Threads¶
Network reads, network writes, and timer waits go through the netpoller, a runtime component that uses non-blocking I/O plus epoll (Linux), kqueue (BSD/macOS), or IOCP (Windows).
When a goroutine calls conn.Read:
- The runtime sets the underlying fd non-blocking (once, on first use).
- The runtime calls
read. If data is ready, it returns immediately. - If not, the runtime registers the fd with the netpoller and parks the goroutine. No M is held by the parked G.
- The netpoller goroutine — running on one M — calls
epoll_waitperiodically (or continuously, depending on activity). - When
epoll_waitsays the fd is ready, the netpoller unparks the goroutine. - The goroutine is placed on a runqueue. Some M picks it up and
readretries (it will succeed this time).
Effect: 50 000 idle WebSocket connections cost ~50 000 goroutines and ~one M. The M handling epoll_wait is the same M as any other — there is no dedicated netpoller thread (one of the Ms takes turns handling it, called from the scheduler).
// Pseudocode for what happens under the hood of conn.Read
func (conn *netConn) Read(p []byte) (int, error) {
for {
n, err := nonblockingRead(conn.fd, p)
if err == nil { return n, nil }
if !errors.Is(err, syscall.EAGAIN) { return n, err }
// EAGAIN: not ready. Park.
netpoll.waitFor(conn.fd, 'r') // park goroutine, return when fd ready
}
}
The blocking-style conn.Read you write in Go is the cheap-and-friendly API on top of an event loop. Best of both worlds.
What can use the netpoller?¶
- Sockets (TCP, UDP, Unix domain).
- Pipes (sometimes).
- Disk I/O? No. Linux
epollon regular files is broken — always returns "ready" — so disk reads are not netpolled. They go through the blocking syscall path. - File watchers (
inotify) — depends on the implementation.
This is why a server reading from a 10 GB log file may spike thread count, but a server with 50 000 idle TCP connections does not.
Cgo and Thread Costs¶
A cgo call (C.some_function()) holds an M for its entire duration. Why:
- A C function may block, may store thread-local state, may interact with signal masks. The Go runtime cannot safely interrupt or move it.
- During the cgo call, the M is in a special state (
_Pcgo). It is detached from its P (so other goroutines can run) but cannot be reused. - When the cgo call returns, the M tries to re-attach to a P; if none is available, the M parks.
The M-creation storm¶
If many goroutines simultaneously call into C and each call blocks for a while:
For 50 ms, the runtime needs 100 Ms to keep 100 cgo calls in flight. If only GOMAXPROCS + a few Ms exist, the runtime calls clone(2) to create more. Thread count grows from ~10 to ~100 in milliseconds.
Recovery: after the calls return, Ms park. The runtime keeps a small pool of parked Ms (free Ms) to reuse but eventually destroys excess.
Cgo overhead per call¶
Even a tiny cgo call costs ~100–200 ns of overhead just for the M-state transition. For a hot loop, batch as much C work into a single cgo call as possible:
// Bad: 1M cgo calls
for _, x := range xs {
C.process_one(C.int(x))
}
// Good: 1 cgo call
C.process_many((*C.int)(unsafe.Pointer(&xs[0])), C.int(len(xs)))
When cgo is unavoidable¶
A few mitigations:
- Pool the C resources (open files, contexts) so each Go goroutine does not have to create one.
- Use
runtime.LockOSThreadif the C library is thread-affine, to avoid migration costs. - Profile with
runtime/trace; cgo state changes show up as separate events.
LockOSThread in Practice¶
runtime.LockOSThread() pins the calling goroutine to the OS thread it is currently running on. Subsequent calls to UnlockOSThread() (the same number) release the pin.
Use cases (in order of frequency):
- Thread-local C library state. OpenGL contexts, GnuTLS sessions, some database client libraries (older Oracle OCI), CUDA contexts.
- OS APIs that are thread-scoped. Linux
setns(namespace switching),unshare,prctl(PR_SET_NAME),setpriority,sched_setaffinity. - Signal handling on a specific thread. Rare but real.
- Profiling.
runtime/pprof.SetGoroutineLabelscan be combined with pinning to get a single thread of execution that is easier to read inperf.
Example: switching Linux namespaces¶
func enterNetworkNamespace(fd int) error {
// setns must be called on the thread that you want to switch.
// Pin the goroutine so it never moves.
runtime.LockOSThread()
defer runtime.UnlockOSThread()
return syscall.Setns(fd, syscall.CLONE_NEWNET)
}
If you forget LockOSThread, the goroutine may run on M1 when you call Setns (switching M1's namespace), then drift to M2 before the next syscall — which is in a different namespace. Programs that need namespace switching almost always pin.
Cost of LockOSThread¶
- The pinned goroutine occupies its M exclusively. The runtime cannot move it.
- If the pinned goroutine blocks, that M is unusable for other Go work.
- If too many goroutines are pinned, you can deadlock: every M is held by a pinned goroutine, and
GOMAXPROCSMs have no P to run on.
Use sparingly; one or two pinned goroutines per process is typical.
UnlockOSThread semantics¶
runtime.LockOSThread()
runtime.LockOSThread() // counter = 2
runtime.UnlockOSThread() // counter = 1, still pinned
runtime.UnlockOSThread() // counter = 0, unpinned
Match each Lock with an Unlock. If the goroutine exits while still pinned, the runtime may destroy the M (does so when it cannot safely reuse the thread).
Tuning GOMAXPROCS¶
Defaults are usually right, but you may need to tune:
| Situation | Recommended value |
|---|---|
| Bare metal, dedicated machine | Default (NumCPU). |
| Container with CPU limit, Go ≥ 1.16 on Linux | Default (runtime reads cgroup v1/v2 quota). |
| Container with CPU limit, Go < 1.16 or non-Linux | Use go.uber.org/automaxprocs. |
| Mixed CPU-bound + I/O-bound workload | Default; the netpoller and syscall handoff handle the I/O. |
| Latency-critical, very low concurrency | GOMAXPROCS=1 can reduce tail latency by avoiding cross-core cache effects. |
| Embarrassingly parallel CPU work | GOMAXPROCS=NumCPU. More does not help. |
| Wide NUMA box (≥ 32 cores) | Profile; sometimes a lower value reduces cross-socket cache traffic. |
Reading and setting¶
Set early — before any work spawns — to avoid runtime reshuffling.
Environment¶
Honoured by the runtime at startup. Overrides the default.
Heuristic: never set above NumCPU¶
GOMAXPROCS > NumCPU means more Ps than cores. Some of them sit idle (no M to attach to), and the kernel context-switches Ms more than necessary. Almost always a regression.
Heuristic: set below NumCPU when sharing the box¶
A Go service co-tenanted on a 32-core machine with three other Go services: each at GOMAXPROCS=8 is sometimes better than each at GOMAXPROCS=32. The scheduler does not coordinate across processes; you must.
Containers, cgroups, and GOMAXPROCS¶
The most common production bug at middle level: Go service in Kubernetes pod with cpu: 500m, but runtime.GOMAXPROCS(0) returns 64 (the node count).
Pre-Go 1.16 / 1.16 / 1.17 / 1.18¶
| Version | Behaviour |
|---|---|
| ≤ 1.15 | Ignores cgroup quota. NumCPU returns node CPU count. |
| 1.16 | First version to read cgroup v1 CPU quota on Linux. Partial. |
| 1.18 | Reads cgroup v2 quota. Modern containers (k8s ≥ 1.25) use v2. |
| ≥ 1.18 | Treats fractional quotas correctly: cpu: 500m → GOMAXPROCS=1 (rounded up from 0.5). |
If you cannot upgrade Go, use the Uber library:
It reads the cgroup at init time and sets GOMAXPROCS accordingly. Print a log line if you want to confirm.
Symptoms of mis-tuning¶
- Latency spikes during bursts (scheduler thrash).
- High CPU usage with low throughput.
- GC pause variability.
- Inconsistent throughput across pods.
Always log runtime.GOMAXPROCS(0) at startup. It is a one-line change that catches container-bound surprises.
Goroutine Identity vs Thread Identity¶
Threads have OS identity (TID, visible in /proc/<pid>/task/). Goroutines do not.
The Go runtime does maintain an internal goid, but does not expose it via a public API. Reasons:
- Goroutines were designed to not be addressable from outside. Cancellation is cooperative via
context.Context. Identification would enable broken patterns ("send a kill to goroutine 47"). - Goroutine pools recycle G structs;
goidis not stable across runs.
How to identify a goroutine in your own program¶
For logging, use context.Context to carry a request ID. Do not chase goid. There are unofficial hacks (runtime.Stack parsing) but they are fragile and slow.
// Good
func handle(ctx context.Context, req Request) {
log.Printf("[%s] handling", req.ID)
}
// Bad (don't do this)
func goid() uint64 { /* parse runtime.Stack output */ }
Thread identity¶
Sometimes you do need thread identity (for gettid() to call a kernel API). Use syscall.Gettid() after runtime.LockOSThread:
Signals and Threads¶
POSIX signals can be delivered to any thread of a multi-threaded process. The Go runtime installs handlers that route signals into Go via os/signal.Notify. From the application's perspective, signals look like channel sends.
Two flavours of signal¶
| Type | Examples | Behaviour |
|---|---|---|
| Synchronous | SIGSEGV, SIGFPE, SIGBUS | Delivered to the thread that caused them. The Go runtime turns these into runtime panics (or crashes). |
| Asynchronous | SIGINT, SIGTERM, SIGHUP | Delivered to "some" thread of the process. Go routes them through signal.Notify. |
You can normally ignore the per-thread aspect and just use signal.Notify(ch, sig).
When the per-thread aspect matters¶
If you call a C library that installs its own signal handlers, those handlers run on whichever thread received the signal. If they depend on thread-local state, you must pin or arrange handler delivery on a specific thread. This is exotic and rare.
SIGURG and async preemption¶
Go 1.14+ uses SIGURG internally for asynchronous goroutine preemption. If your C library uses SIGURG, it conflicts. Workarounds:
- Change the C library to use a different signal.
- Run with
GODEBUG=asyncpreemptoff=1(disables async preemption; not recommended for production).
Comparing Goroutines to Other Languages' Concurrency¶
| Model | Language | M:N? | Cost | Notes |
|---|---|---|---|---|
| OS threads | C, C++, Java pre-21 | 1:1 | High | Familiar, kernel-managed. |
| OS threads + futures | Java, C# | 1:1 | High | Thread pool reduces creation cost. |
| Coroutines (cooperative) | Python asyncio, JS | N:1 typically | Very low | Single-thread; CPU-bound work blocks all. |
| Stackful coroutines | Lua, Boost.Coroutine | N:1 | Very low | No preemption. |
| Goroutines | Go | M:N | Very low | Preemptive (since 1.14), runtime-scheduled. |
| Virtual threads | Java 21+ (Project Loom) | M:N | Very low | Like goroutines: cheap, parked on blocking. |
| Green threads | Erlang processes | M:N | Very low | Strict isolation: no shared memory between processes. |
| Async/await | Rust, JS, Python, C# | varies | low | Compile-time transformation to state machines. |
Goroutines are closest to Java virtual threads (Loom) and Erlang processes, with the caveat that Go shares memory (Erlang does not).
When to prefer goroutines¶
- High concurrency over high parallelism.
- Blocking-style code is desirable.
- Network-heavy workloads.
When to prefer something else¶
- True hard real-time → low-level C/Rust with priority threads.
- Strict isolation (one bad task does not kill peers) → Erlang/Elixir.
- Web frontend → JS async/await.
Diagnostics — Counting Threads and Goroutines¶
Goroutine count¶
For a stack trace of every goroutine:
import (
"bytes"
"runtime"
)
buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true)
os.Stderr.Write(buf[:n])
Production: use net/http/pprof:
Thread count (Linux)¶
Or:
Per-thread CPU:
Thread count (Go runtime)¶
There is no public API. Workarounds:
- Read
/proc/self/statusfrom inside the program on Linux. - Use
runtime.ReadMemStatsandruntime/debugfor indirect signs. runtime/metricsexposes/sched/threads:threadssince Go 1.21:
import "runtime/metrics"
samples := []metrics.Sample{{Name: "/sched/gomaxprocs:threads"}}
metrics.Read(samples)
fmt.Println("gomaxprocs:", samples[0].Value.Uint64())
Future-proof your monitoring by going through runtime/metrics.
Scheduler trace¶
Prints one line per second:
SCHED 1000ms: gomaxprocs=4 idleprocs=2 threads=11 spinningthreads=0 needspinning=0 idlethreads=5 runqueue=0 [0 0 0 0]
threads=11: total Ms (not just runnable).idlethreads=5: parked Ms in the pool.runqueue=0 [0 0 0 0]: global runqueue + per-P queues.
Add scheddetail=1 for per-G/P/M detail. Useful but very chatty.
Performance Gotchas¶
Gotcha 1: spurious thread growth from os/exec¶
os.exec.Cmd.Run uses fork+exec. On Linux, the runtime is careful: it has special code to do the fork without breaking the runtime state. But the post-fork child briefly has all the parent's Ms.
Symptom: thread count visible in ps looks weird during exec spikes.
Usually harmless. If you exec aggressively, profile.
Gotcha 2: cgo storms¶
Covered above. Symptom: thread count climbs to hundreds during steady-state.
Mitigation: batch C calls; pin to a single goroutine that processes a queue.
Gotcha 3: file I/O parks Ms¶
Non-network blocking I/O does not use the netpoller. Reading a 10 GB file in 100 goroutines spawns ~100 Ms.
Mitigation: bound the parallelism of file I/O via a semaphore. Disk parallelism is rarely > 4–8 useful anyway.
Gotcha 4: GC writes can stall a goroutine¶
The Go GC is concurrent but has stop-the-world (STW) phases of ~100 µs–1 ms. During STW, every goroutine pauses. This is unrelated to threading but often confused with scheduler issues.
Gotcha 5: LockOSThread causes scheduler imbalance¶
A pinned goroutine that does little work still occupies an M. With GOMAXPROCS=4 and 4 long-pinned goroutines, the runtime has no Ms left for other work.
Mitigation: minimise pinned goroutines; offload work to non-pinned ones via channels.
Gotcha 6: runtime.Gosched is rarely useful¶
A leftover from pre-1.14 era. With async preemption, the scheduler interrupts you when it needs to. Calling Gosched in tight loops adds overhead.
Gotcha 7: misreading top¶
top shows threads, not goroutines. A Go server with Threads: 12 is normal regardless of goroutine count.
Best Practices for Established Codebases¶
- Log
GOMAXPROCSat startup. One line. Catches all cgroup misconfigurations. - Run cgo-heavy code in a bounded pool — never "1 cgo call per request" without bounding.
- Pair every
LockOSThreadwith a documented reason in the source comment. - Track thread count as a metric (
/sched/threads:threadssince Go 1.21). Alarm if it exceeds a sane threshold. - Avoid
runtime.Goschedin production code. Tests sometimes need it; production rarely. - Use
context.Contextfor cancellation across goroutines. Do not try to cancel a goroutine via signals or other thread-level mechanisms. - Never assume thread identity is stable unless you have called
LockOSThread. - Prefer the netpoller (network) to blocking syscalls when you have a choice — abstractions like
net.Pipeare netpoller-backed.
Self-Assessment¶
- I can describe the M:N model and the role of G, M, and P.
- I know how the runtime handles a blocking syscall (P handoff).
- I know how the netpoller makes network I/O cheap.
- I can predict thread count for a typical Go server and explain spikes.
- I know when to use
LockOSThreadand when not to. - I can tune
GOMAXPROCSin a container (or know that Go 1.16+ does it automatically on Linux). - I know that cgo calls hold an M and how that can lead to M-creation storms.
- I know that signal handlers run on "some" thread and how Go routes them via
signal.Notify. - I can read
GODEBUG=schedtrace=1000output. - I prefer
context.Contextfor cancellation over any thread-level mechanism.
Summary¶
The middle-level view of goroutines vs threads is about applying the layered model in real code:
- The Go scheduler is an M:N scheduler with three abstractions (G, M, P). Most operations are P-local and lock-free.
- Blocking syscalls trigger a P handoff: the M is parked in the kernel, a fresh M takes over the P, and goroutine scheduling continues. Thread count may briefly exceed
GOMAXPROCS. - Network I/O (and timers) goes through the netpoller, which uses non-blocking syscalls +
epoll/kqueue. 50 000 idle connections cost zero extra threads. - Cgo is the main source of M-creation storms; bound your cgo concurrency.
LockOSThreadis the explicit pin needed for OS- or library-thread-local state. Use sparingly.GOMAXPROCSin containers: trust Go 1.16+ on Linux; otherwise useautomaxprocs.- Goroutines have no public identity. Use
context.Contextfor tracing and cancellation.
The senior level adds architectural decisions: choosing between a Go service, a thread-pool service, and an event-loop service; cross-language interop patterns; and surviving thread-level failures (cgo segfaults, signal masks, fork/exec) in long-running production systems.