Goroutines vs OS Threads — Hands-on Tasks¶
Practical exercises from easy to hard. Each task says what to build, what success looks like, and a hint or expected outcome. Solutions or solution sketches are at the end.
Easy¶
Task 1 — Print the GMP knobs at startup¶
Write a program that, at startup, prints:
- Go version (
runtime.Version()). runtime.NumCPU().runtime.GOMAXPROCS(0).runtime.NumGoroutine().
Run it. Then run with GOMAXPROCS=1 ./your-program and GOMAXPROCS=8 ./your-program. Observe the change.
Goal. Learn the standard runtime knobs and how environment variables override them.
Task 2 — Spawn 100 000 goroutines and measure spawn cost¶
Time how long it takes to:
- Spawn 100 000 goroutines, each calling
wg.Done(). - Wait for them all.
Print the average per-goroutine cost.
const N = 100_000
var wg sync.WaitGroup
wg.Add(N)
start := time.Now()
for i := 0; i < N; i++ {
go func() { wg.Done() }()
}
wg.Wait()
elapsed := time.Since(start)
fmt.Printf("avg %v / goroutine\n", elapsed/N)
Expected: 0.5–2 µs per goroutine on a modern laptop.
Goal. See the order-of-magnitude cost of goroutine creation.
Task 3 — Compare to thread spawn cost (via cgo)¶
Using cgo and pthread_create, spawn 1 000 OS threads in a loop and measure. Compare to 1 000 goroutines.
/*
#include <pthread.h>
void *worker(void *arg) { return NULL; }
*/
import "C"
import (
"fmt"
"time"
"unsafe"
)
func main() {
const N = 1000
start := time.Now()
for i := 0; i < N; i++ {
var tid C.pthread_t
C.pthread_create(&tid, nil, (*[0]byte)(C.worker), nil)
C.pthread_join(tid, nil)
}
fmt.Println("threads:", time.Since(start))
}
Compare timing to Task 2. Threads should be ~30–100× slower per unit.
Goal. Internalise that threads are much heavier than goroutines.
Task 4 — time.Sleep does not consume a thread¶
Spawn 10 000 goroutines, each calling time.Sleep(30 * time.Second). While they sleep, read /proc/self/status (Linux) and inspect the Threads: line.
package main
import (
"fmt"
"os"
"strings"
"sync"
"time"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 10_000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
time.Sleep(30 * time.Second)
}()
}
// Read after a moment to let goroutines spawn
time.Sleep(1 * time.Second)
data, _ := os.ReadFile("/proc/self/status")
for _, line := range strings.Split(string(data), "\n") {
if strings.HasPrefix(line, "Threads:") {
fmt.Println(line)
}
}
wg.Wait()
}
Expected: Threads: is a small single digit (4–10) even with 10 000 sleeping goroutines.
Goal. Empirically confirm that goroutines parked on the timer/netpoller do not consume threads.
Task 5 — Find the thread count of an external Go process¶
Take any running Go program (a web server, a tool). Find its PID with pgrep, then:
Note the count. Compare to runtime.NumGoroutine() reported by the program (if it has a debug endpoint).
Goal. Get comfortable inspecting Go process internals from outside.
Task 6 — Demonstrate GOMAXPROCS affecting parallelism¶
Spawn 4 goroutines that each loop incrementing a local counter for 1 second. Run with GOMAXPROCS=1, GOMAXPROCS=2, GOMAXPROCS=4. Measure total throughput (sum of counts).
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func runWith(procs int) int64 {
runtime.GOMAXPROCS(procs)
var wg sync.WaitGroup
counts := make([]int64, 4)
for i := 0; i < 4; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
end := time.Now().Add(1 * time.Second)
for time.Now().Before(end) {
counts[i]++
}
}(i)
}
wg.Wait()
var total int64
for _, c := range counts {
total += c
}
return total
}
func main() {
for _, p := range []int{1, 2, 4} {
fmt.Printf("GOMAXPROCS=%d total=%d\n", p, runWith(p))
}
}
Expected: throughput scales roughly linearly with GOMAXPROCS (up to the core count).
Goal. See parallelism in action and confirm GOMAXPROCS is the right knob.
Medium¶
Task 7 — Measure context-switch cost¶
Two goroutines ping-pong an integer through a channel for a fixed duration. Count exchanges. Compute per-exchange time.
package main
import (
"fmt"
"sync"
"time"
)
func main() {
a := make(chan int, 1)
b := make(chan int, 1)
done := make(chan struct{})
go func() {
for {
select {
case v := <-a:
b <- v + 1
case <-done:
return
}
}
}()
go func() {
for {
select {
case v := <-b:
a <- v + 1
case <-done:
return
}
}
}()
a <- 0
time.Sleep(1 * time.Second)
close(done)
// Drain a few
// (For a real measurement, instrument with atomics or counters.)
_ = a
_ = b
var wg sync.WaitGroup
_ = wg
fmt.Println("done")
}
(Add a proper counter via atomic to get real numbers; expected: ~200–500 ns per round-trip on a modern laptop.)
Goal. Estimate goroutine context-switch cost.
Task 8 — Compare blocking vs non-blocking syscall behaviour¶
Write a program that:
- Spawns 100 goroutines, each opens a file, reads 1 byte, closes. The reads are blocking syscalls (regular file).
- Reports
runtime.NumGoroutine()and the OS thread count during the work.
Run the same with 100 goroutines doing http.Get (network → netpoller).
Expected: file-read version shows thread count climbing (~10–50); network version stays at 5–10 threads.
Goal. See the difference between blocking-syscall and netpoller paths in real time.
Task 9 — Trigger a cgo M-creation storm¶
package main
/*
#include <unistd.h>
void slow(void) { sleep(2); }
*/
import "C"
import (
"fmt"
"os"
"strings"
"sync"
"time"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
C.slow()
}()
}
// Sample thread count during the storm
for i := 0; i < 5; i++ {
time.Sleep(400 * time.Millisecond)
data, _ := os.ReadFile("/proc/self/status")
for _, l := range strings.Split(string(data), "\n") {
if strings.HasPrefix(l, "Threads:") {
fmt.Println(l)
}
}
}
wg.Wait()
}
Expected: Threads: climbs from ~10 to ~100, then drops back as cgo calls return.
Goal. Witness a cgo M-creation storm; understand why bounding cgo concurrency matters.
Task 10 — Mitigate the storm with a semaphore¶
Take Task 9. Add a semaphore that limits cgo calls to 8 concurrent. Verify thread count stays bounded.
import "golang.org/x/sync/semaphore"
import "context"
sem := semaphore.NewWeighted(8)
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
sem.Acquire(context.Background(), 1)
defer sem.Release(1)
C.slow()
}()
}
Expected: Threads: stays at ~15, even with 100 goroutines waiting.
Goal. Apply the standard mitigation for cgo storms.
Task 11 — LockOSThread and syscall.Gettid¶
Write a goroutine that:
- Calls
runtime.LockOSThread. - Reads
syscall.Gettid()and stores it. - Yields 100 times.
- After each yield, asserts
syscall.Gettid()still equals the stored value.
import (
"runtime"
"syscall"
)
func main() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
tid := syscall.Gettid()
for i := 0; i < 100; i++ {
runtime.Gosched()
if syscall.Gettid() != tid {
panic("drifted!")
}
}
}
Then remove LockOSThread and run a similar test under a load with many goroutines. You may not always see a drift, but the assertion is no longer guaranteed.
Goal. Confirm that LockOSThread truly pins.
Task 12 — Measure errgroup.SetLimit vs unbounded errgroup¶
import "golang.org/x/sync/errgroup"
const N = 10000
unboundedTime := func() time.Duration {
g, _ := errgroup.WithContext(context.Background())
start := time.Now()
for i := 0; i < N; i++ {
g.Go(func() error {
time.Sleep(1 * time.Millisecond)
return nil
})
}
g.Wait()
return time.Since(start)
}
boundedTime := func(limit int) time.Duration {
g, _ := errgroup.WithContext(context.Background())
g.SetLimit(limit)
start := time.Now()
for i := 0; i < N; i++ {
g.Go(func() error {
time.Sleep(1 * time.Millisecond)
return nil
})
}
g.Wait()
return time.Since(start)
}
Compare unboundedTime (huge concurrency, all done in ~1 ms) to boundedTime(100) (~100 ms = N×sleep/limit).
Goal. Understand the trade-off between concurrency level and resource use.
Task 13 — Implement GOMAXPROCS warning sidecar¶
Write a sidecar function called from main that, after a brief delay, compares runtime.GOMAXPROCS(0) to the cgroup limit (read /sys/fs/cgroup/cpu.max for cgroup v2). If GOMAXPROCS is 2× or more than the cgroup limit, log a loud warning.
Goal. Defensive coding for container misconfiguration.
Task 14 — Read scheduler trace¶
Run any Go program with:
Identify in the output:
gomaxprocsthreadsidlethreadsrunqueue(global)- per-P runqueue sizes
- per-M and per-G states (with
scheddetail=1)
Find a moment when a syscall handoff happens (an M in syscall, a P in _Psyscall).
Goal. Read the runtime's own diagnostic output.
Hard¶
Task 15 — Build a thread-pinned worker for a thread-affine task¶
Imagine a fictional C library that requires the same thread for init, do, done. Wrap it in Go:
- One goroutine per "session," pinned via
LockOSThread. - A channel of work items.
- Initialise the C library in the goroutine; tear down before exit.
Test by spawning 4 sessions and submitting 100 work items each. Verify each goroutine stays on its thread (via syscall.Gettid).
Goal. Production-quality thread pinning pattern.
Task 16 — Measure scheduler latency under load¶
Use runtime/trace:
import "runtime/trace"
import "os"
func main() {
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... your workload ...
}
Then go tool trace trace.out and explore the visualisation. Find:
- An M that is parked in a syscall.
- A P that is idle while Ms are spinning.
- GC stop-the-world.
- Goroutine creator-stack relationships.
Goal. Be fluent in the runtime trace UI.
Task 17 — Build a NUMA-aware deployment plan¶
For a Go service that runs on a 64-core, 4-socket box:
- Plan how many replicas, each with which
GOMAXPROCS. - Use
numactlto pin each replica to one NUMA node. - Verify with
lscpuandnumactl --hardware.
Compare throughput / latency to a single-process GOMAXPROCS=64. Report your findings.
Goal. Hands-on NUMA tuning.
Task 18 — Implement a goroutine ID extractor (for fun, not for production)¶
Parse the output of runtime.Stack to extract the current goroutine's ID. Write a test that confirms the ID is unique per goroutine. Note: this is hacky and slow. Do not use it in production code.
import (
"bytes"
"runtime"
"strconv"
)
func goid() uint64 {
var buf [64]byte
n := runtime.Stack(buf[:], false)
line := buf[:n]
// "goroutine N [..."
i := bytes.IndexByte(line, ' ')
j := bytes.IndexByte(line[i+1:], ' ')
id, _ := strconv.ParseUint(string(line[i+1:i+1+j]), 10, 64)
return id
}
Goal. Understand why Go intentionally does not expose goroutine ID.
Task 19 — Compare goroutine vs thread for a per-request server¶
Implement two web servers:
- A: pure Go, one goroutine per request, no cgo.
- B: Go + cgo for some hot-path operation (e.g., calls a C function that takes 1 ms).
Load-test both with wrk or vegeta at 1 000 RPS.
Measure:
- Throughput.
- p99 latency.
- Thread count (
/proc/<pid>/status:Threads). - Goroutine count.
Compare results. Explain the gap.
Goal. End-to-end measurement of the goroutine-vs-thread cost in a realistic workload.
Task 20 — Implement a "thread-watcher" service¶
Build a small Go service that, in the background, polls /proc/self/status:Threads every second and exposes it as a Prometheus metric. Add an alert rule: thread count > 50 for 30 s.
import (
"os"
"strconv"
"strings"
"time"
"github.com/prometheus/client_golang/prometheus"
)
var threadGauge = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "process_threads_total",
})
func watchThreads() {
for range time.Tick(1 * time.Second) {
data, _ := os.ReadFile("/proc/self/status")
for _, line := range strings.Split(string(data), "\n") {
if strings.HasPrefix(line, "Threads:") {
n, _ := strconv.Atoi(strings.TrimSpace(strings.TrimPrefix(line, "Threads:")))
threadGauge.Set(float64(n))
}
}
}
}
Goal. Production-ready thread observability.
Solution Sketches¶
Task 1¶
Pure runtime calls; no surprises. Test value of GOMAXPROCS env var.
Task 2 / Task 3¶
Compare absolute numbers. The C version is 30–100× slower per spawn.
Task 4¶
Confirms parking. Threads: stays small.
Task 6¶
Linear scaling visible.
Task 9 / Task 10¶
Storm → mitigation. Thread count drops from ~100 to ~15.
Task 15 sketch¶
type Session struct {
in chan Work
quit chan struct{}
}
func NewSession() *Session {
s := &Session{in: make(chan Work, 16), quit: make(chan struct{})}
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
C.session_init()
defer C.session_done()
for {
select {
case w := <-s.in:
C.session_do(w.cdata)
case <-s.quit:
return
}
}
}()
return s
}
Task 20¶
Use prometheus/client_golang. Expose /metrics. Alert in Prometheus YAML:
Wrap-up¶
After working through these tasks you should be able to:
- Quantify the cost of goroutines and threads on your hardware.
- Detect a cgo storm in seconds via thread count.
- Confirm
LockOSThreadpinning works as advertised. - Use
runtime/traceto inspect scheduler behaviour. - Build observability that distinguishes goroutine-level from thread-level issues.
Next: find-bug.md for bug-finding exercises around thread affinity, signal handling, and OS-resource leaks.