Go Runtime GMP — Optimization Exercises¶
Each exercise presents code with a scheduler-related performance issue. Identify it, optimise, measure.
Exercise 1 — Overspawning goroutines¶
Baseline.
For 1 million items: 1 million goroutines. Stack overhead = ~2 GB. Scheduler churn high.
Goal. Bound concurrency.
Solution.
sem := make(chan struct{}, runtime.NumCPU()*2)
var wg sync.WaitGroup
for _, item := range items {
sem <- struct{}{}
wg.Add(1)
go func(item Item) {
defer wg.Done()
defer func() { <-sem }()
process(item)
}(item)
}
wg.Wait()
Bounded concurrency at 2 * NumCPU. Memory usage stays small.
Exercise 2 — Container with default GOMAXPROCS¶
Baseline.
Go service running in Kubernetes pod with CPU limit of 2 cores, but on a 32-core host. Older Go (1.19) defaults GOMAXPROCS to 32. The kernel throttles to 2 cores.
Goal. Match GOMAXPROCS to actual CPU.
Solution.
Or upgrade to Go 1.21+, which detects cgroup CPU quota natively.
Verify with:
Should report 2 in the container.
Exercise 3 — Cgo in a hot loop¶
Baseline.
Each Cgo call costs ~300 ns transition. 1M calls = 300 ms in transitions.
Goal. Reduce Cgo overhead.
Solution: batch in C.
Define a C function that takes a count and loops:
One transition; transitions cost amortised over 1M iterations.
Exercise 4 — Excessive runtime.Gosched¶
Baseline.
In Go 1.14+, async preemption handles fairness. Manual Gosched adds ~50 ns per call for no benefit.
Goal. Remove unnecessary yields.
Solution.
Trust the scheduler.
Exercise 5 — Goroutine churn¶
Baseline.
If events arrive at 100 000/sec, you spawn 100 000 goroutines/sec. Scheduler overhead measurable.
Goal. Reuse goroutines.
Solution: worker pool.
workers := runtime.NumCPU() * 4
queue := make(chan Event, 1024)
for i := 0; i < workers; i++ {
go func() {
for e := range queue {
process(e)
}
}()
}
for event := range events {
queue <- event
}
close(queue)
A fixed pool of goroutines reused across events. Scheduler churn drops dramatically.
Exercise 6 — Per-request runtime.GC¶
Baseline.
http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
process(r)
runtime.GC() // "free memory"
})
Each GC call has a brief STW phase. At 1000 req/sec, you're STW'ing 1000 times/sec.
Goal. Let GC manage itself.
Solution.
Remove the explicit runtime.GC(). Tune GOGC if memory growth is a concern:
Higher GOGC means less frequent GC at the cost of more memory.
Exercise 7 — LockOSThread overuse¶
Baseline.
func handler(w http.ResponseWriter, r *http.Request) {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
process(r)
}
Each request pins an OS thread. At 10 000 concurrent requests = 10 000 threads.
Goal. Reserve thread pinning for code that needs it.
Solution.
Remove LockOSThread from the handler. Only use it for the specific code paths that genuinely require thread identity (Cgo with thread-local state, certain syscalls).
If a sub-function needs thread pinning, isolate it:
func handler(w http.ResponseWriter, r *http.Request) {
process(r)
}
func process(r *http.Request) {
if needsThreadAffinity() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
doForeignThing()
}
// rest of processing without lock
}
Exercise 8 — Many simultaneous syscalls¶
Baseline.
10 000 concurrent file reads = 10 000 OS threads (each syscall holds an M).
Goal. Bound syscall concurrency.
Solution.
sem := make(chan struct{}, 16)
for _, path := range paths {
sem <- struct{}{}
go func(p string) {
defer func() { <-sem }()
data, _ := os.ReadFile(p)
process(data)
}(path)
}
16 concurrent reads; 16 M's used. Throughput close to optimal for spinning disks; SSDs may benefit from 32–64.
Exercise 9 — time.After in a tight loop¶
Baseline.
Each iteration creates a new timer. The previous one lives in memory until its expiry (up to 1 second). Memory accumulates.
Goal. Reuse a single timer.
Solution.
t := time.NewTimer(time.Second)
defer t.Stop()
for {
if !t.Stop() {
select { case <-t.C: default: }
}
t.Reset(time.Second)
select {
case v := <-ch:
process(v)
case <-t.C:
idle()
}
}
One timer reused. Less memory pressure, less GC work.
Exercise 10 — Allocations in hot path¶
Baseline.
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
buf := make([]byte, 0, 4096)
buf = encode(buf, data)
w.Write(buf)
})
Per request: 4 KB allocation. At 10 000 req/sec: 40 MB/sec to GC. GC frequency rises.
Goal. Reuse buffers.
Solution.
var bufPool = sync.Pool{
New: func() interface{} {
b := make([]byte, 0, 4096)
return &b
},
}
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
bp := bufPool.Get().(*[]byte)
defer func() {
*bp = (*bp)[:0]
bufPool.Put(bp)
}()
*bp = encode(*bp, data)
w.Write(*bp)
})
Buffers reused. Allocation rate drops; GC pressure falls.
Exercise 11 — Heavy runtime.NumGoroutine polling¶
Baseline.
http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "goroutines %d\n", runtime.NumGoroutine())
// ... other metrics ...
})
runtime.NumGoroutine walks internal data structures and is not free under heavy load. Called on every metrics scrape, accumulated cost matters.
Goal. Cache or sample less frequently.
Solution.
Sample periodically and cache:
var (
cachedNumGoroutine atomic.Int64
)
func init() {
go func() {
t := time.NewTicker(time.Second)
for range t.C {
cachedNumGoroutine.Store(int64(runtime.NumGoroutine()))
}
}()
}
http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "goroutines %d\n", cachedNumGoroutine.Load())
})
Per-scrape cost: nearly zero.
Exercise 12 — Spawning a goroutine per WebSocket message¶
Baseline.
If a connection sends 1000 msg/sec, you spawn 1000 goroutines/sec per connection. With 10 000 connections, 10 million goroutines/sec.
Goal. Process messages without spawning.
Solution.
If handle is non-blocking, just call it:
If handle is sometimes slow, use a worker pool:
queue := make(chan Message, 64)
for i := 0; i < 4; i++ {
go func() {
for m := range queue {
handle(m)
}
}()
}
for {
msg := readMessage(conn)
queue <- msg
}
Exercise 13 — Avoiding lock contention in hot path¶
Baseline.
var (
mu sync.Mutex
counter int64
)
http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
mu.Lock()
counter++
mu.Unlock()
})
At 100 000 req/sec, the mutex is hot. Multiple cores contend on the lock; throughput plateaus.
Goal. Reduce contention.
Solution.
var counter atomic.Int64
http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
counter.Add(1)
})
Atomic add: ~5 ns, no mutex. Throughput scales with cores.
For even higher throughput, use per-CPU counters and sum on read (pattern in expvar's Int.Add).
Exercise 14 — Reducing GC pause time¶
Baseline.
A service with a large heap (10 GB) has 50 ms GC pauses, dropping p99 latency.
Goal. Reduce pause.
Solution.
- Reduce allocation rate (object pools, byte buffers).
- Reduce heap size (drop caches, smaller working set).
- Use
GOMEMLIMITto bound memory; GC runs more often but with less work each time. - Use
GOGC=200to delay GC for higher allocation rates, accepting more memory. - For latency-critical services, use
debug.SetGCPercentto manage GC pace.
Each option has trade-offs. Profile to choose.
Exercise 15 — Excessive channel allocation¶
Baseline.
func process(req Request) Response {
ch := make(chan Response, 1)
go func() {
ch <- compute(req)
}()
return <-ch
}
Per request: one channel allocation + one goroutine spawn. At 100 000 req/sec, 200 000 short-lived allocations and goroutines.
Goal. Remove unnecessary indirection.
Solution.
If you wait on the channel synchronously, just call the function:
If a goroutine is needed for timeout enforcement, structure differently:
func processWithTimeout(req Request, d time.Duration) (Response, error) {
ctx, cancel := context.WithTimeout(context.Background(), d)
defer cancel()
type result struct {
r Response
err error
}
out := make(chan result, 1)
go func() {
out <- result{compute(req), nil}
}()
select {
case r := <-out:
return r.r, r.err
case <-ctx.Done():
return Response{}, ctx.Err()
}
}
Even better: cancel the work in compute via the context.
Closing¶
Scheduler-aware optimisation patterns:
- Bound goroutine creation. Pool or semaphore-limit.
- Reuse goroutines. Worker pools beat per-event spawns.
- Reuse buffers.
sync.Poolreduces GC pressure. - Reuse timers.
time.NewTimer+Resetinstead oftime.Afterin loops. - Avoid Cgo in hot paths. Batch transitions.
- Avoid
LockOSThreadunless needed. Preserves scheduler flexibility. - Avoid manual
runtime.GCandGosched. Trust the runtime. - Use atomics for simple shared state. Channels for ownership transfer.
- Match
GOMAXPROCSto container CPU. Useautomaxprocsor Go 1.21+. - Set
GOMEMLIMITfor production services. Prevents OOM.
Optimise based on profile data, not intuition. The Go scheduler is highly tuned; the bottleneck is usually in your code, not in the runtime.