Runtime Source Dive — Optimization¶
1. How to use this file¶
Seventeen scenarios where code is slower, allocates more, or scales worse than it should because the runtime is being asked to do work the author didn't realize they were paying for. Each entry has a Before (code + benchmark) and a collapsible Solution (optimized code + benchmark + why + when NOT).
Anchored at Go 1.23, amd64. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Runtime cost surfaces in seven recurring places: runtime.newproc (goroutine creation), runtime.chansend/chanrecv (channel ops with lock+park), runtime.gopark/goready (scheduler trips for sync primitives), runtime.mallocgc (heap allocation), runtime.startTimer (timer heap insertion), runtime.cgocall (Go↔C transition), and runtime.convT*/reflect (interface boxing, type-table lookups). Most wins remove one of those from the hot path.
Reading order: Ex. 1, 2, 4, 8 first. Ex. 3, 5, 7, 13 are the ones senior reviews flag most.
Exercise 1: Spawning a goroutine for trivially short work¶
Difficulty: Junior Skills: scheduler awareness, profiling, runtime/proc.go
func ValidateRequest(r Req) error {
errs := make(chan error, 4)
var wg sync.WaitGroup
wg.Add(4)
go func() { defer wg.Done(); errs <- checkUserID(r.UserID) }()
go func() { defer wg.Done(); errs <- checkEmail(r.Email) }()
go func() { defer wg.Done(); errs <- checkAge(r.Age) }()
go func() { defer wg.Done(); errs <- checkName(r.Name) }()
wg.Wait(); close(errs)
for e := range errs { if e != nil { return e } }
return nil
}
Why it's slow: runtime.newproc (runtime/proc.go) is ~400–600 ns per goroutine on amd64 — malg allocates the g if the gfree list is empty, plus runq enqueue. Each checkX finishes in ~50 ns. The runtime does 50× more work than your code.
Hint
Profile shows `runtime.newproc1`, `runtime.malg`, `runtime.chanrecv1` near the top. If goroutine setup dwarfs goroutine work, inline it.Solution
**Benchmark before/after:** **Why it's faster:** No `runtime.newproc`, no `g` allocation, no runq enqueue, no `gopark`/`goready` on the channel. The checks inline through the compiler. Rule: only spawn a goroutine when the body does ≥ 10 µs of CPU or any blocking I/O. **When NOT:** When each check is itself slow I/O or independent network work that can truly run in parallel.Exercise 2: Channel-based counter vs atomic.Int64¶
Difficulty: Middle Skills: runtime/chan.go, sync/atomic, lock-free primitives
type Counter struct { ch chan int64 }
func NewCounter() *Counter {
c := &Counter{ch: make(chan int64, 1024)}
go func() { var total int64; for v := range c.ch { total += v }; _ = total }()
return c
}
func (c *Counter) Inc() { c.ch <- 1 }
Why it's slow: runtime.chansend (chan.go) calls lock(&c.lock), copies the element, may goready a parked receiver. Even uncontended, the lock + qcount/sendx bookkeeping is ~30 ns. Under cross-P contention, the channel becomes a serialization point.
Hint
A counter doesn't need ordering between writers. `atomic.Int64.Add` compiles to a single `LOCK XADD` — no runtime trip.Solution
**Benchmark before/after:** **Why it's faster:** `LOCK XADD` costs ~3–10 ns. No `chansend`, no buffer index math, no aggregator goroutine. The atomic stays in L1 for the writing core; cross-core contention costs cache-line bounces but never the scheduler. **When NOT:** When you need ordering between writers or batching across events — a channel is the right primitive.Exercise 3: sync.Mutex ping-pong instead of sharded counters¶
Difficulty: Senior Skills: runtime/sema.go, lock contention, cache-line padding
type Limiter struct {
mu sync.Mutex
counts map[string]int
}
func (l *Limiter) Hit(tenant string) int {
l.mu.Lock(); defer l.mu.Unlock()
l.counts[tenant]++
return l.counts[tenant]
}
Why it's slow: Under contention, sync.Mutex.Lock falls into the slow path (sync.runtime_SemacquireMutex → runtime.semacquire in sema.go). The runtime parks the goroutine on the mutex's sudog wait list; the M finds other work via findrunnable, then goreadys the winner. Each park/wake is 1–3 µs.
Hint
Per-tenant updates don't need to serialize with *other* tenants' updates. Shard by hash.Solution
**Benchmark before/after:** **Why it's faster:** Contention drops by a factor of `shardCount`. `semacquire` slow-path is rarely entered; `Lock` stays on its atomic-CAS fast path. Cache-line padding prevents adjacent shards from invalidating each other under cross-core writes. **When NOT:** Single-tenant or very low-QPS systems — sharding is wasted memory and complexity below ~10k QPS per core.Exercise 4: Per-call heap allocation that should have stayed on the stack¶
Difficulty: Middle Skills: escape analysis, -gcflags="-m", runtime.mallocgc
type Entry struct { Level int; Time time.Time; Message string }
type Handler interface{ Handle(*Entry) }
func Log(h Handler, level int, msg string) {
e := &Entry{Level: level, Time: time.Now(), Message: msg} // escapes
h.Handle(e)
}
go build -gcflags="-m" says &Entry{...} escapes to heap.
Why it's slow: Because Handle takes *Entry through an interface, the compiler must assume the pointer may outlive the call, so e escapes. runtime.mallocgc runs: mcache tinyalloc or sizeclass lookup, plus GC bookkeeping. For high-volume logging this is the bottleneck.
Hint
Pass `Entry` by value. The value lives in the caller's stack frame; the concrete-typed handler avoids interface boxing.Solution
type Entry struct { Level int; Time time.Time; Message string }
type fastHandler struct{ w io.Writer }
func (f *fastHandler) Handle(e Entry) { fmt.Fprintln(f.w, e.Message) }
func Log(h *fastHandler, level int, msg string) { // concrete type, no iface
h.Handle(Entry{Level: level, Time: time.Now(), Message: msg})
}
Exercise 5: time.After inside a loop¶
Difficulty: Middle Skills: runtime/time.go, timer heap, NewTimer + Reset
func Worker(work <-chan Job, timeout time.Duration) {
for {
select {
case j := <-work:
j.Run()
case <-time.After(timeout):
return
}
}
}
Why it's slow: Every iteration creates a new *runtimeTimer (runtime.startTimer) and inserts into the per-P timer heap. The timer survives until it fires or GC reclaims it — even when work was selected first. Insertion is O(log n) plus a heap alloc.
Hint
Allocate one `*time.Timer` outside the loop. `Stop` + `Reset` each iteration.Solution
**Benchmark before/after:** **Why it's faster:** One timer allocation for the entire loop. `Reset` reuses the same `*runtimeTimer` via `resettimer` — re-insertion only, no malloc. The timer heap stays smaller because dead timers don't accumulate. **When NOT:** Loop bodies that fire once per minute — `time.After` is one line vs five and the cost is invisible.Exercise 6: Manual runtime.GC() calls¶
Difficulty: Junior Skills: GC pacer, GOGC, debug.SetMemoryLimit
func HandleBigJob(j Job) Result {
r := process(j)
runtime.GC() // "free memory before next request"
return r
}
Why it's slow: runtime.GC (runtime/mgc.go) triggers a synchronous mark cycle: STW start, concurrent mark, STW termination, sweep. The runtime would have run GC anyway when GOGC=100 was hit; calling it manually does both — your manual one and the eventual triggered one — burning CPU twice and adding latency to the request that triggered it.
Hint
Trust the pacer (`mgcpacer.go`). It schedules GC exactly when the heap doubles. Manual calls don't free anything `GOGC=100` wouldn't.Solution
If memory pressure is the real concern, tune `GOGC` or `debug.SetMemoryLimit` (Go 1.19+). **Benchmark before/after:** **Why it's faster:** No forced mark/sweep. The pacer targets ~25% CPU for GC at steady state; manual `GC()` calls bypass that and run on the request critical path. **When NOT:** Benchmarks where you want deterministic GC state — `runtime.GC()` before `b.ResetTimer()` is fine.Exercise 7: Many short-lived []byte buffers¶
Difficulty: Middle Skills: sync.Pool, runtime/mcache.go, allocator sizeclasses
func Render(w http.ResponseWriter, data Data) {
buf := make([]byte, 0, 4096)
buf = append(buf, "{\"id\":"...)
buf = strconv.AppendInt(buf, data.ID, 10)
buf = append(buf, ",\"name\":\""...)
buf = append(buf, data.Name...)
buf = append(buf, "\"}"...)
w.Write(buf)
}
Why it's slow: Every call hits runtime.mallocgc for the 4 KB buffer (sizeclass 36). At 50k RPS that's 200 MB/s of garbage. The mcache → mcentral path warms up frequently; the heap grows under load; the GC scans the buffer briefly before sweep returns it.
Hint
`sync.Pool` is purpose-built for this. P-local pool slot; `Get` is wait-free on the fast path.Solution
var bufPool = sync.Pool{
New: func() any { b := make([]byte, 0, 4096); return &b },
}
func Render(w http.ResponseWriter, data Data) {
bp := bufPool.Get().(*[]byte)
buf := (*bp)[:0]
defer func() { *bp = buf; bufPool.Put(bp) }()
buf = append(buf, "{\"id\":"...)
buf = strconv.AppendInt(buf, data.ID, 10)
buf = append(buf, ",\"name\":\""...)
buf = append(buf, data.Name...)
buf = append(buf, "\"}"...)
w.Write(buf)
}
Exercise 8: fmt.Sprintf in the hot path¶
Difficulty: Middle Skills: interface boxing, runtime.convT*, strconv.Append*
Why it's slow: Two allocations per call: (1) id boxes into interface{} via runtime.convT64 because Sprintf takes ...any; (2) the formatted result string. Then fmt walks the format string via reflection, dispatches on verb, calls strconv.FormatInt internally — none of which inlines through the variadic any path.
Hint
`strconv.AppendInt` writes into a `[]byte` you control. A stack array holds the result; one final `string()` conversion.Solution
**Benchmark before/after:** **Why it's faster:** No `runtime.convT64` — `id` stays as `int64`, never boxed. No reflection over format verbs. Only the final `string(out)` allocation remains. Inner loop is `strconv.formatBits` writing into pre-sized memory. **When NOT:** Format strings that change at runtime, or cold paths where readability beats 150 ns.Exercise 9: Closure capture forcing a heap alloc¶
Difficulty: Middle Skills: closure capture rules, escape analysis, defer-arg pattern
func Handle(req Req) Resp {
start := time.Now() // captured below → escapes to heap
defer func() {
metrics.Observe("handle", time.Since(start))
}()
return process(req)
}
go build -gcflags="-m": moved to heap: start.
Why it's slow: Go closures capture by reference. To satisfy escape analysis, the runtime allocates start on the heap and the closure environment alongside it — two allocations per call.
Hint
Pass `start` as an *argument* to the deferred function. Arguments are evaluated at `defer`-time and copied — no capture, no escape.Solution
**Benchmark before/after:** **Why it's faster:** No closure environment on the heap. `start` lives in the deferred-call's argument area on the goroutine stack — the runtime's `_defer` record stores arguments inline (see `runtime/runtime2.go`'s `_defer` struct). With no captures, Go 1.14+ uses open-coded defers — inlined at the return point, no `runtime.deferproc` call. **When NOT:** When you need the *current* value at defer-execution time, not at defer-statement time — capture is correct.Exercise 10: Spurious runtime.LockOSThread¶
Difficulty: Senior Skills: scheduler M/P/G binding, work-stealing, when LockOSThread is required
func Worker(jobs <-chan Job) {
runtime.LockOSThread() // copy-pasted "safety"
defer runtime.UnlockOSThread()
for j := range jobs { j.Run() } // pure Go work
}
Why it's slow: LockOSThread (runtime/proc.go) wires the goroutine to a specific OS thread (M). When the goroutine blocks, the runtime can't reuse the M for other work — it sits parked. If the locked goroutines block frequently, M's pile up and the runtime creates extras (newm → clone/pthread_create), wasting kernel resources.
Hint
`LockOSThread` is required only for: cgo callbacks needing stable TLS, OS APIs that bind to a thread (Linux namespaces, Windows GUI), or signal masks. Pure Go code never needs it.Solution
**Benchmark before/after:** **Why it's faster:** The scheduler multiplexes goroutines onto the smallest set of M's. Blocked goroutines `gopark` without binding their M — `findrunnable` immediately gives the M new work. **When NOT:** Cgo callbacks that store data in pthread TLS. Linux unshare-based sandboxing. OS GUI loops.Exercise 11: GOMAXPROCS left at default in a cgroup-limited container¶
Difficulty: Senior Skills: cgroup CPU quotas, GOMAXPROCS, uber-go/automaxprocs
// No automaxprocs. GOMAXPROCS = NumCPU() = 64 on a 64-core host.
func main() { http.ListenAndServe(":8080", nil) }
Run with docker run --cpus=2 ....
Why it's slow: Go reads runtime.NumCPU() from sched_getaffinity and sets GOMAXPROCS=64. The Linux CFS scheduler enforces the 2-CPU quota by throttling: when the cgroup's CPU bucket empties, all 64 threads pause until the next CFS period (100 ms). GC pacer math is also wrong — it computes assists assuming 64-way parallelism.
Hint
Import `go.uber.org/automaxprocs` for a zero-effort fix — it reads the cgroup quota at startup. Go 1.25+ does this automatically.Solution
Or set explicitly: `runtime.GOMAXPROCS(2)`. **Benchmark before/after:** **Why it's faster:** No oversubscription. The 2 P's match the 2 real cores; no CFS throttling, no thrashing runqueues across phantom P's. GC pacer computes assist credit correctly. **When NOT:** Containers with no CPU limit, or standalone servers using all cores.Exercise 12: GC pressure from short-lived objects in a hot loop¶
Difficulty: Senior Skills: sync.Pool, GC pacer, GODEBUG=gctrace=1, runtime/mgc.go
type Record struct {
Fields [20]string
Tags []string
Body []byte
}
func Process(lines <-chan []byte) {
for line := range lines {
r := &Record{}
json.Unmarshal(line, r)
emit(r)
}
}
Why it's slow: Each Record heap-allocates (escapes through emit). At 200k/sec × 2.4 KB = 480 MB/s of garbage. The pacer triggers a cycle every heap doubling; under steady high allocation rate, GC runs constantly. Mark assists run on the allocating goroutine, eating request time.
Hint
Pool the `Record`. `sync.Pool` is GC-aware — pooled objects are dropped at GC but typically survive long enough that allocation pressure plummets.Solution
var recordPool = sync.Pool{
New: func() any { return &Record{Tags: make([]string, 0, 8), Body: make([]byte, 0, 256)} },
}
func Process(lines <-chan []byte) {
for line := range lines {
r := recordPool.Get().(*Record)
r.reset()
json.Unmarshal(line, r)
emit(r)
recordPool.Put(r) // assumes emit copies what it needs
}
}
func (r *Record) reset() {
for i := range r.Fields { r.Fields[i] = "" }
r.Tags = r.Tags[:0]; r.Body = r.Body[:0]
}
Exercise 13: runtime.SetFinalizer for resource cleanup¶
Difficulty: Senior Skills: runtime/mfinal.go, finalizer queue, deterministic Close()
type Conn struct { sock net.Conn }
func Dial(addr string) (*Conn, error) {
s, err := net.Dial("tcp", addr)
if err != nil { return nil, err }
c := &Conn{sock: s}
runtime.SetFinalizer(c, func(c *Conn) { c.sock.Close() })
return c, nil
}
Why it's slow: Finalizers (runtime/mfinal.go) run on a dedicated finalizergoroutine after the object's next GC cycle determines it unreachable. They double the object's GC lifetime: cycle N marks dead, cycle N+1 actually frees. Under low allocation rate, cycles are minutes apart. Sockets leak past the FD ulimit.
Hint
Finalizers are a safety net at best. Explicit `Close()` plus `defer` is the production pattern. Reserve finalizers to catch forgotten-close in *tests*.Solution
type Conn struct { sock net.Conn }
func Dial(addr string) (*Conn, error) {
s, err := net.Dial("tcp", addr)
if err != nil { return nil, err }
return &Conn{sock: s}, nil
}
func (c *Conn) Close() error { return c.sock.Close() }
// Usage:
c, err := Dial(addr)
if err != nil { return err }
defer c.Close()
Exercise 14: Cgo calls in a hot loop¶
Difficulty: Senior Skills: runtime/cgocall.go, entersyscall, batching
/*
#include "hash.h"
*/
import "C"
func HashAll(records [][]byte) []uint64 {
out := make([]uint64, len(records))
for i, r := range records {
out[i] = uint64(C.hash((*C.char)(unsafe.Pointer(&r[0])), C.int(len(r))))
}
return out
}
Why it's slow: Each C.hash goes through runtime.cgocall (cgocall.go): entersyscall detaches the goroutine's M from its P (so the P runs other goroutines), switch to the C stack, execute, then exitsyscall reattaches. Overhead is ~150–300 ns per call regardless of the C work; for tiny C bodies the runtime is doing more than C.
Hint
One cgo call hashing 10k records is hundreds of times cheaper than 10k cgo calls.Solution
/*
#include "hash.h"
void hash_batch(const char** ptrs, const int* lens, int n, unsigned long long* out);
*/
import "C"
func HashAll(records [][]byte) []uint64 {
n := len(records)
ptrs := make([]*C.char, n)
lens := make([]C.int, n)
for i, r := range records {
ptrs[i] = (*C.char)(unsafe.Pointer(&r[0]))
lens[i] = C.int(len(r))
}
out := make([]uint64, n)
C.hash_batch(&ptrs[0], &lens[0], C.int(n), (*C.ulonglong)(&out[0]))
return out
}
Exercise 15: reflect-based field copy in a loop¶
Difficulty: Senior Skills: reflect cost, cached type plans, code generation
func Copy(dst, src any) {
dv := reflect.ValueOf(dst).Elem()
sv := reflect.ValueOf(src).Elem()
for i := 0; i < sv.NumField(); i++ {
name := sv.Type().Field(i).Name
df := dv.FieldByName(name) // O(N) name lookup
if df.IsValid() && df.CanSet() { df.Set(sv.Field(i)) }
}
}
Why it's slow: FieldByName is O(N) per lookup — walks the struct's field table with string compares. reflect.Value is 24 B, returned by value; storing it through interface boxes it. Set does another type-check trip through the runtime's unsafe_NewAt.
Hint
Cache the field offset map per (src, dst) type pair on first call. Subsequent calls are O(1) lookups + direct memory writes.Solution
type fieldCopy struct{ srcOff, dstOff, size uintptr }
type copyPlan struct{ fields []fieldCopy }
var planCache sync.Map // map[[2]reflect.Type]*copyPlan
func Copy[D, S any](dst *D, src *S) {
key := [2]reflect.Type{reflect.TypeOf(*dst), reflect.TypeOf(*src)}
pv, ok := planCache.Load(key)
if !ok { pv = buildPlan(key); planCache.Store(key, pv) }
plan := pv.(*copyPlan)
dPtr, sPtr := unsafe.Pointer(dst), unsafe.Pointer(src)
for _, f := range plan.fields {
copyBytes(unsafe.Add(dPtr, f.dstOff), unsafe.Add(sPtr, f.srcOff), f.size)
}
}
Exercise 16: select with time.After leaking timers under context cancel¶
Difficulty: Senior Skills: timer heap, time.NewTimer + Stop, ctx-aware patterns
func Wait(ctx context.Context, d time.Duration) error {
select {
case <-ctx.Done():
return ctx.Err() // timer leaks until d fires
case <-time.After(d):
return nil
}
}
Why it's slow: time.After schedules a timer firing at now+d. If ctx cancels first, the select returns but the timer sits in the runtime's per-P timer heap (siftup/siftdown in time.go). checkTimers scans the heap every scheduler tick; with thousands of dead timers, scan time dominates idle CPU.
Hint
`time.NewTimer` returns a `*Timer` you can `Stop`. Always pair creation with a `Stop` on the cancel path.Solution
**Benchmark before/after:** **Why it's faster:** `t.Stop()` calls `runtime.stopTimer` (`runtime/time.go`), which removes the timer from the heap on next `checkTimers` scan. Memory and CPU stop accumulating. Under high cancel rate this is the difference between a memory-stable service and one OOMing in an hour. **When NOT:** When `d` is sub-millisecond — the leak window is too small to matter. Code where `ctx` never cancels mid-wait.Exercise 17: Map of large structs copies on every lookup¶
Difficulty: Middle Skills: runtime/map.go, value vs pointer values, cache locality
type Position struct {
Symbol [32]byte
Quantity int64
Price float64
// ... 200+ more bytes (256 B total)
}
var book = map[uint64]Position{}
func Get(id uint64) (Position, bool) {
p, ok := book[id]
return p, ok
}
Why it's slow: runtime.mapaccess2_fast64 (runtime/map.go) finds the bucket and returns a pointer to the value slot; the compiler emits a memcpy of sizeof(Position) = 256 B into the caller's p. Each bucket holds 8 entries × 256 B = ~2 KB — blows past L1 cache lines and forces extra bucket misses.
Hint
Store pointers. Lookup returns 8 B; allocation happens once on insert.Solution
Callers that need a defensive copy do `*p` themselves; most callers just read fields. **Benchmark before/after:** **Why it's faster:** 8-byte pointer copy vs 256-byte value copy. Buckets hold 8 pointers (64 B) plus keys — fits in one cache line. `mapaccess` returns a value small enough to live in a register. Downstream code dereferences only the fields it needs; the hardware prefetches. **When NOT:** Tiny values (≤ 32 B) where the pointer is no smaller. Maps where callers mutate returned values — pointer aliasing causes spooky action at a distance.19. When NOT to optimize¶
Runtime overhead dominates only when the per-operation cost rivals real work. A CLI tool that runs once per minute, a startup-only config loader, a test fixture — none benefit from arena allocation, sharded locks, or pooling. time.After leak (Ex. 16) doesn't matter if your service handles 10 RPS; runtime.LockOSThread (Ex. 10) is a non-issue with 4 workers and 1000 RPS.
Profile first. Runtime overhead has recognizable signatures in go tool pprof:
runtime.newproc1,runtime.malg→ Ex. 1 (excess goroutine spawn)runtime.chansend1,runtime.chanrecv→ Ex. 2 (channel over atomic)sync.runtime_SemacquireMutex→ Ex. 3 (mutex contention)runtime.mallocgcon a hot stack → Ex. 4, 7, 12 (escape, pooling)runtime.startTimer,runtime.checkTimers→ Ex. 5, 16 (timer reuse, Stop on cancel)runtime.convT64,runtime.convT*→ Ex. 8 (interface boxing in fmt)runtime.deferproc, closure allocations → Ex. 9 (defer-arg pattern)runtime.cgocall,runtime.exitsyscall→ Ex. 14 (cgo batching)runtime.mapaccess*returning large values → Ex. 17 (pointers in maps)
Common premature optimizations: pooling Record (Ex. 12) when the service does 100 RPS; sharding the mutex (Ex. 3) on a low-QPS service; batching cgo (Ex. 14) when each call is already milliseconds of C work; replacing fmt.Sprintf (Ex. 8) in error and log paths.
Correctness gaps disguised as optimizations: sync.Pool (Ex. 7, 12) reused after Put → use-after-free; defer t.Stop() (Ex. 5, 16) where t.C was already drained → second drain blocks forever; sharded counter (Ex. 3) summed non-atomically → torn reads; pointer-in-map (Ex. 17) mutated through one alias and read through another → data race; reflect plan cache (Ex. 15) keyed on reflect.Type across plugin reloads → stale offsets; cgo batch (Ex. 14) where C holds Go pointers past return → memory model violation; closure replaced by defer-arg (Ex. 9) where the captured value was supposed to be the current one at defer time.
20. Summary¶
Always-ship wins: inline trivial goroutine bodies (Ex. 1); atomic over channel for counters (Ex. 2); pass start as a defer arg (Ex. 9); never call runtime.GC() in production (Ex. 6); never runtime.LockOSThread unless you need TLS or thread-pinned OS APIs (Ex. 10); time.NewTimer + Stop over time.After in any loop or cancelable wait (Ex. 5, 16); explicit Close() over finalizers (Ex. 13); automaxprocs in any containerized service (Ex. 11); pointers in maps when values exceed ~64 B (Ex. 17).
Wins behind a profile: shard mutexes (Ex. 3, when semacquire shows); pool buffers and structs (Ex. 7, 12, when mallocgc shows); replace fmt.Sprintf with strconv.Append* (Ex. 8, when convT* shows); batch cgo (Ex. 14, when cgocall shows per-record); cache reflect plans or use codegen (Ex. 15, when reflect methods show); refactor to avoid escape (Ex. 4, when mallocgc shows on a leaf and -gcflags="-m" confirms).
Specialty: hand-written arena allocators for parser/AST workloads with millions of nodes; custom lock-free queues for SPSC ring buffers; pinned-OS-thread workers for hardware syscalls (io_uring, perf_event); custom GC tuning via debug.SetMemoryLimit and tuned GOGC for batch jobs with large persistent heaps.
The runtime is fast — extraordinarily so, given what it does. Most overhead is self-inflicted: spawning goroutines for nanoseconds of work, choosing channels where atomics suffice, allocating where stacks would do, paying for time.After you forgot to stop. Read runtime/proc.go, runtime/chan.go, runtime/malloc.go, runtime/time.go, and runtime/cgocall.go once — three hours of source-diving — and you'll spot these patterns in your own code for the rest of your career. Profile, find the signature, apply the matching exercise. The runtime's hot paths are short and well-engineered; the user code calling them is where the wins live.