Runtime Source — Optimization¶
1. How to use this file¶
Fourteen scenarios where idiomatic Go talks to the runtime in a way that looks cheap but isn't. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT). The justification for every fix lives in src/runtime/*.go — the entries cite the functions and slow paths you'd land on with go tool pprof or a runtime/trace.
Anchored at Go 1.23, amd64, GOGC=100. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Runtime cost is dominated by six things: scheduler entry/exit (runtime.gopark/goready), per-G heap traffic (runtime.mallocgc), STW transitions (stopTheWorld), timer-heap mutations (addtimer), defer-chain construction (deferproc/deferreturn), and iface conversion (convT*). Most wins remove one of those six from the hot path. Reading order: Ex. 1, 2, 4, 11, then any order. Ex. 3, 9, 13 are the ones most senior reviews flag.
2. Exercise 1 — Channel send in a hot loop where atomic would do¶
A producer increments a shared count by sending 1 on a buffered chan to a reducer goroutine. Each send walks runtime.chansend, takes hchan.lock, copies the element, and may park the reducer on its recvq. For a counter, every one of those steps is wasted.
type Counter struct{ ch chan int64 }
func NewCounter() *Counter {
c := &Counter{ch: make(chan int64, 1024)}
go func() { for v := range c.ch { _ = v /* reduce */ } }()
return c
}
func (c *Counter) Inc() { c.ch <- 1 }
After
`atomic.Int64.Add` — one `LOCK XADD`, no lock, no scheduler involvement. ~43× faster. **Why faster:** `runtime/chan.go::chansend` (the unbuffered/blocked path drops into `chansend1` → `chansend` → `gopark` if the buffer is full) takes `c.lock` via `lock(&c.lockOrder)`, runs the `sendDirect`/memmove element copy, then `goready`s any sleeping receiver. Even the fast path (buffered, no waiter) does CAS on `qcount`, atomic store of the element, and a `runtime.gosched` budget check. An `atomic.Int64.Add` is a single `LOCK XADD` — no scheduler entry, no `hchan` cache line bouncing across cores. **Trade-off:** Lose the reducer-goroutine sequencing point — if `Inc` callers needed ordering with other channel ops, the atomic loses it. Lose flow-control: the channel's buffer was implicit backpressure. For counters, neither matters; for queues, they do. **When NOT:** When the chan delivers a real value (struct, pointer), not a tally. When you need the reducer to run a side effect once per increment (logging, batching). When the chan is the synchronization point with `select` elsewhere.3. Exercise 2 — select { case x := <-ch: ... default: } in a spin loop¶
A worker polls a chan with select … default inside a for to "check if there's work, otherwise spin." Each iteration walks runtime.selectgo — the most expensive operator in the runtime — only to fall through the default and burn a CPU.
After
Block on the receive. If you need to also watch a stop signal, add it as a second `case` — still a *blocking* select. Throughput unchanged; CPU at idle drops to zero. On a 16-core machine with 16 spinners, system CPU drops from 1600% to ~0%. **Why faster:** `runtime/select.go::selectgo` runs three passes per call — `sellock` on every case's channel, a `pollorder` pass to find a ready case, and `lockorder` to acquire locks deterministically (the per-case lock dance is what makes `selectgo` ~10× more expensive than a plain chan op). With a `default`, the `pollorder` pass always finds the default `caseDefault` after checking every other case — so you pay the locking and polling cost on every iteration and never sleep. Blocking selects park on the chan's `recvq` via `gopark`; the OS thread (M) detaches from the G and the P picks up other work. No CPU burns. **Trade-off:** You lose the ability to "do something else while waiting" within the same goroutine. If "something else" is real work, use a second goroutine and let the scheduler park each. **When NOT:** When you genuinely have non-blocking work to perform on each iteration (a tick + a chan check is `select` + `time.NewTicker`, not `select` + `default`). When you're in a `runtime.LockOSThread`'d goroutine where parking is unsafe — but you shouldn't be using a chan there either.4. Exercise 3 — time.After in a select loop¶
A handler uses time.After(5*time.Second) inside a for { select { case <-ch: ...; case <-time.After(5*time.Second): ... } }. Every loop iteration adds a new timer to the runtime's timer heap. The old timer becomes garbage but isn't reclaimed until it fires.
for {
select {
case msg := <-ch:
handle(msg)
case <-time.After(5 * time.Second):
return // idle timeout
}
}
After
One `time.Timer` outside the loop. `Reset` after each receive; `Stop` on exit. ~7.7× faster, allocation eliminated. **Why faster:** `runtime/time.go::startTimer` (called by `time.NewTimer` and by `time.After` per call) takes the per-P timer bucket lock, sift-ups the timer into the four-heap (`siftupTimer`), and wakes the timer-proc goroutine if the new timer is earliest. `time.After` is a thin wrapper: `func After(d Duration) <-chan Time { return NewTimer(d).C }` — so each call into `time.After` constructs a fresh `*Timer` (allocation), a fresh chan (allocation), and a fresh runtime `timer` struct, then heaps it. With Go 1.23's per-P timer buckets this is cheaper than the old global-lock era, but `addtimer` → `cleantimers` → `siftupTimer` is still O(log N) over the bucket. `Reset` instead mutates the existing `runtime.timer.when` field and re-sifts in place; no new chan, no new alloc, no new G wake. **Trade-off:** `Reset`'s contract is fiddly — the "drain `t.C` only if `Stop` returned false" dance is famously bug-prone. Wrap in a helper or use `time.AfterFunc` for fire-and-forget patterns. With Go 1.23 `Reset` is safer (the channel was made unbuffered-like via `chanrecv` semantics), but the drain dance is still required for receives the goroutine already started. **When NOT:** When the loop iterates rarely (once per minute) — the allocation noise is invisible. When you need a *new* deadline per receive that's relative to message arrival — you'd still reset, so this still wins.5. Exercise 4 — defer Unlock() in a tight loop¶
A function takes a mutex, mutates a map, releases. It's called 10M times per second in a hot path. The defer adds ~30 ns and an 8-byte heap slot on every call (Go 1.14+ open-coded defers help when there's exactly one defer in a non-loop function — but only sometimes, and never in a loop).
After
Manual `Unlock`. Wrap the body in a helper if it has multiple early returns. ~3.3× faster. **Why faster:** `runtime/panic.go::deferproc` records the deferred function in a `_defer` struct linked off the G. `runtime/panic.go::deferreturn` walks the chain at function return and invokes each. Go 1.14 added "open-coded defers" — when the compiler can prove the defer count is small and known, it emits the deferred calls inline with a bitmask; no `_defer` allocation. **But** the open-coded path is disabled if the defer is in a loop, if there are more than 8 defers, or if the function takes the address of `defer`. For `defer Unlock`, the open-coded path applies most of the time and the overhead drops from ~30 ns to ~5-8 ns vs. ~1 ns for a direct call. In a function called millions of times per second, that ~5 ns × 10M = 50 ms/sec of CPU spent on bookkeeping. Manual `Unlock` is a single call; the compiler often inlines `(*Mutex).Unlock`'s fast path (`atomic.CompareAndSwapInt32(&m.state, mutexLocked, 0)`). **Trade-off:** Panic-safety. If anything between `Lock` and `Unlock` panics, the mutex is never released; subsequent acquirers deadlock. Mitigate by keeping the critical section truly trivial (one map op) and asserting in code review that no allocator/escape-checked operation lives there. **When NOT:** When the critical section can panic (arithmetic, slice indexing under uncertain length, calls into user code). When the function has multiple return paths and `defer` is the readability win. When the function is called < 1k times/sec — the saved 20 ns is invisible.6. Exercise 5 — Goroutine per item¶
A request handler fans out one goroutine per row in a 10k-row batch to call an enrichment service. The scheduler creates 10k Gs, runs them, and lets them die — 10k × ~2 KB stack + scheduler bookkeeping per request.
func Enrich(rows []Row) []Enriched {
out := make([]Enriched, len(rows))
var wg sync.WaitGroup
for i, r := range rows {
wg.Add(1)
go func(i int, r Row) {
defer wg.Done()
out[i] = enrich(r)
}(i, r)
}
wg.Wait()
return out
}
After
Fixed worker pool sized to `GOMAXPROCS` (or `min(GOMAXPROCS, len(rows))`). Feed work through a chan; workers loop.func Enrich(rows []Row) []Enriched {
out := make([]Enriched, len(rows))
workers := runtime.GOMAXPROCS(0)
if workers > len(rows) { workers = len(rows) }
type job struct{ i int; r Row }
ch := make(chan job, workers)
var wg sync.WaitGroup
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := range ch { out[j.i] = enrich(j.r) }
}()
}
for i, r := range rows { ch <- job{i, r} }
close(ch); wg.Wait()
return out
}
7. Exercise 6 — Manual runtime.GC() to "free memory now"¶
A long-lived server calls runtime.GC() at the end of each request "to keep memory low." Every call is a full STW garbage collection — the entire goroutine population pauses while the runtime scans the heap.
func handleRequest(w http.ResponseWriter, r *http.Request) {
serve(w, r)
runtime.GC() // "to keep memory bounded"
}
After
Remove it. Tune `GOGC` (default 100 = double heap before next GC; set lower for tighter memory, higher for less GC pressure) or `GOMEMLIMIT` if you have a hard cap. ~600× faster on the request path. Peak memory stays bounded via `GOMEMLIMIT`. **Why faster:** `runtime/mgc.go::gcStart` (called by `runtime.GC()`) calls `stopTheWorld("GC")` to enter mark setup, runs concurrent mark on dedicated GC workers, then `stopTheWorld` again for mark termination. The first STW is brief (microseconds), the concurrent mark runs alongside the program but pays a write-barrier tax (`gcWriteBarrier`), and the second STW is bounded but real. By calling `runtime.GC()` per request you force a full cycle every request — completely defeating the concurrent GC's purpose. The runtime already triggers GC when the heap doubles (or hits `GOMEMLIMIT`), runs it concurrently with marker workers (`gcBgMarkWorker`), and amortizes the cost. `debug.SetMemoryLimit` lets you pin a ceiling without forcing cycles. `runtime.GC()` exists for benchmarks and tests where you need a known heap state — not for production. **Trade-off:** Removing manual GC means peak heap is `2 × live_set` (or `GOGC=X` ⇒ `(1+X/100) × live_set`). If you actually need tighter peak, set `GOMEMLIMIT` — the runtime adjusts its triggering to respect the cap, paying CPU rather than RAM. **When NOT:** Right before a long benchmark you want to start with a clean heap. Right before a memory-pressure test to force a known state. After releasing a *huge* one-shot dataset (10 GB load-then-process) where the natural trigger would happen seconds later — but even then prefer setting `GOGC` smaller in the loaded section.8. Exercise 7 — runtime.Gosched() "to give others a chance"¶
A CPU-heavy goroutine sprinkles runtime.Gosched() calls "to be fair." Each call enters the scheduler, looks for runnable work, and either yields or returns. On a system with no other runnable goroutines, it's pure overhead.
func compute(data []float64) float64 {
sum := 0.0
for i, v := range data {
sum += math.Sqrt(v) * math.Sin(v)
if i%1000 == 0 { runtime.Gosched() } // "be nice"
}
return sum
}
After
Remove the call. The runtime preempts long-running goroutines on its own (Go 1.14+ asynchronous preemption). ~1.7× faster. **Why faster:** `runtime/proc.go::gosched_m` calls `dropg`, `casgstatus(gp, _Grunning, _Grunnable)`, puts the G back on the global runq (`globrunqput`), and re-enters `schedule()` to find work. Each Gosched costs ~100-200 ns and trashes the M's instruction cache (scheduler code is cold relative to your math loop). Worse, Gosched goes to the *global* queue, defeating the per-P run queue's affinity — the next G to run on this P likely isn't yours. Go 1.14 added signal-based asynchronous preemption (`runtime/preempt.go::preemptPark`, fired by the sysmon goroutine after 10 ms in the same G) — the runtime can interrupt your loop without cooperation. Manual `Gosched` is left over from Go 1.0 when goroutines could starve each other. **Trade-off:** None for CPU-bound code. The only place `Gosched` is occasionally defensible is inside a `for { /* spin */ }` waiting for a condition you can't atomic/chan — but you should restructure such code. **When NOT:** Inside `runtime.LockOSThread`'d code that explicitly wants to share the OS thread with other Gs — but if you're locking the OS thread, you usually don't want to yield it anyway.9. Exercise 8 — runtime.LockOSThread for no reason¶
A goroutine calls runtime.LockOSThread() "to make sure it stays on the same thread for performance." It does no cgo, no syscalls that require thread affinity (like X11 or thread-local storage in C libs), and no OS-thread-specific state.
func worker() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
for job := range jobs {
process(job) // pure Go, no C, no thread-locals
}
}
After
Remove both calls. ~1.8× faster. **Why faster:** `runtime/proc.go::dolockOSThread` (called by `LockOSThread`) sets `g.lockedm = m` and `m.lockedg = g`. From then on, the runtime cannot run this G on any other M, and cannot run any other G on this M. When this G blocks on a chan, the M parks too — it doesn't pick up other work. When this G is descheduled and rescheduled, the runtime must find this specific M; if it's busy, the G waits even if other M/P pairs are idle. Effective parallelism collapses: with `GOMAXPROCS=8` and 8 locked workers, you get 8 dedicated OS threads with no sharing. Worse, `LockOSThread` disables the G's preemption guarantees — the M cannot service signals for other Gs while this one runs. **Trade-off:** None when the lock was unnecessary. When it's necessary (cgo with thread-local storage, OpenGL contexts, `syscall.Setuid` per-thread, Linux netns/usrns work), the lock is non-negotiable. **When NOT:** Cgo into a library with thread-local state (CUDA, OpenGL, X11). Linux per-thread namespace operations (`unshare`, `setns`). Signal handlers that must run on a specific thread. Calls to `syscall.RawSyscall` that mutate per-thread kernel state (rlimits, capabilities) — Go's `syscall.Setuid` already issues to all threads; raw syscalls don't.10. Exercise 9 — make([]byte, n) per call for short-lived buffers¶
A request handler allocates a 64 KB buffer per call to decode protobuf, hands it to Unmarshal, then discards it. At 10k req/s that's 640 MB/s of GC pressure.
func decode(data []byte) (*Message, error) {
buf := make([]byte, 64*1024)
n, err := unpack(data, buf)
if err != nil { return nil, err }
return parse(buf[:n])
}
After
`sync.Pool` keyed on buffer size. Reset and put back at the end. ~8× faster, ~2700× less GC pressure. **Why faster:** Every `make([]byte, 64*1024)` walks `runtime/malloc.go::mallocgc`. The 64 KB size puts it into the "large object" path (objects > 32 KB go via `mcache.allocLarge` → straight to `mheap` lock acquisition). Even smaller objects pay per-call overhead: `mcache` per-P fast path checks the span, finds a free slot, sets the bitmap bit, possibly refills from `mcentral` (taking `mcentral.partial[].spineLock`), and the GC eventually scans the object. `sync.Pool` keeps a per-P local cache (`poolLocal`), pulled with no lock; on miss it walks the victim cache and other Ps' pools (`getSlow`). The 24 B/op in the After is the `*[]byte` indirection — using a pointer-to-slice avoids the slice-header escape that plain `[]byte` would cause inside `Pool.Get`. **Trade-off:** `sync.Pool` evicts during GC — a low-traffic endpoint sees zero hit rate. Buffers must be reset (or you leak data across calls). Pool isn't suitable for buffers of variable size (different requests need different sizes) — use a slab pool or a `sync.Pool` per size bucket. **When NOT:** When buffers vary wildly in size (the pool either keeps oversized buffers, wasting memory, or constantly allocates new ones, defeating the point). When the allocation isn't on a hot path. When the buffer lifetime crosses goroutines in unclear patterns — pool reuse becomes a correctness risk.11. Exercise 10 — runtime.SetFinalizer to close resources¶
A struct wrapping a file handle uses runtime.SetFinalizer(f, (*File).close) "for safety." The finalizer runs eventually (GC-triggered), possibly after the FD limit is hit. Each finalizer adds work to the dedicated finalizer goroutine.
type File struct{ fd int }
func Open(path string) (*File, error) {
fd, err := syscall.Open(path, syscall.O_RDONLY, 0)
if err != nil { return nil, err }
f := &File{fd: fd}
runtime.SetFinalizer(f, (*File).close)
return f, nil
}
func (f *File) close() { syscall.Close(f.fd) }
// Real symptom: "too many open files" errors under load before GC catches up.
// Benchmark not the right tool; check FD count under sustained Open without Close.
After
`Close()` method + `defer f.Close()` at the call site. Finalizer (if kept) only as a debugging belt-and-suspenders that *warns* on leak.type File struct{ fd int; closed bool }
func Open(path string) (*File, error) {
fd, err := syscall.Open(path, syscall.O_RDONLY, 0)
if err != nil { return nil, err }
f := &File{fd: fd}
runtime.SetFinalizer(f, func(f *File) {
if !f.closed { panic("File leaked without Close") } // dev-only
})
return f, nil
}
func (f *File) Close() error {
if f.closed { return nil }
f.closed = true
runtime.SetFinalizer(f, nil) // disarm
return syscall.Close(f.fd)
}
// Callers:
// f, err := Open(path); if err != nil { return err }
// defer f.Close()
12. Exercise 11 — interface{} boxing in a hot path¶
A logger accepts ...interface{} and stores each value in a ring buffer. Every int, int64, float64, and other-than-pointer value gets boxed via runtime.convT64 / runtime.convTslice / etc., allocating an _type-tagged heap object per call.
type Logger struct{ buf []any }
func (l *Logger) Logf(format string, args ...any) {
l.buf = append(l.buf, formatted(format, args)...)
}
func main() {
l := &Logger{}
for i := 0; i < 1_000_000; i++ {
l.Logf("iter=%d val=%f", i, float64(i)*1.5)
}
}
After
Typed API for hot fields, or generics for monomorphized paths. `slog` uses typed `Attr` for exactly this reason.type Attr struct {
Key string
Kind uint8 // KindInt, KindFloat, KindString
Int int64
Float float64
Str string
}
func KV(key string, v any) Attr {
switch v := v.(type) {
case int: return Attr{Key: key, Kind: KindInt, Int: int64(v)}
case int64: return Attr{Key: key, Kind: KindInt, Int: v}
case float64: return Attr{Key: key, Kind: KindFloat, Float: v}
case string: return Attr{Key: key, Kind: KindStr, Str: v}
}
return Attr{Key: key, Kind: KindStr, Str: fmt.Sprint(v)}
}
func (l *Logger) Log(msg string, attrs ...Attr) {
// serialize without boxing
}
// Caller:
// l.Log("iter", Int("i", i), Float("val", float64(i)*1.5))
13. Exercise 12 — runtime.NumGoroutine() polled in metrics¶
A /metrics endpoint reports goroutine count by calling runtime.NumGoroutine() on every scrape. The runtime walks allgs (all goroutines ever created, filtered by status) — under sustained load with millions of historical Gs created and dead, this becomes slow and takes the allgs lock.
func metricsHandler(w http.ResponseWriter, r *http.Request) {
n := runtime.NumGoroutine()
fmt.Fprintf(w, "goroutines %d\n", n)
}
After
For the *application-level* concept ("active workers", "in-flight requests") use an atomic counter. Keep `NumGoroutine` for occasional debugging. ~300000× faster on the metrics path. **Why faster:** `runtime/proc.go::gcount` walks `allgs` accumulating those with status in `{_Grunnable, _Grunning, _Gsyscall, _Gwaiting, _Gpreempted}` minus those in `_Gdead`. The `allgs` slice grows monotonically over the process lifetime (Gs are reused but the slice never shrinks); a long-lived server with bursty goroutine creation can have `len(allgs)` in the millions. `gcount` takes `allglock` and iterates — that's microseconds-to-milliseconds depending on history. An atomic counter is one `LOCK XADD`. Also: `NumGoroutine` reports a runtime-level number that's only loosely related to "how busy is my app" — it includes the GC workers, the timer-proc, the finalizer goroutine, network poller, etc. The atomic counter measures the thing the operator actually wants. **Trade-off:** You're not measuring goroutine *leaks* — `NumGoroutine` does that and you still want it on an occasional debug endpoint. The atomic counter only knows about the app-level units you instrument. **When NOT:** When you specifically want to detect goroutine leaks — keep `NumGoroutine` on a slow-poll debug endpoint, alongside the cheap atomic counters. When the number of goroutines created over the process lifetime is small (< 10k) — `gcount` is fast.14. Exercise 13 — runtime.ReadMemStats per request¶
A handler emits memory stats per request to a tracing system. runtime.ReadMemStats is documented to stop the world to collect a consistent snapshot of all memory counters.
func handleRequest(w http.ResponseWriter, r *http.Request) {
serve(w, r)
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
tracer.RecordHeapInUse(ms.HeapInuse)
}
After
`runtime/metrics.Read` for a non-STW snapshot of the metrics you actually want. ~8700× faster. No STW. **Why faster:** `runtime/mstats.go::readmemstats_m` (the M-routine that `ReadMemStats` calls into) is wrapped in `stopTheWorld("read mem stats")` to ensure a coherent snapshot across all per-P stats caches (`mcache.local_*`, `mcentral.nmalloc`, etc.). The STW pauses every G — under load that's per-request multi-millisecond latency injected into every other concurrent handler. `runtime/metrics` (Go 1.16+) was designed to expose individual counters via atomic reads — `metrics.Read` walks the requested samples and grabs each value with a load, no global pause. Most production observability stacks (Prometheus, OpenTelemetry) now expose `runtime/metrics` by default; only legacy code still reads `MemStats`. **Trade-off:** The set of metrics differs slightly. Anything in `MemStats` worth reporting has a `runtime/metrics` equivalent — see `runtime/metrics.All()` for the inventory. **When NOT:** When you need a *coherent* snapshot across many counters (you don't, usually — operators want individual gauges). One-shot tools where the STW pause is irrelevant. Tests that profile heap precisely after a known point.15. Exercise 14 — runtime.Stack(buf, true) always-on dump¶
A panic handler proactively captures runtime.Stack(buf, true) on every request "in case we need debug info." all=true stops the world and walks every goroutine's stack, formatting frames into the buffer.
func handleRequest(w http.ResponseWriter, r *http.Request) {
buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true) // all goroutines, every request
log.Debug("stacks", string(buf[:n]))
serve(w, r)
}
After
Dump stacks only on the conditions that need them — panics, deadlock detection (timeouts), debug endpoint.func handleRequest(w http.ResponseWriter, r *http.Request) {
defer func() {
if rv := recover(); rv != nil {
buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true) // only on panic
log.Error("panic", "value", rv, "stacks", string(buf[:n]))
}
}()
serve(w, r)
}
// Plus a debug endpoint:
http.HandleFunc("/debug/goroutines", func(w http.ResponseWriter, r *http.Request) {
buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true)
w.Write(buf[:n])
})
16. When NOT to optimize¶
Runtime cost dominates only when the runtime is on the hot path of a high-frequency operation. If your handler runs 10 times per minute, all 14 of these "wins" combined save you microseconds you can't measure. The signature in a profile that says "I should look at the runtime" is one of: runtime.mallocgc >5% of CPU, runtime.chansend1/chanrecv1 >2%, runtime.selectgo >2%, runtime.deferreturn on a hot stack, runtime.gopark with the wrong reason at the wrong frequency, runtime/mgc.go::gcBgMarkWorker >10% (or gcDrain consuming a worker), STW intervals >100 μs.
Profile first. go tool pprof on a CPU profile collected with runtime/pprof.StartCPUProfile — search the call graph for the runtime functions named in each exercise. runtime/trace shows scheduler events: EvGoBlockSync (Ex. 1, channel send blocked), EvGoBlockSelect (Ex. 2), EvGoSched (Ex. 7), EvGCSTWStart/EvGCSTWDone (Ex. 6, 13, 14). Each maps to a row in the trace UI; click into it and you see exactly which call site triggered.
Common premature optimizations: atomic counter (Ex. 1) replacing a chan that genuinely carries values; manual Unlock (Ex. 4) inside a function that does anything that can panic; worker pool (Ex. 5) on a 50-row batch; removing runtime.GC (Ex. 6) before checking whether GOMEMLIMIT covers the use case; sync.Pool (Ex. 9) on buffers smaller than 1 KB; typed-attr API (Ex. 11) on a logger called twice per request.
Correctness gaps disguised as optimizations: atomic counter (Ex. 1) where ordering with other channel ops was load-bearing; blocking select (Ex. 2) that removes the non-blocking case where work did happen on default; Reset (Ex. 3) without the Stop+drain dance — the timer fires twice; manual Unlock (Ex. 4) inside a path that can panic — deadlock on retry; worker pool (Ex. 5) without context.Context cancellation — one error doesn't stop the batch; removing runtime.GC (Ex. 6) in a process that legitimately needs to bound heap before a known-large operation; removing Gosched (Ex. 7) inside a LockOSThread'd G that was sharing a thread with another locked G (don't do this); sync.Pool (Ex. 9) holding buffers that weren't reset, leaking data across requests; explicit Close (Ex. 10) at call sites that forgot defer; typed-attr API (Ex. 11) that silently drops attrs of unsupported kinds; atomic in-flight counter (Ex. 12) that doesn't reset when a panic skips the defer Add(-1) — fix with defer inside recover'd wrapper; metrics.Read (Ex. 13) consuming the same slice from multiple goroutines (it mutates the slice); on-demand runtime.Stack (Ex. 14) inside the same goroutine that's already panicking — captures a stack you already have via recover.
17. Summary¶
Always-ship wins (default in any new runtime-touching code): never call runtime.GC() in production (Ex. 6); never call runtime.Gosched() (Ex. 7) — the scheduler has been preempting for years; never runtime.LockOSThread() without a cgo/syscall reason (Ex. 8); never time.After in a loop (Ex. 3); never select { default: } as a spin (Ex. 2); use runtime/metrics over runtime.ReadMemStats (Ex. 13); dump stacks on panic, not per request (Ex. 14); defer f.Close() at the call site, not runtime.SetFinalizer for the happy path (Ex. 10).
Wins behind a profile (when measurements justify them): atomic counter over chan-of-counts (Ex. 1, when chansend shows); manual Unlock (Ex. 4, when deferproc shows on a million-call/s function); worker pool over goroutine-per-item (Ex. 5, when newproc shows); sync.Pool for short-lived buffers (Ex. 9, when mallocgc shows in the same span class); typed-attr/generic API over any (Ex. 11, when convT64/convTstring shows); atomic in-flight counter for app-level metrics (Ex. 12, when gcount or allglock shows in a metrics path).
Specialty (only when the design calls for it): custom timer wheel for systems with millions of pending timers per process (Discord's gobwas/ws style); per-P sharded counters using sync.Pool or runtime internals exposed via internal/runtime/atomic (cgo-grade hot paths only); manual stack snapshots via runtime.Callers (~200 ns) instead of runtime.Stack when you only need PCs not symbolized text.
Runtime cost is allocation, scheduler entry, lock contention, STW pauses, defer-chain construction, and iface boxing. Strip those six from the read path by choosing the right primitive: atomics for counters; blocking chans/selects for coordination, never spinning; pooled buffers for short-lived large objects; typed APIs for hot dispatch; explicit Close over finalizers; runtime/metrics over MemStats; on-demand stack dumps. The runtime itself is well-engineered — the wins come from not calling into it unnecessarily. Profile, then pick the lever; the call-graph signatures above tell you which one.