Modern Standard-Library Additions — Optimization¶
Honest framing first: most of these APIs are already fast — the Go team designed them with allocation and contention in mind. The real wins come from using them correctly (the
LogAttrsfast path, sets instead of linear scans, interning the right field) and from removing dependencies the stdlib now replaces (a logger, a router) which shrinks builds, binaries, and audit surface. Each entry states the problem, a "before", an "after", and the realistic gain. The closing sections cover measurement and the cases where the optimization is the wrong move.
Optimization 1 — Use LogAttrs on hot logging paths¶
Problem: The convenient logger.Info(msg, args...) form boxes every value into any and assembles pairs at runtime — allocations that add up in high-throughput request handlers.
Before:
Each call allocates an[]any and boxes code, elapsed, etc. After:
logger.LogAttrs(ctx, slog.LevelInfo, "request done",
slog.String("method", m),
slog.Int("status", code),
slog.Int64("ms", elapsed),
)
Attrs use slog.Value's union representation — no per-value any boxing. Expected gain: Fewer allocs/op per log line (often the dominant allocation in a request). Measure with -benchmem; in log-heavy services this meaningfully reduces GC pressure.
Optimization 2 — Guard expensive log construction with Enabled¶
Problem: Building an attribute is expensive (serialising a struct, formatting), but the log level is disabled in production, so the work is wasted.
Before:
After:
if logger.Enabled(ctx, slog.LevelDebug) {
logger.Debug("state", "snapshot", expensiveJSON(state))
}
// or make snapshot a LogValuer so it resolves lazily only when emitted.
Expected gain: The expensive call vanishes entirely on disabled levels. For a Debug line in a hot loop, this is the difference between paying the cost on every request and paying it never.
Optimization 3 — Replace linear Contains with a set¶
Problem: slices.Contains is O(n). Calling it inside a loop over the same slice is O(n²).
Before:
After:
set := make(map[T]struct{}, len(allowed))
for _, a := range allowed { set[a] = struct{}{} }
for _, x := range incoming {
if _, ok := set[x]; ok { admit(x) } // O(1) each
}
Expected gain: From O(n·m) to O(n+m). For allowed of thousands and incoming of thousands, this turns a multi-millisecond hot spot into microseconds. Use slices.Contains only for small or one-shot checks.
Optimization 4 — Intern a repetitive field with unique¶
Problem: Millions of records each carry a string drawn from a tiny set (region, status, tenant), so the heap stores millions of duplicate string headers and bytes.
Before:
After:
type Row struct{ Region unique.Handle[string] }
r := Row{Region: unique.Make(region)}
// later: r.Region.Value()
Expected gain: Heap residency for that field collapses to one canonical copy per distinct value, plus a small handle per row. Equality (r1.Region == r2.Region) becomes a pointer compare, speeding up group-by/dedup. Verify with inuse_space heap profiles — the win is residency, not Make micro-latency. Only apply to genuinely repetitive fields; interning unique data is pure overhead.
Optimization 5 — Drop the router dependency¶
Problem: A gorilla/mux/chi dependency exists solely for method + path-variable routing that the stdlib now provides (Go 1.22+).
Before:
r := mux.NewRouter()
r.HandleFunc("/items/{id}", h).Methods("GET")
// + a go.mod dependency, its transitive deps, and its CVE surface
After:
Expected gain: One fewer dependency (and its transitive closure) in go.mod — smaller builds, smaller binary, less to audit and patch. Routing performance is competitive, and the stdlib mux avoids some third-party per-request allocations. Keep chi only if you genuinely need its middleware/grouping ergonomics.
Optimization 6 — Reuse a buffer pool in a custom JSON handler¶
Problem: A custom slog.Handler allocates a fresh []byte/strings.Builder per record; at high log volume this churns the GC.
Before:
func (h *h) Handle(_ context.Context, r slog.Record) error {
var b strings.Builder // new allocation every record
// ... format ...
}
After:
var bufPool = sync.Pool{New: func() any { return new(bytes.Buffer) }}
func (h *h) Handle(_ context.Context, r slog.Record) error {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
// ... format into buf ...
_, err := h.w.Write(buf.Bytes())
return err
}
Expected gain: Per-record buffer allocations drop to near zero under load. This is what the stdlib JSON handler does internally; replicate it in custom handlers serving hot paths.
Optimization 7 — Use math/rand/v2 to remove a global-mutex bottleneck¶
Problem: Concurrent goroutines calling v1 math/rand top-level functions contend on the package's global mutex.
Before:
import "math/rand" // v1; global functions share one mutex
go func(){ _ = rand.Intn(100) }() // contended under load
After:
import "math/rand/v2" // global path avoids the v1 mutex bottleneck
_ = rand.IntN(100)
// or give each goroutine its own *rand.Rand for full independence
Expected gain: Reduced lock contention on randomness in highly-concurrent code (rate-limiter jitter, load-balancer choices). Under heavy parallelism this removes a measurable serialization point.
Optimization 8 — Pre-size and reuse with slices.Clip/Grow¶
Problem: Repeated append without capacity hints reallocates; and sharing a sliced-down backing array causes a later append to clobber data.
Before:
out := existing[:0] // reuse backing array
out = append(out, more...) // may overwrite data still referenced via `existing`
After:
out := slices.Clip(existing[:0]) // cap == len, so append reallocates safely
out = append(out, more...)
// or pre-size a fresh slice:
out := make([]T, 0, len(a)+len(b))
Expected gain: Avoids both surprise data corruption and repeated growth reallocations. slices.Grow(s, n) reserves capacity for n more elements in one allocation when the final size is known.
Measuring¶
- Allocations:
go test -bench . -benchmemand readallocs/op. Theslogvariadic-vs-LogAttrsand custom-handler-pool wins show here. - Heap residency:
go test -memprofile/pprof -inuse_space. Theuniqueinterning win shows here, not in-bench. - Contention:
go test -bench . -cpu 1,4,16and the mutex profile (runtime.SetMutexProfileFraction) for themath/randglobal-mutex win. - Stability: run benchmarks multiple times and compare with
benchstat; single runs are noise. - Build/binary size:
go build -o /dev/nulltiming andls -lon the binary before/after dropping a router dependency.
When These Optimizations Are the Wrong Move¶
LogAttrseverywhere harms readability for cold-path logging (startup, rare errors). Use the variadic form where it is clearer and the path is not hot.uniqueon unique or low-repetition data is negative —Makecosts a lookup with no residency win. Profile first.- A buffer pool in a low-volume handler adds complexity for no measurable gain. Pool only proven hot paths.
- Sets instead of
slices.Containsfor tiny slices (a handful of elements) is over-engineering — the linear scan is faster than building a map. - Dropping
chi/echopurely to remove a dependency, when you actually use its middleware/grouping, trades real ergonomics for a marginal footprint win. math/rand/v2for determinism-sensitive code without injecting an explicit seeded source breaks reproducibility — the absence of globalSeedis a correctness consideration, not just a perf one.
The honest summary: the biggest, safest wins are using slog's fast path, choosing sets over linear scans on hot paths, and deleting dependencies the stdlib now replaces. Reach for interning and pooling only with a profile in hand.
In this topic