sync — Optimization¶
1. How to use this file¶
Fourteen scenarios where sync-shaped code is slower, allocates more, or scales worse than a different primitive in the same package — or no primitive at all. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT).
Anchored at Go 1.23, amd64, GOMAXPROCS=8. Numbers are reproducible-shape — run go test -bench=. -benchmem -cpu=8 on your hardware before quoting them. sync cost is dominated by four things: contention on a single word (mutex state, atomic value), cache-line bouncing across cores, per-call allocation of pooled values, and runtime calls into runtime.semacquire/runtime.gopark. Most wins remove one of those four from the hot path. Reading order: Ex. 1, 2, 11, then any order. Ex. 4, 6, 13 are the ones most senior reviews flag.
2. Exercise 1 — Mutex-protected counter¶
A request counter is read and incremented millions of times per second from many goroutines. Each mu.Lock/Unlock pair is a runtime.semacquire-shaped call even when uncontended, and contended waits park the goroutine.
type Counter struct {
mu sync.Mutex
n int64
}
func (c *Counter) Inc() { c.mu.Lock(); c.n++; c.mu.Unlock() }
func (c *Counter) Load() int64 { c.mu.Lock(); defer c.mu.Unlock(); return c.n }
BenchmarkMutexCounter-8 50000000 42 ns/op 0 B/op 0 allocs/op // uncontended
BenchmarkMutexCounterParallel-8 8000000 178 ns/op 0 B/op 0 allocs/op // 8 goroutines
After
`atomic.Int64` is a single LOCK XADD on x86 — no park, no spin, no fairness machinery. ~7× faster uncontended, ~10× under contention. **Why faster:** `sync.Mutex` does an atomic CAS plus bookkeeping (`state`, `sema`) and parks on contention. `atomic.Int64.Add` compiles to one `LOCK XADDQ` instruction. No goroutine ever sleeps — the bus arbitrates. **Trade-off:** Atomics protect one word — for two related fields (counter + last-updated timestamp), you still need a mutex or a packed struct + `atomic.Pointer`. Compound reads (`if n == 0 { ... }`) are racy without rechecking. Memory ordering is sequentially consistent in Go's `sync/atomic` — fine, but on weaker-ordered platforms the compiler still emits barriers. **When NOT:** Counter participates in invariants with other fields (e.g., "if n > 0, ptr must be non-nil"). The increment is conditional on a complex predicate. You want fairness/queue order — atomics don't queue.3. Exercise 2 — Global Mutex on a hot path¶
A request-tracking map is guarded by one sync.Mutex for a whole-process metrics path. Eight goroutines spin on the same cache line; throughput collapses past 4 cores.
type Stats struct {
mu sync.Mutex
perRoute map[string]int
}
func (s *Stats) Hit(route string) {
s.mu.Lock()
s.perRoute[route]++
s.mu.Unlock()
}
After
Shard the map. Pick the shard from a hash of the key; only goroutines hitting the same shard contend.const shards = 64
type Stats struct {
s [shards]struct {
mu sync.Mutex
m map[string]int
_ [56]byte // pad to 64 B so neighboring shards don't share a cache line
}
}
func (s *Stats) Hit(route string) {
h := fnv.New32a(); h.Write([]byte(route))
sh := &s.s[h.Sum32()%shards]
sh.mu.Lock(); sh.m[route]++; sh.mu.Unlock()
}
4. Exercise 3 — RWMutex worse than Mutex (few readers, many writers)¶
A small in-memory routing table is read 1× per request and written 4× per request (counters, last-seen, latency, status). The author reached for sync.RWMutex because "it has reads," but write-heavy workloads pay RWMutex's larger state machine without the readers-in-parallel benefit.
type Routes struct {
mu sync.RWMutex
routes map[string]*Route
}
func (r *Routes) Get(k string) *Route { r.mu.RLock(); defer r.mu.RUnlock(); return r.routes[k] }
func (r *Routes) Update(k string, fn func(*Route)) {
r.mu.Lock(); fn(r.routes[k]); r.mu.Unlock()
}
BenchmarkRWMutexWriteHeavy-8 5000000 310 ns/op 0 B/op 0 allocs/op // 1 read : 4 writes, 8 goroutines
After
Plain `sync.Mutex`. Lock/unlock is one CAS pair; RWMutex's writer path goes through a reader-counting protocol that's costlier under contention. ~1.8× faster. **Why faster:** `RWMutex.Lock` (writer) must first wait for active readers to drain, atomic-increment-decrement a reader counter, and park if needed. `Mutex.Lock` is a single CAS on `state`. When writers dominate, the reader-counting machinery is pure overhead. **Trade-off:** Loses parallel reads — but the workload had only 1 read per 4 writes, so there's nothing to parallelize. If the workload mix shifts back to read-heavy (>10:1), switch back. **When NOT:** Read:write ratio above ~8:1 with reads holding the lock for nontrivial work — `RWMutex` wins there. Long-held read locks where parallel reads matter more than per-op overhead. Mixed access patterns where the lock type is hard to predict — measure both.5. Exercise 4 — sync.Map for write-heavy work¶
sync.Map is optimized for "write once, read many" or "disjoint key sets per goroutine." A session cache that's read and written at near-equal rates pays for the read/dirty two-map machinery without the read-mostly win.
var cache sync.Map // map[string]*Session
func Get(id string) *Session {
v, _ := cache.Load(id)
if v == nil { return nil }
return v.(*Session)
}
func Put(id string, s *Session) { cache.Store(id, s) }
After
Plain `map[string]*Session` + `sync.Mutex`. Single lock, single map, predictable cost. ~3× faster, allocations eliminated. **Why faster:** `sync.Map.Store` on a missing key promotes the `read` map to `dirty` and re-allocates entries — that's the 2 allocs/op. Writes that hit the `dirty` path traverse two maps. A plain map + mutex has one lookup, one assignment, and amortizes contention by holding the lock briefly. **Trade-off:** All access serializes through one lock — shard (Ex. 2) for higher fan-in. Lose `Range`'s lock-free iteration. Lose per-key concurrent updates if you also need read parallelism (then `RWMutex` if reads dominate). **When NOT:** True read-mostly (>10:1) workloads with stable key sets — `sync.Map` shines. Per-goroutine key partitions where each goroutine owns its keys. Code that uses `LoadOrStore`/`CompareAndSwap` patterns that the mutex map would have to reinvent.6. Exercise 5 — sync.Pool storing tiny objects¶
A logging path pools *[8]byte headers via sync.Pool "to avoid allocations." Each header is 8 bytes — smaller than the per-Get/Put bookkeeping. The pool adds an atomic CAS and per-P bookkeeping for every call.
var headerPool = sync.Pool{New: func() any { return new([8]byte) }}
func writeRecord(w io.Writer, ts int64) {
h := headerPool.Get().(*[8]byte)
binary.LittleEndian.PutUint64(h[:], uint64(ts))
w.Write(h[:])
headerPool.Put(h)
}
After
Just allocate on the stack. A `[8]byte` value is 8 bytes; escape analysis keeps it stack-resident if the slice doesn't escape. ~3× faster, same zero allocations (escape analysis keeps `h` on the stack). **Why faster:** `sync.Pool.Get` does an atomic load of `local[P].private`, then `local[P].shared.popHead`, then steals from other Ps if empty. That machinery is ~30 ns even on a cache hit. For an 8-byte value, allocation costs zero (stack), so the pool is pure overhead. **Trade-off:** Stack-only works if escape analysis agrees. `w.Write(h[:])` keeps `h` on the stack here because `Write` doesn't retain the slice. If it did escape (e.g., goroutine spawn), you'd allocate per call — which for 8 B is still ~5 ns, not 55. **When NOT:** The pooled value is large (≥ 256 B) and escapes — pool saves the allocator call and GC pressure. Pool stores a value with expensive initialization (a zeroed-out `bytes.Buffer` ready to grow). The hot path allocates ≥ 1 MB across goroutines per second.7. Exercise 6 — Channel-based mutex pattern¶
A "Go-idiomatic" lock uses a buffered channel of capacity 1. The author thought "channels not mutexes." Send/receive on a buffered channel is slower than mutex lock/unlock because it goes through runtime.chansend1/chanrecv1 with parking machinery.
type ChanLock struct{ ch chan struct{} }
func NewChanLock() *ChanLock { return &ChanLock{ch: make(chan struct{}, 1)} }
func (l *ChanLock) Lock() { l.ch <- struct{}{} }
func (l *ChanLock) Unlock() { <-l.ch }
BenchmarkChanLock-8 30000000 58 ns/op 0 B/op 0 allocs/op // uncontended
BenchmarkChanLockParallel-8 5000000 240 ns/op 0 B/op 0 allocs/op // 8 goroutines
After
`sync.Mutex` — the right primitive for mutual exclusion. ~2.4× faster uncontended, ~2.1× under contention. **Why faster:** `sync.Mutex.Lock` fast-path is one CAS on a 32-bit `state` word. The channel path enters `runtime.chansend` which acquires the channel's own mutex, checks the buffer, possibly enqueues on the sendq, and may park. Even when the buffer has room, the bookkeeping is 2× the work of a mutex CAS. **Trade-off:** Channels are needed when you also want `select`/timeout/cancellation. `Mutex` doesn't compose with `<-ctx.Done()`. If you really need "try-lock with timeout," use `chan struct{}` with `select` or `TryLock` (Go 1.18+) plus a wait loop. **When NOT:** You need `select { case <-lock: ...; case <-ctx.Done(): return ctx.Err() }` semantics. The "lock" is conceptually a token (semaphore with N > 1) — then a buffered channel of capacity N is the right primitive. Educational examples illustrating CSP.8. Exercise 7 — WaitGroup + result via shared slice¶
A fan-out collects results into a []Result indexed by goroutine ID. A sync.Mutex guards the append; a WaitGroup signals completion. The shape works but tangles concerns: error propagation is bolted on as a shared error slot, and the API leaks the index discipline.
func fetchAll(urls []string) ([]Result, error) {
results := make([]Result, len(urls))
var firstErr error
var mu sync.Mutex
var wg sync.WaitGroup
for i, u := range urls {
wg.Add(1)
go func(i int, u string) {
defer wg.Done()
r, err := fetch(u)
mu.Lock()
if err != nil && firstErr == nil { firstErr = err }
results[i] = r
mu.Unlock()
}(i, u)
}
wg.Wait()
return results, firstErr
}
After
`errgroup.Group` with context-cancel on first error, results in a pre-sized slice keyed by index — no mutex needed because each goroutine writes a disjoint cell.import "golang.org/x/sync/errgroup"
func fetchAll(ctx context.Context, urls []string) ([]Result, error) {
results := make([]Result, len(urls))
g, ctx := errgroup.WithContext(ctx)
for i, u := range urls {
i, u := i, u
g.Go(func() error {
r, err := fetchCtx(ctx, u)
if err != nil { return err }
results[i] = r // disjoint index — no mutex
return nil
})
}
if err := g.Wait(); err != nil { return nil, err }
return results, nil
}
9. Exercise 8 — sync.Cond for one-time event¶
A connection-init path uses sync.Cond to broadcast "connection ready" once at startup. Cond is right for repeated events with multiple waiters; for one-shot signaling, a channel close is simpler and faster.
type Conn struct {
mu sync.Mutex
cond *sync.Cond
ready bool
}
func NewConn() *Conn { c := &Conn{}; c.cond = sync.NewCond(&c.mu); return c }
func (c *Conn) WaitReady() {
c.mu.Lock()
for !c.ready { c.cond.Wait() }
c.mu.Unlock()
}
func (c *Conn) MarkReady() {
c.mu.Lock()
c.ready = true
c.cond.Broadcast()
c.mu.Unlock()
}
After
A `chan struct{}` closed once. Any number of receivers wake up; subsequent receives return immediately.type Conn struct {
ready chan struct{} // closed when ready
}
func NewConn() *Conn { return &Conn{ready: make(chan struct{})} }
func (c *Conn) WaitReady() { <-c.ready }
func (c *Conn) WaitReadyCtx(ctx context.Context) error {
select {
case <-c.ready: return nil
case <-ctx.Done(): return ctx.Err()
}
}
func (c *Conn) MarkReady() { close(c.ready) }
10. Exercise 9 — sync.Once on the hot path (already optimal)¶
A config loader uses sync.Once.Do to lazily initialize a singleton. The Once is called millions of times per second after init; the worry is "is Do slow on the fast path?" — but Once has a fast-path atomic load, so re-checking is a single load.
var (
once sync.Once
cfg *Config
)
func GetConfig() *Config {
once.Do(func() { cfg = loadConfig() })
return cfg
}
BenchmarkOnceHot-8 1000000000 1.2 ns/op 0 B/op 0 allocs/op // post-init, single goroutine
BenchmarkOnceHotParallel-8 500000000 2.4 ns/op 0 B/op 0 allocs/op
After (no change — confirm it's already optimal)
`Once.Do` reads `o.done` atomically first; if 1, returns immediately. No mutex acquired on the fast path. Don't replace it with `atomic.Bool` + manual init unless you're solving a different problem. No change. The benchmark confirms `Once.Do` post-init is ~1.2 ns — within noise of an inline `atomic.Bool.Load` + branch. **Why already optimal:** Since Go 1.x, `Once.Do` starts with `atomic.LoadUint32(&o.done)`. If non-zero, it returns. The slow path (mutex + closure call) runs only on the first call per Once. The 1.2 ns measurement is just the atomic load + branch + return. **Trade-off:** Don't be tempted to hoist `cfg` to a package var and access it directly to "skip the Once" — you lose initialization-safety guarantees if any caller can race the loader. The 1.2 ns is not the bottleneck. **When NOT (to "optimize"):** Never. Replacing `Once` with custom atomic-bool patterns is how the `sync.Once` memory-order bug from older codebases happens. Trust the primitive; profile something else.11. Exercise 10 — Per-call allocation of struct with Mutex¶
A per-request rate limiter allocates a *Bucket per call, locks it briefly, then drops it. The allocation is the cost — mutex use is fine.
type Bucket struct {
mu sync.Mutex
tokens float64
last time.Time
}
func newBucket() *Bucket { return &Bucket{tokens: 10, last: time.Now()} }
func Allow() bool {
b := newBucket() // alloc per call
b.mu.Lock()
now := time.Now()
b.tokens += now.Sub(b.last).Seconds()
b.last = now
ok := b.tokens >= 1
if ok { b.tokens-- }
b.mu.Unlock()
return ok
}
After
`sync.Pool` for the bucket. The allocation goes away on the hot path; only cold paths allocate. ~3.5× faster, allocations eliminated. **Why faster:** The 48 B allocation per call was the dominant cost. `sync.Pool.Get`/`Put` is ~10 ns per call on cache hits — well under the 100+ ns saved by skipping `mallocgc` plus the GC's eventual sweep work. **Trade-off:** The example is contrived because each "request" makes a *fresh* bucket — a real rate limiter would key buckets by user/IP and share them. If buckets are shared, you don't need a pool; if they're per-call, the right fix is usually not to allocate them at all. Treat this as the "if you really must do per-call allocation, pool it" recipe. **When NOT:** Buckets carry state that must persist across calls — sharing one per key beats pooling. Allocation rate < 100k/s — pool overhead approaches allocation savings. Pooled object holds resources (file handles, goroutines) that need explicit cleanup — pool's "may be dropped" semantics break that.12. Exercise 11 — Atomic on contended small field¶
A ConnState struct has an atomic.Int32 field for status. Eight goroutines hammer it across the connection pool. The field shares a cache line with the next field (stats), so updates to either ping-pong the line — false sharing.
type ConnState struct {
status atomic.Int32 // 4 B at offset 0
stats atomic.Int64 // 8 B at offset 8 — same 64 B cache line
}
BenchmarkFalseSharing-8 3000000 320 ns/op 0 B/op 0 allocs/op // status+stats updated by different goroutines
After
Pad to separate cache lines. The Go convention is `_ [N]byte` filler sized to push the next field onto its own line. ~2.5× faster under cross-field contention. **Why faster:** False sharing forces the cache line to bounce between cores even though they're updating *different* fields. Padding gives each field its own 64 B line; writes go through L1 of the writing core without invalidating the other's cache. Visible in `perf stat -e cache-misses` as a sharp drop. **Trade-off:** Memory cost: each padded field consumes 64 B instead of 4 or 8. For 1M struct instances, that's 60 MB of wasted RAM per padded field. Wrong cache-line size on ARM (often 128 B Cortex-A, 64 B M-series) — use `cpu.CacheLinePad` from `golang.org/x/sys/cpu` for portability. **When NOT:** Atomic field is only ever updated by one goroutine — no contention to optimize. The two fields are always read/written together by the same goroutine — false sharing doesn't apply. Memory is constrained (embedded, mobile) and contention is mild.13. Exercise 12 — sync.Map.LoadOrStore allocating on the hit path¶
LoadOrStore(key, value) always evaluates the second argument, even when the key exists. For values that are expensive to construct (sessions, compiled regexes), you allocate per call regardless of cache state.
var sessions sync.Map
func GetOrCreate(id string) *Session {
s := newSession(id) // expensive, called even on cache hit
actual, _ := sessions.LoadOrStore(id, s)
return actual.(*Session)
}
After
Cheap `Load` first; allocate + `Store` only on miss. The `LoadOrStore` after miss handles the race where another goroutine stored concurrently. ~18× faster on the hit path, allocations gone. **Why faster:** `Load` is lock-free against the immutable `read` map in `sync.Map`. The hit path becomes a single map lookup with no allocation. The miss path still allocates one `Session` and may race with another goroutine — `LoadOrStore` resolves the race by returning the winner. **Trade-off:** Slightly more code than `LoadOrStore` alone. On miss, you may allocate a `Session` that's then discarded (the racing goroutine's wins) — wasted work, but rare and bounded. If `newSession` has side effects (opens a file, registers a callback), discarding it leaks; structure init to be cheap and idempotent. **When NOT:** Hit rate is low (< 50%) — you'd allocate often anyway, and the double-check adds overhead. `newSession` is trivially cheap (e.g., struct literal) — the unconditional allocation is < 20 ns. You need exact "store wins iff key absent" semantics where the loser's value must be observable — only `LoadOrStore` gives that.14. Exercise 13 — Excessive read-lock acquisition on config-style reads¶
A feature-flag service has 50k QPS of reads per server, with config reloads every 30 seconds. Each read takes RLock/RUnlock on a sync.RWMutex. The atomic CAS on RLock is fine uncontended but pings the cache line every read across cores.
type FlagService struct {
mu sync.RWMutex
flags map[string]bool
}
func (s *FlagService) Enabled(name string) bool {
s.mu.RLock(); defer s.mu.RUnlock()
return s.flags[name]
}
func (s *FlagService) Reload(newFlags map[string]bool) {
s.mu.Lock(); s.flags = newFlags; s.mu.Unlock()
}
BenchmarkRLockRead-8 30000000 42 ns/op 0 B/op 0 allocs/op
BenchmarkRLockReadParallel-8 8000000 180 ns/op 0 B/op 0 allocs/op // 8 goroutines
After
`atomic.Pointer[map[...]bool]` — the map is immutable; reload swaps the pointer atomically. Reads are a single atomic load with no fence on x86.type FlagService struct {
flags atomic.Pointer[map[string]bool]
}
func NewFlagService() *FlagService {
s := &FlagService{}
m := map[string]bool{}
s.flags.Store(&m)
return s
}
func (s *FlagService) Enabled(name string) bool {
return (*s.flags.Load())[name]
}
func (s *FlagService) Reload(newFlags map[string]bool) {
s.flags.Store(&newFlags)
}
15. Exercise 14 — Heavy work inside critical section¶
A request handler computes an expensive JSON marshal while holding a mutex on a shared cache. Other goroutines block on the mutex for the entire marshal duration, even though they don't need the marshaled result.
type Cache struct {
mu sync.Mutex
data map[string]*Entry
}
func (c *Cache) Snapshot(key string) ([]byte, error) {
c.mu.Lock()
defer c.mu.Unlock()
entry := c.data[key]
if entry == nil { return nil, errNotFound }
return json.Marshal(entry) // 50-500 us under lock
}
After
Narrow the critical section: lock to fetch + copy the pointer, unlock, then marshal outside the lock. If `Entry` is mutable, deep-clone the fields the marshaler reads. ~5× faster under 8-way contention (single-threaded numbers are identical — the win is in not blocking peers). **Why faster:** Holding the lock during marshal serializes 8 goroutines on a 50-us-per-op operation, capping throughput at ~20k/s. Narrowing the critical section to a pointer fetch (~50 ns) means the lock is held 1000× less time; goroutines marshal in parallel. The total work is the same; the *blocking* drops by 1000×. **Trade-off:** `entry` must be safe to read outside the lock. If `Entry` has fields mutated by other goroutines (counters, last-updated), you race. Fixes: deep-clone inside the lock, mark `Entry` immutable post-insert, or wrap its mutable fields in atomics. Failure mode: stale-by-microseconds data in the marshaled output — usually acceptable for read snapshots. **When NOT:** The work *must* happen atomically with the lookup (e.g., increment counter then read related field — invariant breaks if a writer interleaves). The critical section is already short (< 1 us). Single-threaded code where there's no peer to block.16. When NOT to optimize¶
sync cost dominates only when a primitive is on the hot path of a high-frequency operation across multiple cores. If your mutex is taken once per HTTP request (a few thousand QPS), every optimization here is irrelevant: request-scoped locks in middleware, init-time singletons, occasional config reloads — sync cost is dwarfed by the work it guards.
Profile first. Sync overhead has four signatures in a CPU profile: runtime.semacquire/runtime.gopark → Ex. 1, 2, 3, 6, 7 (contention or wrong primitive); runtime.mallocgc on a pool-shaped path → Ex. 5 or 10 (over- or under-pooling); sync.(*Map).Store with high allocs/op → Ex. 4 or 12; runtime.cgoCheckPointer-free atomics that still cap at ~50 ns/op under 8-way parallel — Ex. 11 (false sharing).
Common premature optimizations: sharded mutex (Ex. 2) on a map with < 1k QPS; atomic counter (Ex. 1) where the counter participates in invariants with other fields — atomics break the invariant; replacing channel-based signaling with mutexes (Ex. 6) where you actually needed select/cancellation; padding atomics (Ex. 11) on structs allocated by the million; pooling tiny values (Ex. 5) where stack allocation already costs zero; preempting sync.Once (Ex. 9) — it's already optimal.
Correctness gaps disguised as optimizations: atomic.Int64 replacing a mutex (Ex. 1) on a counter that's part of a multi-field invariant; sharded mutex (Ex. 2) where snapshot-across-shards forgot to lock in a stable order — deadlock under "snapshot during update"; mutex-protected map (Ex. 4) replacing sync.Map where the workload actually was read-mostly with stable keys; errgroup (Ex. 7) where workers don't honor ctx — first error doesn't cancel siblings; channel close (Ex. 8) where the channel is closed twice — panic; pool of buckets (Ex. 10) where reset forgot a field — leaked state from a previous request; atomic.Pointer for config (Ex. 13) where callers mutated the loaded map — data race invisible until production; narrowed critical section (Ex. 14) where the entry was mutated by a writer between unlock and marshal — torn read in the JSON output.
17. Summary¶
Always-ship wins (default in any new sync-touching code): atomic.Int64 over sync.Mutex for plain counters (Ex. 1); sync.Mutex over channel-as-mutex for plain mutual exclusion (Ex. 6); narrow critical sections — never hold a lock across I/O or expensive computation (Ex. 14); chan struct{} close for one-shot events (Ex. 8); errgroup with disjoint result indices for fan-out with error propagation (Ex. 7); trust sync.Once on hot paths (Ex. 9).
Wins behind a profile (when measurements justify them): sharded mutex (Ex. 2, when one mutex shows in contention profile under high QPS); switching RWMutex to Mutex (Ex. 3, when writes dominate); plain map + mutex over sync.Map (Ex. 4, when write-heavy with churn); pool tiny objects only if they escape (Ex. 5, after escape analysis confirms heap allocation); pool for genuinely allocation-heavy paths (Ex. 10, when allocator shows in pprof); cache-line padding on contended atomic fields (Ex. 11, when cache-misses perf counter spikes); Load then Store over LoadOrStore (Ex. 12, when the second arg is allocation-heavy); atomic.Pointer for read-mostly config (Ex. 13, when RLock contention shows in 8+ core profiles).
Specialty (only when the design calls for it): cache-line-padded shards combined with per-shard pools for high-throughput connection state; atomic.Pointer over copy-on-write maps for hot-path config / feature flags; sync.Cond only when waiters need predicate checks on each wakeup — otherwise prefer channels; lock-free ring buffers (golang.org/x/sys/cpu + careful atomics) for single-producer single-consumer paths where even mutex CAS shows in profiles.
Sync cost is contention, cache-line bouncing, primitive mismatch, and locked work. Strip those four from the read path by choosing the right primitive: atomic for single-word state; mutex for mutual exclusion; channel for signaling and cancellation; sync.Map only for read-mostly stable keys; errgroup for typed fan-out; sync.Pool for genuinely large per-call allocations. The package itself is cheap — the wins come from matching the primitive to the access shape. Profile, then pick the lever; the four signatures above tell you which one.