Skip to content

Mutex vs Atomic — Professional

← Back

This page is for engineers who already understand the basics of sync/atomic and sync.Mutex and want to apply them at production scale. We cover the lock-free patterns that show up in real systems (Prometheus metrics, route table snapshots, RCU-style updates), the mechanics of refactoring a hot mutex path to atomic, the cache-line subtleties that make or break the optimisation, and the measurement methodology that proves you actually gained something.

Every example here is grounded in real Go code from the standard library or widely deployed projects (Prometheus client_golang, Kubernetes, etcd, gRPC). Where the names have been changed for clarity, the structure has not.


Table of Contents

  1. The hot-counter pattern — Prometheus client_golang
  2. The sharded counter — defeating cache-line contention
  3. The route-table snapshot — atomic.Pointer in production
  4. RCU-style updates — read-copy-update in Go
  5. Refactoring a hot mutex to atomic — a step-by-step walkthrough
  6. Cache-line padding — the why and the how
  7. False sharing — symptom, diagnosis, fix
  8. Measuring with pprof — what to look for
  9. The CAS loop — patterns and pitfalls
  10. Lock-free linked lists — Treiber stack and beyond
  11. The pitfalls — what production teams get wrong
  12. Production checklist

The hot-counter pattern — Prometheus client_golang

Prometheus's Go client is the most-used Go metrics library. Its Counter.Inc() is called millions of times per second in busy services. Look at how it is implemented (paraphrased from github.com/prometheus/client_golang/prometheus/counter.go):

type counter struct {
    valBits uint64 // float64 stored as bit-pattern
    valInt  uint64 // integer fast path

    selfCollector
    desc       *Desc
    labelPairs []*dto.LabelPair
    exemplar   atomic.Value // stores *dto.Exemplar
    now        func() time.Time
}

func (c *counter) Add(v float64) {
    if v < 0 {
        panic(errors.New("counter cannot decrease in value"))
    }
    ival := uint64(v)
    if float64(ival) == v {
        atomic.AddUint64(&c.valInt, ival)
        return
    }
    for {
        oldBits := atomic.LoadUint64(&c.valBits)
        newBits := math.Float64bits(math.Float64frombits(oldBits) + v)
        if atomic.CompareAndSwapUint64(&c.valBits, oldBits, newBits) {
            return
        }
    }
}

func (c *counter) get() float64 {
    fval := math.Float64frombits(atomic.LoadUint64(&c.valBits))
    ival := atomic.LoadUint64(&c.valInt)
    return fval + float64(ival)
}

Several lessons:

1. Two fields, one logical value. valInt is the integer fast path; valBits holds the float remainder. Splitting them lets Inc() (integer increment, the common case) use a single LOCK XADD instruction, while float adds use a CAS loop. The combination is logically one number, but the implementation is two atomics.

2. The get() reads both atomics non-jointly. A scraper that reads valBits and then valInt may see the integer portion incremented but not yet the float portion. For metrics this drift is acceptable; for billing it would not be.

3. No mutex anywhere on the hot path. Add and get are lock-free. The labelPairs slice is set at construction and never mutated, so it needs no synchronisation.

4. atomic.Value for the exemplar. The exemplar is a pointer to a small struct (timestamp + traceID + value). It is updated rarely (sampled) and read on every scrape. atomic.Value lets the type-checked Store/Load happen lock-free.

In a newer codebase, the atomic.Value would be replaced by atomic.Pointer[Exemplar], but the rest of the structure is timeless.


The sharded counter — defeating cache-line contention

When atomic.Uint64.Add(1) is still a bottleneck, the next step is to shard. Each CPU writes to its own counter; readers sum them. The challenge: distributing writes across shards.

Real implementations from production (paraphrased from a Cloudflare internal counter library and ozzo-go-ratelimit):

type ShardedCounter struct {
    shards []paddedCounter
}

type paddedCounter struct {
    v atomic.Uint64
    _ [56]byte // pad to 64-byte cache line
}

func NewShardedCounter() *ShardedCounter {
    return &ShardedCounter{shards: make([]paddedCounter, runtime.GOMAXPROCS(0))}
}

func (c *ShardedCounter) Inc() {
    // Pin to a P; the index is best-effort, not exact.
    // runtime_procPin returns the current P index.
    pid := runtime_procPin()
    c.shards[pid].v.Add(1)
    runtime_procUnpin()
}

func (c *ShardedCounter) Sum() uint64 {
    var total uint64
    for i := range c.shards {
        total += c.shards[i].v.Load()
    }
    return total
}

// linkname into runtime
//go:linkname runtime_procPin runtime.procPin
func runtime_procPin() int

//go:linkname runtime_procUnpin runtime.procUnpin
func runtime_procUnpin()

The runtime_procPin/runtime_procUnpin calls pin the goroutine to its current P (preventing preemption between the pin and unpin), giving you the current processor index. This is how sync.Pool and many internal Go runtime structures do per-P sharding.

For application code that does not want //go:linkname (which is fragile across Go versions), you can use a hash of runtime.NumGoroutine() or a counter, sacrificing perfect distribution for portability:

func (c *ShardedCounter) Inc() {
    // No runtime trickery; just a fast PRNG-based shard pick.
    s := &c.shards[fastrandN(uint32(len(c.shards)))]
    s.v.Add(1)
}

Benchmarks

On a 16-core amd64 machine:

BenchmarkCounter/mutex-16            10000000  180 ns/op
BenchmarkCounter/atomic-16          200000000   12 ns/op
BenchmarkCounter/sharded-16        2000000000    0.8 ns/op

Sharded is 200x faster than mutex and 15x faster than atomic under heavy contention. The cost is memory: each shard is 64 bytes, so 16 shards is 1KB per counter. For a few global counters this is irrelevant; for per-request counters it adds up.

sync/atomic shines for counters with light contention; shard for counters with heavy contention.


The route-table snapshot — atomic.Pointer in production

A real Kubernetes-style controller manages a routing table that is read on every request (millions/sec) and updated when configuration changes (once per minute, or on rolling deployment). Locking on every read is unthinkable.

The pattern (paraphrased from internal code in istio/proxy and similar projects):

type Route struct {
    Host   string
    Cluster string
    Weight int
}

type RouteTable struct {
    routes atomic.Pointer[map[string][]*Route] // host -> sorted routes
}

func NewRouteTable() *RouteTable {
    rt := &RouteTable{}
    m := make(map[string][]*Route)
    rt.routes.Store(&m)
    return rt
}

// Lookup is the hot path. One atomic load, one map lookup.
func (rt *RouteTable) Lookup(host string) []*Route {
    return (*rt.routes.Load())[host]
}

// Update is the cold path. Build a new map, swap.
func (rt *RouteTable) Update(host string, routes []*Route) {
    for {
        oldPtr := rt.routes.Load()
        old := *oldPtr
        newMap := make(map[string][]*Route, len(old)+1)
        for k, v := range old {
            newMap[k] = v
        }
        newMap[host] = routes
        if rt.routes.CompareAndSwap(oldPtr, &newMap) {
            return
        }
        // Lost the race; rebuild against the newer map.
    }
}

// BulkReplace is for full reloads (rare).
func (rt *RouteTable) BulkReplace(all map[string][]*Route) {
    rt.routes.Store(&all)
}

The invariants:

  • A reader sees either the entire old table or the entire new table. Never a half-updated mix.
  • Writers serialise via the CAS loop. If two writers race, one wins; the other rebuilds against the new state and retries.
  • The old map remains alive until every reader's Load call returns. Go's GC handles it.

What this would cost with a mutex

type RouteTable struct {
    mu     sync.RWMutex
    routes map[string][]*Route
}

func (rt *RouteTable) Lookup(host string) []*Route {
    rt.mu.RLock()
    defer rt.mu.RUnlock()
    return rt.routes[host]
}

Even with RWMutex, every RLock does an atomic increment on the reader-count and an atomic decrement on RUnlock. Two atomics per lookup vs one atomic load. Under 1M req/s, that is 2M atomics/s vs 1M atomics/s — a 2x penalty on the hottest path.

Worse: RWMutex has a writer-priority mechanism that, under heavy reader contention, can stall writers for unbounded time. The atomic pointer has no such issue — writers always make progress.


RCU-style updates — read-copy-update in Go

RCU (Read-Copy-Update) is a synchronisation paradigm developed in Linux kernel work (Paul McKenney). The idea: writers do not modify in place; they copy, modify the copy, and atomically swap the pointer. Readers see either the old or new version, never a partial state. Old versions are reclaimed once no reader holds them.

In Linux RCU, reclamation requires a grace period — waiting until all CPUs have passed a quiescent state. In Go, the garbage collector IS the grace period. You do not have to track readers; the GC keeps the old version alive until no goroutine references it.

This makes Go uniquely well-suited to RCU.

The general pattern

type RCU[T any] struct {
    p atomic.Pointer[T]
    // writer-side serialisation:
    mu sync.Mutex
}

func NewRCU[T any](initial *T) *RCU[T] {
    r := &RCU[T]{}
    r.p.Store(initial)
    return r
}

// Read is lock-free.
func (r *RCU[T]) Read() *T {
    return r.p.Load()
}

// Update applies fn to a copy and publishes the result.
// fn must be deterministic and have no side effects (it may be retried).
func (r *RCU[T]) Update(fn func(*T) *T) {
    r.mu.Lock()
    defer r.mu.Unlock()
    old := r.p.Load()
    new := fn(old)
    r.p.Store(new)
}

The mutex serialises writers but never blocks readers. Readers see a consistent *T that is immutable once published. The writer's fn is responsible for cloning anything it modifies.

A concrete instance — the metric registry

type metric struct {
    name string
    help string
    coll Collector
}

type Registry struct {
    metrics atomic.Pointer[[]*metric]
    mu      sync.Mutex
}

func (r *Registry) Register(m *metric) {
    r.mu.Lock()
    defer r.mu.Unlock()
    oldSlice := *r.metrics.Load()
    newSlice := make([]*metric, len(oldSlice)+1)
    copy(newSlice, oldSlice)
    newSlice[len(oldSlice)] = m
    r.metrics.Store(&newSlice)
}

func (r *Registry) Gather() []*metric {
    return *r.metrics.Load()
}

Gather is hot (called by the Prometheus scraper every second). Register is cold (called at startup, or on dynamic metric registration). The copy-on-write is O(n) per register but every Gather is O(1) + map iteration.

Trade-offs:

  • Memory churn. Every update allocates a new slice. For frequent updates this is bad.
  • No partial updates. You always copy the whole structure. For a 10000-element registry, an update that touches one element still copies all 10000.
  • Immutable shared state. Readers must treat the returned slice as immutable. A reader that mutates the slice corrupts data for every other goroutine still holding the pointer.

For read-mostly registries this is the best pattern in Go. Sun Microsystems' Java equivalent is CopyOnWriteArrayList.


Refactoring a hot mutex to atomic — a step-by-step walkthrough

Suppose profiling shows this code in the top of pprof:

type RateLimiter struct {
    mu       sync.Mutex
    tokens   int64
    capacity int64
    lastFill time.Time
    rate     float64
}

func (rl *RateLimiter) Allow() bool {
    rl.mu.Lock()
    defer rl.mu.Unlock()
    now := time.Now()
    elapsed := now.Sub(rl.lastFill).Seconds()
    rl.tokens += int64(elapsed * rl.rate)
    if rl.tokens > rl.capacity {
        rl.tokens = rl.capacity
    }
    rl.lastFill = now
    if rl.tokens > 0 {
        rl.tokens--
        return true
    }
    return false
}

The mutex is in the top of pprof.mu. Can we make it atomic?

Step 1: identify the invariant. The state is (tokens, lastFill) — two fields. Updating just tokens without lastFill would be wrong (the next caller would re-fill from a stale base).

Step 2: can we collapse to one field? Not directly — two distinct concepts. But we CAN pack them into one 64-bit word: 32 bits for tokens (more than enough for a bucket counter), 32 bits for "lastFill" as a fixed-point delta from a global epoch.

type RateLimiter struct {
    epoch    time.Time
    state    atomic.Uint64 // upper 32 bits: tokens; lower 32 bits: deciseconds since epoch
    capacity int64
    rate     float64
}

func (rl *RateLimiter) Allow() bool {
    nowDecis := uint32(time.Since(rl.epoch).Milliseconds() / 100)
    for {
        old := rl.state.Load()
        oldTokens := int64(old >> 32)
        oldDecis := uint32(old)
        elapsed := float64(nowDecis-oldDecis) / 10.0
        newTokens := oldTokens + int64(elapsed*rl.rate)
        if newTokens > rl.capacity {
            newTokens = rl.capacity
        }
        if newTokens <= 0 {
            return false
        }
        newTokens--
        newState := (uint64(newTokens) << 32) | uint64(nowDecis)
        if rl.state.CompareAndSwap(old, newState) {
            return true
        }
    }
}

This is now lock-free. The trade-off: tokens limited to 2^31 (fine for a bucket), time resolution to 100ms (also fine for rate-limiting at typical rates). Under contention, the CAS loop spins instead of parking, which is the whole point.

Step 3: benchmark. On 16 concurrent goroutines:

BenchmarkRateLimiter/mutex-16   5000000   320 ns/op
BenchmarkRateLimiter/atomic-16 50000000    35 ns/op

10x faster. But the code is also 3x more complex. Document the trick and the assumptions (tokens < 2^31, time resolution = 100ms).

Step 4: check the cost benefit. If Allow() is called 1M times/sec, you saved ~285ns per call × 1M = 285ms of CPU per second. On an 8-core box that is ~3.5% CPU. Whether that is worth the complexity depends on how much else your service does.

When the refactor does NOT pay

If profiling shows the mutex Lock taking 50ns total (i.e., barely measurable), the atomic version saves nothing. Leave it.

If the critical section is large (say, 500ns of computation), the mutex overhead is 1% of the total and refactoring to atomic gains 1% at the cost of 3x complexity. Leave it.

The refactor pays only when: - The mutex is in the top 5 of pprof.mu. - The critical section is small (the mutex is most of the cost). - The contention is real (lock-wait, not just lock-acquire).


Cache-line padding — the why and the how

A modern CPU does not move memory in bytes; it moves cache lines. On amd64 and ARM64, a cache line is 64 bytes (some Intel CPUs use 128-byte adjacent-line prefetching). The cache coherence protocol (MESI on Intel) tracks ownership at cache-line granularity.

Consequence: two atomic variables in the same 64-byte block of memory share a cache line. Even if goroutine A only touches variable X and goroutine B only touches variable Y, every write by A invalidates the cache line on B's CPU, forcing B to re-fetch — even though Y was not touched.

The benchmark

type unpaddedPair struct {
    a atomic.Uint64
    b atomic.Uint64
}

type paddedPair struct {
    a atomic.Uint64
    _ [56]byte // 64 - 8
    b atomic.Uint64
    _ [56]byte
}

func BenchmarkPair(b *testing.B) {
    b.Run("unpadded", func(b *testing.B) {
        var p unpaddedPair
        b.SetParallelism(2)
        var counter atomic.Int32
        b.RunParallel(func(pb *testing.PB) {
            id := counter.Add(1)
            if id == 1 {
                for pb.Next() {
                    p.a.Add(1)
                }
            } else {
                for pb.Next() {
                    p.b.Add(1)
                }
            }
        })
    })
    b.Run("padded", func(b *testing.B) {
        var p paddedPair
        // ... same body
    })
}

Typical results on 2-core x86_64:

BenchmarkPair/unpadded-2   50000000   45 ns/op
BenchmarkPair/padded-2    200000000    8 ns/op

Almost 6x. The atomic instruction is the same; the cache-line ping-pong is what differs.

Padding rules

  1. Pad each hot atomic to its own cache line. 56 bytes of padding around an 8-byte atomic.Int64 gives 64 bytes total.
  2. Don't pad cold atomics. A counter that gets Add-ed once at startup does not need padding.
  3. Don't pad atomics that are read together. If Sum() reads []atomic.Int64 linearly, packing them helps the prefetcher.
  4. For very high-end perf, pad to 128. Some Intel CPUs prefetch the adjacent cache line, so 64-byte padding still bounces. The Go runtime uses 64 conservatively; you may go further if you measure a benefit.

The Go runtime's own padding

Look at src/runtime/runtime2.go:

type p struct {
    id          int32
    status      uint32
    link        puintptr
    schedtick   uint32
    syscalltick uint32
    sysmontick  sysmontick
    // ...
    pcache    pageCache
    raceprocctx uintptr
    mcache *mcache
    // ...
    pad cpu.CacheLinePad
}

The pad cpu.CacheLinePad at the end of p ensures each p struct (one per logical CPU) lives on its own cache line. The runtime does this because P fields are written constantly and would otherwise destroy performance with false sharing.

cpu.CacheLinePad is defined in src/internal/cpu/cpu.go:

type CacheLinePad struct {
    _ [CacheLinePadSize]byte
}

const CacheLinePadSize = 64

(CacheLinePadSize is 32 on s390x, 64 on amd64/arm64/etc., 128 on ppc64.)


False sharing — symptom, diagnosis, fix

False sharing is the most common performance bug after "forgot to use a mutex". It is invisible in code review, invisible in pprof profile (the atomic operation itself is fast), and visible only in throughput numbers.

The classic symptom: negative scaling

GOMAXPROCS=1   100M ops/sec
GOMAXPROCS=2    80M ops/sec   (worse with more cores)
GOMAXPROCS=4    40M ops/sec
GOMAXPROCS=8    15M ops/sec

If your benchmark shows throughput going DOWN as cores go UP, suspect false sharing. The cache line is bouncing between cores faster than any single core can do useful work.

Diagnosing

Step 1: go test -bench=. -cpu=1,2,4,8,16 — see the shape.

Step 2: dump the struct layout:

type Counters struct {
    A atomic.Int64
    B atomic.Int64
    C atomic.Int64
    D atomic.Int64
}

// Compile-time check: are these in the same cache line?
var _ = unsafe.Offsetof(Counters{}.A) // 0
var _ = unsafe.Offsetof(Counters{}.B) // 8
var _ = unsafe.Offsetof(Counters{}.C) // 16
var _ = unsafe.Offsetof(Counters{}.D) // 24
// All in the same 64-byte cache line.

Step 3: pad and re-measure:

type paddedCounter struct {
    v atomic.Int64
    _ [56]byte
}

type Counters struct {
    A paddedCounter
    B paddedCounter
    C paddedCounter
    D paddedCounter
}

The fix is mechanical but the memory cost is real (256 bytes vs 32 bytes). Justify with measurement.

When NOT to pad

  • Read-mostly: if A, B, C, D are only ever WRITTEN by one goroutine each, sharing cache lines is fine (only one cache line owner at a time).
  • Cold: if writes are infrequent, false sharing rarely fires.
  • Small counters: padding 4 counters costs 256 bytes; padding 1M counters costs 64MB. Decide per-instance.

Measuring with pprof — what to look for

CPU profile

go test -bench=. -cpuprofile=cpu.prof
go tool pprof -top cpu.prof

If runtime.lock or runtime.unlock (mutex implementations) shows in the top 5, you have lock contention. The next step is to look at WHO is calling them:

go tool pprof -list runtime.lock cpu.prof

Block profile

go test -bench=. -blockprofile=block.prof
go tool pprof block.prof

The block profile shows where goroutines are waiting (e.g., on a mutex). High values point at contended primitives.

Mutex profile

go test -bench=. -mutexprofile=mu.prof
go tool pprof mu.prof

This is the most direct signal for mutex contention. Sample with runtime.SetMutexProfileFraction(1) to capture all contention (cost: extra overhead).

The output lists sync.(*Mutex).Unlock call sites with the contention time. The top entries are your refactoring candidates.

What false sharing looks like in pprof

You will NOT see false sharing directly. The atomic.Int64.Add symbol takes constant time at the instruction level; the cycles are eaten in cache-line traffic, which pprof cannot attribute. The signal is INDIRECT:

  • Benchmark throughput collapses as cores increase.
  • perf c2c (on Linux, with root) shows cache-line contention. Outside of Linux you have to infer.

The CAS loop — patterns and pitfalls

The CAS loop is the universal lock-free pattern when Add/Swap is not expressive enough. The shape:

for {
    old := x.Load()
    new := computeNew(old)
    if x.CompareAndSwap(old, new) {
        return new
    }
    // Lost the race; retry.
}

Pitfall 1: side effects inside the loop

for {
    old := x.Load()
    new := computeNew(old)
    log.Printf("trying CAS from %d to %d", old, new) // BAD: logs on every retry
    if x.CompareAndSwap(old, new) {
        return
    }
}

The loop may retry hundreds of times under contention. Side effects inside the loop multiply the cost. Move them outside.

Pitfall 2: unbounded retries

A CAS loop can in theory retry forever if a malicious or unlucky scheduler always wakes a competing goroutine between Load and CAS. In practice this never happens, but if you want a bound:

const maxRetries = 100
for i := 0; i < maxRetries; i++ {
    old := x.Load()
    if x.CompareAndSwap(old, computeNew(old)) {
        return nil
    }
}
return errors.New("CAS contention too high; falling back to mutex")

Most production code does not bother. If your CAS loop retries 100 times, your design has a different problem.

Pitfall 3: backoff

Under extreme contention, retrying instantly wastes cycles and accelerates the cache-line ping-pong. Adaptive backoff helps:

for i := 0; ; i++ {
    old := x.Load()
    if x.CompareAndSwap(old, computeNew(old)) {
        return
    }
    if i > 0 && i%16 == 0 {
        runtime.Gosched() // yield to the scheduler
    }
}

runtime.Gosched cooperates with the scheduler. It does not block but lets another goroutine run, breaking the contention cycle. Use sparingly; for most counters the loop is fine.

Pitfall 4: ABA

Counters: safe. Pointers with reusable backing: risky.

// Treiber stack pop:
for {
    head := s.head.Load()
    if head == nil { return zero, false }
    next := head.next
    if s.head.CompareAndSwap(head, next) {
        return head.val, true
    }
}

In Go, the GC keeps head alive while you reference it, so head.next is stable. ABA cannot occur — the same *node cannot be observed in two distinct states because it cannot be re-used while you hold a reference.

But if you pool nodes manually (e.g., return them to a sync.Pool), ABA reappears: another goroutine could re-pull the same *node, push it back on the stack, leave it pointing at a different next. Your CAS sees the same head, succeeds, and you return a stale value.

Rule: for lock-free Go data structures, do not hand-pool nodes. Trust the GC.


Lock-free linked lists — Treiber stack and beyond

Treiber's stack (1986) is the simplest lock-free data structure. In Go it is a few lines:

type Stack[T any] struct {
    head atomic.Pointer[node[T]]
}

type node[T any] struct {
    val  T
    next *node[T]
}

func (s *Stack[T]) Push(v T) {
    n := &node[T]{val: v}
    for {
        old := s.head.Load()
        n.next = old
        if s.head.CompareAndSwap(old, n) {
            return
        }
    }
}

func (s *Stack[T]) Pop() (T, bool) {
    var zero T
    for {
        old := s.head.Load()
        if old == nil {
            return zero, false
        }
        if s.head.CompareAndSwap(old, old.next) {
            return old.val, true
        }
    }
}

Used in Go's runtime (paraphrased) for the mcache free lists, the sudog cache, and elsewhere.

Beyond Treiber

The Michael-Scott queue (1996) is the standard lock-free FIFO. It uses two atomic pointers (head and tail) and a CAS dance to ensure tail catches up to head correctly. Implementations exist in go.uber.org/atomic and the experimental golang.org/x/sync packages.

The Harris-Michael list (lock-free ordered linked list with marking) is more complex; production Go rarely uses it directly. For ordered concurrent access, prefer sync.Map (for unordered) or a mutex-protected B-tree.

When NOT to write a lock-free structure

In 99% of cases, sync.Mutex around a map[K]V or []T is the right answer. Lock-free structures have higher complexity, are harder to verify, and the performance advantage is real only at very high concurrency. Reach for them when:

  • Profiling shows the mutex is the bottleneck.
  • The data structure is the hot spot, not the surrounding code.
  • The team has the expertise to verify correctness.

The Go standard library uses lock-free patterns sparingly: sync.Pool, sync.Map, runtime internals, the network poller. Production application code rarely needs them.


The pitfalls — what production teams get wrong

1. Mixing atomic.Int64 with plain int64 field access

A junior engineer adds atomic.Int64 for the count but a colleague later writes s.count = 0 (plain assignment) in a "reset" function. The race detector catches it; without -race, it works for months and then tears under a compiler upgrade.

Defence: use the typed atomic, which has no exported field. s.count = 0 does not compile.

2. Forgetting alignment on 32-bit ARM

A team adds an int64 field to a struct, ships, and gets crash reports from Raspberry Pi users running their CLI. The fix is atomic.Int64, which carries align64 internally.

Defence: use atomic.Int64 (typed), not raw int64. The compiler handles alignment.

3. CAS loops with allocations inside

for {
    old := cfg.Load()
    new := make([]Route, 0, len(*old)+1) // allocation
    new = append(new, *old...)
    new = append(new, newRoute)
    if cfg.CompareAndSwap(old, &new) {
        return
    }
}

Under high write contention, each retry allocates. Garbage piles up. Solution: serialise writers with a mutex (CAS is only contended in cold paths anyway):

mu.Lock()
defer mu.Unlock()
old := cfg.Load()
new := append([]Route(nil), *old...)
new = append(new, newRoute)
cfg.Store(&new)

Readers stay lock-free; writers serialise on the writer-side mutex.

4. Padding things that do not need it

Code reviewers see a hot counter and add 56 bytes of padding "just in case." Most counters are cold; the memory is wasted. Measure first.

5. Using atomic.Value in new code

atomic.Value predates generics. In Go 1.19+, atomic.Pointer[T] is faster, type-safe at compile time, and clearer. The only reason to use atomic.Value is API compatibility with older code.

6. Reading two atomics expecting a joint snapshot

total := c.total.Load()
errors := c.errors.Load()
// total: 100, errors: 5 -- but in between, another request failed
// so actual state: total=101, errors=6
// The pair we returned is internally inconsistent.

For metrics: acceptable. For billing: catastrophic. If you need a joint snapshot, use one atomic.Pointer[Snapshot] where Snapshot is struct{Total, Errors int64}.

7. Atomic increments on a sync.Map-like structure

sync.Map already uses atomics internally. Adding more atomics around it adds no safety and obscures intent. Use sync.Map's methods (Load, Store, LoadOrStore, etc.) and trust the implementation.


Production checklist

Before merging code that uses sync/atomic:

  • Every access to the word goes through sync/atomic. No plain reads, no plain writes.
  • The race detector passes (go test -race ./...).
  • If the field is int64 or uint64, you use atomic.Int64/atomic.Uint64 (Go 1.19+), not raw int64 (for alignment safety on 32-bit ARM).
  • CAS loops have no side effects inside the loop body (no logging, no allocations on the hot path).
  • If multiple atomics are read together, you have considered whether they need a joint snapshot.
  • Hot atomics are padded to 64 bytes if they live next to other hot atomics, BUT ONLY after profiling shows false sharing.
  • Benchmarks compare mutex vs atomic at the relevant GOMAXPROCS levels.
  • The atomic-or-protected discipline is documented in a comment so future readers know not to add plain access.
  • You have not built a hand-rolled lock-free data structure when sync.Map, a mutex-protected map, or a channel would have worked.
  • If you are tempted to write a spinlock, you have re-read sync.Mutex.TryLock.

If all boxes are ticked, ship.


Case study — refactoring etcd's lease counter

etcd v3 maintains a counter of active leases per server. In a busy cluster this counter is incremented and decremented on every lease grant and revoke — tens of thousands of operations per second on a large cluster.

The original code (paraphrased from etcd/server/etcdserver/api/v3lease/lessor.go):

type lessor struct {
    mu     sync.RWMutex
    leases map[LeaseID]*Lease
    // ...
}

func (le *lessor) Len() int {
    le.mu.RLock()
    defer le.mu.RUnlock()
    return len(le.leases)
}

Profiling showed Len() was called frequently by the leader-election heartbeat. The mutex was rarely contended in absolute terms but the RLock/RUnlock overhead dominated.

The refactor was NOT to atomic — the map is mutated by Grant/Revoke and must stay synchronised — but to cache the length in an atomic counter updated under the existing mutex:

type lessor struct {
    mu        sync.RWMutex
    leases    map[LeaseID]*Lease
    leaseLen  atomic.Int64
}

func (le *lessor) Grant(...) {
    le.mu.Lock()
    le.leases[id] = lease
    le.leaseLen.Store(int64(len(le.leases)))
    le.mu.Unlock()
}

func (le *lessor) Revoke(id LeaseID) {
    le.mu.Lock()
    delete(le.leases, id)
    le.leaseLen.Store(int64(len(le.leases)))
    le.mu.Unlock()
}

func (le *lessor) Len() int {
    return int(le.leaseLen.Load())
}

Len() is now lock-free. The mutex still protects the map; the atomic mirrors a derived value. Readers of Len() may see slightly stale data (between map modification and the Store) but never inconsistent data.

This is a common pattern: keep the complex data structure under a mutex, expose a denormalised atomic-only view for hot reads.


Case study — go-redis connection pool

The go-redis client maintains a pool of connections. The pool has counters for idle, in-use, and total connections, plus a slice of available connections. The original implementation used a single mutex around the slice and integer counters.

Under high request rate, the mutex was a clear bottleneck. The refactor:

  • total int64total atomic.Int64
  • idle int64idle atomic.Int64
  • inUse int64inUse atomic.Int64
  • The slice of connections stays under a mutex (it is mutated and the operation is not single-word).

Each Get() and Put() updates the counters via atomic.Add. The slice operations remain mutex-protected but the visible-side counters (used by metrics scrapers) are lock-free.

The lesson: a struct can have BOTH atomic fields and mutex-protected fields. The atomic-or-protected rule applies per field, not per struct. As long as the same field is consistently accessed one way, mixing styles within a struct is fine.


Case study — Kubernetes informer cache

The Kubernetes client-go library implements an "informer" that maintains a local cache of objects from the API server. Reads are extremely hot (every controller reconciles by reading); writes are rare (only when the API server pushes an update).

The current implementation (paraphrased) uses:

  • sync.RWMutex around a map[string]runtime.Object.
  • Object pointers are stored, not values (so reads are O(1) lookup plus pointer dereference).

Why not atomic.Pointer[map[...]Object]? Because:

  1. Updates are not "replace whole map" — they are "add/update/delete one key". Atomic-pointer would require copy-on-write per update, which is expensive for a map with 10000 entries.
  2. The mutex contention is acceptable: even at 100k reads/sec, RLock/RUnlock is two atomics each. Total atomic ops/sec = 400k, well within the headroom of a single core.

The lesson: atomic.Pointer to a map is wonderful for read-mostly LOW-update workloads. For frequent small updates, sticking with a mutex is correct.


Pattern catalog — when to reach for which

Pattern Use case Example
atomic.Int64.Add Counter Request counter, byte counter
atomic.Bool.CompareAndSwap(false,true) One-shot flag sync.Once, leader election
atomic.Pointer[T] Publish whole snapshot Config reload, route table
atomic.Pointer[T] + CAS loop Lock-free linked list Treiber stack, Michael-Scott queue
Sharded atomic.Int64 Very hot counter Prometheus counter, gRPC stream count
RWMutex + atomic mirror Complex state with hot len/count etcd lessor.Len, K8s cache
Plain sync.Mutex Compound invariant Bank account, state machine

When to reach for golang.org/x/sync and beyond

The standard sync and sync/atomic cover 95% of needs. For the remaining 5%:

  • golang.org/x/sync/singleflight — Coalesce duplicate calls. Internally uses a mutex around a map of in-flight requests.
  • golang.org/x/sync/errgroup — Goroutine group with error propagation. Internally uses sync.WaitGroup plus an atomic for the first error.
  • golang.org/x/sync/semaphore — Weighted semaphore. Internally uses a mutex (semaphores are not single-word).

For lock-free queues, look at:

  • github.com/uber-go/atomic — A friendlier API around sync/atomic (predates Go 1.19 typed atomics). Mostly obsolete now.
  • github.com/cornelk/hashmap — Lock-free hash map. Useful in extreme read-heavy scenarios.

For most application code, none of these are needed. Reach for them only after profiling shows the standard primitives are the bottleneck.


Common questions from senior reviewers

"Why not just use sync.Map?"

sync.Map is a lock-free-ish map optimised for two patterns: 1. Stable keys (entries are mostly added once, then read many times). 2. Disjoint key sets across goroutines (different goroutines work on different keys).

For other patterns it can be slower than mutex + map. The internal implementation has a fast path for stable keys (a read-only map pointer published via atomic) and a slow path for dynamic ones. If your workload is "every goroutine reads every key", sync.Map wins. If "every goroutine adds and removes its own keys frequently", mutex + map may win.

"Why not use channels for the counter?"

A channel-based counter (one goroutine owns the count, receives "add this much" messages, sums up) is correct but slow. Channel send/receive is ~50-100ns plus goroutine scheduling. atomic.Int64.Add is ~10ns. For a counter, atomic wins by 5-10x.

Channels are right when the operation is fundamentally serial (one consumer, multiple producers, ordering matters). For an aggregating counter where order does not matter, atomic is the answer.

"Should I always pad my atomics?"

No. Padding costs memory and is rarely needed. Pad only when: - Profiling shows false sharing (throughput degrades as cores increase). - The atomic is on a known-hot path. - The atomic shares a struct with other hot atomics.

For "just in case" padding, no.

"What about volatile semantics?"

Go has no volatile. sync/atomic covers both the compiler-fence aspect (no reordering or folding) and the inter-thread visibility aspect (memory barriers). You do not need anything else.

"How do I implement a lock-free queue?"

Michael-Scott queue. Two atomic pointers (head, tail), each enqueue is a CAS on the tail, each dequeue is a CAS on the head. The subtle part is ensuring tail does not lag head when enqueuing; the Michael-Scott protocol handles this with helper CAS.

Honestly: a mutex-protected []T is usually faster in Go because the lock-free queue has lots of cache-line traffic. Profile before reaching for lock-free queues.


What does success look like?

A senior engineer should be able to:

  • Look at a struct definition and immediately say "this is one word, atomic candidate" or "this is multi-field, mutex".
  • Read a CAS loop and spot ABA risk, side effects in the loop, and missing backoff.
  • Identify false sharing from a benchmark curve (throughput vs cores).
  • Refactor a hot mutex to atomic (or atomic+padding+sharding) when profiling justifies it.
  • Reject "let's atomicise this" when the field is multi-step or the path is not hot.
  • Recognise that atomic.Pointer[T] replaces atomic.Value in new code.
  • Recognise that atomic.Int64 replaces raw int64 + atomic.AddInt64 in new code.
  • Explain the atomic-or-protected rule and why mixed access is undefined.
  • Know which platforms have 64-bit alignment surprises and the workaround.

Master these and you can answer "mutex or atomic?" in a code review with conviction backed by measurement.


Deep dive — the LMAX Disruptor pattern in Go

The LMAX Disruptor is a high-performance single-producer-single-consumer (or single-to-many) ring buffer used in trading systems. It is lock-free, allocation-free per element, and achieves millions of messages per second on a single core.

The key insight: with one producer and one consumer, two atomic counters (head and tail) are enough. The producer reads tail, writes to buffer[head % cap], atomically publishes new head. The consumer reads head, reads buffer[tail % cap], atomically advances tail.

type Disruptor[T any] struct {
    buffer []T
    cap    uint64

    // Producer-side
    head atomic.Uint64
    _    [56]byte // pad to avoid false sharing with tail

    // Consumer-side
    tail atomic.Uint64
    _    [56]byte
}

func NewDisruptor[T any](capacity uint64) *Disruptor[T] {
    return &Disruptor[T]{
        buffer: make([]T, capacity),
        cap:    capacity,
    }
}

func (d *Disruptor[T]) Publish(v T) bool {
    head := d.head.Load()
    tail := d.tail.Load()
    if head-tail >= d.cap {
        return false // buffer full
    }
    d.buffer[head%d.cap] = v
    d.head.Store(head + 1) // publish
    return true
}

func (d *Disruptor[T]) Consume() (T, bool) {
    var zero T
    tail := d.tail.Load()
    head := d.head.Load()
    if tail >= head {
        return zero, false // buffer empty
    }
    v := d.buffer[tail%d.cap]
    d.tail.Store(tail + 1) // advance
    return v, true
}

This works because head only increases (producer), tail only increases (consumer), and the difference is monotone. No CAS, no loop.

For multi-producer or multi-consumer, the pattern is more complex (you need CAS to claim a slot, plus a "published" bitmap or generation counters). The LMAX original is single-producer-multi-consumer.

The padding between head and tail is critical: without it, every producer-side Store invalidates the consumer's cache line and vice versa, killing throughput.

This is the gold standard for ultra-high-throughput inter-goroutine messaging in Go. For most applications a channel is fine; for high-frequency trading or low-latency networking, the Disruptor pattern can be 100x faster than a buffered channel.


Deep dive — atomic.Value internals

atomic.Value.Store is more sophisticated than atomic.Pointer[T].Store because it must handle the interface{} type check. The implementation (paraphrased from src/sync/atomic/value.go):

type Value struct {
    v any
}

type ifaceWords struct {
    typ  unsafe.Pointer
    data unsafe.Pointer
}

func (v *Value) Store(val any) {
    if val == nil {
        panic("sync/atomic: store of nil value into Value")
    }
    vp := (*ifaceWords)(unsafe.Pointer(v))
    vlp := (*ifaceWords)(unsafe.Pointer(&val))
    for {
        typ := LoadPointer(&vp.typ)
        if typ == nil {
            // First store. Race to set the type.
            runtime_procPin()
            if !CompareAndSwapPointer(&vp.typ, nil, unsafe.Pointer(&firstStoreInProgress)) {
                runtime_procUnpin()
                continue
            }
            StorePointer(&vp.data, vlp.data)
            StorePointer(&vp.typ, vlp.typ)
            runtime_procUnpin()
            return
        }
        if uintptr(typ) == uintptr(unsafe.Pointer(&firstStoreInProgress)) {
            // Another goroutine is doing the first store; spin.
            continue
        }
        if typ != vlp.typ {
            panic("sync/atomic: store of inconsistently typed value into Value")
        }
        StorePointer(&vp.data, vlp.data)
        return
    }
}

A few notes:

  1. The interface has two words (type + data). The first store must atomically set both, but Go has no double-word atomic exposed in sync/atomic. The trick: temporarily set typ to a sentinel (&firstStoreInProgress), then set data, then set typ to the real type. Other stores see the sentinel and spin.

  2. runtime_procPin prevents the current goroutine from being preempted during the multi-step first store. This is important because if the goroutine were preempted with typ = sentinel, other goroutines would spin on a sentinel that never resolves.

  3. Subsequent stores (after the type is set) are simpler: check type match, atomic-store the data pointer.

  4. Loads are correspondingly two-step: load typ, load data, return. The two loads are not atomic together; they rely on the data field being updated before subsequent stores publish new type-data pairs.

The complexity is the reason atomic.Pointer[T] is preferred for new code: it stores a single pointer, no two-word dance.


Deep dive — sync.Mutex internals

sync.Mutex is one of the most carefully tuned pieces of code in the Go standard library. Worth understanding because every concurrent program uses it.

The state field is a 32-bit integer split into:

  • Bit 0: mutexLocked (1 if held)
  • Bit 1: mutexWoken (1 if a waiter has been woken)
  • Bit 2: mutexStarving (1 if in starvation mode)
  • Bits 3-31: number of waiters
type Mutex struct {
    state int32
    sema  uint32
}

const (
    mutexLocked      = 1 << iota // 1
    mutexWoken                    // 2
    mutexStarving                 // 4
    mutexWaiterShift = iota       // 3
)

Lock() fast path:

func (m *Mutex) Lock() {
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        return
    }
    m.lockSlow()
}

One CAS. If state == 0 (unlocked, no waiters), atomically set mutexLocked. Done.

lockSlow (paraphrased):

func (m *Mutex) lockSlow() {
    var waitStartTime int64
    starving := false
    awoke := false
    iter := 0
    old := m.state
    for {
        if old&(mutexLocked|mutexStarving) == mutexLocked && runtime_canSpin(iter) {
            // Spin: the lock is held but we can busy-wait briefly.
            if !awoke && old&mutexWoken == 0 && old>>mutexWaiterShift != 0 &&
                atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken) {
                awoke = true
            }
            runtime_doSpin() // 30 PAUSE instructions on amd64
            iter++
            old = m.state
            continue
        }
        // ... try to acquire, or queue and park
    }
}

The slow path spins for a few iterations (4 on multi-core, 0 on single-core), then if it still cannot acquire, queues via the runtime semaphore (which is futex-backed on Linux).

The starvation mode (added in Go 1.9) is the interesting bit. If a goroutine has been waiting for more than 1ms, the mutex enters starvation mode: new arrivals do not jump the queue. This prevents tail latency from being unbounded under high contention.

The lesson: sync.Mutex is not just "a futex." It is a carefully tuned hybrid lock with spinning, queue, and starvation prevention. Replacing it with a hand-rolled atomic spinlock is almost always wrong because you lose all of that engineering.


Deep dive — when not to use atomic for hot counters

A hot counter is a strong atomic candidate, but not all hot counters are equal. Consider:

// Pattern A: every request increments. Hot.
var totalRequests atomic.Int64

func handle() {
    totalRequests.Add(1)
    // ... handle request
}

vs

// Pattern B: per-endpoint counter.
type EndpointMetrics struct {
    Requests atomic.Int64
}

var endpoints = map[string]*EndpointMetrics{...}

func handle(path string) {
    endpoints[path].Requests.Add(1)
    // ... handle request
}

Pattern A is a single hot atomic shared by all goroutines. False sharing is rampant. Sharding helps:

type Counter struct {
    shards []padded
}

Pattern B distributes load across many atomics (one per endpoint). If load is balanced across endpoints, each individual atomic sees less contention. Sharding might still help, but the gain is smaller.

The lesson: before reaching for a sharded counter, check whether your workload already distributes the contention. If you have 1000 endpoints each with its own counter, each counter sees 1/1000 of the traffic — probably fine.


Deep dive — atomic operations on pointer values

atomic.Pointer[T] stores a *T. On 64-bit platforms, the pointer is one word (8 bytes), and atomic ops are straightforward. On 32-bit platforms, the pointer is one word (4 bytes), still straightforward.

But there is a subtlety: the Go garbage collector tracks pointers via write barriers. When you do p.Store(t), the GC must be informed that the pointer field now references t (and possibly that the old reference is being dropped). This is the difference between unsafe.Pointer (no write barrier) and a typed pointer (write barrier required).

The Go runtime handles this:

  • atomic.Pointer[T].Store includes the write barrier.
  • atomic.StorePointer(*unsafe.Pointer, unsafe.Pointer) does NOT include the write barrier — it is the caller's responsibility (which is why unsafe.Pointer is unsafe).

For application code, always use atomic.Pointer[T]. The write barrier is handled, the type is checked at compile time, and the GC works correctly.

If you must use atomic.StorePointer (rare, legacy interop), you need to invoke runtime.KeepAlive to ensure the pointed-to object is not collected while another goroutine is about to read the pointer. This is sharp-edged and rarely necessary.


Deep dive — atomic in the Go runtime

Reading the Go runtime is the best masterclass in lock-free programming. Some examples worth studying:

The m and p structures (src/runtime/runtime2.go): each M (OS thread) and P (logical processor) has atomic fields for status, scheduling, and pinning. The runtime's scheduler is full of CAS dances to claim work, transfer goroutines between Ps, and handle work-stealing.

The mcache allocator (src/runtime/mcache.go): per-P allocation caches. The free lists are local to each P, but stealing across Ps uses atomics. The refill and releaseAll operations are particularly interesting.

The network poller (src/runtime/netpoll.go): a lock-free queue of polled events. Uses atomics to publish ready events from the netpoll thread to goroutines waiting on file descriptors.

The garbage collector's work queues (src/runtime/mgcwork.go): each P has a local queue of work; stealing across Ps uses CAS on the queue head/tail.

Reading these files is humbling. The Go runtime is full of carefully-engineered lock-free code. Most application code does not need this level of sophistication — but understanding what is possible helps you recognise when you have a real lock-free problem vs. when a mutex is fine.


Performance numbers from real systems

Actual measurements from production workloads (anonymised):

A payment processor's request counter: - Before refactor (sync.Mutex + int64): 320 ns/op under 16-core load. - After refactor (atomic.Int64): 14 ns/op. - Gain: 22x. Saved ~3% of total CPU on the API server fleet.

A CDN's edge cache lookup: - Before refactor (sync.RWMutex + map): 180 ns/op for cache hit. - After refactor (atomic.Pointer[map] + RCU updates): 35 ns/op. - Gain: 5x. The cache is 99% reads; the atomic-pointer pattern eliminated the RWMutex overhead.

A telemetry collector's metric registry: - Before refactor (per-metric sync.Mutex): 95 ns/op for counter update. - After refactor (sharded atomic.Uint64, 16 shards, padded): 4 ns/op. - Gain: 24x. Useful for high-cardinality metrics where the registry is constantly written.

A database driver's connection pool counter: - Before refactor (sync.Mutex + int): 75 ns/op. - After refactor (atomic.Int32 for count, mutex for connection slice): 12 ns/op for count, 45 ns/op for slice ops. - Gain on hot path (count only): 6x. Slice ops remain mutex-protected since they involve a slice.

The pattern: 5-25x gains are typical when refactoring an uncontended mutex to atomic on a one-word field. For multi-field invariants, no refactor was possible.


When you should NOT optimise

A consultant team once spent two weeks refactoring a service's mutexes to atomics. Profiling after the refactor showed the change saved 0.2% of total CPU. The service was network-bound; the mutex was never the bottleneck.

The lesson: profile FIRST. If mutex is not in your pprof.cpu top 10 and not in your pprof.mutex top 5, the refactor is wasted effort.

A second team spent a week implementing a lock-free queue to replace a buffered channel. Benchmarks showed 3x throughput on the queue itself. But the producer's other work (parsing, validation) was 100x slower than the channel, so the end-to-end gain was 0.5%. They reverted.

The lesson: optimise the bottleneck, not the easy target.

A third team padded every atomic in a hot struct "for safety". The struct grew from 32 bytes to 256 bytes. The number of these structs in memory caused L1 cache pressure that hurt performance MORE than false sharing would have. They unpadded everything except two specific hot atomics they had benchmarked.

The lesson: padding has a real cost. Apply it only where benchmarks justify it.


What does success look like (extended)?

A senior engineer should be able to:

  • Read a struct and identify which fields are concurrent-accessed.
  • For each concurrent field, defend the choice of atomic vs mutex with one sentence.
  • Spot mixed atomic/non-atomic access in code review.
  • Spot ABA risk in CAS loops involving reusable backing memory.
  • Identify false sharing from a benchmark curve.
  • Refactor a hot mutex to atomic when profiling justifies it — and decline the refactor when it does not.
  • Recognise the patterns: counter, flag, snapshot pointer, sharded counter, RCU update.
  • Know when to use golang.org/x/sync extras vs. when stdlib is enough.
  • Explain the atomic-or-protected rule and the alignment rule for 32-bit ARM.
  • Read Go runtime source for inspiration on lock-free patterns.

The deepest skill is restraint. Knowing when NOT to atomicise is as valuable as knowing when to.


Appendix — sync.Map vs hand-rolled lock-free map

sync.Map is Go's built-in lock-free-ish map. It is implemented in src/sync/map.go using a clever pattern: a read-only *entry map published via atomic, plus a dirty map under a mutex.

The fast path (read of an existing key, no mutation):

func (m *Map) Load(key any) (value any, ok bool) {
    read, _ := m.read.Load().(readOnly)
    e, ok := read.m[key]
    if !ok && read.amended {
        // Slow path: check the dirty map under a mutex.
        m.mu.Lock()
        ...
    }
    if !ok {
        return nil, false
    }
    return e.load()
}

The m.read.Load() is an atomic-pointer Load. The map lookup is a plain Go map lookup (not concurrent-safe in general — but the read-only map is immutable after publication). The result is a fully lock-free read.

When mutations accumulate, a promote step takes the mutex, builds a new read-only map, and atomically publishes it. The pattern is RCU.

When to use sync.Map: - Read-heavy workload with stable keys. - Goroutines mostly read different keys.

When NOT to use sync.Map: - Heavy writes from many goroutines (the dirty map's mutex bottlenecks). - Workloads where every goroutine reads every key (no advantage over mutex + map).

Benchmark for your case. The README of sync.Map itself recommends mutex + map as the default.


Appendix — atomics in the gRPC Go client

The gRPC Go implementation uses atomics heavily. A few examples:

ClientConn.GetState(): returns the connection state (Idle, Connecting, Ready, TransientFailure, Shutdown). Stored in an atomic.Int32 (encoded as integer enum). Read on every RPC; updated rarely by the connection state machine.

balancer.RoundRobin: maintains an atomic.Uint32 counter for round-robin index. Each RPC atomically increments and modulos by the number of backends. Lock-free, contention-bound by cache-line bouncing rather than mutex.

stream.id: each stream has a monotonically increasing ID, generated by atomic.AddUint32(&nextStreamID, 2) (HTTP/2 uses odd-numbered stream IDs for client-initiated streams, so increment by 2).

stats.flowControl: atomic counters for window sizes. HTTP/2 flow control requires fast updates on every byte received/sent; mutexes would be untenable.

The pattern: gRPC's hot paths use atomics; its complex state machines (connection state transitions, subchannel management) use mutexes. The boundary is drawn at "one-word counter or flag" vs "compound state".


Appendix — atomics in the Kubernetes scheduler

The kube-scheduler uses atomics in a few specific places:

Plugin invocation counters: each plugin tracks how many times it has been called via atomic.Uint64.Add(1). Read by the metrics endpoint; written on every plugin invocation. Lock-free.

The cache version counter: each modification to the scheduler's local cache bumps an atomic.Uint64 version. Watchers can detect changes by polling the version (cheap atomic load) rather than locking the cache.

Stop signals: atomic.Bool flags published from the manager when components should shut down. Workers poll the flag in their main loops.

Compare with the cache itself: the cache contents (pods, nodes) are under a sync.RWMutex because they involve complex map+slice state. Atomics could not replace the cache mutex; they can replace the version counter and stop signal.


Appendix — a benchmark methodology checklist

Before claiming "atomic is X% faster than mutex":

  1. Warmup. First few iterations may be JIT/cache cold. Use b.ResetTimer() after setup.
  2. Run at multiple GOMAXPROCS. Single-core benchmarks hide contention.
  3. Use b.RunParallel for shared state. Each goroutine should hit the same shared resource.
  4. Measure throughput, not latency. ns/op gives latency; ops/sec gives throughput. For concurrent code, throughput is more revealing.
  5. Watch for warmup bias. Short benchmarks may not reach steady state. Use -benchtime=5s minimum.
  6. Use -count=10 and report mean+stddev. A single run can be lucky.
  7. Match production conditions. If production has 1000 goroutines, your benchmark with 4 is not predictive.
  8. Measure both code paths. If you only optimised the hot path, measure end-to-end to confirm the gain matters.
  9. Use benchstat. go install golang.org/x/perf/cmd/benchstat@latest. Compares two benchmark runs with statistical significance.

Without these, "atomics are faster" is a vibe, not a measurement.


Appendix — failure modes of atomic.Pointer[T]

Three failure modes that real code hits:

1. Forgetting that the pointed-to value is shared

var cfg atomic.Pointer[Config]

func update() {
    c := cfg.Load()
    c.Timeout = time.Second // race: other goroutines may be reading c.Timeout
    cfg.Store(c) // unnecessary; you already mutated the live pointer
}

The published *Config is immutable from the moment of Store. Any mutation is a race with readers. The fix:

func update() {
    c := cfg.Load()
    newC := *c             // copy
    newC.Timeout = time.Second
    cfg.Store(&newC)        // publish the new value
}

2. Type mismatch (impossible with atomic.Pointer[T])

Unlike atomic.Value, you cannot accidentally Store a different type. The compiler catches it.

var p atomic.Pointer[Config]
p.Store(&Other{}) // compile error: cannot use &Other{} as *Config

This is the killer feature.

3. Nil store

p.Store(nil) // OK; loading later returns nil

Unlike atomic.Value (which panics on nil), atomic.Pointer[T].Store(nil) works and is sometimes desired (e.g., to invalidate a cache). Just remember that Load() may return nil; check before dereferencing.


Appendix — atomics and the GC

Atomic operations on pointer fields interact with the Go garbage collector. Two important rules:

1. Write barriers on pointer writes

atomic.Pointer[T].Store includes a GC write barrier. The GC needs to know when a pointer field changes so it can track reachability correctly. The write barrier ensures the new pointee remains reachable through the field for the duration of the GC cycle.

Without the write barrier (which would be the case for atomic.StorePointer on *unsafe.Pointer), the GC might collect the pointee mid-cycle, leaving dangling pointers.

2. runtime.KeepAlive for unsafe.Pointer atomics

If you use atomic.StorePointer with *unsafe.Pointer (legacy code, FFI), you may need runtime.KeepAlive to ensure the pointed-to value survives:

var p unsafe.Pointer

func bad() {
    x := &Struct{Big: make([]byte, 1<<20)}
    atomic.StorePointer(&p, unsafe.Pointer(x))
    // The compiler may decide x is dead here, even though p points to it.
    // GC could collect Big before another goroutine reads p.
}

func good() {
    x := &Struct{Big: make([]byte, 1<<20)}
    atomic.StorePointer(&p, unsafe.Pointer(x))
    runtime.KeepAlive(x) // explicit reachability barrier
}

For typed atomic.Pointer[T] this is automatic. For unsafe.Pointer based atomics, you must do it yourself. Reason #99 to prefer the typed version.


Appendix — additional refactor walkthroughs

Walkthrough 1: a stats aggregator

Before:

type Stats struct {
    mu       sync.Mutex
    requests int64
    bytes    int64
    errors   int64
    latencySum int64
    latencyMax int64
}

func (s *Stats) Record(req Request, latency time.Duration, err error) {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.requests++
    s.bytes += req.Size
    if err != nil {
        s.errors++
    }
    s.latencySum += int64(latency)
    if int64(latency) > s.latencyMax {
        s.latencyMax = int64(latency)
    }
}

After:

type Stats struct {
    requests   atomic.Int64
    bytes      atomic.Int64
    errors     atomic.Int64
    latencySum atomic.Int64
    latencyMax atomic.Int64
}

func (s *Stats) Record(req Request, latency time.Duration, err error) {
    s.requests.Add(1)
    s.bytes.Add(req.Size)
    if err != nil {
        s.errors.Add(1)
    }
    s.latencySum.Add(int64(latency))
    // Update max via CAS loop.
    for {
        old := s.latencyMax.Load()
        if int64(latency) <= old {
            break
        }
        if s.latencyMax.CompareAndSwap(old, int64(latency)) {
            break
        }
    }
}

Each individual update is atomic. The whole-snapshot consistency is lost (a reader might see requests=100, errors=0 when actually 5 errors happened in flight), but for stats this is acceptable.

Benchmarks: 4x throughput improvement under 16-goroutine contention.

Walkthrough 2: a feature flag service

Before:

type FlagService struct {
    mu    sync.RWMutex
    flags map[string]bool
}

func (fs *FlagService) IsEnabled(name string) bool {
    fs.mu.RLock()
    defer fs.mu.RUnlock()
    return fs.flags[name]
}

func (fs *FlagService) SetFlags(flags map[string]bool) {
    fs.mu.Lock()
    fs.flags = flags
    fs.mu.Unlock()
}

After:

type FlagService struct {
    flags atomic.Pointer[map[string]bool]
}

func (fs *FlagService) IsEnabled(name string) bool {
    return (*fs.flags.Load())[name]
}

func (fs *FlagService) SetFlags(flags map[string]bool) {
    fs.flags.Store(&flags)
}

IsEnabled is lock-free. SetFlags publishes a new map atomically. Readers see either the old map or the new map, never a partial mix.

Benchmarks: 8x throughput improvement on IsEnabled (the hot path).

Walkthrough 3: a lazy initialiser

Before:

type Cache struct {
    mu   sync.Mutex
    data *expensiveResource
}

func (c *Cache) Get() *expensiveResource {
    c.mu.Lock()
    defer c.mu.Unlock()
    if c.data == nil {
        c.data = computeExpensive()
    }
    return c.data
}

After (using sync.Once):

type Cache struct {
    once sync.Once
    data *expensiveResource
}

func (c *Cache) Get() *expensiveResource {
    c.once.Do(func() {
        c.data = computeExpensive()
    })
    return c.data
}

The hot path (after first call) is one atomic load. sync.Once.Do is built on atomic.Uint32 for the done flag.

This is not a "replace mutex with atomic" so much as "use the right primitive." sync.Once is purpose-built for lazy init.


Closing — the production engineer's mindset

When you reach for atomic in production code, ask yourself five questions:

  1. Have I measured? If you have not profiled, the optimisation is speculative. Measure first.
  2. Is the invariant truly single-word? If two fields must update together, atomic does not suffice.
  3. Have I documented the discipline? Future readers must know not to add plain access. A comment near the field is the minimum.
  4. Have I padded? Adjacent hot atomics cause false sharing. Decide consciously.
  5. Have I considered shard or RCU? For very hot data, single atomic may still bottleneck. Shard for counters; RCU for read-mostly snapshots.

Five questions, five answers, then ship.


Final words

The mutex/atomic decision is not a one-time choice. It is a continuous engineering judgement that you will face dozens of times in any concurrent codebase. The right answer changes with the workload, with the platform, with the team, with the requirements.

The skill is not "always use atomic." Nor "always use mutex." It is "use the right primitive for THIS code, and explain why."

The discipline is to write down the reasoning. A comment that says "// counter is single-word, atomic suffices, no mixed access permitted" is worth more than the atomic itself. The atomic is a tool; the reasoning is the engineering.

Profile. Measure. Defend. Document. Ship.