sync/atomic — Optimisation Exercises¶
Atomic operations are already the lowest-overhead synchronisation primitive in Go (~2 ns uncontended). The "optimisation" goal here is not faster atomics — it is choosing the right primitive, eliminating contention, and removing unnecessary synchronisation. Each exercise either replaces a slower primitive with
atomic, or replaces a contended atomic with a contention-free shape.
Principle: Pick the Cheapest Tool for the Job¶
Approximate costs on modern x86-64, per operation:
| Operation | Uncontended | Contended (16 goroutines) |
|---|---|---|
| Non-atomic memory access | < 1 ns | N/A (would be racy) |
atomic.Int64.Load | ~1 ns | ~1 ns (read-shared cache line) |
atomic.Int64.Add | ~2 ns | ~60 ns (cache-line bounce) |
atomic.Int64.CompareAndSwap | ~2 ns | ~100 ns (with retries) |
sync.Mutex Lock+Unlock | ~15 ns | ~200 ns - 10 µs (with parking) |
sync.RWMutex RLock+RUnlock | ~10 ns | ~100 ns (atomic counter) |
| Channel send/recv (buffered) | ~50 ns | ~100-1000 ns |
The optimisation game is: - Push hot paths to the fastest primitive. - Avoid contention via sharding. - Avoid synchronisation entirely where possible (immutable data, single-goroutine ownership).
Exercise 1 — Replace a Mutex-Guarded Counter With Atomic¶
Starting code:
type Counter struct {
mu sync.Mutex
n int64
}
func (c *Counter) Inc() {
c.mu.Lock()
c.n++
c.mu.Unlock()
}
func (c *Counter) Value() int64 {
c.mu.Lock()
defer c.mu.Unlock()
return c.n
}
Optimisation. Replace with atomic.Int64:
type Counter struct {
n atomic.Int64
}
func (c *Counter) Inc() { c.n.Add(1) }
func (c *Counter) Value() int64 { return c.n.Load() }
Win. 5-10x faster for Inc (one CPU instruction vs lock-unlock). The contention curve flattens; multiple goroutines incrementing scale much better.
Verification. Benchmark with b.RunParallel:
func BenchmarkMutex(b *testing.B) {
var c Counter
b.RunParallel(func(pb *testing.PB) {
for pb.Next() { c.Inc() }
})
}
Expect ~3x throughput at 4 goroutines.
Exercise 2 — Shard a Contended Counter¶
Starting code:
At 4 goroutines, throughput is fine. At 64 goroutines on a 16-core machine, the counter becomes a bottleneck — every Add requires exclusive ownership of one cache line.
Optimisation. Shard into one counter per CPU:
const shards = 64
type ShardedCounter struct {
s [shards]struct {
v atomic.Int64
_ [56]byte
}
}
func (c *ShardedCounter) Add(n int64) {
idx := shardIndex()
c.s[idx].v.Add(n)
}
func (c *ShardedCounter) Load() int64 {
var total int64
for i := range c.s {
total += c.s[i].v.Load()
}
return total
}
Win. Add becomes contention-free. Throughput scales linearly with goroutine count up to the shard count.
Trade-off. Load is O(shards). Acceptable for metrics scraped every few seconds; bad for counters read on every operation.
Exercise 3 — Replace Load-then-Store With Swap¶
Starting code:
Bug. Between Load and Store, increments can be lost. The drain reads n, then a producer adds 5, then drain writes 0 — the 5 is gone.
Optimisation. Use Swap:
Win. Atomic snapshot-and-reset. No lost updates. Also one operation instead of two — a small performance bonus.
Exercise 4 — Replace CAS Loop With Add Where Possible¶
Starting code:
Optimisation. If the update is +delta, use Add:
Win. One instruction instead of a loop. Under contention, the CAS loop retries; Add does not — the CPU hardware retries internally at the cache-coherence level, which is faster.
Use CAS only when the update is non-trivial (max-update, conditional update, pointer publication). Use Add for counters.
Exercise 5 — Pre-Build the Replacement in Copy-on-Write¶
Starting code:
var cfg atomic.Pointer[Config]
func reload() {
cfg.Store(&Config{})
cfg.Load().Endpoints = loadEndpoints() // BUG — race, and slow
}
Bug (covered in find-bug.md): mutation after publication.
Optimisation. Build the full struct before publishing:
func reload() {
newCfg := &Config{
Endpoints: loadEndpoints(),
Timeout: loadTimeout(),
}
cfg.Store(newCfg)
}
Win. Correct, and fewer operations. No partial updates visible to readers.
Exercise 6 — Cache-Line Padding to Eliminate False Sharing¶
Starting code:
Both fields fit in one 64-byte cache line. If requests is incremented by goroutine A on core 1 and errors by goroutine B on core 2, the line ping-pongs between caches even though the variables are independent.
Optimisation. Pad to separate cache lines:
Win. No false sharing. Each field lives on its own cache line; updates are independent. Measurable improvement at high concurrency (> 4 cores writing simultaneously).
Trade-off. Memory cost: 64 bytes per atomic instead of 8. Acceptable for hot counters; wasteful for thousands of cold counters.
Exercise 7 — Replace atomic.Value with atomic.Pointer[T]¶
Starting code:
var cfg atomic.Value
func reload(c *Config) { cfg.Store(c) }
func current() *Config {
if v := cfg.Load(); v != nil {
return v.(*Config)
}
return nil
}
Optimisation. Use the typed atomic pointer:
var cfg atomic.Pointer[Config]
func reload(c *Config) { cfg.Store(c) }
func current() *Config { return cfg.Load() }
Wins.
- Compile-time type safety. No type-mismatch panic at runtime.
- No interface dance.
atomic.Value.Loadreturnsanyand requires a type assertion;atomic.Pointer[T].Loadreturns*Tdirectly. - Slightly faster — no two-word interface handling.
- Nil is allowed.
atomic.Value.Store(nil)panics;atomic.Pointer[T].Store(nil)is fine.
Exercise 8 — Use Go 1.23 Or/And Instead of CAS Loops¶
Starting code:
var flags atomic.Uint32
func setFlag(f uint32) {
for {
old := flags.Load()
if flags.CompareAndSwap(old, old|f) {
return
}
}
}
func clearFlag(f uint32) {
for {
old := flags.Load()
if flags.CompareAndSwap(old, old&^f) {
return
}
}
}
Optimisation (Go 1.23+). Use the new bitwise atomic operations:
Win. One CPU instruction (LOCK OR, LOCK AND on x86) instead of a CAS loop. Faster, especially under contention where the CAS retries multiply.
If you cannot use Go 1.23, the CAS loop is the correct fallback.
Exercise 9 — Avoid Atomic Through Goroutine-Local State¶
Starting code:
var totalProcessed atomic.Int64
func worker(items <-chan Item) {
for it := range items {
process(it)
totalProcessed.Add(1) // hot, contended
}
}
Optimisation. Maintain a goroutine-local count, merge at the end:
var totalProcessed atomic.Int64
func worker(items <-chan Item) {
var local int64
for it := range items {
process(it)
local++
}
totalProcessed.Add(local) // one atomic at the end
}
Win. From one atomic per item to one atomic per goroutine. For a million items processed by 8 workers, this is 8 atomics vs 1 million. The local local++ is a single register increment — essentially free.
Caveat. The global counter is no longer live during processing. If a metrics scraper wants the count mid-flight, this approach hides progress. Choose based on whether you need real-time visibility.
For a hybrid, periodically merge:
const batchSize = 1000
var local int64
for it := range items {
process(it)
local++
if local == batchSize {
totalProcessed.Add(local)
local = 0
}
}
totalProcessed.Add(local)
One atomic per 1000 items. The visible count lags by at most batchSize.
Exercise 10 — Profiling Atomic Hot Paths¶
Starting code:
type Service struct {
requests atomic.Int64
cfg atomic.Pointer[Config]
}
func (s *Service) handle(r Request) {
s.requests.Add(1)
cfg := s.cfg.Load()
process(r, cfg)
}
How to profile.
If requests.Add shows up in the top, you have a hot counter. Options: - Shard (Exercise 2). - Move to a goroutine-local + merge (Exercise 9). - Accept the cost if it is a small fraction.
If cfg.Load shows up significantly, you have a hot read of a pointer. atomic.Pointer[T].Load is essentially a MOV on x86; if it dominates, something is unusual. Inspect: - Is the function being inlined? -gcflags='-m' reports. - Is the call going through an interface (Exercise 12)?
Win. Profiling identifies which atomic ops are actually hot. Most atomic ops are not bottlenecks; do not waste time optimising the ones that are not.
Exercise 11 — Cap CAS-Loop Retries Under Pathological Contention¶
Starting code:
If slowTransform is expensive and contention is high, every retry wastes the slow computation. Under pathological contention, the loop may not terminate in any reasonable time.
Optimisation. Cap retries; fall back to a mutex:
const maxRetries = 16
for i := 0; i < maxRetries; i++ {
old := x.Load()
new := slowTransform(old)
if x.CompareAndSwap(old, new) {
return
}
}
mu.Lock()
defer mu.Unlock()
old := x.Load()
x.Store(slowTransform(old))
Win. Pathological cases fall back to deterministic mutex behaviour. Common case still benefits from lock-free.
Caveat. This pattern is rare. Most CAS loops succeed in 1-2 iterations. The retry cap is for code that has profiled badly under contention.
Exercise 12 — Avoid Interface Dispatch on Atomic Methods¶
Starting code:
type Counter interface {
Inc()
}
type atomicCounter struct{ n atomic.Int64 }
func (c *atomicCounter) Inc() { c.n.Add(1) }
func process(c Counter) {
for i := 0; i < 1e6; i++ {
c.Inc() // interface dispatch
}
}
Optimisation. Pass the concrete type:
func process(c *atomicCounter) {
for i := 0; i < 1e6; i++ {
c.Inc() // direct call; inlined; emits LOCK XADDQ inline
}
}
Win. Removes interface vtable lookup (~3 ns) per call. For a hot path of 1M iterations, this saves 3 ms total. The compiler also inlines Inc and emits the atomic instruction directly — no function call.
Trade-off. Loses interface flexibility. If you need to swap counter implementations, keep the interface and accept the cost. If the implementation is fixed, the concrete type wins.
Exercise 13 — Use Per-CPU runtime.NumCPU for Shard Count¶
Starting code:
Optimisation. Match the actual hardware:
Or round up to a power of 2 for masking:
Win. No wasted shards (on small machines), no excess contention (on large machines). The exact match between shards and cores eliminates the "two goroutines on the same core sharing a shard" pathology.
Caveat. runtime.NumCPU returns the OS view of cores, not the Go runtime's GOMAXPROCS. Usually the same; in containers they may differ. Honour GOMAXPROCS if your workload is GOMAXPROCS-bound.
Exercise 14 — Lazy Atomic Reset¶
Starting code:
Bug (covered in Exercise 3): lost updates between Load and Store.
Optimisation. Use Swap:
Win. Atomic snapshot-and-reset. Also one operation instead of two.
This is Exercise 3 repeated because the pattern is so common it deserves emphasis. Every time you see "Load then Store(0)" in production code, replace it with Swap(0).
Exercise 15 — Eliminate Atomic Where Single-Goroutine Owns the Variable¶
Starting code:
type Producer struct {
sent atomic.Int64
}
func (p *Producer) Run() {
for {
msg := nextMessage()
send(msg)
p.sent.Add(1)
}
}
func (p *Producer) Sent() int64 { return p.sent.Load() }
The producer is the only goroutine writing sent. Why atomic?
Optimisation. Use a plain int64, but ensure Sent() reads safely. Options:
- Accept stale reads —
Sent()is for a periodic metric. Use a plainint64written by the producer and a periodic snapshot via a channel:
type Producer struct {
sent int64 // owned by Run
snapshotCh chan int64
}
func (p *Producer) Run() {
var localSent int64
for {
select {
case msg := <-...:
send(msg)
localSent++
case p.snapshotCh <- localSent:
}
}
}
Run loops on a select that either processes a message or replies to a snapshot request. The variable localSent is owned by Run. Readers ask via the channel.
- Use atomic, accept the cost. Often the right answer — atomic is cheap.
Win. Approach 1 makes ownership explicit; the variable is purely goroutine-local. The atomic disappears. Approach 2 is simpler and rarely costs anything.
The lesson: question whether you need an atomic at all. If a single goroutine writes and others only read for monitoring, a snapshot channel can replace the atomic.
Exercise 16 — Inline-Friendly Atomic Methods¶
Starting code:
Optimisation. Remove logging from the hot path:
Win. The Inc function is small enough to inline. The compiler emits LOCK XADDQ directly at the call site. With the log call, the function is too large to inline and every Inc is a function call (~3 ns extra).
Generally: keep atomic-wrapping methods one-liners. Logging belongs in slow paths, not hot atomic increments.
Exercise 17 — Choose Read-Friendly vs Write-Friendly Layouts¶
Starting code (read-heavy):
For read-heavy workloads, the copy-on-write atomic.Pointer shines: reads are lock-free, writes pay an allocation.
Starting code (write-heavy):
For write-heavy workloads, plain atomic.Int64.Add is the right choice. Reading once per second is fine.
Mid case (balanced read/write):
For a balanced workload, profile both. sync.Mutex + map often wins for balanced map access. For balanced counter access, atomic wins until contention forces sharding.
Win. The "right" primitive is workload-dependent. Profile before optimising.
Exercise 18 — Replace Channel Signalling With Atomic Flag¶
Starting code:
done := make(chan struct{})
go func() {
select {
case <-done:
return
case <-ticker.C:
work()
}
}()
close(done)
If the worker's select already handles ticker events, the channel is fine. But if the worker is in a tight loop:
The select { case <-done: ... default: } is heavier than necessary. An atomic flag is faster.
Optimisation.
Win. atomic.Bool.Load is ~1 ns; the select-default pattern is ~5-10 ns. For a tight inner loop, atomic wins.
Caveat. If the worker is not in a tight loop and naturally blocks on a channel anyway, keep the channel. Atomic-flag polling is only better when polling is already what you want.
Exercise 19 — Benchmark Before and After Every Change¶
Always:
Compare with benchstat:
go install golang.org/x/perf/cmd/benchstat@latest
go test -bench=. -count=10 > old.txt
# make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt
Without measurement, "optimisations" can be regressions. Atomic-related changes especially: the constant-factor changes are small (~ns), and noise can dominate. benchstat reports confidence intervals.
Win. Verified improvements; no regressions.
Exercise 20 — Know When to Stop¶
Atomic optimisations have diminishing returns. After:
- Replacing mutex-guarded counters with atomic.
- Sharding hot counters.
- Using the right operation (
Addnot CAS,Swapnot Load+Store). - Eliminating false sharing with padding.
- Removing atomics in single-goroutine-owned variables.
The atomic cost is usually a small fraction of total CPU. Further optimisation belongs to:
- Algorithm changes (different data structure, better algorithm).
- Reducing work (fewer items processed, smarter caching).
- Eliminating allocations.
- Removing redundant operations.
A 50% improvement to a 2 ns operation saves 1 ns per call. A 10% improvement to a 200 ns operation saves 20 ns. Optimise the big numbers first.
Final Notes¶
The sync/atomic optimisation playbook:
- Replace mutex with atomic for single-variable updates.
- Replace CAS with Add/Swap/And/Or when applicable.
- Replace
atomic.Valuewithatomic.Pointer[T]for type safety and speed. - Shard hot counters with cache-line padding.
- Goroutine-local accumulation with periodic merge for hot per-item counters.
- Pass concrete types, not interfaces for atomic-method calls.
- Keep wrapper methods one-liners so the compiler inlines them.
- Question whether atomic is needed at all — single-owner variables do not need it.
- Profile before optimising — most atomic ops are not bottlenecks.
- Benchmark with
benchstatto validate improvements.
The smallest correct primitive wins. Atomic for one variable, mutex for several, channel for messages. Pick correctly and the rest is constant-factor tuning.