Compare-and-Swap (CAS) Algorithms — Optimization¶
Table of Contents¶
- Introduction
- Measure Before You Optimise
- Optimisation 1: Prefer
Add/Or/AndOver CAS Loops - Optimisation 2: Cache-Line Padding
- Optimisation 3: Sharding
- Optimisation 4: Batch Updates
- Optimisation 5: Read Less, Recompute More
- Optimisation 6: Early Exit
- Optimisation 7: Backoff Under Contention
- Optimisation 8: Lazy Publication
- Optimisation 9: Avoid CAS in the Read Path
- Optimisation 10: Layout for Locality
- Anti-Patterns
- Profiling Tools
- Summary
Introduction¶
This file is about making CAS-using code faster — under contention, under load, in real systems. Two principles run through every optimisation:
-
The instruction is not the bottleneck. A single uncontended CAS costs 5-15 ns. If you are doing 100M CAS per second, you spend 1 second per second. The cost is elsewhere — usually cache-coherence traffic.
-
Reduce contention, not the CAS. Most "CAS is slow" complaints reduce to "many cores fight for the same cache line." The fix is structural (shard, batch, layout) not micro-optimisational.
Measure Before You Optimise¶
The most important advice: profile first. Common CAS performance complaints are misattributed. The real cost may be:
- Cache misses on unrelated data.
- Allocation pressure from
atomic.Pointerswaps. - Scheduler overhead from too many goroutines.
- Mutex behind a wrapping function you assumed was CAS-based.
Tools:
go test -benchto establish baselines.go tool pproffor CPU profiling.perf stat(Linux) for cache-miss counters, branch mispredicts, etc.runtime/tracefor scheduler behaviour.
A 10-minute profile saves hours of misguided optimisation.
Baseline benchmark template¶
func BenchmarkCAS(b *testing.B) {
var v atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
for {
old := v.Load()
if v.CompareAndSwap(old, old+1) {
break
}
}
}
})
}
Run with -cpu=1,2,4,8 to see scaling:
If single-thread is 10 ns/op and 8-thread is 200 ns/op, you have contention scaling badly. Time to consider sharding.
Optimisation 1: Prefer Add/Or/And Over CAS Loops¶
If the operation maps directly to a hardware atomic instruction, use the dedicated function. No loop, no retry.
// before: CAS loop
for {
old := c.v.Load()
if c.v.CompareAndSwap(old, old+1) {
break
}
}
// after: single atomic instruction
c.v.Add(1)
Add compiles to LOCK XADDQ on x86 — one instruction. CAS loop is at least three plus the conditional.
Similarly for bit ops on Go 1.23+:
// before: CAS loop for OR
for {
old := f.bits.Load()
if f.bits.CompareAndSwap(old, old|mask) {
break
}
}
// after: single instruction
f.bits.Or(mask)
The hardware has LOCK OR for atomic bitwise OR. The Go compiler emits it.
When CAS still wins. When the operation is conditional in a way that Add/Or/And cannot express: set-if-greater, increment-if-positive, compare-versions-then-update. The conditional logic forces CAS.
Optimisation 2: Cache-Line Padding¶
False sharing destroys atomic-operation performance. Two unrelated variables on the same cache line cause cache-line transfers on every write, even though the writes are logically independent.
Symptom¶
A microbenchmark that scales well at low core counts collapses at higher counts, and perf stat shows high cache-misses or mem_load_l3_miss_retired.remote_hitm.
Fix¶
Pad each hot atomic to its own cache line:
For an array of counters:
Each shard is on its own line; writes to different shards don't bounce lines.
Cost¶
64 bytes per padded counter instead of 8. For 16 shards, 1 KB instead of 128 bytes. For a few atomics this is trivial; for thousands, it matters. Decide based on whether the variables are truly hot.
Don't pad cold data¶
type Stats struct {
coldCount int64 // updated once per minute
_ [56]byte // pointless padding
hotCount atomic.Int64
}
If coldCount is rarely touched, false sharing with it is rare. Padding wastes memory for no gain.
Detecting false sharing¶
Linux perf with the right counters:
perf stat -e cache-misses,cache-references ./mybin
perf stat -e mem_load_l3_miss_retired.remote_hitm ./mybin
A high HITM rate on a hot loop indicates inter-core cache-line traffic.
Optimisation 3: Sharding¶
When N cores all CAS the same variable, throughput is bounded by cache-coherence traffic. Splitting the variable into N shards (one per core, ideally) cuts the cost.
type ShardedCounter struct {
shards [N]struct {
v atomic.Int64
_ [56]byte // cache-line padding
}
}
func (c *ShardedCounter) Inc() {
shard := pickShard() // hash by goroutine ID, processor, etc.
c.shards[shard].v.Add(1)
}
func (c *ShardedCounter) Value() int64 {
var sum int64
for i := range c.shards {
sum += c.shards[i].v.Load()
}
return sum
}
Writers contend within their shard. With N = number of cores, contention drops to (workload / N) per shard.
Picking the shard¶
- By goroutine ID: hash a unique goroutine identifier. Cheap, fair.
- By P (processor): pin to a P, write to that P's shard. Requires
runtime_procPin(not exported in the standard library; thesync.Pooldoes it via the runtime). - By round-robin: keep a goroutine-local counter, mod N. Simple but writes still bounce as goroutines migrate.
The Go runtime's sync.Pool uses per-P shards via internal hooks. User code typically settles for goroutine-ID-based hashing.
Cost on the read side¶
Reading the total is O(N) — sum across shards. For read-heavy workloads, this can erase the writer gains. Sharding is a write-optimisation; if you read often, profile both sides.
Sharding is not a free lunch¶
- Order is lost: a sharded counter has no global "Nth call" answer.
- Some operations don't shard naturally (max, min, set-membership).
- Memory grows by N×.
Optimisation 4: Batch Updates¶
If a goroutine performs many small updates, batch them locally and publish in fewer CAS operations.
// Before: one CAS per increment
for i := 0; i < 1000; i++ {
c.Inc() // 1000 CAS
}
// After: local sum, one CAS to publish
var local int64
for i := 0; i < 1000; i++ {
local++
}
c.v.Add(local) // 1 CAS
The trade-off: the count is invisible until the batch lands. If readers need fresh totals, batching breaks freshness.
Batched producers¶
type BatchProducer struct {
queue chan int
batch []int
flushAt int
}
func (b *BatchProducer) Submit(v int) {
b.batch = append(b.batch, v)
if len(b.batch) >= b.flushAt {
b.flush()
}
}
func (b *BatchProducer) flush() {
for _, v := range b.batch {
b.queue <- v
}
b.batch = b.batch[:0]
}
Each Submit is a local append (no atomic op). The flush does N channel sends, but channel ops on a buffered channel batch nicely too.
Batching trades latency for throughput. Pick the trade-off based on your workload.
Optimisation 5: Read Less, Recompute More¶
In a CAS loop, every iteration loads from memory. If the surrounding code re-reads the same atomic outside the loop, those reads are redundant.
// Before: extra read outside the loop
func (c *Counter) Inc() int64 {
var result int64
for {
old := c.v.Load()
if c.v.CompareAndSwap(old, old+1) {
result = old + 1
break
}
}
return c.v.Load() // redundant
}
// After: use the value you already have
func (c *Counter) Inc() int64 {
for {
old := c.v.Load()
if c.v.CompareAndSwap(old, old+1) {
return old + 1
}
}
}
The Load at the end re-reads from memory. The value old + 1 we just published is in our register; use it.
Avoid unnecessary atomic loads¶
// Before
if c.v.Load() > 0 {
if c.v.Load() < c.max {
// ...
}
}
// After
v := c.v.Load()
if v > 0 && v < c.max {
// ...
}
Even though Load is cheap on x86, the read can move to a stale cache line. One Load is one cache transaction; two are two.
Optimisation 6: Early Exit¶
In a conditional CAS loop, check the condition before the CAS. If the condition fails, return without touching the atomic.
// Before: always CAS
func (m *Max) Observe(x int64) {
for {
old := m.v.Load()
new := max(old, x)
if m.v.CompareAndSwap(old, new) {
return
}
}
}
// After: early-return when no update needed
func (m *Max) Observe(x int64) {
for {
old := m.v.Load()
if x <= old {
return // no update, no CAS
}
if m.v.CompareAndSwap(old, x) {
return
}
}
}
Most calls to Observe with x <= current max exit without any CAS — saving the contention cost entirely. For a watermark that climbs early and plateaus, the steady-state CAS count is near zero.
Optimisation 7: Backoff Under Contention¶
Under heavy contention, fast retries make things worse — they multiply the cache-line bouncing. Backoff reduces the rate of failed CAS attempts.
Fixed backoff¶
for {
old := c.v.Load()
if c.v.CompareAndSwap(old, old+1) {
return
}
runtime.Gosched() // yield once per failure
}
Gosched yields the goroutine to the scheduler. Other goroutines run. By the time we resume, the contended line may have been written by them and the cache state has changed.
Exponential backoff¶
delay := time.Nanosecond
for {
old := c.v.Load()
if c.v.CompareAndSwap(old, old+1) {
return
}
time.Sleep(delay)
if delay < time.Microsecond {
delay *= 2
}
}
Doubles the wait on each failure. Reduces contention but increases worst-case latency.
When backoff helps¶
- Many goroutines (more than GOMAXPROCS) contending the same line.
- Each successful operation does measurable work that needs to land.
When backoff hurts¶
- Few goroutines, light contention. Backoff just adds latency.
- Real-time-sensitive code. Backoff makes worst-case latency unpredictable.
Profile both with and without backoff. Don't add backoff speculatively.
Optimisation 8: Lazy Publication¶
Instead of CAS-publishing every change, accumulate changes locally and publish periodically.
type LazyMax struct {
global atomic.Int64
}
type Local struct {
global *LazyMax
pending int64
}
func (l *Local) Observe(x int64) {
if x > l.pending {
l.pending = x
}
}
func (l *Local) Flush() {
for {
old := l.global.global.Load()
if l.pending <= old {
return
}
if l.global.global.CompareAndSwap(old, l.pending) {
return
}
}
}
Each goroutine tracks its local max. Periodically (every N observations, or every M ms), it CAS-flushes the local max to the global.
Trade-off: the global is stale by up to one flush-interval. Acceptable for metrics and watermarks; not for monotonic ID generation.
Optimisation 9: Avoid CAS in the Read Path¶
Reads via atomic.Load are essentially free on x86 (single MOV). Writes via Store or CAS are expensive (LOCK prefix, cache-line ownership). Design so reads dominate over CAS.
Publish-by-pointer for read-mostly state¶
type Config struct { /* fields */ }
var current atomic.Pointer[Config]
// Hot read path: one Load
func Read() *Config { return current.Load() }
// Cold write path: occasional Store
func Update(c *Config) { current.Store(c) }
Readers do one Load per access (no CAS, no allocation, no contention). Writers do one Store per update (or CAS if computing from old). For a workload with 1M reads per write, the writer cost is irrelevant.
Avoid CAS-then-Load patterns¶
// Bad: CAS then Load
ok := c.v.CompareAndSwap(old, new)
v := c.v.Load() // extra read
// Better: CAS, use the value you swapped to
ok := c.v.CompareAndSwap(old, new)
if ok {
use(new)
}
Optimisation 10: Layout for Locality¶
Co-locate fields that are accessed together. Separate fields that are accessed by different goroutines.
// Bad: hot and cold fields mixed
type Server struct {
name string // cold
requestCount atomic.Int64 // hot
startTime time.Time // cold
responseTime atomic.Int64 // hot
address string // cold
}
// Better: hot fields in their own cache line
type Server struct {
// cold metadata
name string
startTime time.Time
address string
// hot atomic counters, cache-aligned
_ [0]byte // alignment marker
requestCount atomic.Int64
responseTime atomic.Int64
_ [40]byte // pad to 64 bytes for the two counters
}
The hot counters are on their own cache line; metadata reads don't pull the counters' line, and counter updates don't dirty the metadata.
This optimisation is per-struct and per-workload. Profile to confirm.
Anti-Patterns¶
"Optimising" by adding CAS¶
A common mistake: replacing a sync.Mutex with a CAS-loop "for performance" without measuring.
For very short critical sections (single-word updates), CAS wins. For everything else, mutex is competitive or better, and far easier to maintain. Switch only with benchmarks.
Padding everything¶
If a, b, c are not concurrently hot, the padding wastes 168 bytes for no gain. Pad only the fields you have measured as hot and contended.
Sharding low-contention counters¶
Sharding adds:
- O(shards) read cost.
- O(shards) memory.
- Code complexity (shard-picking, padding).
For a counter incremented once per second, all this is overhead. Sharding helps at 1M+ ops/s. Below that, don't bother.
Spinning forever¶
In a Go program with more goroutines than GOMAXPROCS, a pure spin can starve other goroutines on the same OS thread. The Go scheduler is preemptive (1.14+), so this is less catastrophic than it used to be, but runtime.Gosched() after failure remains a good citizen.
Optimising the CAS instead of the workload¶
If your CAS-loop retries 100x per success, the fix is not "make the CAS faster." It is "reduce the contention" — shard, batch, redesign.
Profiling Tools¶
go test -bench¶
The baseline. Always start here.
-cpu shows scaling. -count=5 reduces benchmark noise.
go tool pprof¶
CPU profiling:
Look for time in runtime/internal/atomic — that is CAS itself. Lots of time there suggests contention or wrong primitive choice.
perf stat (Linux only)¶
Hardware counters:
For CAS-specific contention:
remote_hitm counts cache-line bounces between cores. High and growing with core count means contention.
runtime/trace¶
Scheduler behaviour:
import "runtime/trace"
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... workload ...
View with go tool trace trace.out. Shows when goroutines block, how long, and on what. Contended CAS-loops show up as goroutines spending lots of time in user code (no blocking) — distinct from mutex contention which shows as syscalls.
runtime.SetMutexProfileFraction and SetBlockProfileFraction¶
Useful for mutex contention, less for CAS. CAS contention does not show up as blocking — it shows as CPU time.
Summary¶
Optimising CAS-using code follows a predictable hierarchy:
- Measure first. Without numbers, you are guessing.
- Pick the right primitive.
Add/Or/Andover CAS when possible. - Reduce contention before reducing per-op cost. Shard, batch, lazy-publish.
- Pad hot fields to cache-line size. Eliminate false sharing.
- Lay out structs so hot and cold fields are separated. Hot fields together, cold fields together.
- Early-exit conditional CAS loops. Avoid CAS when no update is needed.
- Use backoff only when measured to help. Default to no backoff.
- Prefer read-mostly designs with
atomic.Pointer. Wait-free reads, occasional CAS writes.
The single instruction LOCK CMPXCHG takes ~10 ns. Anything significantly slower than that, per CAS, is contention or design. Optimise the design, not the instruction.
Further Reading¶
- Ulrich Drepper, "What Every Programmer Should Know About Memory," 2007. Section 6 on cache effects.
- Dmitry Vyukov, "Lock-free algorithms," https://www.1024cores.net/.
- Brendan Gregg, "Systems Performance," 2nd ed. Chapter on CPU and cache profiling.
- Linux
perftutorial: https://www.brendangregg.com/perf.html. - Go's
sync.Poolsource:src/sync/pool.go. Reference implementation of per-P sharding. golang.org/x/sync/singleflight: a CAS-adjacent pattern for deduplication.- The Go runtime's
runtime/internal/atomic: see how the runtime itself uses these primitives.