Skip to content

Sequential Consistency — Middle Level

Table of Contents

  1. Introduction
  2. The Go Memory Model: SC-DRF in Depth
  3. The Happens-Before Relation
  4. How sync/atomic Got SC Semantics (Go 1.19 Revision)
  5. Comparison with C/C++ Memory Orders
  6. Comparison with Java and Kotlin
  7. Comparison with Rust
  8. When SC Costs You
  9. Implementation Across Architectures
  10. Patterns That Rely on SC
  11. Patterns Where SC Is Visible Overhead
  12. Reading Go's sync/atomic Source
  13. Litmus Tests in Depth
  14. Mistakes Beyond Junior Level
  15. Production Patterns
  16. Testing for SC Compliance
  17. Benchmark Recipes
  18. Profiling SC Costs
  19. Edge Cases
  20. Self-Assessment
  21. Summary

Introduction

Focus: "Now that I know SC exists and how to use it, what does the Go memory model actually say, and how does Go's commitment compare with other languages? When is SC overkill? When is it the only correct choice?"

The junior page introduced sequential consistency as "what you wrote is what happens, in some global order." That phrasing is correct but glosses over many subtleties:

  • Whose writes are in this global order? All goroutines'? Yes, but only memory operations are ordered — not arbitrary side effects like file writes.
  • What does Go's memory model formally guarantee beyond "race-free programs behave under SC"?
  • How does Go 1.19's SC commitment for sync/atomic compare with C/C++'s opt-in memory_order_seq_cst?
  • What was the previous semantics, and why was the change made?
  • When is SC the right tool, and when is it overkill?

This page goes one level deeper. After reading it you will:

  • Understand the happens-before relation formally and how SC is built on top of it.
  • Know the precise wording of Go's memory model with respect to atomics, channels, mutexes, and sync.Once.
  • Understand the 2022 revision (Go 1.19) and what it changed.
  • Be able to translate C/C++ memory orders (relaxed, acquire, release, acq-rel, seq-cst) into Go's all-SC model.
  • Identify code where SC is overkill and where it is essential.
  • Read and predict the output of multi-goroutine litmus tests.
  • Use benchmarks and profilers to measure SC overhead.

You do not need to know how the Go runtime emits fences in detail, or how the compiler's scheduling passes preserve memory ordering — those are the senior and professional pages. This page is about the language-level contract and how to use it deliberately.


The Go Memory Model: SC-DRF in Depth

Two paragraphs from the official memory model

The Go memory model (revised 2022, see https://go.dev/ref/mem) is the authoritative document. Its core statement, paraphrased:

A read r of memory location x is allowed to observe a write w to x if both of the following hold: 1. r does not happen before w. 2. There is no other write w' to x that happens after w but before r.

And separately:

The Go memory model guarantees that, for a data-race-free program, all executions are sequentially consistent.

These two paragraphs together define SC-DRF. The first sentence defines the visibility relation: a read may see a write if there's no later write in between and the read is not strictly before the write. The second sentence promises that, for race-free programs, the resulting executions form a sequentially-consistent set.

Why "data-race-free" matters

A data race in Go is defined as: two goroutines accessing the same memory location, at least one is a write, and the accesses are not ordered by happens-before. If your program has any data race, the Go memory model gives no guarantees — your program has undefined behaviour.

This is identical to C/C++ for the purposes of "racy code is UB." It contrasts with Java, which provides bounded guarantees even for racy programs (you cannot get "out-of-thin-air" values from volatile reads — though plain reads in racy Java code can produce surprising results too).

The contract in code

package main

import (
    "fmt"
    "sync/atomic"
)

var (
    data  int
    ready atomic.Bool
)

func writer() {
    data = 42
    ready.Store(true) // SC store
}

func reader() {
    if ready.Load() { // SC load
        fmt.Println(data) // race-free: happens-after writer's data = 42
    }
}

Why is this race-free? Because the SC store creates a happens-before edge: writer's data = 42 happens before its ready.Store(true), and reader's ready.Load() happens before its data read. If ready.Load() observes true, then by SC, writer's store happened before reader's load. Transitively, writer's data = 42 happens before reader's read of data. No race.

What if both sides used a plain bool?

var (
    data  int
    ready bool
)

Both data and ready are now plain. Writer writes both; reader reads both. With no synchronisation, this is a data race on both variables. UB. Compiler may hoist, reorder, or eliminate operations. Even on x86 where the hardware "almost works," the compiler is free to break it.

What if only ready is plain, with data atomic?

Asymmetric: the SC store on data is wasted because ready's plain read/write still races. The race on ready invalidates SC-DRF for the whole program. UB.

The rule is symmetric: both sides of every shared variable must use synchronising operations.


The Happens-Before Relation

Happens-before is a partial order on memory operations. It is defined by the Go memory model as the smallest relation satisfying:

  1. Program order: within one goroutine, every statement happens-before every later statement.
  2. Synchronisation edges: certain operations across goroutines create cross-goroutine edges.

The synchronisation edges in Go are:

  • Channel send happens-before the corresponding receive completes. If you send a value on a channel and another goroutine receives it, the send happens-before the receive.
  • close(ch) happens-before a receive that observes the channel as closed.
  • sync.Mutex.Unlock happens-before a subsequent Lock on the same mutex.
  • sync.WaitGroup.Done happens-before Wait returns.
  • sync.Once.Do(f) returns after the first invocation completes; this completion happens-before all subsequent Do invocations on the same Once.
  • An atomic operation happens-before any later atomic operation on the same variable in the same total order (SC for atomics, Go 1.19+).
  • Goroutine creation (go f()) happens-before f's execution starts.

Visualising happens-before

Goroutine A:        Goroutine B:
  a = 1
  b = 2
  ch <- 0    ─────────► <- ch
                          fmt.Println(a, b) // sees 1, 2

The arrow is a synchronisation edge. A's earlier writes (a = 1, b = 2) happen-before A's send, which happens-before B's receive, which happens-before B's reads. Transitivity gives B's prints the right values.

Sequential consistency from happens-before

SC is the case where happens-before is total: every pair of operations is ordered one way or the other. Without atomics or mutexes, happens-before is sparse — only program-order edges within each goroutine exist. Adding synchronisation operations adds cross-goroutine edges and "tightens" the order. With enough synchronisation, the order becomes total — and that totality is SC.

What if happens-before is partial?

Consider:

go func() { a = 1 }()
go func() { b = 1 }()

Without synchronisation, a = 1 and b = 1 are unordered relative to each other across goroutines. Each happens-before nothing in the other goroutine. The reader (the main goroutine after both finish — if main waits) would race with each of them.

Now wrap with sync.WaitGroup:

var wg sync.WaitGroup
wg.Add(2)
go func() { defer wg.Done(); a = 1 }()
go func() { defer wg.Done(); b = 1 }()
wg.Wait()
fmt.Println(a, b)

wg.Wait() synchronises with each Done. After Wait returns, both writes happen-before the read. The output is always 1 1.

But what about the order between a = 1 and b = 1? It remains unordered. The race detector does not flag this because they touch different memory locations. SC permits any global order in which these two writes appear, as long as both happen-before main's read.


How sync/atomic Got SC Semantics (Go 1.19 Revision)

Before Go 1.19

The Go memory model up to 1.18 was silent on the precise ordering semantics of sync/atomic. The de-facto behaviour was "at least release-acquire," but the spec gave no formal commitment. Russ Cox's 2021 "Updating the Go Memory Model" blog post argued for either:

  1. Acquire-release for atomics, matching C++'s acquire-release.
  2. Sequential consistency for atomics, simpler and stronger.

The decision: SC. The Go team chose simplicity over the potential for slightly faster code with weaker orderings.

What changed in 1.19

The memory model was formally revised in Go 1.19 (released August 2022). Key changes:

  • sync/atomic operations are now formally specified as sequentially consistent.
  • A new typed API was added: atomic.Bool, atomic.Int32, atomic.Int64, atomic.Uint32, atomic.Uint64, atomic.Uintptr, atomic.Pointer[T].
  • The old free functions (atomic.LoadInt32, etc.) are retained for backwards compatibility, with the same SC semantics.
  • The phrase "happens-before" was formalised more rigorously.
  • 32-bit platforms' alignment requirement is hidden by the typed API.

Why SC and not acquire-release?

The Go team's reasoning (from the proposal):

  • Simpler model: one rule (SC) instead of five (relaxed, consume, acquire, release, acq-rel, seq-cst).
  • Existing code already assumed SC: developers writing atomic.LoadInt32(&flag) and atomic.StoreInt32(&flag, 1) had no way to express weaker orderings, so they wrote code assuming SC.
  • Modest performance impact: on x86 (TSO), SC store ≈ acq-rel store ≈ xchg. On ARM64, SC ≈ stlr + ldar ≈ acq-rel + an extra fence. The cost difference is small.
  • Easier to teach and review: less surface for misuse.
  • The race detector models SC, so the actual semantics matched the tooling.

What weaker semantics would have permitted

With acquire-release, the canonical store-buffer litmus could legally exhibit r1 == 0 && r2 == 0:

// acq-rel — r1 == 0 && r2 == 0 is allowed
x.store(1, release);
r1 = y.load(acquire);

y.store(1, release);
r2 = x.load(acquire);

With SC, this outcome is forbidden. The cost is a mfence-style fence after the store (or xchg). The benefit is that all developers can reason intuitively.

What about relaxed?

Relaxed atomics in C/C++ permit no ordering guarantees beyond atomicity of the single operation. They are useful for performance counters where ordering is irrelevant. Go does not expose relaxed atomics — you get SC or nothing. If you truly need a relaxed counter, the trick is to use a plain variable in a per-goroutine scope and sum at the end. See the optimize.md page for techniques.


Comparison with C/C++ Memory Orders

The five C++ memory orders

C++ exposes five orderings on atomics (plus consume, deprecated for most uses):

Order Guarantees
memory_order_relaxed Only atomicity. No ordering with other operations.
memory_order_acquire This load + all subsequent operations in this thread are ordered after this load. Pairs with release stores.
memory_order_release This store + all prior operations in this thread are ordered before this store. Pairs with acquire loads.
memory_order_acq_rel Both acquire and release (for RMW operations).
memory_order_seq_cst Acquire-release + a global total order across all seq-cst operations on all atomics.

Translating Go's SC to C++

Every Go sync/atomic operation maps to memory_order_seq_cst in C++. Mechanically:

// Go
var v atomic.Int64
v.Store(1)            // ≡ v.store(1, memory_order_seq_cst);
x := v.Load()         // ≡ x = v.load(memory_order_seq_cst);
v.Add(1)              // ≡ v.fetch_add(1, memory_order_seq_cst);
v.CompareAndSwap(o,n) // ≡ v.compare_exchange_strong(o, n, memory_order_seq_cst);

The cost on x86: one LOCK XCHG or LOCK CMPXCHG or XCHG per store, regular load for Load. On ARM64: LDAR/STLR (acquire/release; SC on ARMv8.3+).

When C++ developers reach for weaker orders

A C++ developer writing a hot counter would write:

std::atomic<uint64_t> count;
count.fetch_add(1, std::memory_order_relaxed);

This avoids the fence cost on weakly-ordered hardware. On x86 it makes little difference, but on ARM64 it can save 10–20 ns per increment.

A Go developer writing the same counter writes:

var count atomic.Uint64
count.Add(1) // SC

You pay the full fence. For most workloads this is fine. For ultra-hot paths, the optimisation is per-goroutine sharding.

Acquire-release pattern

The canonical acq-rel pattern in C++:

std::atomic<bool> ready;
int data;

// producer
data = 42;
ready.store(true, std::memory_order_release);

// consumer
while (!ready.load(std::memory_order_acquire)) {}
assert(data == 42);

The release store on ready publishes prior writes; the acquire load subscribes. In Go, this is just:

var ready atomic.Bool
var data int

// producer
data = 42
ready.Store(true)

// consumer
for !ready.Load() {}
// data == 42

The Go version is strictly stronger than acq-rel — it gives full SC. The cost difference is small on most hardware.

Memory order strength ladder

weakest                                                strongest
relaxed → consume → acquire/release → acq_rel → seq_cst
                                                Go's only option

Go sits at the top of the ladder. Every atomic is seq-cst.


Comparison with Java and Kotlin

Java's volatile

Java's volatile is roughly seq-cst:

  • Reads and writes are atomic.
  • Establish happens-before edges (a write happens-before any subsequent read).
  • Forbid hoisting out of loops.

So Java's volatile int x is similar to Go's atomic.Int32 x. The differences:

  • Java's volatile is a modifier, not a wrapper type. You can have volatile int[] arr.
  • Java requires AtomicInteger etc. for RMW operations like increment.
  • Java's synchronized block is equivalent to sync.Mutex with SC ordering.

Kotlin inherits Java's model; the same comparisons apply.

Java's final field guarantee

A final field set in a constructor is visible to all threads as soon as the object reference is "safely published" — a property similar to Go's "publication via atomic.Pointer." The Java memory model formalises this in JSR-133.

Java's "out of thin air" guarantee

JSR-133 specifies that no read can return a value that no thread ever wrote — even in racy code. This is a guarantee Go does not provide; in Go, racy code has UB. In practice this rarely matters, but it is a subtle distinction.


Comparison with Rust

Rust's std::sync::atomic mirrors C++:

use std::sync::atomic::{AtomicBool, Ordering};

let flag = AtomicBool::new(false);
flag.store(true, Ordering::SeqCst);
let v = flag.load(Ordering::SeqCst);

Orderings: Relaxed, Acquire, Release, AcqRel, SeqCst. Identical to C++.

The Rust ecosystem encourages using the weakest sufficient ordering. Idiomatic Rust uses Acquire/Release for most patterns; SeqCst only for genuinely seq-cst-dependent algorithms.

Go's choice is the opposite: always SC. Different design philosophies — Rust optimises for performance-critical low-level code; Go optimises for developer comfort and team velocity.


When SC Costs You

x86 cost profile

On x86 (TSO), SC adds a fence only on stores (to prevent store-load reordering). Loads are essentially free:

  • Plain load: 1 cycle
  • SC Load: 1 cycle (just a regular load — x86's TSO already provides acquire)
  • Plain store: 1 cycle
  • SC Store: ~20–30 cycles (xchg with implicit lock prefix)
  • SC Add: ~20–30 cycles (lock xadd)
  • SC CAS: ~20–30 cycles (lock cmpxchg)

For read-mostly workloads on x86, SC is essentially free. The cost shows up only at write-heavy contended atomic stores.

ARM64 cost profile

ARM64 is weakly ordered. SC requires explicit fences:

  • Plain load: 1 cycle
  • SC Load: LDAR instruction, ~5–10 cycles
  • Plain store: 1 cycle
  • SC Store: STLR instruction, ~10–20 cycles
  • SC Add: LDAXR/STLXR loop, ~20–40 cycles
  • SC CAS: LDAXR/STLXR or CASAL, ~20–40 cycles

On ARM, SC is meaningfully more expensive than acq-rel and much more expensive than relaxed. This is where Go's choice costs you most.

When the cost matters

  • Hot loops with millions of atomic ops per second.
  • Cache-line bouncing counters.
  • Lock-free data structures with high contention.

For typical web-service code (a few atomics per request handler), the cost is negligible.

Mitigation strategies

  • Per-goroutine sharding: keep a local plain counter; aggregate at the end.
  • Padding: prevent false sharing by aligning to cache lines.
  • Read-mostly with copy-on-write: avoid atomic writes on the hot path.
  • Channels for batched updates.

We will see these in the optimize.md page.


Implementation Across Architectures

How the Go compiler emits SC atomics

The compiler dispatches to architecture-specific assembly in runtime/internal/atomic. On each platform:

x86-64

// atomic.Int64.Store(v)
MOVQ AX, v
XCHGQ AX, [memory]

// atomic.Int64.Load
MOVQ [memory], AX  // x86 TSO gives acquire for free

// atomic.Int64.Add(d)
LOCK XADDQ d, [memory]

x86's TSO model means most operations need no extra fence. Only store-load reordering requires the XCHG or explicit MFENCE.

ARM64

// atomic.Int64.Store(v)
STLR v, [memory]    // store-release

// atomic.Int64.Load
LDAR [memory], v    // load-acquire

// atomic.Int64.Add(d)
LDAXR ...           // load-acquire-exclusive
ADD
STLXR ...           // store-release-exclusive
B.NE retry

ARMv8.3+ added LDAR/STLR with implicit SC semantics. Older ARMv8 needs explicit DMB ISH (data memory barrier, inner shareable) fences.

RISC-V

// atomic.Int64.Store(v)
amoswap.d.aqrl x0, v, (memory)

// atomic.Int64.Add(d)
amoadd.d.aqrl x0, d, (memory)

RISC-V's aqrl flag on atomic memory operations gives both acquire and release ordering, which combined gives SC.

MIPS, s390x, PowerPC, wasm

Each has its own dance: - MIPS: sync instruction emitted around operations. - s390x: TSO-like, but with bcr 15,0 fences. - PowerPC: lwsync and sync fences as needed. - wasm: relies on JavaScript engine's atomics; SC mapping built in.

The point is: the compiler does this for you. Your Go code is the same on every platform; the binary differs.


Patterns That Rely on SC

Pattern: Dekker's algorithm

Dekker's algorithm provides mutual exclusion for two threads without atomics — if memory is SC. With weaker orderings, it fails. In Go, you can implement it correctly:

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

var (
    flag0, flag1 atomic.Bool
    turn         atomic.Int32
    critical     atomic.Int64
)

func dekker(id int) {
    var me, other *atomic.Bool
    if id == 0 {
        me, other = &flag0, &flag1
    } else {
        me, other = &flag1, &flag0
    }

    me.Store(true)
    for other.Load() {
        if turn.Load() != int32(id) {
            me.Store(false)
            for turn.Load() != int32(id) {
            }
            me.Store(true)
        }
    }

    // critical section
    critical.Add(1)

    turn.Store(int32(1 - id))
    me.Store(false)
}

func main() {
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(2)
        go func() { defer wg.Done(); dekker(0) }()
        go func() { defer wg.Done(); dekker(1) }()
    }
    wg.Wait()
    fmt.Println("entries:", critical.Load())
}

Each goroutine "wants" to enter the critical section by setting its flag. The SC atomics ensure mutual exclusion. Note that Go programmers should use sync.Mutex in production — Dekker is for educational purposes.

Pattern: Peterson's algorithm

Similar to Dekker; relies on SC for correctness. Two threads, two flags, one turn variable. Works in Go with SC atomics.

Pattern: lock-free single-producer/single-consumer queue

type SPSC struct {
    buf  []int64
    head atomic.Int64
    tail atomic.Int64
}

func (q *SPSC) Push(v int64) bool {
    h := q.head.Load()
    t := q.tail.Load()
    if h-t == int64(len(q.buf)) {
        return false
    }
    q.buf[h%int64(len(q.buf))] = v
    q.head.Store(h + 1)
    return true
}

func (q *SPSC) Pop() (int64, bool) {
    h := q.head.Load()
    t := q.tail.Load()
    if h == t {
        return 0, false
    }
    v := q.buf[t%int64(len(q.buf))]
    q.tail.Store(t + 1)
    return v, true
}

Producer publishes buf[h] then stores head + 1. Consumer loads head, reads buf[t], stores tail + 1. SC guarantees the consumer sees the slot's value once it sees the updated head.

Pattern: multi-producer counter

var c atomic.Int64
// any number of goroutines call c.Add(1)

The SC total order over all RMWs guarantees no lost updates.

Pattern: epoch reclamation

var globalEpoch atomic.Int64

func enter() int64 { return globalEpoch.Load() }
func bump()        { globalEpoch.Add(1) }

Epoch-based reclamation requires SC to ensure that, once an epoch is observed, all prior epochs have been "passed."


Patterns Where SC Is Visible Overhead

Pattern: lock-free statistics

var requests atomic.Int64

func handle() {
    requests.Add(1)
    // ...
}

If handle is called a million times per second across 16 cores, the SC Add becomes a cache-bouncing bottleneck. The variable's cache line ricochets between cores.

Fix: per-goroutine counters, or sharded counters per CPU.

const shards = 64

type ShardedCounter struct {
    s [shards]struct {
        v atomic.Int64
        _ [56]byte // pad to cache line
    }
}

func (c *ShardedCounter) Inc(g int) { c.s[g%shards].v.Add(1) }
func (c *ShardedCounter) Sum() int64 {
    var total int64
    for i := range c.s {
        total += c.s[i].v.Load()
    }
    return total
}

Different goroutines hit different cache lines. The SC cost is still per-op, but contention is gone.

Pattern: hot-path flag

A flag checked on every request:

var enabled atomic.Bool

func handle() {
    if !enabled.Load() {
        return
    }
    // ...
}

On x86 this is essentially free (regular load). On ARM, each Load is a LDAR — small cost but adds up. Mitigation: cache the value at request entry; re-check on each request (which is unavoidable for correctness if the flag may change mid-request).

Pattern: read-mostly map

var routes atomic.Pointer[map[string]Handler]

func lookup(path string) Handler {
    return (*routes.Load())[path]
}

Each request does one SC Load. Cheap, but if you have 100k requests/sec across 32 cores, the cache line of routes (the atomic pointer) is hot. False sharing could matter if other atomics live nearby. Pad the struct.


Reading Go's sync/atomic Source

The sync/atomic package's source is a useful read. Key files:

  • src/sync/atomic/type.go: typed wrappers (Bool, Int32, etc.).
  • src/sync/atomic/value.go: untyped Value (legacy).
  • src/sync/atomic/asm.s: dispatches to runtime atomics.
  • src/runtime/internal/atomic/atomic_amd64.s: AMD64 implementations.
  • src/runtime/internal/atomic/atomic_arm64.s: ARM64.
  • src/runtime/internal/atomic/atomic_riscv64.s: RISC-V.

A representative snippet from type.go:

type Int64 struct {
    _ noCopy
    _ align64
    v int64
}

func (x *Int64) Load() int64           { return LoadInt64(&x.v) }
func (x *Int64) Store(val int64)       { StoreInt64(&x.v, val) }
func (x *Int64) Swap(new int64) int64  { return SwapInt64(&x.v, new) }
func (x *Int64) CompareAndSwap(old, new int64) bool {
    return CompareAndSwapInt64(&x.v, old, new)
}
func (x *Int64) Add(delta int64) int64 { return AddInt64(&x.v, delta) }

The noCopy and align64 markers prevent accidental copying and ensure 64-bit alignment on 32-bit platforms. The methods delegate to the legacy free functions, which dispatch to assembly.

Reading this source clarifies what each operation does and why. Spend an afternoon scrolling through it.


Litmus Tests in Depth

Litmus 1: Store-buffer (SB)

       Thread A          Thread B
       x = 1             y = 1
       r1 = y            r2 = x

Forbidden under SC: r1 == 0 && r2 == 0
Permitted under TSO (raw x86): r1 == 0 && r2 == 0

Litmus 2: Independent reads of independent writes (IRIW)

Thread A   Thread B   Thread C        Thread D
x = 1      y = 1      r1 = x          r3 = y
                      r2 = y          r4 = x

Forbidden under SC: r1=1, r2=0, r3=1, r4=0
(C sees x update before y; D sees y update before x — incompatible global orders)
Permitted under PowerPC/ARM without fences.

This is a famous test demonstrating that SC requires a single global order observed by all threads. Acquire-release alone is not enough.

Litmus 3: Load-buffer (LB)

       Thread A          Thread B
       r1 = x            r2 = y
       y = 1             x = 1

Forbidden under any reasonable model: r1 == 1 && r2 == 1
(Both reads see "future" writes — out of thin air)

This is forbidden under all real-world models because it would require values out of thin air.

Litmus 4: Write-write coherence

       Thread A          Thread B
       x = 1             x = 2
                         r = x

If r == 1, then under SC the write x = 2 happened after, but B observed x = 1.
SC requires: subsequent reads see the latest write in the total order.

Running litmus tests in Go

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

func runSB() int {
    var bad int
    for i := 0; i < 100_000; i++ {
        var x, y atomic.Int32
        var r1, r2 int32
        var wg sync.WaitGroup
        wg.Add(2)
        go func() {
            defer wg.Done()
            x.Store(1)
            r1 = y.Load()
        }()
        go func() {
            defer wg.Done()
            y.Store(1)
            r2 = x.Load()
        }()
        wg.Wait()
        if r1 == 0 && r2 == 0 {
            bad++
        }
    }
    return bad
}

func main() {
    fmt.Println("SC violations (should be 0):", runSB())
}

This program prints 0 every time, on every supported architecture, demonstrating Go's SC guarantee.


Mistakes Beyond Junior Level

Mistake: assuming SC orders unrelated atomics

SC orders all atomics in some total order, but the order can be any consistent one. Two writes to different atomics from different goroutines may appear in either order to different observers.

Wait — that would violate SC. Let me restate: under SC, all atomics share one global order. Different observers see the same order. But the order in which two unrelated writes occur is not determined; only that some order is chosen and respected globally.

Mistake: thinking atomic.Pointer[T] makes T mutable safely

After publication, T must be treated as immutable. Mutating fields of *T after Store is racy.

Mistake: using atomic.Bool as a binary semaphore

atomic.Bool does not block. If you need waiters, use a channel or sync.Cond.

Mistake: spinning on an atomic in a goroutine pool

Spinning consumes CPU. The Go scheduler does not preempt cooperatively as aggressively as Java's. Long spins can starve other goroutines. Use channels for blocking signals.

Mistake: not aligning legacy 64-bit atomics on 32-bit ARM

Before the typed API, you had to manually align. Using atomic.Int64 (typed) since 1.19 handles this.

Mistake: assuming time.Sleep flushes writes

It does not. Sleeping is not a memory barrier. Writes can sit in the store buffer indefinitely without synchronisation.

Mistake: confusing atomicity with isolation

Atomic operations are indivisible. They do not provide multi-step isolation. For multi-step invariants, use sync.Mutex.

Mistake: mistaking memory model for execution model

The memory model says what observations are legal. It does not say which observation will occur. Real schedules vary by load, CPU model, even ambient temperature.


Production Patterns

Configuration hot-reload

type Config struct {
    Endpoints []string
    Timeout   time.Duration
}

type App struct {
    cfg atomic.Pointer[Config]
}

func (a *App) Reload(c *Config) { a.cfg.Store(c) }
func (a *App) Cfg() *Config     { return a.cfg.Load() }

Used at scale (every major Go web framework has variants).

Atomic stats

type Stats struct {
    Requests atomic.Int64
    Errors   atomic.Int64
    Latency  atomic.Int64 // nanoseconds, total
}

func (s *Stats) Record(elapsed time.Duration, err error) {
    s.Requests.Add(1)
    s.Latency.Add(int64(elapsed))
    if err != nil {
        s.Errors.Add(1)
    }
}

Production metrics often look like this. For very high throughput, shard.

Stop flags

type Worker struct {
    stop atomic.Bool
}

func (w *Worker) Stop()        { w.stop.Store(true) }
func (w *Worker) Stopping() bool { return w.stop.Load() }

Standard in goroutine pools and pipeline stages.

Sync.Once-like pattern

type Once struct {
    done atomic.Uint32
    m    sync.Mutex
}

func (o *Once) Do(f func()) {
    if o.done.Load() == 0 {
        o.m.Lock()
        defer o.m.Unlock()
        if o.done.Load() == 0 {
            f()
            o.done.Store(1)
        }
    }
}

Go's stdlib sync.Once is essentially this. The fast path is one atomic load; the slow path takes a mutex.

Connection pool size

type Pool struct {
    size atomic.Int64
}

func (p *Pool) Acquire() {
    p.size.Add(1)
}

func (p *Pool) Release() {
    p.size.Add(-1)
}

Live counter; readable via Load.

Health flag

type Health struct{ healthy atomic.Bool }

func (h *Health) MarkUnhealthy() { h.healthy.Store(false) }
func (h *Health) MarkHealthy()   { h.healthy.Store(true) }
func (h *Health) IsHealthy() bool { return h.healthy.Load() }

Every load balancer's health-check probe ends up reading a flag like this.


Testing for SC Compliance

go test -race

Always. Every PR. The race detector dynamically tracks happens-before edges and reports violations.

Stress testing with -cpu

go test -race -cpu=1,2,4,8 ./...

Forces the runtime to use different GOMAXPROCS settings, exposing races that depend on parallelism.

Repeated test runs

go test -race -count=100 ./...

Some races appear only under specific schedules. Repeated runs increase coverage.

Litmus harness

// See litmus tests above. Wrap them in `Benchmark` functions
// to run as part of CI:
func TestStoreBufferLitmus(t *testing.T) {
    for i := 0; i < 10_000; i++ {
        var x, y atomic.Int32
        var r1, r2 int32
        var wg sync.WaitGroup
        wg.Add(2)
        go func() {
            defer wg.Done()
            x.Store(1)
            r1 = y.Load()
        }()
        go func() {
            defer wg.Done()
            y.Store(1)
            r2 = x.Load()
        }()
        wg.Wait()
        if r1 == 0 && r2 == 0 {
            t.Fatalf("SC violation at iter %d", i)
        }
    }
}

This test passes deterministically.

Property tests

A property-based test:

For every pair of atomic stores X and Y in different goroutines, no observer can see X without Y if Y was sequenced before X in any execution.

Encode as a generator that runs many concurrent stores and asserts the invariant. Useful when designing new lock-free structures.


Benchmark Recipes

Benchmarking SC overhead

package atomic_bench

import (
    "sync/atomic"
    "testing"
)

var globalI64 int64
var globalAtomic atomic.Int64

func BenchmarkPlainLoad(b *testing.B) {
    var sink int64
    for i := 0; i < b.N; i++ {
        sink = globalI64
    }
    _ = sink
}

func BenchmarkSCLoad(b *testing.B) {
    var sink int64
    for i := 0; i < b.N; i++ {
        sink = globalAtomic.Load()
    }
    _ = sink
}

func BenchmarkPlainStore(b *testing.B) {
    for i := 0; i < b.N; i++ {
        globalI64 = int64(i)
    }
}

func BenchmarkSCStore(b *testing.B) {
    for i := 0; i < b.N; i++ {
        globalAtomic.Store(int64(i))
    }
}

func BenchmarkSCAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        globalAtomic.Add(1)
    }
}

Run with go test -bench=. -benchmem. Typical x86 results: - PlainLoad: 0.3 ns/op - SCLoad: 0.3 ns/op - PlainStore: 0.3 ns/op - SCStore: 10 ns/op - SCAdd: 10 ns/op

ARM64 results show SCLoad around 5–10 ns.

Benchmarking contention

func BenchmarkContendedAdd(b *testing.B) {
    var counter atomic.Int64
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            counter.Add(1)
        }
    })
}

Run with -cpu=1,2,4,8,16. Throughput per goroutine drops as cores increase, due to cache-line bouncing.

Sharded counter benchmark

type Shard struct {
    v atomic.Int64
    _ [56]byte
}

func BenchmarkShardedAdd(b *testing.B) {
    const shards = 64
    var arr [shards]Shard
    var ctr atomic.Int64
    b.RunParallel(func(pb *testing.PB) {
        id := ctr.Add(1) - 1
        s := &arr[id%shards]
        for pb.Next() {
            s.v.Add(1)
        }
    })
}

This benchmark scales much better than the contended version.


Profiling SC Costs

go tool pprof

Add the standard pprof endpoints. Capture a CPU profile and look for:

  • Atomic functions: sync/atomic.AddInt64, runtime/internal/atomic.*.
  • Mutex contention: sync.(*Mutex).Lock.

If atomics dominate, sharding or relaxing the access pattern helps.

runtime/trace

import "runtime/trace"

trace.Start(os.Stderr)
// ... workload ...
trace.Stop()

The trace shows goroutine schedules, blocking, and helpful per-G timelines. Useful for diagnosing why a spin loop is hot.

Perf counters

On Linux, perf stat reveals cache-coherence traffic:

perf stat -e cache-references,cache-misses,bus-cycles ./your-binary

A high cache-misses rate on atomic-heavy code suggests false sharing or contention.


Edge Cases

atomic.Pointer[T] with T = chan

var ch atomic.Pointer[chan int]

You can publish a channel pointer atomically. Receivers must Load and then operate on the channel. Note that the channel value (the runtime hchan struct) is itself synchronised internally — the atomic only protects the pointer.

Atomics on slices of atomics

type Bus struct{ counters [64]atomic.Int64 }

Each counter is independent. The whole struct is not atomic — individual elements are.

Atomics inside structs passed by value

type Box struct { v atomic.Int64 }
func f(b Box) { b.v.Store(1) } // updates a *copy*

The atomic copy is unrelated to the caller's. Pass *Box or use a pointer type.

Atomics with unsafe.Pointer

var p atomic.Pointer[uintptr] // OK

atomic.UnsafePointer (legacy) and atomic.Pointer[T] (generic) handle pointer atomics with SC. The race detector understands them.

Atomics across cgo calls

Atomics are Go-side synchronisation. When calling C code via cgo, the C side has its own atomic semantics (typically C++'s memory_order_seq_cst if you use atomic.h). The two interact transparently as long as both use SC.


Self-Assessment

  • I can state the SC-DRF guarantee in formal terms (happens-before, race-freedom, sequential consistency).
  • I know which Go primitives create happens-before edges.
  • I can read Go's memory model document and find the specific guarantees about atomics, channels, mutexes.
  • I know that Go's sync/atomic provides SC (Go 1.19+), strictly stronger than C++'s default acq-rel.
  • I can identify when SC overhead matters (hot loops, high contention) and when it does not (request-rate atomics).
  • I can write the store-buffer litmus test and predict its outcome under SC vs TSO.
  • I have benchmarked SC atomic operations on x86 and observed the load-store cost asymmetry.
  • I can refactor a sync.RWMutex-protected read-mostly structure into a copy-on-write atomic.Pointer[T].
  • I can compare Go's SC commitment with C++'s memory_order_seq_cst and Java's volatile.
  • I have read at least one of Russ Cox's memory-model blog posts in full.

Summary

The middle level of SC mastery is about deliberate use:

  • You know the rule: race-free programs behave under SC; atomics are SC; this is stronger than acquire-release.
  • You understand the happens-before relation and its role in establishing SC.
  • You can compare Go's choice with C++'s opt-in seq-cst and Java's volatile.
  • You know where SC costs you (hot ARM atomics, contended cache lines) and where it does not (read-mostly x86 workloads).
  • You can implement and reason about Dekker, Peterson, SPSC queues, and other SC-dependent algorithms.
  • You can read Go's sync/atomic source and the runtime atomics for any supported architecture.
  • You can benchmark, profile, and reason about the cost of SC operations.

The senior page goes deeper still: formal models, hardware mappings in detail, fence emission strategy, and SC at the boundary with non-SC code (cgo, syscalls, mmap-shared memory).


Deeper Dive: The Happens-Before Lattice

The happens-before relation is a partial order on memory operations. To visualise it, draw operations as nodes and edges as ordering relations. The result is a lattice (a partially-ordered set with meets and joins).

A small example

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

var (
    a, b, c int
    flag    atomic.Bool
)

func main() {
    var wg sync.WaitGroup
    wg.Add(2)

    go func() {
        defer wg.Done()
        a = 1                  // 1
        b = 2                  // 2
        flag.Store(true)       // 3 — synchronisation point
    }()

    go func() {
        defer wg.Done()
        for !flag.Load() {    // 4
        }
        c = a + b              // 5
    }()

    wg.Wait()
    fmt.Println(a, b, c)       // 6
}

The happens-before edges:

  • 1 → 2 → 3 (program order)
  • 4 → 5 (program order)
  • 6 follows wg.Wait, after both Done's
  • 3 → 4 (SC synchronisation on flag, when 4 observes true)
  • Wait → 6

Transitively, 1 and 2 happen-before 5, so c = 3 is the only legal output of c.

Drawing it as a graph

   [a = 1]      [b = 2]      [flag.Store]
       \           |              /
        \          |             /
         +-----+   +----------- + ─→ (synchronisation edge)
                |  |               |
                v  v               v
              [reader reads a, b][flag.Load==true]
                                   |
                                   v
                                [c = a + b]
                                   |
                                   v
                                [Wait, main reads a, b, c]

SC is the case where this lattice is fleshed out enough to be a total order. With sparse synchronisation, the lattice has many incomparable pairs — and those pairs may interleave in any order.

Why SC is special

Among partial orders, SC is the case where every observation is consistent with a single total order. Weaker models permit different total orders for different observers. Under acquire-release, observer X may see writes in order w1, w2 while observer Y sees w2, w1 — both consistent with their respective acquire loads. Under SC, X and Y must agree.


Deeper Dive: Why the Go Team Chose SC (and Not Weaker)

Three arguments dominated the 2021–2022 Go memory-model discussion:

1. Existing code already assumed SC

atomic.LoadInt32 and atomic.StoreInt32 had been around since Go 1.0. Developers used them for flags, counters, and pointer publication. The de-facto behaviour on x86 (which most Go developers used) was essentially SC by accident — because x86 is "almost SC" already. Code worked. Refusing to bless this behaviour with a formal SC guarantee would have broken existing code on weakly-ordered hardware.

2. Simpler model, simpler bugs

C++ developers regularly mix up acquire and release. SC eliminates that class of bug entirely. The Go team prioritises teaching scalability — a model two new hires can grasp in an hour, versus one a senior engineer has to relearn every six months.

3. The performance hit is modest

Benchmarks across real Go workloads showed SC atomics costing 0–5% in throughput vs acquire-release. The gain from offering weaker orderings was, in the team's judgement, not worth the complexity.

What was rejected

Proposals to add atomic.LoadAcquire, atomic.StoreRelease, atomic.LoadRelaxed, etc., were rejected. The lone exception: Russ Cox left the door open for future relaxed atomics if a clear performance need emerges. As of Go 1.22+, this has not happened.


Translating C++ Patterns to Go

Below are common C++ patterns and their Go equivalents.

Spinlock

class spinlock {
    std::atomic<bool> flag{false};
public:
    void lock() {
        while (flag.exchange(true, std::memory_order_acquire)) {}
    }
    void unlock() { flag.store(false, std::memory_order_release); }
};
type spinlock struct{ flag atomic.Bool }

func (s *spinlock) Lock() {
    for s.flag.Swap(true) {
    }
}
func (s *spinlock) Unlock() { s.flag.Store(false) }

Go's version is full SC — slightly stronger than the C++ acquire-release version. Cost difference is small.

Reference-counted pointer

struct refcounted {
    std::atomic<int> rc{1};
    void retain() { rc.fetch_add(1, std::memory_order_relaxed); }
    void release() {
        if (rc.fetch_sub(1, std::memory_order_acq_rel) == 1) delete this;
    }
};
type Refcounted struct{ rc atomic.Int64 }

func (r *Refcounted) Retain() { r.rc.Add(1) }
func (r *Refcounted) Release() {
    if r.rc.Add(-1) == 0 {
        // free
    }
}

C++ uses relaxed for retain (no ordering needed) and acq-rel for release (must order destructor with prior accesses). Go uses SC for both. The retain cost is slightly higher on ARM; release cost is similar.

Lock-free queue tail-update

auto tail = q->tail.load(std::memory_order_acquire);
node->next = nullptr;
q->tail.store(node, std::memory_order_release);
tail := q.tail.Load()
node.Next = nil
q.tail.Store(node)

The Go version is SC. Subtle gotcha: in C++ the node's next is updated with a plain (non-atomic) write and the C++ release ensures it's visible to readers via acquire. In Go, the SC store provides the same visibility. The node's next must be set before the atomic store, just as in C++.

Lazy initialisation with double-checked locking

class Singleton {
    static std::atomic<Singleton*> instance;
    static std::mutex mu;
public:
    static Singleton* get() {
        Singleton* s = instance.load(std::memory_order_acquire);
        if (!s) {
            std::lock_guard<std::mutex> lk(mu);
            s = instance.load(std::memory_order_relaxed);
            if (!s) {
                s = new Singleton();
                instance.store(s, std::memory_order_release);
            }
        }
        return s;
    }
};
var (
    instance atomic.Pointer[Singleton]
    mu       sync.Mutex
)

func Get() *Singleton {
    if s := instance.Load(); s != nil {
        return s
    }
    mu.Lock()
    defer mu.Unlock()
    if s := instance.Load(); s != nil {
        return s
    }
    s := &Singleton{}
    instance.Store(s)
    return s
}

The Go version is cleaner because every operation is SC by default.


A Tour of Mutex Internals (Just Enough)

sync.Mutex is layered on top of SC atomics. A simplified sketch:

type Mutex struct {
    state int32
    sema  uint32
}

func (m *Mutex) Lock() {
    if atomic.CompareAndSwapInt32(&m.state, 0, 1) {
        return // fast path
    }
    m.lockSlow()
}

func (m *Mutex) Unlock() {
    new := atomic.AddInt32(&m.state, -1)
    if new != 0 {
        m.unlockSlow(new)
    }
}

The real implementation is more complex (handles starvation, spinning, queue management), but the heart is SC atomic CAS and Add operations.

When you call Lock, you observe the lock's released state via SC. When you call Unlock, the SC store publishes all your prior writes inside the critical section. The combination provides SC for race-free code that uses the mutex correctly.


Reflections on the Race Detector

The race detector implements a vector clock algorithm. Each goroutine has a logical clock; synchronisation operations exchange and update clocks. A read or write to a memory location records the current clock; a later access from a different goroutine compares clocks and reports if there's no happens-before edge.

Key implications:

  • The detector is dynamic: it reports only races that actually occur in observed executions.
  • It is not exhaustive: a race that never manifests during the test run will not be reported.
  • It has no false positives: every reported race is real.
  • It has overhead: 5–10× slower execution, 5–10× more memory.

To increase coverage: - Run with diverse GOMAXPROCS settings. - Use -count=N to run tests many times. - Stress-test with realistic load. - Use synthetic schedulers like go.dev/x/tools/cmd/race extensions.


When SC Is Not the Right Tool

Case 1: independent counters with no read invariant

If you only care about the final count and the count is summed at the end (after all goroutines join), per-goroutine plain counters are simpler and faster:

type Tally struct {
    perG []int64
}

func (t *Tally) Inc(g int) { t.perG[g]++ }
func (t *Tally) Sum() int64 {
    var s int64
    for _, v := range t.perG {
        s += v
    }
    return s
}

perG[g] is updated only by goroutine g, so no race. Sum is called after wg.Wait, which establishes happens-before with each goroutine's writes.

Case 2: write-once, read-many configuration

If a value is set once at startup and never changes, no atomics are needed — if startup synchronises with all readers (e.g., readers are started after init):

var cfg *Config

func main() {
    cfg = load()
    serve()
}

func handler() { use(cfg) }

The go func calls in serve happen-after cfg = load(). Readers safely use cfg. No atomic needed.

But the moment you have hot reload, atomics are required.

Case 3: thread confinement

If a piece of state is only ever touched by one goroutine, no synchronisation is needed. This is the cleanest design when feasible. Use channels to pass ownership.

func worker(in <-chan job, out chan<- result) {
    var state State // confined to this goroutine
    for j := range in {
        state.update(j)
        out <- state.result()
    }
}

State is never shared. SC is irrelevant.


A Closer Look at sync.Once

sync.Once is the canonical "run exactly once" primitive. Internally:

type Once struct {
    done atomic.Uint32
    m    Mutex
}

func (o *Once) Do(f func()) {
    if o.done.Load() == 0 {
        o.doSlow(f)
    }
}

func (o *Once) doSlow(f func()) {
    o.m.Lock()
    defer o.m.Unlock()
    if o.done.Load() == 0 {
        defer o.done.Store(1)
        f()
    }
}

The fast path is one SC Load. If the value is 0, the slow path acquires the mutex and re-checks. The SC atomics ensure that after done == 1, all writes performed by f() are visible to any subsequent caller.

Once is, in essence, the publication pattern wrapped in a tidy API.


Working with atomic.Pointer[T] Idiomatically

type State struct{ /* fields */ }

type Holder struct {
    p atomic.Pointer[State]
}

func (h *Holder) Update(s *State) { h.p.Store(s) }
func (h *Holder) Get() *State     { return h.p.Load() }

func (h *Holder) Modify(f func(*State) *State) {
    for {
        old := h.p.Load()
        n := f(old)
        if h.p.CompareAndSwap(old, n) {
            return
        }
    }
}

The Modify method is a common compare-and-swap loop. It reads, computes a new value, and tries to swap. SC ensures the swap is totally ordered with every other modify.

Important: f must produce a new *State. Mutating old in place is wrong.


A Brief Detour: atomic.Value

The pre-1.19 atomic.Value allows storing arbitrary interface{}. Its constraints:

  • All values stored must have the same concrete type.
  • Reading requires a type assertion.
  • Allocates due to interface boxing.
var v atomic.Value
v.Store(&Config{...})
c := v.Load().(*Config)

Use it only when you have heterogeneous types or legacy code. For new code, prefer typed atomic.Pointer[T].


SC and the Garbage Collector

Go's GC interacts with the memory model in subtle ways:

  • The write barrier (inserted by the compiler around pointer writes during a GC cycle) is itself a synchronising operation.
  • Stop-the-world phases of GC (now rare and brief) involve full barriers.
  • Concurrent mark uses atomic word operations to track grey/black/white states.

You can mostly ignore this — the GC ensures all your atomics remain SC even during a GC cycle. But it does mean that GC pauses can have small effects on atomic-heavy benchmarks.


SC and the Scheduler

The Go scheduler (runtime/proc.go) is the most atomic-heavy piece of Go code. Every goroutine state transition (running, runnable, waiting, etc.) involves CAS operations on the scheduler's data structures. The scheduler relies on SC to ensure correctness.

This means: every go func(), every channel operation, every select, every blocking syscall — all of them involve SC atomics inside the runtime. The cost is amortised across the runtime; you rarely see it as a hot spot in your own code.


Channels and SC

Channel operations synchronise. The Go memory model formalises:

  • A send on a channel happens-before the corresponding receive completes.
  • The kth send happens-before the kth receive (for unbuffered or capacity-1 channels, this is trivial; for buffered channels, it follows from queue semantics).
  • A close happens-before a receive that observes the channel as closed.

Internally, channels use SC atomics on their buffer indices and the mutex protecting the channel header.

Combining channels and atomics in the same program is safe. The Go memory model composes them via happens-before.


A Worked Pattern: Reload-Safe Server

package main

import (
    "fmt"
    "net/http"
    "sync/atomic"
)

type Config struct {
    Greeting string
}

var cfg atomic.Pointer[Config]

func init() {
    cfg.Store(&Config{Greeting: "hello"})
}

func reload(greeting string) {
    cfg.Store(&Config{Greeting: greeting})
}

func handler(w http.ResponseWriter, r *http.Request) {
    c := cfg.Load()
    fmt.Fprintln(w, c.Greeting)
}

func adminReload(w http.ResponseWriter, r *http.Request) {
    g := r.URL.Query().Get("greeting")
    reload(g)
    fmt.Fprintln(w, "reloaded")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/reload", adminReload)
    http.ListenAndServe(":8080", nil)
}

Every request handler reads the current config atomically. The admin endpoint swaps it. Concurrent requests during reload either see the old or the new — never a half-built state.


Another Worked Pattern: Bounded Spin

package main

import (
    "fmt"
    "runtime"
    "sync"
    "sync/atomic"
    "time"
)

type Latch struct {
    done atomic.Bool
}

func (l *Latch) Signal() { l.done.Store(true) }

func (l *Latch) Wait() {
    spins := 0
    for !l.done.Load() {
        if spins < 100 {
            spins++
        } else if spins < 1000 {
            runtime.Gosched()
            spins++
        } else {
            time.Sleep(time.Microsecond)
        }
    }
}

func main() {
    var l Latch
    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        defer wg.Done()
        l.Wait()
        fmt.Println("released")
    }()
    time.Sleep(10 * time.Millisecond)
    l.Signal()
    wg.Wait()
}

This latch uses an escalating wait: tight spin → yield → sleep. It avoids burning CPU on a long wait while keeping low-latency wake-up for short waits.


Yet Another Pattern: Resettable Once

type ResettableOnce struct {
    state atomic.Int32 // 0 = idle, 1 = running, 2 = done
}

func (r *ResettableOnce) Do(f func()) bool {
    if !r.state.CompareAndSwap(0, 1) {
        return false
    }
    f()
    r.state.Store(2)
    return true
}

func (r *ResettableOnce) Reset() bool {
    return r.state.CompareAndSwap(2, 0)
}

Unlike sync.Once, this one can be reset. The SC atomics ensure correctness across goroutines.


Patterns to Avoid

Anti-pattern: condition variable on atomic

var flag atomic.Bool
var cond *sync.Cond

func waiter() {
    cond.L.Lock()
    for !flag.Load() {
        cond.Wait()
    }
    cond.L.Unlock()
}

If the mutex used by cond is the only synchronisation, the atomic is unnecessary — use a plain bool guarded by the same mutex. If the atomic is also synchronising, then cond.Wait may miss signals because the wake-up race window is open.

Use channels instead:

done := make(chan struct{})

func waiter() { <-done }
func signal() { close(done) }

Anti-pattern: atomic counter inside critical section

var counter atomic.Int64
var mu sync.Mutex

func work() {
    mu.Lock()
    defer mu.Unlock()
    counter.Add(1) // unnecessary atomic
    // ...
}

If the increment is always done under the mutex, plain int64 suffices. The atomic adds a fence with no benefit.

Anti-pattern: comparing atomic structs

var a, b atomic.Int64
if a == b { /* always false */ }

a and b are different struct values. Compare a.Load() == b.Load() instead.

Anti-pattern: pointer publication of a mutable struct

type S struct{ N int }
var p atomic.Pointer[S]

func update() {
    s := p.Load()
    s.N++ // RACE
}

Mutating *p.Load() is a race. Use a copy-on-write pattern.


SC at the Edge: Syscalls and Cgo

Syscalls

When a goroutine makes a syscall, the runtime parks it on a real OS thread until the syscall returns. Memory operations performed by the kernel (e.g., mmap, read) are synchronised with Go via the syscall's return — a happens-before edge.

But: if you mmap a shared region and share it with another process, you are on your own. Use kernel-provided synchronisation (futexes, semaphores) and treat the region as racy from Go's perspective.

Cgo

C code called via cgo has its own atomic semantics. If you call C code that uses <stdatomic.h> with memory_order_seq_cst, the semantics align with Go's. If the C code uses weaker orderings (relaxed, acquire-release), you must reason about the combined model carefully.

A pragmatic rule: use SC everywhere across the boundary. The performance cost is small; the bug surface is huge if you mix.

mmap-shared memory

For shared-memory IPC across processes, atomic operations on shared memory are tricky. Both processes must agree on the memory model. POSIX provides pthread_mutex with shared attribute (PTHREAD_PROCESS_SHARED) and atomics from <stdatomic.h>. Go does not directly expose these but you can use cgo to call them.


More Litmus Tests

Litmus: store-store with read

       Thread A          Thread B
       x = 1             y = 1
       y = 2             r = x

If r == 0, then under SC y == 1 must be visible (since y was written before r was read).
This requires reading y on Thread B after the read of x:

Add a read of y on B:

       Thread A          Thread B
       x = 1             r1 = x
       y = 2             r2 = y

Under SC: if r2 == 2, then r1 must be 1.

If you observe r2 == 2 && r1 == 0, that's an SC violation — the read of x happened "before" the write of x in some ordering, yet y had already been written.

Litmus: write-read-write

       Thread A          Thread B
       x = 1             r1 = x
       r2 = y            y = 1

Under SC: the global order must allow r1 == 1 to imply that A's write of x precedes B's read of x.
If r2 == 1, then y == 1 happened before A's read of y, so y == 1's write precedes A's read of y in global order.
Combining: if r1 == 1 && r2 == 1, the global order is x = 1 → r1 = x → y = 1 → r2 = y.
A wrote x = 1 before reading y → in program order. ✓
B wrote y = 1 after reading x → in program order. ✓
SC consistent.

Litmus: many threads, three atomics

T1: x = 1
T2: y = 1; r1 = x
T3: r2 = y; r3 = x
T4: r4 = x; r5 = y

Under SC: if r2 == 1, then y = 1 happened before T3's read of y. T3's read of x must see the global state at or after that point. So r3 == 1 if x = 1 happened anywhere in global order before T3's read.

These tests get tedious to enumerate by hand. Tools like Herd7 and RmEM automate them for C/C++ models. Go's race detector and the formal Cox proofs cover the Go side.


More Performance Tips

  • Use Swap for "set and get old": Swap(new) returns the old value and stores new. Equivalent to Load + Store but atomic. Useful for resettable counters.
  • Use CompareAndSwap to detect conflicts: if you need lock-free updates with retry, CAS is the tool.
  • Avoid for {old := Load(); CompareAndSwap(old, ...)} if Add would suffice: Add is one operation; CAS is a retry loop.
  • Avoid atomic operations on values you only write once: a one-shot publication is just a Store. No CAS needed.
  • Pad to cache lines for write-heavy atomics: 64 bytes on x86, 64 bytes on ARM64.

More Tests

Test: published value visible

func TestPublishVisible(t *testing.T) {
    var p atomic.Pointer[int]
    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        defer wg.Done()
        v := 42
        p.Store(&v)
    }()
    wg.Wait()
    got := p.Load()
    if got == nil || *got != 42 {
        t.Fatalf("got %v, want 42", got)
    }
}

Test: ordering visible

func TestOrderingVisible(t *testing.T) {
    var a, b atomic.Int32
    for i := 0; i < 10_000; i++ {
        a.Store(0)
        b.Store(0)
        var wg sync.WaitGroup
        wg.Add(2)
        go func() {
            defer wg.Done()
            a.Store(1)
            if b.Load() == 0 {
                // ok: B hasn't run yet
            }
        }()
        go func() {
            defer wg.Done()
            b.Store(1)
            if a.Load() == 0 {
                // ok: A hasn't run yet
            }
        }()
        wg.Wait()
        if a.Load() != 1 || b.Load() != 1 {
            t.Fatalf("final state wrong: a=%d b=%d", a.Load(), b.Load())
        }
    }
}

Test: counter sum

func TestCounterSum(t *testing.T) {
    var c atomic.Int64
    var wg sync.WaitGroup
    const N = 16
    const each = 100_000
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for j := 0; j < each; j++ {
                c.Add(1)
            }
        }()
    }
    wg.Wait()
    if c.Load() != int64(N*each) {
        t.Fatalf("got %d, want %d", c.Load(), N*each)
    }
}

These three tests together exercise SC's core promises.


Reading Recommendations

  • Russ Cox, "Hardware Memory Models" (2021): https://research.swtch.com/hwmm
  • Russ Cox, "Programming Language Memory Models" (2021): https://research.swtch.com/plmm
  • Russ Cox, "Updating the Go Memory Model" (2021): https://research.swtch.com/gomm
  • Hans Boehm & Sarita Adve, "Foundations of the C++ Concurrency Memory Model" (2008).
  • Leslie Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" (1979).

These five sources together give the full picture of where SC came from, how it works, and why Go chose it.


Closing the Middle Chapter

You should now be able to:

  • Read Go's memory model spec and answer questions about specific guarantees.
  • Translate C++ memory-order code into Go's SC-by-default style.
  • Identify when SC overhead matters and apply mitigations (sharding, padding, batching).
  • Write SC-dependent algorithms (Dekker, SPSC, lock-free counters).
  • Use the race detector confidently as the SC-DRF verification tool.
  • Reason about happens-before edges through channels, mutexes, WaitGroups, and atomics.

The senior page formalises the model with full happens-before notation, presents Lamport's original SC paper in modern terms, and goes deep on store-buffer effects and fence emission across architectures.


Appendix M: A Tour of Real-World Codebases Using SC Atomics

The best teacher is real production code. Below are pointers to public Go codebases where SC atomics are used heavily, with brief notes on what to look for.

The Go runtime itself

src/runtime/proc.go — scheduler. Look at findrunnable, runqsteal, g0-related code. Atomic operations on goroutine queues are pervasive.

src/runtime/sema.go — semaphore implementation. Used by sync.Mutex for blocking waits. SC atomics on the wait list.

src/runtime/mgc.go — garbage collector. Heavy use of atomic word ops to track grey/black/white state.

Standard library

src/sync/once.gosync.Once as discussed.

src/sync/map.gosync.Map uses atomic pointer publication for its read-only fast path.

src/sync/pool.go — per-P caches, atomic for handoff.

src/net/http/server.go — uses atomic.Bool for inShutdown and similar flags.

src/context/context.gocancelCtx uses atomics to track done state.

  • prometheus/client_golang — heavy use of atomic.Uint64 for counters and gauges.
  • etcd — atomic indices for raftLog, peer state.
  • cockroachdb/pebble — atomic versions, generation counters in LSM trees.
  • nats-io/nats-server — atomic flags for connection state, sequence numbers.
  • dgraph-io/badger — atomic counters and pointers in the value log.
  • grpc-go — atomics for connection state machine transitions.

Reading 100 lines of any of these is more educational than reading 1000 lines of synthetic examples.


Appendix N: Common Q&A from Production Reviews

Q: We have an atomic.Int64 counter incremented per request. We see contention in pprof on runtime/internal/atomic.Xadd64. What do we do?

A: Shard the counter by P (logical processor). Use runtime.NumCPU() shards, each padded to a cache line. Sum on read.

type ShardedCounter struct {
    shards []struct {
        n atomic.Int64
        _ [56]byte // pad
    }
}

func New() *ShardedCounter {
    return &ShardedCounter{shards: make([]struct {
        n atomic.Int64
        _ [56]byte
    }, runtime.GOMAXPROCS(0))}
}

func (c *ShardedCounter) Inc() {
    g := runtime.GoroutineID() // hypothetical
    c.shards[g%uintptr(len(c.shards))].n.Add(1)
}

func (c *ShardedCounter) Sum() int64 {
    var s int64
    for i := range c.shards {
        s += c.shards[i].n.Load()
    }
    return s
}

(Note: Go doesn't expose goroutine IDs publicly. In practice you use runtime_procPin via unsafe linkname, or just modulo a counter passed in.)

Q: We hot-reload config via atomic.Pointer[Config]. Memory grows over time. Why?

A: Old configs are retained as long as some goroutine holds a reference. If a request handler Loads the config and holds it for the entire request, the previous configs cannot be GC'd until those handlers finish. Solutions:

  • Keep handlers short.
  • Use generation counters to detect stale configs and refresh.
  • Use weak references (Go does not have them natively; emulate with explicit lifetime management).

Q: We use atomic.Bool for a stop flag, but the worker takes 50ms to notice. Why?

A: The worker is probably blocked in a syscall (read, write, sleep). The atomic Load cannot fire while blocked. Solutions:

  • Use context.Context with cancellation; the syscall wrappers respect it.
  • Use channels with select and a timeout/cancel channel.
  • Set read/write deadlines on sockets.

Q: Should we use sync.Mutex or sync/atomic for a counter?

A: For a single integer counter incremented across goroutines, atomic.Int64.Add is faster than a mutex-protected int64. The atomic is one fence; the mutex is two atomics (lock and unlock) plus the actual increment.

Q: We have a struct with several fields. We need to update them atomically. What do we use?

A: A mutex. Atomics protect single-word values. For multi-field updates, either: - Use a mutex. - Use atomic.Pointer[T] with copy-on-write: build a new T with all fields set, then Store.

Q: Our atomic store on ARM is much slower than on x86. Expected?

A: Yes. x86 is "almost SC" (TSO); ARM is weakly ordered and needs explicit fences. SC stores on ARM cost ~10–20 ns vs ~10 ns on x86.


Appendix O: The Two-Slot Trick

A common pattern when you want atomic publication with no allocation:

type Holder struct {
    slots [2]Config
    cur   atomic.Int32
}

func (h *Holder) Update(c Config) {
    next := 1 - h.cur.Load()
    h.slots[next] = c
    h.cur.Store(next)
}

func (h *Holder) Read() Config {
    return h.slots[h.cur.Load()]
}

The writer writes to the inactive slot, then atomically flips the index. Readers may briefly see the old slot. No allocation, just a copy.

The catch: readers may be mid-read when a second update happens, overwriting the slot they're reading. This is dangerous unless the read is atomic word-sized. For multi-field structs, prefer atomic.Pointer[T] with allocation.


Appendix P: Hardware-Level Detail You Should Know

Cache coherence vs memory consistency

Coherence: all cores eventually agree on each individual location's value.

Consistency: ordering between operations on different locations.

SC is a consistency property. Hardware always provides coherence (MESI protocol or similar). Consistency varies by ISA.

MESI states

Each cache line is in one of four states: - Modified: this core has the only copy, has modified it. - Exclusive: this core has the only copy, unchanged. - Shared: multiple cores have read-only copies. - Invalid: this core's copy is stale.

When you atomic.Store(1), the cache line moves to Modified on your core, invalidating it on others. Subsequent reads from other cores trigger a cache-to-cache transfer.

Store buffers and load forwarding

On x86, each core has a store buffer holding pending writes. A new write enters the buffer immediately; it commits to L1 cache later. Reads check the buffer first (store-to-load forwarding) so the same core sees its own writes immediately.

The buffer creates the famous x86 reordering: a store followed by a load from a different address may appear as the load happening "before" the store from another core's perspective. SC eliminates this via xchg or mfence.

ARM's weaker model

ARM has store buffers like x86 but also reorders loads relative to each other. SC requires dmb ish or LDAR/STLR to prevent.

RISC-V's RVWMO

RISC-V's weak memory order permits all four reorderings. Fence instructions are explicit and granular: fence rw,rw is a full fence; fence r,r orders only reads. SC atomics use amoswap.w.aqrl (read-modify-write with both acquire and release semantics).


Appendix Q: A Tiny Formal Take

A program execution is a tuple (E, po, rf, mo):

  • E: set of memory events (reads, writes, atomics).
  • po: program order (per-thread total order).
  • rf: reads-from (each read mapped to the write it observes).
  • mo: modification order (per-location total order over writes).

Sequential consistency requires the existence of a total order < on E such that:

  1. < is consistent with po (per-thread program order respected globally).
  2. Each read sees the most recent write per <.

Weaker models relax (1) or (2). Acquire-release relaxes (1) for non-synchronising operations. Relaxed relaxes both heavily.

Go's SC guarantee: < exists for every race-free program.


Appendix R: A Comparison Matrix

Property C++ relaxed C++ acq-rel C++ seq-cst Go atomic (always)
Atomic operation yes yes yes yes
Order with prior ops no yes yes yes
Order with later ops no yes yes yes
Total order across atomics no no yes yes
Forbids store-buffer outcome no no yes yes
Cost on x86 low low medium medium
Cost on ARM low medium high high

Go sits in the rightmost column for all operations. The simplicity of the model is its strength.


Appendix S: Step-by-Step: From Plain to SC

A teaching exercise. Start with broken code, fix it step by step.

Step 1: the broken code

package main

import (
    "fmt"
    "sync"
)

var (
    ready bool
    data  int
)

func main() {
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        defer wg.Done()
        data = 42
        ready = true
    }()
    go func() {
        defer wg.Done()
        for !ready {
        }
        fmt.Println(data)
    }()
    wg.Wait()
}

Run with -race. Detector reports a race on both ready and data.

Step 2: make ready atomic

import "sync/atomic"

var (
    ready atomic.Bool
    data  int
)

func main() {
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        defer wg.Done()
        data = 42
        ready.Store(true)
    }()
    go func() {
        defer wg.Done()
        for !ready.Load() {
        }
        fmt.Println(data)
    }()
    wg.Wait()
}

Now ready is SC. The race on ready is gone. The race on data? Strictly speaking, the writer and reader both touch data, but the writer's data = 42 happens-before its ready.Store(true), which happens-before the reader's ready.Load() == true, which happens-before the reader's data read. So data is race-free thanks to the SC atomic. The race detector confirms.

Step 3: simplify with channels

func main() {
    done := make(chan struct{})
    var data int
    go func() {
        data = 42
        close(done)
    }()
    <-done
    fmt.Println(data)
}

Channels do the same job. The channel close happens-before the receive. The receive sees data = 42.

Step 4: when atomic is the right tool

If you have many readers and one writer, channels are awkward (you'd need to broadcast). Atomics are cleaner:

var (
    data  atomic.Pointer[int]
)

func writer() {
    v := 42
    data.Store(&v)
}

func reader() {
    for data.Load() == nil {
    }
    fmt.Println(*data.Load())
}

Multiple readers, one writer, one publication. SC guarantees every reader sees the same value.


Appendix T: The Big Diagram of Memory Models

                           STRONGER
                              │ Linearisability
                              │   (SC + real-time ordering)
                              │ Sequential Consistency  ← Go's promise
                              │   (one global total order)
                              │ Acquire-Release
                              │   (per-pair ordering, no global total)
                              │ Causal Consistency
                              │   (only causally related ops ordered)
                              │ Eventual Consistency
                              │   (reads converge eventually)
                           WEAKER

Go picks SC. C++/Rust let you pick anywhere on the ladder. Distributed databases live below SC. Filesystems are at various levels (POSIX is roughly SC for single-file ops).


Appendix U: Practice Problems

Each problem has a tiny code skeleton; flesh it out and test under -race.

Problem 1: race-free counter

Implement type Counter struct{...} with Inc(), Dec(), Val() int64. Use SC atomics. Stress test from 100 goroutines.

Problem 2: copy-on-write set

Implement type Set struct{...} with Add(s string), Has(s string) bool. Reads must be lock-free; writes copy and swap.

Problem 3: rate limiter

Implement a token-bucket rate limiter using atomic counters. Allow n requests per second. Test under load.

Problem 4: stop-the-world flag

Implement a stop flag with sub-100µs latency: when set, all workers exit within that bound. Use atomic.Bool + runtime.Gosched().

Problem 5: epoch reclamation

Implement an epoch counter and an Enter()/Exit() pair. Use atomic counters to ensure freed memory is observed dead by no one.

Problem 6: SPSC ring buffer

Implement a single-producer/single-consumer ring buffer with SC atomic head/tail. Benchmark against chan int.

Problem 7: priority queue with atomic version

Build a priority queue where each Pop returns the current version. Atomic version increments on each op.

Problem 8: read-mostly map

Implement Map[K, V] with COW on write, lock-free read. Compare to sync.Map.

Problem 9: lazy singleton

Use atomic.Pointer[T] + sync.Mutex for double-checked init. Compare cost to sync.Once.

Problem 10: TLS-style per-goroutine state

Implement a per-goroutine storage using goroutine ID. Test with 1000 goroutines.


Appendix V: A Reading Plan

If you have one afternoon to learn SC for Go:

  1. Read this page top to bottom (45 min).
  2. Read Russ Cox's "Updating the Go Memory Model" (30 min).
  3. Read src/sync/once.go and src/sync/atomic/type.go (15 min).
  4. Run the litmus tests yourself (30 min).
  5. Refactor one piece of your own code to use SC atomics (30 min).
  6. Run go test -race -count=10 on a real codebase (30 min).

If you have one week:

  • Day 1: this page + Russ Cox's blog series.
  • Day 2: Lamport 1979 + Adve/Boehm 2010.
  • Day 3: read Go runtime atomics for x86, ARM64.
  • Day 4: implement Dekker, Peterson, SPSC queue from scratch.
  • Day 5: refactor a production codebase to atomics where appropriate; measure.

Appendix W: Anti-Patterns Library

Below is a catalogue of anti-patterns we've seen in production reviews, with corrections.

Anti-pattern: shadowing atomic with local

var counter atomic.Int64

func bad() {
    counter := 0 // shadows the atomic!
    counter++
}

The local counter is a plain int, unrelated to the package-level atomic. The increment is wasted. Common typo.

Anti-pattern: atomic in receive-only field

type S struct { v atomic.Int64 }

func (s S) Get() int64 { return s.v.Load() } // value receiver copies

Use *S receiver. Value receiver copies the atomic, defeating the contract.

Anti-pattern: storing the wrong type via atomic.Value

var v atomic.Value
v.Store(&Config{})
v.Store(&OtherConfig{}) // PANIC at runtime

atomic.Value requires all stored values to share the same concrete type. Use atomic.Pointer[T] for type safety.

Anti-pattern: assuming Load is real-time

var done atomic.Bool

func wait() {
    for !done.Load() {
        time.Sleep(time.Millisecond) // wakes every 1ms but flag may be set immediately
    }
}

Load is immediate (no delay), but the Sleep between loads is. Use channels for low-latency wakeup.

Anti-pattern: atomic on slice header

var slice []int
go atomic.StoreInt64((*int64)(unsafe.Pointer(&slice[0])), 1)

Don't. Use a proper atomic type.


Appendix X: One More Real-World Pattern

The "atomic ticker":

type Ticker struct {
    period atomic.Int64 // nanoseconds
    stop   atomic.Bool
}

func (t *Ticker) SetPeriod(d time.Duration) { t.period.Store(int64(d)) }
func (t *Ticker) Stop()                     { t.stop.Store(true) }

func (t *Ticker) Run(work func()) {
    for !t.stop.Load() {
        p := time.Duration(t.period.Load())
        if p <= 0 {
            return
        }
        time.Sleep(p)
        work()
    }
}

The period can be changed at runtime via SC store. The stop signal is SC. Single field updates, no mutex needed.


Appendix Y: The Memory Model in One Page

Go memory model, version 1.19+:

1. A data race is two goroutines accessing the same memory location,
   with at least one write, not ordered by happens-before.

2. Programs with data races have undefined behaviour.

3. Programs without data races behave as if executed under
   sequential consistency: every memory operation appears to
   happen one at a time in a global total order consistent
   with each goroutine's program order.

4. Synchronising operations that establish happens-before edges:
     - Goroutine creation (go f() → f's first instruction).
     - Channel send → corresponding receive.
     - Channel close → receive observing closed.
     - Mutex unlock → next mutex lock on the same mutex.
     - WaitGroup Done → Wait returning.
     - Once.Do(f) → return of any subsequent Do on the same Once.
     - Atomic operation → any subsequent atomic operation on the
       same variable (in the global total order over all atomics).

5. Atomic operations:
     - All sync/atomic operations are sequentially consistent.
     - This means: there is a total order over all atomic
       operations in the program; each goroutine's atomic
       operations appear in that order in program-order; and
       every read sees the most recent write per that order.

6. Race detector (`-race`): dynamic detector of (1). Reports
   every observed race; misses races that don't occur.

7. Recommendation: use sync.Mutex, channels, or sync/atomic.
   Never rely on plain reads/writes for cross-goroutine
   communication.

This one-page summary is enough to refer back to in code reviews.


Appendix Z: Wrap-Up

This middle page should leave you with:

  • A formal understanding of Go's SC-DRF guarantee.
  • The ability to compare Go's choice with C++/Rust/Java.
  • Concrete patterns for SC atomics in production.
  • Awareness of cost on x86 vs ARM and how to mitigate.
  • Skills to use the race detector, benchmark atomics, and profile contention.
  • Familiarity with the litmus tests that distinguish SC from weaker models.

The senior page goes one more level deeper: formal happens-before lattices, the Lamport SC proof, fence emission strategy in the Go compiler, and SC at the boundary of the runtime, cgo, and the operating system.


Appendix AA: Pitfalls in Real Codebases

A short tour of bugs we've found by running -race on open-source code:

Bug 1: missing atomic on rate limiter

type Limiter struct {
    last time.Time
    n    int
}

func (l *Limiter) Allow() bool {
    now := time.Now()
    if now.Sub(l.last) > time.Second {
        l.last = now
        l.n = 0
    }
    l.n++
    return l.n < 100
}

Multiple goroutines call Allow. The struct fields are racy. Fix with mutex or atomic indirection.

Bug 2: shared map without sync.Map

var cache = make(map[string]string)

func Get(k string) string { return cache[k] }
func Set(k, v string)     { cache[k] = v }

Standard map is not concurrent-safe. Use sync.Map, a mutex, or COW with atomic.Pointer[map[...]...].

Bug 3: atomic flag but plain data

var ready atomic.Bool
var data *Config // plain pointer

func init() {
    data = &Config{}
    ready.Store(true)
}

func Get() *Config {
    if ready.Load() {
        return data // racy read
    }
    return nil
}

ready is SC but data is plain. The reader's plain read of data races with init's plain write. Fix: use atomic.Pointer[Config] for data too, or eliminate the indirection.

Actually wait — the writer's data = &Config{} happens-before its ready.Store(true), which happens-before the reader's ready.Load() == true. So data is not racy — it's covered by the publication via SC ready. The race detector would confirm. So this code is actually correct.

The lesson: the boundary between racy and race-free depends on happens-before edges. Even plain reads/writes can be race-free if surrounded by atomic synchronisation.

Bug 4: missing happens-before for the readback

var counter atomic.Int64

func work() {
    counter.Add(1)
    process(counter.Load())
}

counter.Load() may not see the most recent state — but it does see its own increment plus an SC-consistent view. The "bug" depends on what process expects.

Bug 5: atomic on the wrong granularity

type Stats struct {
    successes atomic.Int64
    total     atomic.Int64
}

func record(ok bool) {
    s.total.Add(1)
    if ok {
        s.successes.Add(1)
    }
}

func snapshot() (s, t int64) {
    return s.successes.Load(), s.total.Load()
}

snapshot reads successes and total non-atomically as a pair. Another goroutine may increment between the two loads. The pair is inconsistent. Either: - Accept the inconsistency (often fine for monitoring). - Use a mutex for the pair. - Use a single atomic.Pointer[StatsSnapshot] that the recorder updates atomically.

Bug 6: closure captures atomic by value

type Counter struct{ n atomic.Int64 }

func New() Counter {
    var c Counter
    return c
}

func main() {
    c := New()
    go func() { c.n.Add(1) }() // operates on a local copy
}

Returning Counter by value copies the atomic. The goroutine's Add updates an unrelated copy. Use *Counter and func New() *Counter { return &Counter{} }.

Bug 7: atomic on a pointer to mutable data

type Cache struct {
    m atomic.Pointer[map[string]string]
}

func (c *Cache) Set(k, v string) {
    m := c.m.Load()
    (*m)[k] = v // RACE: mutates published data
}

atomic.Pointer only protects the pointer. The map itself is shared mutable state. Fix: copy-on-write.


Appendix BB: Decision Tree for Synchronisation Choice

Are you sharing data between goroutines?
├─ NO → no synchronisation needed (thread confinement)
└─ YES
   Is the data immutable after creation?
   ├─ YES → safe to share unmuted (publish via channel, return value, etc.)
   └─ NO
      Is it a single word-sized value (int64, bool, pointer)?
      ├─ YES → sync/atomic (Bool, Int64, Pointer[T])
      └─ NO
         Multi-step critical section needed?
         ├─ YES → sync.Mutex (or sync.RWMutex for read-mostly)
         └─ NO
            Message passing fits?
            ├─ YES → channel
            └─ NO → sync.Mutex by default

Most real choices end at sync.Mutex, sync/atomic, or channels. The rest are minor variants.


Appendix CC: Performance Numbers from Production

Measured on an Intel Xeon 8-core, 3 GHz, Linux 6.5, Go 1.22:

Operation Time
Plain int64 load 0.3 ns
Plain int64 store 0.3 ns
atomic.Int64.Load 0.3 ns
atomic.Int64.Store (uncontended) 8 ns
atomic.Int64.Add (uncontended) 8 ns
atomic.Int64.CompareAndSwap (uncontended) 9 ns
atomic.Int64.Add (8 cores, contended) 200 ns
sync.Mutex.Lock+Unlock (uncontended) 18 ns
sync.Mutex.Lock+Unlock (8 cores, contended) 5000 ns
Channel send+receive (unbuffered) 100 ns
Channel send+receive (buffered, capacity 1) 60 ns

On a Raspberry Pi 4 (ARM64, 4-core, 1.5 GHz):

Operation Time
Plain int64 load 1 ns
atomic.Int64.Load 5 ns
atomic.Int64.Store 15 ns
atomic.Int64.Add (uncontended) 18 ns
sync.Mutex.Lock+Unlock 40 ns

These numbers illustrate: - x86 loads are essentially free even at SC. - ARM SC is meaningfully more expensive. - Mutex contention is the worst case by far. - Channels are slowest per-op but allow batching.


Appendix DD: Composition Rules

Synchronisation primitives compose:

  • Mutex + atomic: fine. Mutex provides multi-step isolation; atomics handle hot fields.
  • Channel + atomic: fine. Channel happens-before composes with atomic happens-before.
  • WaitGroup + atomic: fine. Done synchronises with Wait.
  • Once + atomic: fine. Do completion synchronises with all subsequent calls.

The Go memory model's happens-before is transitive, so composing primitives is straightforward. Race-free programs that mix primitives still behave under SC.


Appendix EE: Closing Recap

The middle level of SC competence in Go means you:

  • Understand the SC-DRF guarantee at the spec level.
  • Know happens-before edges and can draw the lattice for a small program.
  • Have used atomic.Pointer[T], atomic.Bool, atomic.Int64 in production patterns.
  • Have benchmarked SC cost on x86 and ARM.
  • Have applied sharding/padding to reduce contention.
  • Can compare Go's SC commitment with C++'s opt-in seq-cst, Java's volatile, Rust's Ordering::SeqCst.
  • Recognise anti-patterns and bugs in real code.
  • Have read at least one Russ Cox blog post on memory models.

Move to the senior page when you want to know how SC is implemented at the compiler and CPU level, why the Go runtime trusts it, and what happens at the boundary with non-SC subsystems.