Sequential Consistency — Senior Level¶

Table of Contents¶

Introduction
Lamport's Original Definition
Happens-Before, Formally
SC as a Total Order over Atomics
Store Buffers and Why SC Is Hard
The Go Compiler's Fence Emission
x86 TSO in Depth
ARM64 and the Acquire-Release Pair Trick
RISC-V and Explicit Fences
Cache Coherence vs Consistency
The IRIW Litmus and Why SC Differs from Acq-Rel
Reading the Go Runtime Atomics Code
Designing Lock-Free Data Structures on SC
SC at the Runtime Boundary
SC and Cgo, Syscalls, mmap
Formal Verification with TLA+
Production Engineering: When SC Is the Right Tool
Design Trade-offs in Memory Models
What Could Break SC in Go?
Self-Assessment
Summary

Introduction¶

The senior level of SC mastery is not about more code samples — it is about understanding why SC works at the level of compilers, CPUs, and formal models. After this page you should be able to:

State and apply Lamport's original SC definition.
Read the Go compiler's atomic-lowering pass and predict fence emission.
Reason about store-buffer behaviour on x86, ARM, RISC-V and how SC defeats it.
Distinguish the IRIW litmus as the SC-vs-acq-rel separating example.
Design lock-free data structures whose correctness arguments depend on SC.
Identify boundaries (cgo, syscalls, mmap) where SC reasoning ends and you must take over.
Compare memory-model design trade-offs at the language-design level.

This page assumes you have internalised the junior and middle pages: you know what SC-DRF means, you have used sync/atomic, you can read happens-before edges.

Lamport's Original Definition¶

Leslie Lamport defined sequential consistency in 1979:

The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Two key clauses:

Some sequential order: there exists a global total order over all memory operations.
Program order preserved per processor: within each processor's operations, the global order respects program order.

Lamport's paper was about distributed systems, but the definition applies unchanged to shared-memory multi-core CPUs. A multi-core CPU is essentially a tiny distributed system.

Implications¶

The global order is not unique. Many orders may satisfy the definition for one execution.
The order is not real-time-consistent. SC does not require operations to appear in wall-clock order. (Linearisability does; SC does not.)
Per-processor program order is non-negotiable. The hardest constraint to maintain in practice.

Why this definition is hard to implement¶

The cheapest hardware behaviour is "everyone does what they want, eventually agreeing." Coordinating a global total order requires either:

Fully blocking on each operation (slow).
Speculative execution + rollback (complex).
Lazy ordering with fences at synchronisation points (the actual approach).

The Go runtime and compiler emit fences at exactly the synchronisation points needed to preserve SC. Outside those points, the hardware is free to optimise.

Happens-Before, Formally¶

The happens-before relation → on memory events is defined by:

Program order edges: if event a precedes event b in the same goroutine's program, then a → b.
Synchronisation edges: defined per primitive (see below).
Transitive closure: if a → b and b → c, then a → c.

Synchronisation edges in Go¶

Primitive	Edge
Goroutine creation	`go f()` (the statement) → first instruction of `f`
Channel send/receive	`send_k` → `receive_k` (the kth send / receive on the channel)
Channel close	`close(ch)` → any receive observing the channel as closed
Mutex	`Unlock_k` → `Lock_{k+1}` on the same mutex
RWMutex	`Unlock_w` → next `RLock` or `Lock`; all `RUnlock`s in a generation → next `Lock`
WaitGroup	`Done_i` → return of `Wait` (after the counter reaches zero)
Once	completion of `f` inside `Do` → return of any subsequent `Do` on the same Once
Atomic	`op_i` → `op_{i+1}` for the global total order over all atomic operations

Race definition¶

A data race exists between events a and b if: - Both access the same memory location. - At least one of a, b is a write. - Neither a → b nor b → a.

A program is data-race-free if no execution contains a race.

SC-DRF formally¶

For every data-race-free program, there exists a total order < on memory events such that: - < extends the happens-before relation: if a → b, then a < b. - Each read returns the value of the most recent write per <.

This is the formal statement of Go 1.19's SC-DRF guarantee.

SC as a Total Order over Atomics¶

Within Go's memory model, atomics get a stronger guarantee than other synchronising operations: there exists a global total order <_a over all atomic operations across all memory locations.

This is what makes Go's atomics seq-cst rather than just acq-rel:

Acq-rel: each atomic location has its own total order; pairs (release, acquire) cross-couple by happens-before; but two different locations may be observed in different orders by different goroutines.
Seq-cst: there is one total order over all atomic operations on all locations. Every goroutine agrees on this order.

The IRIW litmus separates the two (see below).

How the Go memory model phrases it¶

The spec, paraphrased:

The atomic operations have a total order, and each atomic operation is sequentially consistent with respect to this total order.

The order is established at runtime by the hardware fences emitted by the compiler. The key insight: a full fence on every atomic store (or a CAS / xchg-style operation, which is implicitly fenced) ensures that observed orderings are consistent across all observers.

Store Buffers and Why SC Is Hard¶

The store buffer¶

Each CPU core has a store buffer: a small FIFO queue of pending writes. When you execute a store instruction:

The write enters the store buffer.
The instruction "retires" (the next instruction may execute).
Later, the write commits to L1 cache and propagates via the cache-coherence protocol.

Loads check the store buffer first (store-to-load forwarding) so the same core sees its own writes immediately.

The problem¶

A store followed by a load to a different address may appear, from another core's perspective, as: the load happened first.

Why? The local core sees: store x, load y. But another core sees: load y (because the store to x is still in the buffer), then later store x. From the other core's perspective, the load happened first.

This is the classic x86 store-load reordering, and the only reordering x86 permits. It is enough to break SC.

The store-buffer litmus, replayed¶

Goroutine A         Goroutine B
x = 1               y = 1
r1 = y              r2 = x

Without fences: - A: x = 1 (in buffer), load y (sees 0). - B: y = 1 (in buffer), load x (sees 0). - Both r1 and r2 = 0. SC violation.

With SC fences (a fence between store and load): - A: x = 1 (in buffer), FENCE flushes buffer, load y. y is whatever B has propagated. - Either A's store committed before B's load, or B's store committed before A's load. Both reads cannot return 0.

The fence cost¶

A store buffer flush is not free. It stalls the core until pending writes commit, which requires cache-coherence traffic. On modern x86 a lock-prefixed instruction (which implies a fence) takes 20–30 cycles minimum. On uncontended cache lines it's still ~10 cycles. On highly contended cache lines it's hundreds of cycles.

This is why hot atomic stores are a bottleneck. Mitigation: shard, batch, or relax (in C++; not in Go).

The Go Compiler's Fence Emission¶

The Go compiler lowers sync/atomic operations to platform-specific assembly via the runtime/internal/atomic package. The dispatch is in cmd/compile/internal/ssagen/intrinsics.go and the assembly in runtime/internal/atomic/asm_*.s.

Strategy per platform¶

x86-64: - Load → MOV. (x86 TSO gives load-acquire for free.) - Store → XCHG (LOCK prefix implicit; full fence). - Add → LOCK XADD. - CAS → LOCK CMPXCHG.

ARM64 (ARMv8.0): - Load → DMB ISHLD + LDR. (Or LDAR, post-ARMv8.0.) - Store → STLR + DMB ISH. (Or DMB ISH + STR + DMB ISH for full SC.) - Add → LDAXR / ADD / STLXR retry loop. - CAS → LDAXR / CMP / STLXR retry loop, or CASAL post-ARMv8.1.

ARM64 (ARMv8.3+): - Load → LDAR. - Store → STLR. - Add → LDADDAL or LDADDAL-style. - CAS → CASAL.

LDAR/STLR pair gives full SC on ARM. The non-pair variants (LDAPR, STLR) give RCpc / RCsc, which is weaker.

RISC-V: - Load → LD with implicit aqrl in the runtime wrappers. - Store → SD with explicit fence rw,rw. - Add → amoadd.d.aqrl. - CAS → amocas.d.aqrl or LR/SC pair.

MIPS, s390x, PowerPC, wasm: each has its own fence repertoire.

The compiler's job¶

The compiler does not re-order across atomics. The SSA pass treats atomic ops as memory-ordering boundaries: instructions cannot be moved past them. Combined with the hardware fences, this provides language-level SC.

Inlining and aliasing¶

The compiler also handles aliasing: if you write &x.v (where x is a stack-allocated struct with an atomic field), the compiler ensures the address is correctly identified as a memory operand. Escape analysis may move x to the heap.

x86 TSO in Depth¶

x86's hardware model is Total Store Order (TSO). The single reordering it permits: a store followed by a load to a different address may appear as the load happening first (from another core's perspective).

Equivalently: stores are eagerly buffered; loads bypass the buffer.

Why TSO?¶

The store buffer is essential for performance. Without it, every store would stall the pipeline until cache commit. The buffer hides the latency of cache-coherence traffic.

Intel and AMD codify TSO as the architectural guarantee. Programmers can rely on: - No store-store reordering. - No load-load reordering. - No load-store reordering. - Yes store-load reordering (the only one).

SC on top of TSO¶

To get SC on x86, fence the store-load gap. A LOCK-prefixed instruction or XCHG with a memory operand does this. Modern compilers prefer XCHG over MFENCE because XCHG is slightly faster on most microarchitectures.

The cost is ~10–30 cycles, dominated by the cache-coherence handshake.

x86 quirks¶

Aligned loads/stores ≤ 8 bytes are atomic at the hardware level (no tearing). 16-byte stores via SSE may not be atomic on older CPUs.
MFENCE is a full memory barrier; LFENCE/SFENCE are partial. Go uses XCHG instead because it's faster.
LOCK prefix: applies to ADD, AND, OR, XOR, INC, DEC, XADD, CMPXCHG. Provides atomicity + fence.

ARM64 and the Acquire-Release Pair Trick¶

Background¶

ARM64 (ARMv8) is weakly ordered. By default, all four reorderings (LL, LS, SS, SL) are permitted absent fences.

Fences¶

DMB ISH: full memory barrier in the inner-shareable domain. ~30 cycles.
DMB ISHST: store-only barrier.
DMB ISHLD: load-only barrier.

Acquire-release instructions (ARMv8.0)¶

LDAR: load-acquire. Subsequent loads/stores cannot be reordered before this load.
STLR: store-release. Prior loads/stores cannot be reordered after this store.

The pair LDAR/STLR gives Release Consistency with sequential consistency for synchronising operations (RCsc). This is the SC model.

Cheaper variants (ARMv8.3+)¶

LDAPR: load-acquire processor-consistent. Cheaper but only RCpc (per-processor consistency). Not SC.
STLR: same as before, store-release.

Go uses LDAR/STLR for SC. Some C++ implementations use LDAPR for acq-rel — that's where you'd see a measurable difference vs Go's SC.

Compare-and-swap¶

ARMv8.0 uses LDAXR/STLXR (load-exclusive-acquire / store-exclusive-release) pairs:

retry:
    LDAXR x0, [addr]
    CMP   x0, expected
    B.NE  fail
    STLXR w1, new, [addr]
    CBNZ  w1, retry
fail:

ARMv8.1+ has CASAL (compare-and-swap, acquire+release) as a single instruction. Cheaper and avoids the retry loop on contention.

Fence-only emit¶

For programs that need a fence between two non-atomic ops (rare in Go because atomics imply fences), the compiler emits DMB ISH.

RISC-V and Explicit Fences¶

RISC-V's RVWMO (Weak Memory Order) permits all reorderings. Fences are explicit:

FENCE rw,rw: full memory barrier.
FENCE r,r: load-load only.
FENCE w,w: store-store only.
FENCE r,w, FENCE w,r: mixed.

Atomic instructions¶

The A extension provides:

AMOSWAP.w/d.aqrl: atomic swap with acquire and release.
AMOADD.w/d.aqrl: atomic add with acquire and release.
LR.w/d.aqrl + SC.w/d.aqrl: load-reserved and store-conditional, like ARM's LDAXR/STLXR.
AMOCAS (post-Zacas): single-instruction compare-and-swap (newer extension).

The aqrl suffix means the instruction acts as both acquire and release. Combined, the pair gives SC.

Why RISC-V matters¶

RISC-V is gaining production deployment. The Go toolchain supports it. SC semantics are preserved across architectures, so your Go code Just Works.

Cache Coherence vs Consistency¶

These are often conflated. They are distinct.

Cache coherence: all cores eventually agree on each individual memory location's value. Implemented by MESI / MOESI protocols. Hardware always provides this.

Memory consistency: the ordering between operations on different memory locations. Varies by ISA.

SC is a consistency property. The Go memory model provides SC for atomics; the hardware provides coherence for free.

MESI states¶

State	Meaning
Modified	This core has the only copy and has modified it; others are Invalid.
Exclusive	This core has the only copy, unchanged from memory.
Shared	Multiple cores have read-only copies.
Invalid	This core's copy is stale.

Transitions¶

Read miss → fetch line, share with other cores → S or E.
Write hit on S → invalidate others, transition to M.
Write miss → fetch line, invalidate others → M.

Cache-coherence traffic is the hidden cost of atomic operations: every store invalidates other cores' copies.

When two unrelated variables share a cache line, every store on one invalidates the other on remote cores. Performance plummets even though the variables are logically independent. Mitigation: pad to cache line.

type Padded struct {
    a atomic.Int64
    _ [56]byte
    b atomic.Int64
    _ [56]byte
}

64-byte cache line; 8 bytes for the atomic, 56 bytes of padding.

The IRIW Litmus and Why SC Differs from Acq-Rel¶

The Independent Reads of Independent Writes (IRIW) litmus distinguishes SC from acquire-release:

Thread A:  x.store(1)
Thread B:  y.store(1)
Thread C:  r1 = x.load(); r2 = y.load()
Thread D:  r3 = y.load(); r4 = x.load()

SC forbids:    r1=1, r2=0, r3=1, r4=0
Acq-rel allows: r1=1, r2=0, r3=1, r4=0

Under SC, there is one global order. If C observes x before y is set, and D observes y before x is set, then in C's view A happened before B, but in D's view B happened before A — contradiction.

Under acq-rel, each location has its own total order, but different locations may be observed differently by different observers. C sees A first; D sees B first. No contradiction.

Why SC matters here¶

If you write IRIW-style code in Go, you can rely on the forbidden outcome being impossible. In C++ with acq-rel, you cannot.

In practice, most algorithms do not depend on IRIW. But some do — Peterson-style mutual exclusion, certain lock-free designs, and any algorithm where multiple observers must agree on the order of independent events.

Reading the Go Runtime Atomics Code¶

Let's read a real example: runtime/internal/atomic/atomic_amd64.s (paraphrased):

TEXT ·Xchg64(SB), NOSPLIT, $0-24
    MOVQ ptr+0(FP), BX
    MOVQ new+8(FP), AX
    XCHGQ AX, 0(BX)
    MOVQ AX, ret+16(FP)
    RET

Xchg64 atomically swaps an int64. The instruction is XCHGQ, which on x86 has an implicit LOCK prefix when one operand is memory. This is a full barrier.

TEXT ·Xadd64(SB), NOSPLIT, $0-24
    MOVQ ptr+0(FP), BX
    MOVQ delta+8(FP), AX
    MOVQ AX, CX
    LOCK
    XADDQ AX, 0(BX)
    ADDQ  CX, AX
    MOVQ AX, ret+16(FP)
    RET

Xadd64 is LOCK XADD. The locked prefix flushes the store buffer and provides SC.

TEXT ·Cas64(SB), NOSPLIT, $0-25
    MOVQ ptr+0(FP), BX
    MOVQ old+8(FP), AX
    MOVQ new+16(FP), CX
    LOCK
    CMPXCHGQ CX, 0(BX)
    SETEQ ret+24(FP)
    RET

Cas64 is LOCK CMPXCHG. Atomic compare-and-swap with full barrier.

TEXT ·Load64(SB), NOSPLIT, $0-16
    MOVQ ptr+0(FP), AX
    MOVQ 0(AX), AX
    MOVQ AX, ret+8(FP)
    RET

Load64 is a plain MOVQ. On x86 (TSO), this is sufficient for SC load.

Now ARM64 (atomic_arm64.s, paraphrased):

TEXT ·Load64(SB), NOSPLIT, $0-16
    MOVD ptr+0(FP), R0
    LDAR R0, [R0]
    MOVD R0, ret+8(FP)
    RET

LDAR is load-acquire. ARM needs this fence even for a "simple" load.

TEXT ·Store64(SB), NOSPLIT, $0-16
    MOVD ptr+0(FP), R0
    MOVD val+8(FP), R1
    STLR R1, [R0]
    RET

STLR is store-release. Combined with LDAR on the other side, gives SC.

TEXT ·Xchg64(SB), NOSPLIT, $0-24
    MOVD ptr+0(FP), R0
    MOVD new+8(FP), R1
again:
    LDAXR R2, [R0]
    STLXR R1, [R0], R3
    CBNZ  R3, again
    MOVD  R2, ret+16(FP)
    RET

Xchg64 is the LDAXR/STLXR retry loop. SC via LDAXR (acquire) + STLXR (release).

The pattern is consistent: every atomic operation gets enough fences to satisfy SC, regardless of architecture. Cost varies; semantics don't.

Designing Lock-Free Data Structures on SC¶

SC simplifies the reasoning for lock-free algorithms substantially.

Treiber stack¶

Classic lock-free stack via CAS:

type node struct {
    val  int
    next atomic.Pointer[node]
}

type Stack struct {
    head atomic.Pointer[node]
}

func (s *Stack) Push(v int) {
    n := &node{val: v}
    for {
        old := s.head.Load()
        n.next.Store(old)
        if s.head.CompareAndSwap(old, n) {
            return
        }
    }
}

func (s *Stack) Pop() (int, bool) {
    for {
        old := s.head.Load()
        if old == nil {
            return 0, false
        }
        nxt := old.next.Load()
        if s.head.CompareAndSwap(old, nxt) {
            return old.val, true
        }
    }
}

The SC atomics give: - All pushes/pops globally ordered. - Each successful CAS happens-before subsequent operations that observe its result. - The ABA problem still exists (a node could be popped, freed, recycled, pushed back). Use hazard pointers or epoch-based reclamation.

Michael-Scott queue¶

Two-pointer lock-free FIFO queue using two atomic pointers (head and tail). Each enqueue and dequeue uses CAS. SC ensures global ordering and consistent state observation.

type node struct {
    val  int
    next atomic.Pointer[node]
}

type Queue struct {
    head atomic.Pointer[node]
    tail atomic.Pointer[node]
}

func New() *Queue {
    sentinel := &node{}
    q := &Queue{}
    q.head.Store(sentinel)
    q.tail.Store(sentinel)
    return q
}

func (q *Queue) Enqueue(v int) {
    n := &node{val: v}
    for {
        tail := q.tail.Load()
        next := tail.next.Load()
        if tail == q.tail.Load() {
            if next == nil {
                if tail.next.CompareAndSwap(nil, n) {
                    q.tail.CompareAndSwap(tail, n)
                    return
                }
            } else {
                q.tail.CompareAndSwap(tail, next)
            }
        }
    }
}

func (q *Queue) Dequeue() (int, bool) {
    for {
        head := q.head.Load()
        tail := q.tail.Load()
        next := head.next.Load()
        if head == q.head.Load() {
            if head == tail {
                if next == nil {
                    return 0, false
                }
                q.tail.CompareAndSwap(tail, next)
            } else {
                v := next.val
                if q.head.CompareAndSwap(head, next) {
                    return v, true
                }
            }
        }
    }
}

The double-check (if tail == q.tail.Load()) guards against another goroutine having moved tail between our load and the rest of our operations. SC ensures the comparison is meaningful: if the value matches, no SC-prior writes that we missed could have happened.

Hand-over-hand locking with atomics¶

A linked list with per-node locks. Lock the current and next nodes simultaneously to traverse. With SC atomic flags as the locks (lock = CAS to 1, unlock = Store 0), each lock acts as a synchronisation point.

Skip-list with atomic pointers¶

A lock-free skip list uses atomic.Pointer for each level's forward pointer. SC ensures that observed pointers point to fully-constructed nodes.

SC at the Runtime Boundary¶

The Go runtime itself relies on SC atomics for the scheduler, garbage collector, and memory allocator. Some examples:

Goroutine state transitions¶

// (simplified from runtime/runtime2.go)
type g struct {
    atomicstatus atomic.Uint32
    ...
}

const (
    _Gidle = 0
    _Grunnable = 1
    _Grunning = 2
    _Gsyscall = 3
    _Gwaiting = 4
    _Gdead = 6
)

func casgstatus(g *g, old, new uint32) {
    for !g.atomicstatus.CompareAndSwap(old, new) {
    }
}

Each transition is an SC CAS. The scheduler relies on SC ordering to know that, e.g., a goroutine moved from running to syscall before any reader observes the new state.

Work-stealing queue¶

The per-P run queue uses lock-free push/pop with SC atomics on head and tail indices. Stealing is via CAS on a remote P's tail.

GC tricolor marking¶

The garbage collector tracks each object as white, grey, or black. Marking is via atomic CAS on a status word. SC ensures that when an object transitions to black, no goroutine still observes it as white.

Memory allocator¶

The mcache, mcentral, mheap layers use atomic operations on free-list pointers and arena indices. SC keeps the allocator's state consistent across goroutines.

The point: SC is not academic. The Go runtime, written by the language designers, depends on it. If SC were violated by the compiler or hardware, the runtime would corrupt and Go programs would crash.

SC and Cgo, Syscalls, mmap¶

Cgo¶

When Go calls C via cgo, the cgo runtime saves and restores Go's scheduler state. Memory operations on the C side are governed by C's memory model, not Go's. Crucially:

C _Atomic operations (with memory_order_seq_cst) align with Go's SC.
C operations on volatile int do not align — volatile is not a synchronisation primitive in C.
Bare reads/writes in C are racy and not seen by Go's race detector.

If you share memory between Go and C, use C11 _Atomic with memory_order_seq_cst. The runtime atomics on both sides will compose.

Syscalls¶

A syscall transitions the goroutine to _Gsyscall state. The runtime tracks this transition with SC atomics. When the syscall returns, the runtime transitions back, performing a CAS that synchronises.

Memory written by the kernel (e.g., into a buffer passed to read()) is visible after the syscall returns, as the syscall acts as a synchronisation point.

mmap¶

mmaped shared memory between processes is outside Go's memory model. The Go runtime cannot help here. You must:

Use OS-level synchronisation (futexes, POSIX semaphores).
Treat shared regions as racy in Go.
Use cgo to call atomic operations on the shared region.

Or, more pragmatically, structure the shared region so that one process writes and others read, with the writer's "ready" signal communicated through an OS-level barrier.

Formal Verification with TLA+¶

For mission-critical lock-free algorithms, write a TLA+ specification and check it. TLA+ models memory consistency precisely:

EXTENDS Integers, Sequences, TLC

CONSTANTS Threads, Locations

VARIABLES memory, threadState

\* (... specification of operations ...)

Spec == Init /\ [][Next]_<<memory, threadState>>

SequentialConsistency ==
    \E order \in Permutations(operations):
        /\ ProgramOrderPreserved(order)
        /\ EachReadObservesLatestWrite(order)

The TLC model checker verifies the algorithm under SC. Useful for proving:

No data race (DRF).
Mutual exclusion.
Linearisability (stronger than SC).
Progress (no deadlock, lock-freedom, wait-freedom).

The investment is significant but pays off for algorithms shipped to millions of users.

Production Engineering: When SC Is the Right Tool¶

Case 1: hot-reload configuration¶

atomic.Pointer[Config] is the right tool. SC ensures readers see a fully-constructed Config. Writes are rare; reads are hot. Cost on x86: one MOV per read.

Case 2: stop signal¶

atomic.Bool is the right tool. Channels are alternative but heavier.

Case 3: counters¶

atomic.Int64 is fine if contention is low. Shard if contended.

Case 4: lock-free queue¶

Channels are preferred. Only roll your own if profiling shows channel overhead is the bottleneck.

Case 5: pub-sub broadcast¶

Channels with close for one-shot; multiple goroutines select on the channel.

Case 6: rate limiting¶

atomic.Int64 for the token bucket; refill via a goroutine that uses time.NewTicker.

Case 7: routing table¶

atomic.Pointer[Routes] for hot-swap. SC publication.

Case 8: scheduler state machine¶

CAS on atomic.Int32 for state transitions. SC ordering ensures observers agree on the current state.

Design Trade-offs in Memory Models¶

Simpler model vs more performance knobs¶

C/C++/Rust expose all six memory orders. Go exposes one (seq-cst). Trade-off: simplicity (Go) vs flexibility (C++).

Hardware-revealing vs hardware-abstracting¶

Java's volatile is roughly SC; modern Java atomic classes have variants. Kotlin inherits Java. Go takes the same "abstract away" stance.

Static vs dynamic detection¶

Go's race detector is dynamic (instrumented binaries). Rust's borrow checker is static (compile-time). Both are valid; Go's is more flexible at runtime, Rust's catches more at compile time.

Whole-program vs per-module¶

C++'s memory orders are per-operation. Go's is per-program (SC everywhere). The C++ approach allows fine-tuning at the cost of complexity.

Future direction¶

Go has not added weaker atomics in 1.20, 1.21, 1.22, 1.23. The team holds the line. Any future addition would likely be atomic.LoadRelaxed/StoreRelaxed for counters where ordering is irrelevant.

What Could Break SC in Go?¶

Pointer trickery via unsafe¶

var x atomic.Int64
// somewhere else
p := unsafe.Pointer(&x)
*(*int64)(p) = 999 // not atomic

The unsafe write bypasses the atomic. Race detector may or may not see it.

Cgo without proper atomics on the C side¶

C code using non-atomic shared memory accesses races with Go's atomics.

Custom assembly¶

If you write .s files with manual instructions, you can bypass SC. The Go runtime does this carefully; user code should not.

Compiler bugs¶

Rare but possible. Russ Cox's blog posts emphasise the need for the compiler to honour the model. A bug here breaks every Go program. Reports go to the Go issue tracker.

Hardware bugs¶

Extremely rare. Spectre/Meltdown affected speculative execution, which is below the architectural level. SC remains a guarantee on architectural state.

Misuse of `runtime.GC()`, `runtime.GOMAXPROCS`, etc.¶

These are not synchronisation primitives. Don't use them as such.

Self-Assessment¶

Summary¶

The senior level is about depth: - SC is Lamport's 1979 definition: a global total order respecting per-processor program order. - Go 1.19's memory model formalises SC-DRF and gives sync/atomic SC semantics. - The compiler emits architecture-specific fences (XCHG on x86, LDAR/STLR on ARM64, fence rw,rw on RISC-V) to realise SC. - Store buffers, cache coherence, and out-of-order execution are the hardware features SC must defeat. - The IRIW litmus is the canonical example separating SC from acq-rel. - The Go runtime relies heavily on SC atomics; without them, the scheduler, GC, and allocator would corrupt. - Cgo, syscalls, and mmap are boundaries where SC reasoning ends and ad-hoc synchronisation begins. - SC trade-offs: simplicity (Go) vs flexibility (C++). Go's bet is that simplicity wins.

The professional page steps even further out: language-design trade-offs, formal verification, the future of memory models, and SC at the intersection of distributed systems and shared memory.

Deep Dive: The Lamport SC Paper Revisited¶

Lamport's 1979 paper "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" is the seminal definition. Let's read it with modern eyes.

Lamport's setup¶

Lamport considered shared-memory multiprocessors, but his definition is general. He observed that simple multiprocessors at the time would not execute correctly the algorithms of Dijkstra and others, because the multiprocessors permitted reorderings that the algorithms assumed away.

His definition (modernised):

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Two conditions for a SC multiprocessor¶

Lamport derived two sufficient conditions:

Each processor issues memory requests in the order specified by its program.
Memory requests from all processors issued to one memory location must be serviced from a single FIFO queue.

Condition 1 forbids per-processor reorderings (no store buffer optimisations). Condition 2 forbids parallel memory access to the same location.

Modern multiprocessors violate both conditions for performance — but synthesise SC via fences and cache coherence.

Lamport's example: Dekker's algorithm¶

Lamport showed that Dekker's mutual exclusion algorithm fails under non-SC memory. The two-thread argument:

Thread 1:                Thread 2:
flag[1] = true           flag[2] = true
if flag[2] == false:     if flag[1] == false:
    critical section         critical section

If both stores get buffered and both threads read the other's flag as false, both enter the critical section. Mutual exclusion violated.

The fix in 1979 was condition 1: serialise memory operations. The fix in 2026: SC atomic stores with a fence.

The legacy¶

Lamport's SC remains the canonical "intuitive" memory model. Every later memory model is defined either as "weaker than SC" or "stronger than SC." Linearisability (Herlihy & Wing 1990) adds real-time ordering. Causal consistency (Lamport 1978) weakens SC. Eventual consistency (used in distributed systems) is even weaker.

Go's choice to align sync/atomic with SC is, in essence, a return to Lamport's vision — a multiprocessor (here, Go runtime + hardware) that correctly executes multiprocessor programs, no fences explicit, no surprises.

Deep Dive: Linearisability vs SC¶

These two are often confused. The difference matters.

Linearisability¶

An object (or memory location) is linearisable if every operation appears to take effect at some instant between its invocation and response, and that instant is consistent with real-time ordering.

SC¶

A system is sequentially consistent if all operations across all processors appear in some global order consistent with per-processor program order. Real time is not required.

A separating example¶

Real time:
t=0: A starts x.store(1)
t=1: A finishes
t=2: B starts r = x.load()
t=3: B reads r = 0

Under SC, this is permitted if the global order is: r = x.load (reading 0), then x.store(1). Even though A's store completed before B's load started in real time, SC does not require real-time consistency.

Under linearisability, this is forbidden: A's store completed in real time before B's load began, so B must observe the post-store state.

Why this matters¶

Linearisability is the standard for distributed databases (e.g., Spanner's external consistency). SC is the standard for shared memory.

Go provides SC for atomics — not linearisability. This is a subtle distinction but matters for some algorithms.

Hardware reality¶

x86's TSO is not even SC; it permits store-load reorder. SC is enforced by software fences. Linearisability requires additional real-time ordering, which hardware does not provide. For most shared-memory algorithms, this is irrelevant.

Deep Dive: The Russ Cox 2021 Proposal¶

Russ Cox's blog series "Hardware Memory Models," "Programming Language Memory Models," and "Updating the Go Memory Model" framed the 2022 revision.

The historical problem¶

Pre-2022, Go's memory model was prose-only. It said:

Channel operations synchronise.
Mutex operations synchronise.
Atomics... were not formally specified.

The de-facto behaviour was SC, but the spec did not commit. Russ Cox surveyed:

C++11's memory_order_seq_cst (default for std::atomic).
Java's volatile (SC-like since JSR-133).
C#'s Interlocked.* (SC-like).
Rust's Ordering::SeqCst.
Swift's atomics package (SC, post-2020).

Modern language memory models converged on SC for atomics by default, with weaker orderings as opt-ins. Go's existing usage was already SC-like; formalising it was natural.

The proposal¶

Russ Cox proposed:

Make Go's sync/atomic operations formally SC.
Add typed wrappers: atomic.Bool, atomic.Int32, atomic.Int64, etc.
Add atomic.Pointer[T] for type-safe pointer atomics.
Do not add weaker orderings.

The proposal was accepted essentially unchanged, becoming Go 1.19 in August 2022.

Why not add weaker orderings?¶

Russ Cox's reasoning:

Adding LoadAcquire/StoreRelease etc. doubles or triples the API surface.
Weaker orderings are bug-prone. C++ programmers regularly misuse them.
The performance gain on real workloads is modest (5–15% on the most contended loops).
Go's design philosophy values simplicity.

This is a deliberate trade-off, and a defensible one. The cost: a few specialised algorithms (lock-free RCU, hazard pointers in their fastest form) cannot achieve their peak performance in pure Go.

Future possibilities¶

The door is left open: if a strong case emerges, weaker orderings could be added. As of Go 1.23, no such addition has been made. The community has not pushed for it; the runtime team has not needed it.

Deep Dive: Hardware Memory Models in Detail¶

x86 TSO¶

Intel and AMD codify TSO. The MESI cache-coherence protocol underlies it. Key properties:

Store buffer per core, FIFO.
Load buffer for in-flight loads, allowing speculative execution.
Store-to-load forwarding within a core.
Strict store ordering between cores (TSO).
Single store-load reordering permitted.

Intel's "Software Developer's Manual," volume 3, section 8.2, is the authoritative document. Key takeaways:

All stores from one core become visible to other cores in program order.
Loads from one core appear to other cores in program order for stores to the same location.
A core's own loads see its own stores immediately.

The LOCK prefix on instructions like ADD, XADD, CMPXCHG, XCHG provides full memory barrier semantics. Useful x86 instructions for SC:

LOCK ADD: SC RMW add.
LOCK XADD: SC fetch-and-add.
LOCK CMPXCHG: SC CAS.
XCHG mem, reg: SC swap (locked implicit).
MFENCE: SC fence (used rarely; XCHG is faster).
LFENCE/SFENCE: weaker fences, rarely used for SC.

ARM64 in detail¶

ARM's "Architecture Reference Manual" defines the ARMv8 memory model. It is acquire-release with cumulative ordering (RCsc) when using LDAR/STLR pairs.

ARM permits all four reorderings absent fences: - LL, LS, SS, SL all reorderable.

ARM instructions:

LDR/STR: relaxed; no ordering.
LDAR/STLR: acquire / release (since ARMv8.0).
LDAPR: load-acquire processor-consistent (since ARMv8.3). Weaker than LDAR.
LDADD, LDADDAL, etc.: atomic RMW (since ARMv8.1).
CAS, CASA, CASAL: compare-and-swap (since ARMv8.1).
DMB ISH: full memory barrier in inner shareable domain.
DMB ISHST: store-only.
DMB ISHLD: load-only.
DSB ISH: data synchronisation barrier (waits for all prior memory ops to complete).
ISB: instruction synchronisation barrier (flushes pipeline; rarely needed for memory).

Go's ARM64 backend uses LDAR/STLR for SC. CAS is emitted as CASAL on ARMv8.1+; on older targets, LDAXR/STLXR loops.

RISC-V in detail¶

RVWMO (RISC-V Weak Memory Order) is officially specified. Key instructions:

LD/SD: relaxed loads/stores.
AMOSWAP.w/d.aqrl: atomic swap with acquire+release.
AMOADD.w/d.aqrl: atomic add with acquire+release.
AMOAND, AMOOR, AMOXOR: atomic logical ops.
LR.w/d.aqrl: load-reserved (start of LL/SC).
SC.w/d.aqrl: store-conditional (end of LL/SC).
FENCE pred,succ: explicit fence with predecessor and successor specifications.
FENCE.I: instruction fence (for self-modifying code).

The aqrl flags on atomic ops give SC. Without them, the operation is relaxed.

MIPS, PowerPC, SPARC¶

MIPS uses sync instructions for fences. PowerPC uses lwsync/sync. SPARC (older systems) uses membar. All are supported by Go to varying degrees; consult runtime/internal/atomic for the specifics.

wasm¶

WebAssembly has its own atomic operations specification. JavaScript engines map them to host-platform atomics. Go's wasm backend uses these.

Deep Dive: Implementation Strategy in the Go Compiler¶

The Go compiler's SSA pass treats atomic operations as memory-ordering barriers.

Memory effects in SSA¶

In SSA form, every memory access has a memory effect attached. The optimiser cannot move two operations across a memory effect that orders them.

Atomic operations are marked as having "any memory effect" — equivalent to a function call to an unknown function. This prevents the optimiser from reordering loads/stores past atomics.

Code generation¶

Each architecture has its own backend in cmd/compile/internal/{amd64,arm64,...}. For atomic operations, the backend dispatches to the architecture-specific lowering.

A typical lowering for atomic.Int64.Store:

// amd64
case OpAtomicStore64:
    // Emit XCHGQ src, mem

The emitted machine code includes the fence-providing instruction. The compiler does not need additional fence emission; the instruction itself fences.

Inlining considerations¶

atomic.Int64.Load() etc. are tiny methods. The compiler inlines them, then lowers to the appropriate instruction. The user sees no function-call overhead.

Escape analysis¶

atomic.Int64 is a struct with one int64 field. Escape analysis often determines that the atomic does not escape — it stays on the stack. This is fine; SC semantics apply regardless of allocation site.

Race detector instrumentation¶

The race detector adds calls to __tsan_* functions around each memory access. Atomics are instrumented to record their happens-before edges in the vector-clock tracker.

Worked Example: A Hand-Verified Lock-Free Counter¶

type Counter struct {
    n atomic.Int64
}

func (c *Counter) Inc() int64 { return c.n.Add(1) }
func (c *Counter) Val() int64 { return c.n.Load() }

Correctness argument¶

Claim: After N total Inc calls (across all goroutines), Val() returns N (assuming Val is called after all Incs complete, with happens-before).

Proof: 1. Each Inc is an SC RMW (atomic Add). It reads the current value, adds 1, writes back. Indivisibly. 2. SC implies all RMWs on c.n are in a global total order. 3. Each RMW increments by exactly 1. 4. Therefore, after N RMWs, the value is N. 5. Val() is a Load, which sees the latest value in SC order. 6. If Val happens-after all Incs (via WaitGroup or similar), it observes the post-Nth-RMW state.

QED.

Edge case: Val during Inc¶

If Val is called during a phase where Incs are still happening, it observes some value between 0 and N. SC ensures the value reflects a consistent prefix of the RMW order — no "torn" or impossible value.

Edge case: counter overflow¶

Int64 overflows at 2^63. After overflow, the value wraps to negative. SC does not save you from arithmetic bugs.

Worked Example: A Lock-Free Single-Producer Single-Consumer Queue¶

type SPSC struct {
    buf  []int64
    head atomic.Int64 // producer writes; consumer reads
    tail atomic.Int64 // consumer writes; producer reads
}

func New(cap int) *SPSC {
    return &SPSC{buf: make([]int64, cap)}
}

func (q *SPSC) Push(v int64) bool {
    h := q.head.Load()
    t := q.tail.Load()
    if h-t >= int64(len(q.buf)) {
        return false
    }
    q.buf[h%int64(len(q.buf))] = v
    q.head.Store(h + 1)
    return true
}

func (q *SPSC) Pop() (int64, bool) {
    h := q.head.Load()
    t := q.tail.Load()
    if h == t {
        return 0, false
    }
    v := q.buf[t%int64(len(q.buf))]
    q.tail.Store(t + 1)
    return v, true
}

Correctness argument¶

Assume one producer goroutine and one consumer goroutine.

Claim: every Push(v) that succeeds will eventually be followed by a Pop() returning (v, true), and the order of Pops matches the order of Pushes.

Proof: 1. Producer's Push: load h, load t, write buf[h%len], store h+1. 2. Consumer's Pop: load h, load t, read buf[t%len], store t+1. 3. SC ensures producer's buf[h%len] = v happens-before its head.Store(h+1) (program order). 4. When consumer's head.Load() returns h+1 (or higher), SC ensures the producer's prior writes are visible. 5. Consumer's buf[t%len] read sees the producer's write. 6. The slot is logically owned by the consumer until consumer's tail.Store(t+1). 7. Producer's check h - t >= len ensures it doesn't overwrite an unconsumed slot.

QED.

Edge cases¶

Overflow: head and tail are int64; overflow happens after ~9.2 * 10^18 operations. Effectively never.
Concurrent producers: this design supports only one. Multi-producer needs CAS on head.
False sharing: head and tail should be on different cache lines. Pad the struct.

Padded version¶

type SPSC struct {
    buf  []int64
    head atomic.Int64
    _    [56]byte
    tail atomic.Int64
    _    [56]byte
}

Producer writes head; consumer writes tail. Different cache lines, no false sharing.

Worked Example: Hazard Pointers Sketch¶

Hazard pointers are an alternative to garbage collection for lock-free data structures. Each goroutine has a set of "hazard pointers" — addresses it is about to dereference. Before freeing memory, you check no goroutine has a hazard pointer to it.

type HazardSet struct {
    pointers [maxGoroutines]atomic.Pointer[any]
}

func (h *HazardSet) Set(g int, p any) { h.pointers[g].Store(&p) }
func (h *HazardSet) Clear(g int)      { h.pointers[g].Store(nil) }

func (h *HazardSet) IsHazarded(p any) bool {
    for i := range h.pointers {
        if cur := h.pointers[i].Load(); cur != nil && *cur == p {
            return true
        }
    }
    return false
}

The SC atomics ensure that when a hazard pointer is set, subsequent reads of the data are safe. Cleanup defers freeing until no hazard pointer references the object.

This pattern is rare in Go because the GC handles most of it. But it shows up in custom lock-free designs.

Worked Example: Read-Copy-Update (RCU)¶

RCU is a synchronisation paradigm for read-mostly data. Readers observe an immutable snapshot; writers create a new version and atomically swap.

type RCU[T any] struct {
    cur atomic.Pointer[T]
}

func (r *RCU[T]) Read() *T { return r.cur.Load() }

func (r *RCU[T]) Update(f func(*T) *T) {
    for {
        old := r.cur.Load()
        new := f(old)
        if r.cur.CompareAndSwap(old, new) {
            return
        }
    }
}

Readers are completely lock-free, no fences, no waits. Writers use CAS. SC ensures every reader sees a fully-constructed snapshot.

In the Linux kernel, RCU uses a grace period before freeing old versions — Go's GC handles this automatically.

Worked Example: A Concurrent Map with COW¶

type Map[K comparable, V any] struct {
    m atomic.Pointer[map[K]V]
}

func New[K comparable, V any]() *Map[K, V] {
    var m Map[K, V]
    empty := map[K]V{}
    m.m.Store(&empty)
    return &m
}

func (m *Map[K, V]) Get(k K) (V, bool) {
    v, ok := (*m.m.Load())[k]
    return v, ok
}

func (m *Map[K, V]) Set(k K, v V) {
    for {
        old := m.m.Load()
        cp := make(map[K]V, len(*old)+1)
        for kk, vv := range *old {
            cp[kk] = vv
        }
        cp[k] = v
        if m.m.CompareAndSwap(old, &cp) {
            return
        }
    }
}

func (m *Map[K, V]) Delete(k K) {
    for {
        old := m.m.Load()
        if _, ok := (*old)[k]; !ok {
            return
        }
        cp := make(map[K]V, len(*old)-1)
        for kk, vv := range *old {
            if kk != k {
                cp[kk] = vv
            }
        }
        if m.m.CompareAndSwap(old, &cp) {
            return
        }
    }
}

Reads are lock-free. Writes are O(n) per operation due to the copy. Use only when reads vastly outnumber writes.

SC ensures readers see either the old or the new map, never a partial state.

Worked Example: Lock-Free Bitmap¶

type Bitmap struct {
    words []atomic.Uint64
}

func New(size int) *Bitmap {
    return &Bitmap{words: make([]atomic.Uint64, (size+63)/64)}
}

func (b *Bitmap) Set(i int) {
    b.words[i/64].Or(1 << (i % 64))
}

func (b *Bitmap) Clear(i int) {
    b.words[i/64].And(^(1 << (i % 64)))
}

func (b *Bitmap) Get(i int) bool {
    return b.words[i/64].Load()&(1<<(i%64)) != 0
}

Or and And are atomic RMW. SC ensures consistent observations of multiple bits in the same word. Across words, SC orders the operations but does not provide multi-word atomicity — for that you need a mutex.

Deep Dive: Why ARM64 Differs in Cost¶

The ARM64 architecture is a Reduced Instruction Set Computer (RISC). It emphasises pipelining and simple instructions. The memory model is weaker than x86's because:

Reordering across loads improves pipeline utilisation.
Reordering across stores allows the store buffer to drain efficiently.
Reordering loads past stores allows speculative loads.

The cost: SC requires fences to forbid all four reorderings. The benefit (for non-SC code): faster average throughput.

Cost per fence type¶

DMB ISH (full barrier): ~20 cycles uncontended, ~100 cycles contended.
DMB ISHST (store-only): ~10 cycles.
LDAR (load-acquire): ~5 cycles uncontended.
STLR (store-release): ~10 cycles uncontended.

Compared to plain LDR/STR (~1 cycle each), SC atomic ops are 10-30× more expensive on ARM. This adds up in hot loops.

Mitigations specific to ARM¶

Use LDAPR (release-only acquire) if your algorithm can tolerate per-processor consistency. Go does not currently expose this.
Per-CPU sharding to avoid cache-line bouncing.
Batching: aggregate updates in a goroutine-local buffer, flush periodically.

Deep Dive: Cache-Line Layout¶

A 64-byte cache line is the unit of cache coherence. Atomics within the same cache line interfere — a write to one invalidates the line, even if the other atomic is logically independent.

Use perf stat -e cache-misses on Linux. High miss rates with low logical contention suggest false sharing.

pprof can show hot spots in atomic functions. If you see runtime/internal/atomic.Xadd64 dominating, sharding or padding may help.

Padding patterns¶

type Padded[T any] struct {
    v T
    _ [64 - unsafe.Sizeof(*new(T))]byte
}

Padding to 64 bytes per atomic ensures each is on its own cache line. (Note: cache-line size is 64 bytes on x86/ARM, 128 on PowerPC; check via sysconf(_SC_LEVEL1_DCACHE_LINESIZE).)

`sync.Pool` and per-P padding¶

The Go runtime's sync.Pool shards per-P (logical processor). Each P has its own pool, padded. Reduces cache-line bouncing for high-throughput pool usage.

Deep Dive: The Race Detector's Algorithm¶

The race detector uses vector clocks. Each goroutine has a logical clock. Each memory location remembers the clock of the last access (read or write).

Vector clocks¶

A vector clock is a per-goroutine counter incremented on synchronisation events. When goroutine A releases a mutex, its clock is associated with the mutex; when goroutine B acquires the same mutex, it merges its clock with the mutex's clock.

Race detection¶

For each memory access, the detector checks:

For a write: does any prior access (by another goroutine) lack a happens-before edge to this write?
For a read: does any prior write (by another goroutine) lack a happens-before edge?

If yes, a race is reported.

Synchronisation events tracked¶

Channel send/receive.
Mutex Lock/Unlock.
WaitGroup Done/Wait.
Once.Do.
Atomic operations (each is a synchronisation point).
Goroutine create.
Goroutine exit (Done in WG, or implicit).

Cost¶

Each instrumented access costs ~5-10× normal. Memory overhead: ~5× (one vector clock entry per memory location accessed).

Limitations¶

Only detects races in observed executions.
Cannot detect races in code that did not run.
May miss races dependent on specific schedules.

Mitigation: -cpu=1,2,4,8, -count=N, stress tests.

Deep Dive: SC in the Go Scheduler¶

The scheduler (runtime/proc.go) is the most concurrent piece of Go code. Some patterns to look for:

Goroutine status CAS¶

const (
    _Gidle = 0
    _Grunnable = 1
    _Grunning = 2
    // ...
)

func casgstatus(g *g, oldval, newval uint32) {
    for !g.atomicstatus.CompareAndSwap(oldval, newval) {
        // spin or yield
    }
}

State transitions are SC CAS. Observers (the scheduler reading state) see the new state along with all writes the goroutine made before the CAS.

Run-queue head/tail¶

Each P has a local run-queue with SC atomic head/tail. Push to head; pop from head (LIFO for cache locality). Stealing happens from tail (FIFO, by other Ps).

Steal protocol¶

When a P runs out of work, it tries to steal from another P. The steal uses CAS on the victim's tail. SC ensures no two stealers grab the same goroutine.

sysmon¶

A dedicated thread monitors goroutine progress, GC pacing, and forced preemption. Uses SC atomics for all coordination.

Locks within the runtime¶

The runtime has its own internal locks (runtime.lock / runtime.unlock). These are heavier than sync.Mutex, but they synchronise scheduler-level data structures. They depend on the same SC atomic primitives.

Deep Dive: SC and the GC¶

Go's GC is concurrent. Marking happens alongside user goroutines. The interaction with SC is subtle.

Write barriers¶

During a GC cycle, every pointer write goes through a write barrier (in the runtime). The barrier records the pointer for later marking. The barrier is implemented with SC atomic operations.

Tricolor invariants¶

Each object is white (unmarked), grey (marked, children unscanned), or black (marked, children scanned). The GC maintains:

No black object points to a white object.

Achieved via the write barrier: if a black object writes a pointer to a white object, the white object is greyed.

SC atomics ensure the marker sees a consistent view.

Stop-the-world phases¶

Brief pauses (sub-millisecond) where all goroutines stop. The runtime ensures memory consistency by using SC atomics for the "GC running" flag.

Deep Dive: Inter-Process SC¶

Within one Go process, SC is provided. Across processes, the situation is different.

Shared memory (mmap)¶

mmaped regions are not synchronised by Go. Two processes accessing the same region race. Solutions:

POSIX semaphores or mutexes (cgo).
Atomic ops via cgo with <stdatomic.h>.
Filesystem-based locking.

Files¶

File I/O is mediated by the OS. Writes to a file by one process are eventually visible to readers, but the semantics depend on the filesystem.

Sockets and pipes¶

POSIX guarantees: a write to a socket/pipe is delivered in order to the reader. Effectively, the socket provides happens-before across processes.

gRPC, HTTP¶

Application-level protocols establish their own ordering. Idempotent operations help.

Deep Dive: Comparing Memory Models Side by Side¶

Property	Go	C++ (default)	Rust (default)	Java volatile
Atomic op default	SC	seq-cst	seq-cst	SC-like
Weaker orderings available	no	yes	yes	(use VarHandle)
Race UB	yes	yes	no (Send/Sync)	bounded
Race detector	yes (-race)	yes (ThreadSanitizer)	no (compiler-enforced)	no (limited tools)
Compile-time prevention	no	no	yes	no
Lock-free std types	atomic.Int64 etc.	std::atomic	std::sync::atomic	AtomicInteger

Each language makes different trade-offs. Go's bet: a simple model + dynamic detection. Rust's bet: a strict type system. C++/Java: flexibility at the cost of complexity.

Future Directions for Memory Models¶

Likely additions to Go¶

sync/atomic.LoadRelaxed/StoreRelaxed for performance-critical counters. Possible but not promised.
sync/atomic.Float64, Float32 for atomic floats. Common request.
Larger atomics (128-bit) for specific use cases. Probably won't happen.

Not on the roadmap¶

memory_order_acquire/release. The team has rejected this.
memory_order_consume. Deprecated even in C++.
Per-architecture memory model exposure. Conflicts with portability.

Research directions¶

Hardware support for SC at the architectural level (e.g., RISC-V's RVTSO proposal).
Speculative SC via committed-state tracking.
Hybrid SC + weak orderings selectable per-region.

Practical Advice for Senior Engineers¶

When you see "race condition" in a bug report¶

Run with -race. Get the report.
Identify the involved memory locations.
Check if synchronisation surrounds them.
If using atomics, check for mixed atomic/plain access.
If using mutexes, check for unlocked accesses.
Fix and re-test.

When you see atomic operations in a profile¶

Quantify: how many ops/second? How many cores?
Identify contention (multiple cores writing).
Mitigate: shard, pad, batch.
Re-profile.

When you design a new lock-free data structure¶

Write the invariants in plain English.
Choose primitives (mostly atomic.Pointer, atomic.Int64, CAS).
Argue correctness under SC.
Implement.
Test under -race.
Benchmark vs the simple mutex version.
Decide if the complexity is worth it.

When you review concurrent code¶

Identify shared state.
Identify synchronisation primitives.
Trace happens-before edges.
Check for mixed atomic/plain.
Check for false sharing.
Check for spinning or busy loops.
Recommend mutex if in doubt.

Self-Assessment Extended¶

Summary Extended¶

Senior mastery of SC in Go means:

Reading Lamport 1979 and recognising the same definition in Go's memory model.
Reading Go's runtime atomic assembly and understanding fence emission.
Reasoning about store buffers, cache coherence, and out-of-order execution.
Designing lock-free algorithms with SC as a correctness foundation.
Diagnosing and mitigating SC overhead (sharding, padding, batching).
Comparing memory-model design trade-offs across languages.
Knowing the SC boundary: where it ends (cgo, syscalls, mmap) and how to bridge it.

The professional page goes beyond senior: language-design philosophy, formal verification at scale, the future of memory models in the era of weakly-ordered hardware, and SC at the intersection with distributed systems.

Appendix: A Long List of Real Patterns¶

A non-exhaustive catalogue of SC-dependent patterns seen in real Go codebases:

Hot-reload config: atomic.Pointer[Config].
Atomic counter: atomic.Int64.Add.
Stop flag: atomic.Bool.
Connection state machine: atomic.Int32 CAS for state transitions.
Generation counter: atomic.Int64 for cache versioning.
Sharded counter: per-CPU atomic.Int64 with padding.
Lock-free SPSC queue: SC head/tail indices.
Lock-free MPMC queue: CAS on slot status words.
Treiber stack: CAS on top pointer.
Michael-Scott queue: CAS on head and tail pointers.
Hazard pointers: SC atomic pointer array.
RCU: atomic.Pointer[T] for swap; GC for reclamation.
Copy-on-write map: atomic.Pointer[map].
Read-mostly cache: COW with atomic.Pointer.
Rate limiter: atomic.Int64 token bucket.
Circuit breaker: atomic.Int32 state + atomic.Int64 counters.
Backoff state: atomic.Int64 retry counter.
Tickers: atomic.Int64 period for runtime-adjustable timers.
Lifetime tracking: atomic.Int64 refcount.
Sequence numbers: atomic.Uint64 for log offsets, transaction IDs.

Each of these patterns relies on SC for correctness. None would be writable in Go's pre-1.19 memory model without significant additional reasoning.

Appendix: A Reading List for Senior Engineers¶

Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" (1979).
Herlihy & Wing, "Linearizability: A Correctness Condition for Concurrent Objects" (1990).
Adve & Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware" (CACM 2010).
Russ Cox, "Hardware Memory Models" (2021).
Russ Cox, "Programming Language Memory Models" (2021).
Russ Cox, "Updating the Go Memory Model" (2021).
C++ standard, [intro.races].
Java JSR-133: "Java Memory Model and Thread Specification."
ARM Architecture Reference Manual, Memory Model chapter.
Intel Software Developer's Manual, Volume 3, Section 8.2.
RISC-V Unprivileged ISA, RVWMO appendix.
Sorin, Hill, Wood, "A Primer on Memory Consistency and Cache Coherence" (Morgan & Claypool, 2nd ed.).

If you read all of these, you'll be in the top 0.1% of Go engineers on memory-model knowledge.

Final Words¶

Senior-level SC mastery is not memorising more code patterns — it is being able to:

Read a memory-model spec and answer compliance questions.
Read compiler-emitted assembly and predict fence behaviour.
Design new lock-free data structures with rigorous correctness arguments.
Identify and mitigate SC costs in production workloads.
Communicate trade-offs across architectures, languages, and design philosophies.

If you can do all this, the professional page is calibration: you will recognise much of it, with the new material being the design philosophy and formal-verification details.

Appendix: Reading the SSA Output¶

To peek inside the Go compiler, dump SSA for a simple atomic-using function:

GOSSAFUNC=increment go build .

This produces ssa.html showing each compilation pass. For:

func increment(c *atomic.Int64) int64 {
    return c.Add(1)
}

You'll see:

OpAtomicAdd64: high-level SSA node for atomic add.
Lowered to AMD64XADDQlock (or ARM64LoweredAtomicAdd64).
Final assembly: LOCK XADDQ (x86) or LDADDAL (ARM v8.1+).

The compiler does not move other memory operations across OpAtomicAdd64. The SSA scheduler treats it as a memory barrier.

Appendix: Trace-Level Inspection¶

Use runtime/trace to capture goroutine schedules:

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
// ... workload ...
trace.Stop()
f.Close()

Analyse with go tool trace trace.out. You can see:

Per-goroutine execution timeline.
Synchronisation events.
GC phases.
Network and syscall blocks.

Useful when investigating "why does my atomic-heavy code stall here?"

Appendix: Deeper Litmus Tests¶

Litmus: SB (store buffer)¶

A: x = 1; r1 = y
B: y = 1; r2 = x
SC forbids: r1 = 0 && r2 = 0

The minimal x86 violation.

Litmus: MP (message passing)¶

A: data = 42; flag = 1
B: while (flag == 0); r = data
SC requires: r = 42

The publication pattern. SC ensures data = 42 is visible once flag = 1 is seen.

Litmus: WRC (write-to-read causality)¶

A: x = 1
B: r1 = x; y = 1
C: r2 = y; r3 = x
SC requires: r2 = 1 implies r3 = 1 (assuming r1 = 1 already)

Tests transitivity of causal chains. SC preserves it.

Litmus: 2+2W (two writers two writes)¶

A: x = 1; y = 2
B: y = 1; x = 2
After both finish:
  legal states: (x=1,y=1), (x=1,y=2), (x=2,y=2), (x=2,y=1)

SC permits any of these (the order of stores from each thread is internal). The race detector would flag this as a race because the writes are unordered, but SC allows any of these states. (A real program would need synchronisation.)

Litmus: ARM-specific weakness¶

A: x = 1; y = 1
B: r1 = y; dep = r1; r2 = x[dep & 0]

On ARM without fences, B could observe r1 = 1 (so y = 1 is visible) but r2 = 0 (so x = 1 is not visible). This is the "out of order load" weakness. SC forbids it via the LDAR fence.

Tools¶

herd7 (the herdtools suite) lets you specify litmus tests and check them against memory models.
RMEM is a research tool for hardware memory models.
Coq formalisations of memory models exist for x86, ARM, and POWER.

For Go specifically, no public formal tool yet. The race detector is the practical surrogate.

Appendix: Performance Investigation Workflow¶

Step 1: capture a CPU profile¶

import _ "net/http/pprof"
// ...
go func() { http.ListenAndServe(":6060", nil) }()

Run the workload, then:

go tool pprof -web http://localhost:6060/debug/pprof/profile?seconds=30

Step 2: look for atomic functions¶

In pprof, look for: - runtime/internal/atomic.Xadd64 - runtime/internal/atomic.Cas64 - runtime/internal/atomic.Xchg64 - sync.(*Mutex).Lock - sync.(*RWMutex).RLock

If these dominate, you have contention.

Step 3: capture a mutex profile¶

go tool pprof -web http://localhost:6060/debug/pprof/mutex

Or:

runtime.SetMutexProfileFraction(1)

Shows mutex contention hot spots.

Step 4: capture a goroutine trace¶

As above with runtime/trace. Shows where goroutines block.

perf stat -e cache-references,cache-misses,LLC-load-misses on Linux. High miss rates indicate cache-line bouncing.

Step 6: redesign¶

Apply the mitigations: shard, pad, batch.

Step 7: re-profile¶

Verify the bottleneck moved or shrank.

Appendix: Common Senior-Level Bugs¶

Bug 1: ABA in lock-free stack¶

type node struct {
    val  int
    next *node
}

type Stack struct{ top atomic.Pointer[node] }

func (s *Stack) Pop() (int, bool) {
    for {
        old := s.top.Load()
        if old == nil {
            return 0, false
        }
        nxt := old.next
        if s.top.CompareAndSwap(old, nxt) {
            return old.val, true
        }
    }
}

If thread A loads old, then B pops two nodes and pushes the same one back, A's CAS succeeds with stale nxt. ABA.

Fix: tagged pointers (extra version field) or hazard pointers or epoch-based reclamation.

Bug 2: write barrier missed¶

If you publish a pointer atomically but mutate the pointed-to data afterwards, readers race. Subtle when the mutation is in a different function.

Bug 3: forgetting cache-line padding¶

A struct with two atomics on the same cache line scales poorly. Add padding.

Bug 4: spin loops without backoff¶

A tight spin on an atomic burns CPU and starves the writer. Add runtime.Gosched() or exponential backoff.

Bug 5: incorrect happens-before assumption¶

go func() { data = 42 }()
runtime.Gosched()
fmt.Println(data) // race

Gosched is not a memory barrier. The print races with the assignment.

Bug 6: relying on `print` ordering¶

fmt.Println from goroutines acquires an internal mutex, so writes within Println are ordered. But the call order between goroutines is not guaranteed. Don't rely on print order for observable ordering.

Bug 7: shared map without mutex¶

A built-in map[K]V is not concurrent-safe. Even concurrent reads can race in the presence of writes. Use sync.Map, mutex, or COW.

Appendix: Architecture-Specific Tips¶

x86¶

SC is cheap. Use atomics liberally.
xchg is faster than mfence for full barriers.
The lock prefix is mandatory for atomic RMW; never omit.

ARM64¶

SC is expensive. Profile.
Use LDAR/STLR pair for SC load/store.
LDAPR is cheaper but RCpc, not SC.
Atomic RMW via LDADDAL/CASAL on v8.1+.

RISC-V¶

Explicit fences are required.
aqrl flag on AMO ops gives SC.
LR/SC pair for compare-and-swap loops.

PowerPC¶

lwsync for acq-rel; sync for full SC.
Less commonly seen in Go production.

s390x (IBM Z)¶

TSO-like with bcr 15,0 fences.
Mostly relevant for mainframe Go workloads.

wasm¶

JavaScript engine emits browser-platform fences.
Limited atomic operation set.

Appendix: A Comparative Sketch of Atomic APIs¶

Go¶

var x atomic.Int64
x.Load()
x.Store(1)
x.Add(1)
x.Swap(1)
x.CompareAndSwap(0, 1)

All operations are SC.

C++¶

std::atomic<int64_t> x;
x.load(std::memory_order_seq_cst);
x.store(1, std::memory_order_seq_cst);
x.fetch_add(1, std::memory_order_seq_cst);
x.exchange(1, std::memory_order_seq_cst);
x.compare_exchange_strong(expected, 1, std::memory_order_seq_cst);

The default is seq-cst; can be relaxed.

Rust¶

use std::sync::atomic::{AtomicI64, Ordering};
let x = AtomicI64::new(0);
x.load(Ordering::SeqCst);
x.store(1, Ordering::SeqCst);
x.fetch_add(1, Ordering::SeqCst);
x.swap(1, Ordering::SeqCst);
x.compare_exchange(0, 1, Ordering::SeqCst, Ordering::SeqCst);

Explicit Ordering. Rust borrow checker prevents some misuses.

Java¶

AtomicLong x = new AtomicLong(0);
x.get();
x.set(1);
x.getAndAdd(1);
x.getAndSet(1);
x.compareAndSet(0, 1);

Implicitly SC (volatile-like).

C¶

long x = 0;
Volatile.Read(ref x);
Volatile.Write(ref x, 1);
Interlocked.Add(ref x, 1);
Interlocked.Exchange(ref x, 1);
Interlocked.CompareExchange(ref x, 1, 0);

Mixed model: Volatile is acq-rel-ish; Interlocked is SC.

Appendix: A Compact SC Cheat Sheet for Senior Engineers¶

SC GUARANTEE (Go 1.19+):
  Race-free programs behave as if executed under sequential consistency.
  Every sync/atomic operation participates in a global total order.

WHEN TO USE:
  Single-word state with multi-goroutine access.
  Publication of immutable data.
  Counters, flags, generation numbers.
  Lock-free data structures.

WHEN NOT TO USE:
  Multi-step invariants (use sync.Mutex).
  Heavy contention with no sharding (consider mutex or batching).
  When channels work cleaner (don't over-engineer).

PRIMITIVES:
  atomic.Bool             one bit
  atomic.Int32, Int64     signed
  atomic.Uint32, Uint64   unsigned
  atomic.Uintptr          pointer-sized integer
  atomic.Pointer[T]       typed pointer (Go 1.19+)
  atomic.Value            untyped (legacy)

OPERATIONS:
  Load, Store, Swap, CompareAndSwap, Add (numeric)

COST (rough):
  x86 Load: 1 cycle
  x86 Store: 10 cycles
  x86 Add/CAS: 10-30 cycles uncontended; up to 1000+ contended
  ARM64 Load: 5 cycles
  ARM64 Store: 10 cycles
  ARM64 Add/CAS: 20-40 cycles

ANTI-PATTERNS:
  Mixed atomic/plain access to same var.
  Mutating *T after atomic.Pointer[T].Store.
  Spinning without backoff.
  False sharing (same cache line).

VERIFICATION:
  go test -race
  go test -race -count=N -cpu=1,2,4,8
  Read pprof for runtime/internal/atomic.* hot spots.

This cheat sheet captures the operational essentials.

Appendix: Final Worked Example — A Production-Grade Generation Cache¶

A real-world cache with hot reads, occasional writes, and a generation counter to detect staleness.

package gencache

import (
    "sync"
    "sync/atomic"
)

type Entry[V any] struct {
    Gen   int64
    Value V
}

type Cache[K comparable, V any] struct {
    gen atomic.Int64
    m   atomic.Pointer[map[K]Entry[V]]
    mu  sync.Mutex // serialises writers
}

func New[K comparable, V any]() *Cache[K, V] {
    var c Cache[K, V]
    empty := map[K]Entry[V]{}
    c.m.Store(&empty)
    return &c
}

func (c *Cache[K, V]) Get(k K) (V, int64, bool) {
    m := *c.m.Load()
    if e, ok := m[k]; ok {
        return e.Value, e.Gen, true
    }
    var zero V
    return zero, 0, false
}

func (c *Cache[K, V]) Set(k K, v V) int64 {
    c.mu.Lock()
    defer c.mu.Unlock()

    newGen := c.gen.Add(1)

    old := *c.m.Load()
    cp := make(map[K]Entry[V], len(old)+1)
    for kk, vv := range old {
        cp[kk] = vv
    }
    cp[k] = Entry[V]{Gen: newGen, Value: v}
    c.m.Store(&cp)
    return newGen
}

func (c *Cache[K, V]) Delete(k K) int64 {
    c.mu.Lock()
    defer c.mu.Unlock()

    newGen := c.gen.Add(1)

    old := *c.m.Load()
    if _, ok := old[k]; !ok {
        return newGen
    }
    cp := make(map[K]Entry[V], len(old)-1)
    for kk, vv := range old {
        if kk != k {
            cp[kk] = vv
        }
    }
    c.m.Store(&cp)
    return newGen
}

func (c *Cache[K, V]) Gen() int64 { return c.gen.Load() }

Correctness argument¶

Get is lock-free. The SC load of c.m gives a consistent snapshot of the map. The Entry inside it is immutable.
Set and Delete take a mutex, building a new map and swapping atomically.
The generation counter is SC; observers see a monotonically-increasing number.

Performance¶

Reads: one atomic Load + map lookup. ~10 ns on x86.
Writes: full copy O(n). Suitable for read-mostly workloads.
No false sharing (the atomic pointer is on its own field).

Limitations¶

Writes are O(n) due to the copy. Don't use for write-heavy workloads.
Memory churn: each write allocates a new map. GC pressure.

Appendix: Recommendations for Architects¶

If you are responsible for the concurrency architecture of a large Go codebase:

Document the synchronisation primitives policy. Which patterns are encouraged? Discouraged?
Enforce -race in CI. Non-negotiable.
Establish a code-review checklist for concurrent code. Shared state, primitives, happens-before, false sharing.
Provide internal libraries for common patterns. Don't make every team re-invent rate limiting.
Profile regularly. Don't wait for production fires.
Train. Junior engineers benefit hugely from a one-day course on the memory model.
Read the runtime source as a team. A team that has read sync/atomic together makes better decisions.

These are organisational, not technical, but they determine whether SC mastery scales beyond a few senior engineers.

Closing¶

The senior page covered SC in depth: the formal definition, the happens-before construction, the hardware mechanisms that implement it, and the practical patterns built on top. The professional page steps further out: the philosophy of memory-model design, formal verification, the future of weakly-ordered hardware, and SC as a building block in distributed systems.

If you've made it this far, you have enough SC knowledge to debug any race, design any lock-free data structure, and choose the right primitive for any Go concurrency challenge. The remaining ten percent — the design-level and philosophical material — is in the professional page.

Appendix: Building a Custom Lock-Free Bounded MPMC Queue¶

Multi-producer multi-consumer (MPMC) queues are notoriously tricky. Let's design one in Go using SC atomics.

Design¶

A circular buffer of slots, each with a sequence number. Producers and consumers compete via CAS on the slot's sequence.

package mpmc

import (
    "sync/atomic"
)

type slot struct {
    seq atomic.Uint64
    val interface{}
    _   [40]byte // pad to 64 bytes
}

type Queue struct {
    buf    []slot
    mask   uint64
    enq    atomic.Uint64
    _      [56]byte
    deq    atomic.Uint64
}

func New(capacity int) *Queue {
    if capacity&(capacity-1) != 0 {
        panic("capacity must be power of two")
    }
    q := &Queue{
        buf:  make([]slot, capacity),
        mask: uint64(capacity - 1),
    }
    for i := range q.buf {
        q.buf[i].seq.Store(uint64(i))
    }
    return q
}

func (q *Queue) Enqueue(v interface{}) bool {
    var cell *slot
    pos := q.enq.Load()
    for {
        cell = &q.buf[pos&q.mask]
        seq := cell.seq.Load()
        diff := int64(seq) - int64(pos)
        if diff == 0 {
            if q.enq.CompareAndSwap(pos, pos+1) {
                break
            }
        } else if diff < 0 {
            return false
        } else {
            pos = q.enq.Load()
        }
    }
    cell.val = v
    cell.seq.Store(pos + 1)
    return true
}

func (q *Queue) Dequeue() (interface{}, bool) {
    var cell *slot
    pos := q.deq.Load()
    for {
        cell = &q.buf[pos&q.mask]
        seq := cell.seq.Load()
        diff := int64(seq) - int64(pos+1)
        if diff == 0 {
            if q.deq.CompareAndSwap(pos, pos+1) {
                break
            }
        } else if diff < 0 {
            return nil, false
        } else {
            pos = q.deq.Load()
        }
    }
    v := cell.val
    cell.seq.Store(pos + q.mask + 1)
    return v, true
}

Correctness argument¶

Each slot has a sequence number. Producers wait for seq == pos (slot empty for this generation). Consumers wait for seq == pos + 1 (slot full for this generation). CAS on enq/deq claims a position. SC ensures the slot's value is visible after the producer's sequence update.

Performance¶

This is the famous Vyukov bounded MPMC queue, adapted to Go. It outperforms simple mutex-based queues under high contention. The padding avoids false sharing between enq and deq and between slots.

Caveats¶

Capacity must be power of two.
Storing interface{} causes boxing for non-pointer types. Use a typed version with generics for performance-sensitive code.
Spinning under contention — not wait-free, just lock-free.

Appendix: Garbage Collection Impact on SC¶

Go's GC interacts with SC in several ways.

Write barriers and SC¶

During a GC cycle, every pointer write goes through the write barrier:

*p = val
// becomes:
runtime.gcWriteBarrier(p, val)
*p = val

The barrier records the pointer for the marker. The barrier itself uses SC atomic operations on the GC's grey-set queues.

For your atomic operations, the barrier is automatic and invisible. But it adds latency: - Non-GC time: atomic.Pointer.Store is one fence. - During GC: Store is one fence + barrier write.

Stop-the-world (STW) pauses¶

Modern Go has very short STW pauses (sub-millisecond). During STW, all goroutines stop. SC is trivially preserved (no concurrent activity).

When goroutines resume, they see the post-STW state. The SC contract is unaffected.

Allocation pressure from COW¶

Copy-on-write patterns allocate on every write. The GC handles this, but write-heavy workloads can trigger frequent GC cycles. Mitigations:

Pool allocations with sync.Pool.
Pre-allocate during initialisation.
Use mutable structures with mutex for write-heavy state.

Appendix: SC in the Context of Lock Inversion¶

When you have multiple locks (mutexes), lock ordering matters. SC does not save you from deadlock if you violate the lock-acquisition order.

// DEADLOCK SCENARIO
var (
    a, b sync.Mutex
)

func task1() {
    a.Lock()
    b.Lock()
    // ...
    b.Unlock()
    a.Unlock()
}

func task2() {
    b.Lock()
    a.Lock() // potential deadlock with task1
    // ...
    a.Unlock()
    b.Unlock()
}

SC orders memory operations; it does not prevent deadlock. Always acquire locks in a consistent order.

For lock-free designs, deadlock is impossible (no blocking), but other progress hazards exist: livelock (constant retry without progress) and ABA.

Appendix: SC in Persistent Storage Contexts¶

If you persist atomic values to disk and read them back, the memory model does not apply. Disk reads and writes have their own semantics.

For atomic file updates: - Use rename-then-overwrite for atomic file writes (POSIX guarantees). - Use fsync to ensure data hits disk. - Use durable record formats (e.g., append-only logs) for crash safety.

If you mmap a file and use atomics on the mapped region, you get in-memory SC but not durability. A crash before the dirty page reaches disk loses the writes. Use msync for durability.

Appendix: SC and Time¶

The Go memory model says nothing about wall-clock time. SC orders operations logically; it does not require them to happen at any specific time.

This means: - A SC write may not propagate immediately. It might take microseconds. - Time-based synchronisation (e.g., "sleep 100ms and assume the write is visible") is wrong. - Use explicit synchronisation primitives for visibility, not time.

For monitoring or eventual visibility, time is fine. For correctness, never.

Appendix: SC and Cross-Datacenter Replication¶

This is well outside Go's memory model, but the analogy is useful.

In a single Go process: SC means all goroutines agree on a global order. In a distributed system: SC means all replicas agree on a global order.

The mechanisms differ: - Shared memory: fences and cache coherence. - Distributed: consensus protocols (Raft, Paxos).

The latency cost grows by orders of magnitude. Cross-DC SC takes milliseconds; in-process SC takes nanoseconds.

If your Go code participates in a SC distributed system (e.g., etcd, Spanner client), be aware that the latency cost of SC operations is dominated by the network, not by Go's atomics.

Appendix: Common Misconceptions, Senior Edition¶

"SC means linearisable"¶

No. SC is a logical ordering; linearisability adds real-time consistency. SC is weaker.

"Atomic ops are non-blocking"¶

True at the lock-free level — they don't block in the OS sense. But they can stall a core for cache-coherence latency. Not "blocking" in the threading sense.

"Mutex is always slower than atomic"¶

False under low contention with multi-step critical sections. A mutex's lock+unlock is two atomics; a critical section with N atomics is 2N. Mutex wins.

"x86 is sequentially consistent"¶

False. x86 is TSO. Almost SC. Store-load reordering is still permitted absent fences.

"ARM is fully relaxed"¶

ARM is weakly ordered but has acquire/release primitives. Modern ARM (v8.3+) provides SC via LDAR/STLR pairs cheaply.

"GC pauses don't affect concurrency"¶

GC pauses can delay every goroutine. Modern Go's GC pauses are sub-millisecond, but for tight latency budgets, they matter.

"Channels are slow"¶

Channels are slower than atomics per op but provide much richer semantics. For most workloads, channel cost is negligible.

"Spinning is fast"¶

Tight spinning wastes CPU and starves other goroutines. Always back off.

Appendix: A Walk Through `sync.Map`¶

Go's sync.Map is designed for two patterns: 1. Write-once / read-many. 2. Disjoint sets of keys per goroutine.

For these patterns, it outperforms a sync.RWMutex-protected map.

Internals (simplified)¶

type Map struct {
    mu Mutex
    read atomic.Pointer[readOnly]
    dirty map[any]*entry
    misses int
}

type readOnly struct {
    m map[any]*entry
    amended bool
}

type entry struct {
    p atomic.Pointer[any]
}

The read field is an atomic pointer to a read-only snapshot. Hot reads hit read lock-free. The dirty map is for new keys, protected by the mutex. Periodic promotion of dirty to read updates the atomic pointer.

The implementation is heavy on SC atomics. Reading the source (src/sync/map.go) is enlightening.

When to use¶

Read-mostly with stable key set: yes.
Mixed read/write or new keys per request: no, use mutex + map.

Appendix: A Walk Through `sync.Pool`¶

sync.Pool provides per-P caches for object reuse. Reduces GC pressure.

type Pool struct {
    local     unsafe.Pointer // [P]poolLocal
    localSize uintptr
    // ...
}

type poolLocal struct {
    poolLocalInternal
    pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

type poolLocalInternal struct {
    private interface{}
    shared poolChain
}

Each P has a poolLocal with private storage and a shared chain. Get prefers private, then shared, then steals from other Ps. All operations use SC atomics on the chain.

The padding ensures one poolLocal per cache line. The chain uses atomic pointer operations for lock-free append/steal.

Appendix: The Future of Atomic APIs in Go¶

As of Go 1.23: - Typed API: stable. - atomic.Pointer[T]: stable. - Generics support: stable.

Future possibilities (speculation): - atomic.Float32, atomic.Float64: requested often. Likely to come eventually. - atomic.AlignedInt64: explicit-alignment variant for shared-memory IPC. Niche. - atomic.LoadRelaxed/StoreRelaxed: rejected so far. Door not fully closed. - atomic.Fence: explicit fence operation. Currently not exposed; runtime uses internally.

The Go team's pace is deliberate. Changes are weighed for the entire ecosystem.

Appendix: A Worked SPSC Disruptor Sketch¶

The LMAX Disruptor is a high-throughput SPSC/MPSC queue used in financial systems. The Go adaptation:

type Sequence struct{ n atomic.Int64 }

type Disruptor struct {
    buf   []Event
    mask  int64
    cursor Sequence // producer
    cached Sequence // consumer's view of cursor
    consumer Sequence
}

type Event struct{ data int64 }

func NewDisruptor(size int) *Disruptor {
    return &Disruptor{
        buf:  make([]Event, size),
        mask: int64(size - 1),
    }
}

func (d *Disruptor) Publish(data int64) {
    pos := d.cursor.n.Add(1) - 1
    for {
        consumed := d.consumer.n.Load()
        if pos-consumed < int64(len(d.buf)) {
            break
        }
        runtime.Gosched()
    }
    d.buf[pos&d.mask].data = data
    d.cursor.n.Store(pos + 1) // re-publish (post-increment idiom)
}

func (d *Disruptor) Consume() int64 {
    pos := d.consumer.n.Add(1) - 1
    for d.cursor.n.Load() <= pos {
        runtime.Gosched()
    }
    v := d.buf[pos&d.mask].data
    return v
}

This sketch is incomplete but illustrates the pattern: producer increments cursor, consumer follows. SC ensures the consumer sees the slot's value after the producer's cursor update.

In production, the Disruptor uses more elaborate barrier handling and waits. The principle remains: SC atomics provide the publication discipline.

Appendix: Lock-Free Skip List Sketch¶

A lock-free skip list uses atomic pointers per level. Insertion sets up the new node's forwards then CAS-links it into each level.

type Node struct {
    key  int
    val  int
    next []atomic.Pointer[Node] // per level
}

type SkipList struct {
    head    *Node
    maxLevel int
}

func (s *SkipList) Get(k int) (int, bool) {
    cur := s.head
    for level := s.maxLevel - 1; level >= 0; level-- {
        for {
            next := cur.next[level].Load()
            if next == nil || next.key > k {
                break
            }
            cur = next
        }
    }
    cur = cur.next[0].Load()
    if cur != nil && cur.key == k {
        return cur.val, true
    }
    return 0, false
}

Each Load is SC. Each CompareAndSwap for insertion is SC. The structure is observable consistently because all pointer reads/writes are SC.

Real lock-free skip lists are more elaborate (handle deletion, retries, marking), but the SC foundation is the same.

Appendix: Wait-Free vs Lock-Free vs Obstruction-Free¶

These are progress conditions:

Wait-free: every operation completes in a bounded number of steps.
Lock-free: at least one operation makes progress in a bounded number of steps.
Obstruction-free: an operation makes progress if it runs in isolation.

SC is orthogonal to these. SC orders operations; progress is a separate concern. Most "lock-free" Go code is lock-free in this technical sense.

Wait-free algorithms are rare in practice — the bounds are often weak. Lock-free is the practical sweet spot.

Appendix: A Note on Async/await and SC¶

Languages with async/await (Rust, JavaScript, Kotlin coroutines) have their own concurrency models. Async tasks may execute on multiple threads (Rust's tokio) or one thread (JavaScript). The memory model varies:

Rust async: same SC rules as sync threads; tokio runtime ensures correctness.
JavaScript: single-threaded; SC trivially holds.
Kotlin coroutines: Java/Kotlin memory model applies.

Go does not have async/await. Goroutines are the unit; SC applies uniformly.

Appendix: Performance Profiling Real Production¶

A common scenario:

"Our service has slowed down. We profile and see runtime/internal/atomic.Xadd64 at 30% CPU."

Investigation: 1. Look at the call stack. Which atomic operation? 2. Frequency: per-request or shared across requests? 3. Contention: how many cores? 4. False sharing: is the atomic on a hot cache line with other atomics?

Resolution paths: - Shard the counter (per-P). - Pad the cache line. - Batch updates (per-goroutine local, flush periodically). - Reconsider the algorithm (do you need exact counts?).

A 30% CPU spent on atomics usually means the atomic is in the hottest path and needs rethinking, not just tweaking.

Appendix: Mental Models for SC Reasoning¶

Model: SC is a serialiser¶

Imagine a serialiser that picks one goroutine at a time to make one memory access, then moves to the next. The output is the global order. SC is the existence of such a serialiser.

Model: SC is a sound proof¶

A proof that uses SC as an axiom can derive program correctness step by step. If SC holds, the proof is sound. If SC fails (race), the proof's assumptions fail and any conclusion is possible.

Model: SC is a contract¶

Between you and the runtime. You provide race-freedom; runtime provides SC.

Model: SC is a constraint set¶

The set of legal executions is the set of executions consistent with some global order. SC narrows the set vs weaker models.

Appendix: A Day in the Life of a Senior Go Concurrency Engineer¶

Monday: profile a slow service. Identify mutex contention. Refactor to SC atomics. Benchmark. Ship.

Tuesday: review a PR introducing a new lock-free data structure. Trace happens-before edges. Suggest hazard pointer for ABA prevention.

Wednesday: investigate a flaky test. Run with -race -count=100. Identify race in a shared map. Fix with sync.Map.

Thursday: design a configuration hot-reload feature. Use atomic.Pointer[Config]. Document the publication contract.

Friday: write a tech-talk explaining SC-DRF to junior engineers. Use the store-buffer litmus as motivation.

Repeat. The work is varied; the underlying knowledge is consistent.

Appendix: Closing Recap (Even Longer)¶

The senior-level SC knowledge for Go encompasses:

Lamport's 1979 SC definition and modern formalisations.
The happens-before relation and its construction from primitives.
The Go 1.19 memory model revision and its rationale.
Hardware memory models: x86 TSO, ARM RCsc/RCpc, RISC-V RVWMO.
Compiler fence emission strategy.
Cache coherence (MESI) vs consistency.
Store-buffer effects and the canonical litmus tests (SB, IRIW, MP, LB, WRC).
Lock-free data structures: Treiber stack, Michael-Scott queue, SPSC ring, MPMC queue.
Production patterns: hot-reload, sharded counters, RCU, COW, hazard pointers.
Cost profiles on different architectures.
Race detector internals (vector clocks).
Boundaries: cgo, syscalls, mmap, persistent storage.
Comparative analysis with C++, Rust, Java.

If you have internalised this, you are a true SC expert and can debug, design, review, and teach concurrent Go code at the highest level. The professional page is a final calibration: design philosophy, formal verification, future directions, and SC in the broader ecosystem.

Appendix: Long-Form Case Study — A High-Throughput Stats Engine¶

Context: a production service receives 100k events/sec. Each event needs to update several counters. The original design used a sync.Mutex per counter; profiling showed 40% CPU in mutex contention.

First refactor: replace mutexes with atomics¶

type Stats struct {
    Requests atomic.Int64
    Errors   atomic.Int64
    Total    atomic.Int64
}

func (s *Stats) Record(ok bool) {
    s.Total.Add(1)
    if ok {
        s.Requests.Add(1)
    } else {
        s.Errors.Add(1)
    }
}

CPU drops from 40% to 25% — better but still high. Atomics scale better than mutexes but still bounce cache lines.

Second refactor: sharding¶

const shards = 64

type Stats struct {
    requests [shards]struct {
        n atomic.Int64
        _ [56]byte
    }
    errors [shards]struct {
        n atomic.Int64
        _ [56]byte
    }
}

func (s *Stats) Record(g int, ok bool) {
    idx := uint(g) % shards
    if ok {
        s.requests[idx].n.Add(1)
    } else {
        s.errors[idx].n.Add(1)
    }
}

func (s *Stats) Sum() (req, err int64) {
    for i := 0; i < shards; i++ {
        req += s.requests[i].n.Load()
        err += s.errors[i].n.Load()
    }
    return
}

CPU drops to 4%. The sharding spreads contention across 64 cache lines, eliminating the bouncing.

Third refactor: per-goroutine batching¶

For ultra-high throughput, batch updates per goroutine:

type LocalStats struct {
    Requests int64
    Errors   int64
}

type Stats struct {
    shards [shards]struct {
        n atomic.Int64
        _ [56]byte
    }
}

func (s *Stats) Flush(g int, local *LocalStats) {
    idx := uint(g) % shards
    s.shards[idx].n.Add(local.Requests + local.Errors)
    local.Requests = 0
    local.Errors = 0
}

Each goroutine accumulates locally, flushing every N events. The atomic write rate drops by a factor of N.

Lessons¶

The first optimisation (mutex → atomic) is a 1.6× improvement.
The second (sharding) is a 6× improvement on top.
The third (batching) is another 10× on top.
Each step makes the code more complex; pick the right level for your throughput.

Appendix: Long-Form Case Study — A Lock-Free Configuration Store¶

Context: A service uses a sync.RWMutex to protect a large *Config struct. Reads happen on every request (100k/sec). Writes are rare (once per hour). The RWMutex's per-read overhead shows up in profiles.

Refactor: COW with atomic.Pointer¶

type Config struct {
    DBHosts []string
    Timeout time.Duration
    Routes  map[string]string
}

type Store struct {
    cfg atomic.Pointer[Config]
}

func (s *Store) Load() *Config { return s.cfg.Load() }

func (s *Store) Update(f func(*Config) *Config) {
    for {
        old := s.cfg.Load()
        new := f(old)
        if s.cfg.CompareAndSwap(old, new) {
            return
        }
    }
}

Reads: one atomic load (essentially free on x86). Writes: O(n) copy, but rare.

Pitfalls¶

f must produce a new *Config. Mutating old is a race.
Configs accumulate during reload until old request handlers release their references. Long handlers retain memory longer.

Benchmark¶

Before: 800 ns/read (RWMutex.RLock + access + RUnlock).
After: 1 ns/read (atomic.Load + access).
800× speedup on the hot path.

Production rolled this out and saw immediate latency reductions on request handlers.

Appendix: Long-Form Case Study — A Race Detected in Production¶

Context: A nightly job sometimes produces wrong totals. Engineers initially blame the database. Investigation reveals a Go race.

The buggy code¶

type Aggregator struct {
    total int64
    mu    sync.Mutex
}

func (a *Aggregator) Add(n int64) {
    a.mu.Lock()
    a.total += n
    a.mu.Unlock()
}

func (a *Aggregator) Total() int64 {
    return a.total // RACE: reads without lock
}

The Total() method reads without locking. Concurrent Add calls modify the field; the read sees torn or stale data.

Detection¶

go test -race -count=10 ./aggregator reports the race within a few seconds.

Fix¶

func (a *Aggregator) Total() int64 {
    a.mu.Lock()
    defer a.mu.Unlock()
    return a.total
}

Or, lock-free:

type Aggregator struct {
    total atomic.Int64
}

func (a *Aggregator) Add(n int64)  { a.total.Add(n) }
func (a *Aggregator) Total() int64 { return a.total.Load() }

The lock-free version is simpler and faster. Production rolled it out the same day.

Lessons¶

Always run -race in CI. This bug would have been caught at PR time.
"Reading without locking" is a common pattern that produces sporadic bugs.
SC atomics often replace mutex-protected counters with a one-line change.

Appendix: Long-Form Case Study — A Subtle ABA in a Custom Allocator¶

Context: A custom slab allocator uses a lock-free free list. Occasionally, two goroutines receive the same memory chunk, causing data corruption.

The bug¶

type Slab struct {
    head atomic.Pointer[node]
}

type node struct {
    next *node
}

func (s *Slab) Alloc() *node {
    for {
        old := s.head.Load()
        if old == nil {
            return nil
        }
        nxt := old.next
        if s.head.CompareAndSwap(old, nxt) {
            return old
        }
    }
}

func (s *Slab) Free(n *node) {
    for {
        old := s.head.Load()
        n.next = old
        if s.head.CompareAndSwap(old, n) {
            return
        }
    }
}

Scenario: 1. Thread A: loads head = X. 2. Thread B: pops X, pops Y, pushes X back. Now head = X again, but X.next points to a different node than A read. 3. Thread A: CAS succeeds (head was X, still X). But A's stale nxt is now incorrect. 4. Both A and B return X to callers. Memory corruption.

Fix: tagged pointer¶

Combine the pointer with a version number in a single atomic word:

type Slab struct {
    head atomic.Uint64 // pack (ptr | version)
    // ... node management ...
}

Each Alloc/Free increments the version. CAS on the combined word fails if the version changed. ABA eliminated.

Alternative fix: hazard pointers (each goroutine declares "I'm reading X"; Free defers if X is hazarded).

Lessons¶

ABA is subtle. SC atomics on raw pointers can hide ABA.
Lock-free data structures need explicit ABA prevention.
For most Go code, use channels or mutexes; reach for lock-free only with measured need.

Context: A benchmark of a worker pool shows linear scaling up to 4 cores, then plateaus.

Investigation¶

perf stat shows high cache-misses despite low theoretical contention.

The worker struct:

type Worker struct {
    id      int
    requests atomic.Int64 // per-worker counter
    errors   atomic.Int64
}

Eight workers in a slice: workers := make([]Worker, 8).

The two atomics (requests and errors) are adjacent in the struct, so they share a cache line. Worse, the entire slice is contiguous, so worker 0's atomics share a cache line with worker 1's. Every worker's atomic update invalidates other workers' cache lines.

Fix¶

type Worker struct {
    id      int
    requests atomic.Int64
    _       [56]byte // pad
    errors   atomic.Int64
    _       [56]byte // pad
}

Now each atomic is on its own cache line. Throughput scales linearly to all cores.

Benchmark¶

Before: 4 cores ≈ 2M ops/sec. 8 cores ≈ 2.5M ops/sec (plateau).
After: 4 cores ≈ 2M ops/sec. 8 cores ≈ 4M ops/sec.

The padding wasted 112 bytes per worker, but the throughput doubled. Trade-off was obvious.

Lessons¶

Cache lines are 64 bytes (x86, ARM). Atomics in the same line interfere.
Use unsafe.Sizeof to verify your struct layout.
perf stat -e cache-misses is your friend on Linux.
The padding cost is usually trivial; the throughput gain is large.

Appendix: Long-Form Case Study — Hot Reload Memory Growth¶

Context: A service hot-reloads config via atomic.Pointer[Config]. Memory grows steadily over time.

Investigation¶

go tool pprof -alloc_objects shows accumulated *Config allocations. The new configs are clearly being created, but the old ones aren't being freed.

Root cause: long-running goroutines (websocket handlers) hold references to the config they started with. The GC can't collect those configs until the handlers finish.

Fix¶

Strategy 1: re-fetch the config periodically.

func handler() {
    ticker := time.NewTicker(time.Minute)
    defer ticker.Stop()
    var cfg *Config
    for {
        select {
        case <-ticker.C:
            cfg = store.Load() // refresh
        case msg := <-input:
            process(msg, cfg)
        }
    }
}

Strategy 2: keep only relevant fields.

type RequestCfg struct {
    Timeout time.Duration
    URL     string
}

func handler() {
    full := store.Load()
    rc := RequestCfg{Timeout: full.Timeout, URL: full.URL}
    // use rc instead of full
}

Now the handler holds only a small struct; the full *Config is collectable.

Lessons¶

Long-running goroutines holding atomic.Pointer-loaded values prevent GC of old versions.
Periodic re-loading or field extraction is the fix.
Profile allocations, not just CPU.

Appendix: Long-Form Case Study — A Race in `time.Now()` Caching¶

Context: An optimisation caches time.Now() once per second in an atomic. Profiling shows the cache works, but occasional handler logs show wildly wrong timestamps.

The bug¶

var cachedNow atomic.Int64 // unix nano

func init() {
    go func() {
        for {
            cachedNow.Store(time.Now().UnixNano())
            time.Sleep(time.Second)
        }
    }()
}

func Now() time.Time {
    return time.Unix(0, cachedNow.Load())
}

The bug: the goroutine starts in init, but cachedNow is zero until the first iteration. Handlers running immediately after startup see Now() == 1970-01-01.

Fix¶

func init() {
    cachedNow.Store(time.Now().UnixNano())
    go func() {
        for {
            time.Sleep(time.Second)
            cachedNow.Store(time.Now().UnixNano())
        }
    }()
}

Initialise before starting the updater. The atomic now always has a valid value.

Lessons¶

Initial state of atomics is zero. Always initialise before any reader can run.
Even with SC, you must establish the invariant before publication.

Appendix: A Final Set of Recommendations¶

For senior engineers leading Go teams:

Adopt the typed atomic API as the default. Migrate legacy code over time.
Run -race in CI on every PR. Non-negotiable.
Use atomic.Pointer[T] for hot-reload patterns. RWMutex is rarely the right tool for read-mostly state.
Profile contended atomics. Shard or batch when measurable.
Document publication contracts in code comments adjacent to atomic declarations.
Train juniors on the memory model with the store-buffer litmus and the publication pattern.
Read the Go runtime's atomic code as a team exercise once per year. It evolves; staying current pays off.
Don't roll your own lock-free unless you have measured need and verified correctness.
Prefer channels for messaging, atomics for state, mutexes for multi-step invariants.
Test with -cpu=1,2,4,8,16 to expose schedule-dependent races.

Appendix: The Ten Most Important Things¶

If you remember nothing else:

SC means program order is preserved globally.
Go 1.19+ promises SC for race-free programs.
sync/atomic operations are SC.
The race detector verifies SC's precondition (race-freedom).
Mutexes, channels, WaitGroup, Once also create happens-before edges.
Publication: write data, then SC store; SC load, then read data.
After publication, treat data as immutable.
Pad to cache lines for write-heavy contended atomics.
Shard counters to scale across cores.
SC is the simplest model; Go bet on it. Make the bet pay off.

Closing Senior Page¶

This concludes the senior-level treatment. The technical content is exhaustive: definitions, hardware models, compiler details, lock-free patterns, production case studies. If you have absorbed it, you are equipped to lead Go concurrency at any scale.

The professional page that follows is about meta-level concerns: why Go made these choices, how to evaluate trade-offs at the language level, formal verification techniques, and the future direction of shared-memory programming.

Appendix: Detailed Walkthrough — The Go 1.19 Memory Model Document¶

The official document at https://go.dev/ref/mem is dense. Let's walk through the key sections.

Section "Overview"¶

The opening paragraph sets the philosophy:

Programs that modify data being simultaneously accessed by multiple goroutines must serialize such access.

Note the must. Not "should." If you race, your program is broken.

To serialize access, protect the data with channel operations or other synchronization primitives such as those in the sync and sync/atomic packages.

Channels are listed first. The Go team's design preference.

Section "Memory Model"¶

The Go memory model specifies the conditions under which reads of a variable in one goroutine can be guaranteed to observe values produced by writes to the same variable in a different goroutine.

This is the heart. Observation guarantees, not implementation details.

Section "Happens Before"¶

The formal definition.

Within a single goroutine, the happens-before order is the order expressed by the program.

A read r of a variable v is allowed to observe a write w to v if both of the following hold: 1. r does not happen before w. 2. There is no other write w' to v that happens after w but before r.

Note: "is allowed to observe" — not "must." The model permits multiple legal observations.

Section "Synchronization"¶

Lists the synchronization primitives and their happens-before edges: - Initialization - Goroutine creation and destruction - Channel communication - Locks (sync.Mutex, sync.RWMutex) - Once (sync.Once) - Atomic Values (sync/atomic)

For atomics:

The APIs in the sync/atomic package are collectively "atomic operations" that can be used to synchronize the execution of different goroutines. If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B.

And critically:

All the atomic operations executed in a program behave as though executed in some sequentially consistent order.

This is the SC guarantee. One sentence, momentous implications.

Section "Incorrect synchronization"¶

Examples of bad code. Worth reading multiple times. They illustrate: - Plain bool flags don't work. - Time-based reasoning doesn't work. - Memory operations on unsync'd shared variables are races.

Section "Implementation restrictions for compilers"¶

Constraints on the Go compiler: - Reads of single machine words may not be split. - Writes of single machine words may not be split. - Speculative writes are forbidden where they may be observed by other goroutines.

These are programmer-visible constraints; the compiler honours them.

Appendix: The Memory Model in Comparison¶

Side-by-side comparison of "race-on-bool" semantics:

Language	Plain bool flag, racy	Defined behaviour
C++	undefined behaviour	undefined
Rust	won't compile (Sync)	n/a
Java	bounded — value will be old or new	bounded
C#	torn read possible but rare	implementation-defined
Go	undefined behaviour	undefined

Go and C++ are the strict ones. Java and Rust take different approaches.

Appendix: SC in the Context of `unsafe.Pointer`¶

The unsafe.Pointer type bypasses Go's type system. With it, you can:

var x int64
p := (*[8]byte)(unsafe.Pointer(&x))

Mutating via p is not atomic, even if you use atomic.StoreInt64(&x, ...) elsewhere. SC operations protect the location, not the type.

Worse: the race detector may miss races through unsafe pointers because the type information is lost. Use unsafe operations on shared memory only with extreme care.

A common safe pattern is converting a uintptr and back, used by the runtime for low-level data structures. User code rarely needs this.

Appendix: Atomic Operations on Floating-Point¶

Go's sync/atomic does not provide atomic.Float64. Workaround:

type AtomicFloat64 struct {
    bits atomic.Uint64
}

func (a *AtomicFloat64) Load() float64 {
    return math.Float64frombits(a.bits.Load())
}

func (a *AtomicFloat64) Store(v float64) {
    a.bits.Store(math.Float64bits(v))
}

func (a *AtomicFloat64) Add(d float64) float64 {
    for {
        old := a.Load()
        new := old + d
        if a.bits.CompareAndSwap(math.Float64bits(old), math.Float64bits(new)) {
            return new
        }
    }
}

Cast float to uint64 bits for storage; convert back on read. SC applies to the uint64; the float semantics follow.

The Add is a CAS loop, more expensive than integer add. For accumulators (e.g., training neural networks), this is acceptable.

Appendix: SC and Concurrent Maps¶

sync.Map is engineered for specific access patterns. Its internals use SC atomics extensively.

For other patterns, the alternatives are: - sync.RWMutex + map[K]V: simple, scales poorly under heavy reads. - atomic.Pointer[map[K]V] + COW: fast reads, slow writes. - Custom sharded map: linear scaling.

Each has trade-offs. Profile your workload before choosing.

Appendix: Spinning vs Blocking¶

A spinlock (atomic CAS in a loop) vs a blocking mutex (which parks the goroutine):

Property	Spinlock	Mutex
Hold time	very short (<μs)	any
Contention behaviour	wastes CPU	parks goroutine
Throughput under contention	poor	fair
Latency under contention	bad	acceptable
Implementation complexity	low	high

Go's sync.Mutex does brief spinning before parking (called "spin-park" or hybrid mutex). For most code, sync.Mutex is the right choice. Roll your own spinlock only if you have measured that mutex park-unpark is too slow.

Appendix: Memory Reclamation in Lock-Free Structures¶

A lock-free data structure that allocates nodes faces the question: when is it safe to free a node?

Options: 1. GC: Go's GC handles this automatically. Most lock-free Go structures rely on this. 2. Reference counting: with atomic refcounts, decrement on every "release"; free at zero. 3. Hazard pointers: each thread declares pointers it's reading; freeing defers until no thread hazards. 4. Epoch-based reclamation (EBR): each thread enters an epoch on read; freeing waits for all threads to advance past the epoch.

Go's GC is fast enough that hand-rolled schemes are rarely needed. The exceptions: - When the GC's pause behaviour is unacceptable (high-frequency trading). - When allocations are bottleneck (NUMA-aware allocators). - When you're writing the GC itself.

Appendix: SC and the Stack¶

Go goroutines have growable stacks. When a goroutine's stack grows or shrinks, all pointers into the stack are updated. This is a "stack copy" operation.

Atomics on stack-allocated variables work, but if the stack moves, the atomic's address changes. The race detector understands this. Other goroutines accessing the atomic via pointer must follow the moved value.

In practice: avoid passing stack-allocated atomics to other goroutines. Use heap-allocated *Atomic values or struct fields.

// risky
func f() {
    var a atomic.Int64
    go func() { a.Add(1) }() // a may move when f's stack grows
}

// safe
func f() {
    a := new(atomic.Int64)
    go func() { a.Add(1) }()
}

The Go compiler usually escapes the atomic to the heap automatically when it sees a closure capture. Still, being explicit is clearer.

Appendix: SC Across Goroutine Boundaries — Edge Cases¶

Passing a struct by value¶

type S struct { v atomic.Int64 }

go func(s S) { s.v.Add(1) }(local) // operates on a copy

The copy is independent. Use *S:

go func(s *S) { s.v.Add(1) }(&local)

Closure capture by reference¶

var x atomic.Int64
go func() { x.Add(1) }() // captures &x

This is captured by reference (since x is a struct that goroutine modifies). The Go compiler handles it. The atomic location is the original.

Channel of struct values¶

type S struct { v int64 }
ch := make(chan S)
go func() {
    s := <-ch
    s.v = 1 // operates on a copy
}()

Sending a struct copies it. The receiver's copy is independent. If you need shared state, send *S.

Interface boxing¶

var i any = SomeValue
go func() { /* use i */ }()

Interface values are two words: type and data. Reading an interface field is two loads. If the interface is shared and modified concurrently, you can observe a torn interface (one word from the old value, one from the new). Always use atomic.Pointer[any] for shared interfaces — or better, avoid sharing interface values mutably.

Appendix: A Reading Order for `runtime/internal/atomic`¶

If you want to study Go's atomic implementation, this order helps:

runtime/internal/atomic/types.go: type definitions.
runtime/internal/atomic/atomic_amd64.go: x86 wrappers.
runtime/internal/atomic/atomic_amd64.s: x86 assembly.
runtime/internal/atomic/atomic_arm64.go: ARM64 wrappers.
runtime/internal/atomic/atomic_arm64.s: ARM64 assembly.
runtime/internal/atomic/atomic_riscv64.s: RISC-V.
sync/atomic/type.go: user-facing typed wrappers.
sync/atomic/value.go: legacy untyped wrapper.

Read in order; the layers stack from low-level to high-level. An afternoon's study leaves you with a thorough understanding.

Appendix: A Day of Profiling Atomics¶

Suppose you join a team and the service is slow. A walkthrough:

Hour 1: profile¶

Capture a CPU profile and look for atomic hot spots:

curl -s 'http://localhost:6060/debug/pprof/profile?seconds=30' > prof
go tool pprof -top prof | head

Look for runtime/internal/atomic.* near the top.

Hour 2: identify the variable¶

In the profile, click into the hot atomic. Trace back to the application code. Identify the variable and access pattern.

Hour 3: hypothesise¶

Is it write-heavy? (Yes → consider sharding or batching.)
Is it read-heavy? (Atomic load should be cheap; check for false sharing.)
Is it CAS-heavy? (Spin contention? Add backoff.)

Hour 4: prototype¶

Implement one mitigation. Benchmark in isolation. Measure improvement.

Hour 5: integrate¶

Apply to the codebase. Run tests under -race.

Hour 6: re-profile¶

Confirm the bottleneck moved or shrank.

Hour 7: document¶

Update the design docs and code comments. Future-you and future-team will thank present-you.

Appendix: A Set of Interview Questions¶

Below are senior-level interview questions on SC. The interview.md page has its own list; these are deeper.

State Lamport's 1979 definition of SC. Apply it to a two-thread program.
Why is the store-buffer litmus the canonical separating example for SC vs TSO?
Why is the IRIW litmus the separating example for SC vs acq-rel?
Describe how Go's compiler emits SC operations on ARM64 vs x86.
What is the difference between cache coherence and memory consistency?
Why does Go's memory model give SC for race-free programs only?
What is happens-before? How is it constructed in Go?
Compare Go's SC commitment with C++'s memory_order_seq_cst default.
Describe the ABA problem and three solutions.
Why are mutexes preferred over atomics for multi-step critical sections?
What is false sharing? Demonstrate it in code and show the fix.
Why does the race detector use vector clocks?
Describe the implementation of sync.Once.
Describe the implementation of sync.Map's read fast path.
What is the cost of an SC atomic store on x86 vs ARM vs RISC-V?
Why are LDAR/STLR paired for SC on ARM64?
What happens if you mutate a *Config after publishing via atomic.Pointer[Config]?
Why is runtime.Gosched() not a memory barrier?
Why is time.Sleep not a memory barrier?
Describe the publication pattern with atomic.Pointer[T]. Why is it safe?

A senior candidate should answer 15 of 20 correctly within an hour.

Appendix: Final SC Mastery Checklist¶

If you can check 15 of 17, you are a senior-level expert. If you check all 17, you are ready to write the next generation of memory-model documentation.

Final Senior Wrap-Up¶

You have reached the end of the senior page. The content was:

Lamport's definition and its modern restatement.
Formal happens-before construction.
SC as a total order over atomics.
Store-buffer effects and the litmus tests that reveal them.
Compiler fence emission across architectures.
Hardware memory models: x86 TSO, ARM ARMv8 RCsc, RISC-V RVWMO.
Cache coherence (MESI) vs consistency.
The IRIW litmus and the SC-vs-acq-rel separating example.
Reading the Go runtime atomics source.
Lock-free data structure design: Treiber, Michael-Scott, SPSC, MPMC, skip list.
SC at the runtime boundary: scheduler, GC, allocator.
SC at the OS boundary: cgo, syscalls, mmap.
Formal verification with TLA+.
Production case studies: stats engine, configuration store, race detection, ABA, false sharing.
Comparative analysis: C++, Rust, Java, C#.
Future directions for Go's atomic API.

This page is the technical apex. The professional page that follows is about the meta — language design, ecosystem influence, and the philosophy behind the choices.

Appendix: Worked Microbenchmarks¶

Below are detailed microbenchmark setups for measuring SC overhead. Run them on your target hardware to calibrate intuition.

Benchmark: load throughput¶

package atomicbench

import (
    "sync/atomic"
    "testing"
)

var (
    plain int64
    atom  atomic.Int64
)

func BenchmarkPlainLoad(b *testing.B) {
    var sink int64
    for i := 0; i < b.N; i++ {
        sink = plain
    }
    _ = sink
}

func BenchmarkSCLoad(b *testing.B) {
    var sink int64
    for i := 0; i < b.N; i++ {
        sink = atom.Load()
    }
    _ = sink
}

Run: go test -bench=Load -benchmem.

Typical x86: PlainLoad and SCLoad both around 0.3 ns/op. The compiler may even compile them identically (a regular MOV).

Typical ARM64: PlainLoad ~1 ns/op, SCLoad ~5 ns/op (LDAR fence).

Benchmark: store throughput¶

func BenchmarkPlainStore(b *testing.B) {
    for i := 0; i < b.N; i++ {
        plain = int64(i)
    }
}

func BenchmarkSCStore(b *testing.B) {
    for i := 0; i < b.N; i++ {
        atom.Store(int64(i))
    }
}

x86: PlainStore ~0.3 ns/op, SCStore ~10 ns/op (XCHG).

ARM: PlainStore ~1 ns/op, SCStore ~10 ns/op (STLR).

Benchmark: contended add¶

func BenchmarkContendedAdd(b *testing.B) {
    var counter atomic.Int64
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            counter.Add(1)
        }
    })
}

Run with -cpu=1,2,4,8. Throughput per goroutine drops as cores increase. On 8 cores, single-counter contention can drop throughput by 10×.

Benchmark: sharded add¶

type Shard struct {
    n atomic.Int64
    _ [56]byte
}

func BenchmarkShardedAdd(b *testing.B) {
    const shards = 64
    var arr [shards]Shard
    var ctr atomic.Int64
    b.RunParallel(func(pb *testing.PB) {
        id := uint64(ctr.Add(1))
        s := &arr[id%shards]
        for pb.Next() {
            s.n.Add(1)
        }
    })
}

Throughput scales linearly with cores. Compare to contended: usually 10-100× improvement on 8 cores.

Benchmark: CAS contention¶

func BenchmarkCAS(b *testing.B) {
    var v atomic.Int64
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            for {
                old := v.Load()
                if v.CompareAndSwap(old, old+1) {
                    break
                }
            }
        }
    })
}

Heavy contention. Throughput degrades catastrophically under load. Use Add directly when possible.

Benchmark: pointer publication¶

type Config struct{ N int }

func BenchmarkAtomicPointerStore(b *testing.B) {
    var p atomic.Pointer[Config]
    for i := 0; i < b.N; i++ {
        p.Store(&Config{N: i})
    }
}

func BenchmarkAtomicPointerLoad(b *testing.B) {
    var p atomic.Pointer[Config]
    p.Store(&Config{N: 42})
    var sink *Config
    for i := 0; i < b.N; i++ {
        sink = p.Load()
    }
    _ = sink
}

Pointer atomics behave similarly to int atomics: load is cheap on x86, store is ~10 ns.

Benchmark: mutex vs atomic¶

var mu sync.Mutex
var counter int64

func BenchmarkMutexInc(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            mu.Lock()
            counter++
            mu.Unlock()
        }
    })
}

Uncontended: mutex is ~20 ns, atomic add is ~10 ns. Atomic wins.

Contended (8 goroutines on 4 cores): mutex throughput drops faster because of park-unpark overhead. Atomic with sharding wins.

Appendix: Microbenchmark Pitfalls¶

Benchmark code doesn't include real workload: a benchmark measuring only atomic.Add cost is misleading if the real path also does network I/O.
Compiler optimisations: _ = sink is needed; otherwise the compiler may elide the load entirely.
CPU frequency scaling: cpupower frequency-set --governor performance on Linux pins frequency for consistent measurement.
Thermal throttling: long benchmarks may slow down due to heat. Verify with perf stat.
Other goroutines: background goroutines from the test framework can interfere. Use b.ResetTimer after setup.
GC pauses: heap pressure during a benchmark introduces noise. runtime.GC() before timing.
CPU pinning: on NUMA systems, atomics on remote nodes are slower. taskset on Linux.

Appendix: A Library of SC Patterns¶

To close, a compact library of SC patterns suitable for senior engineers.

Pattern: lazy initialisation¶

type Lazy[T any] struct {
    p atomic.Pointer[T]
    m sync.Mutex
}

func (l *Lazy[T]) Get(init func() *T) *T {
    if v := l.p.Load(); v != nil {
        return v
    }
    l.m.Lock()
    defer l.m.Unlock()
    if v := l.p.Load(); v != nil {
        return v
    }
    v := init()
    l.p.Store(v)
    return v
}

Pattern: atomic Or¶

type Bits struct{ v atomic.Uint64 }

func (b *Bits) Set(mask uint64) {
    for {
        old := b.v.Load()
        if b.v.CompareAndSwap(old, old|mask) {
            return
        }
    }
}

func (b *Bits) Clear(mask uint64) {
    for {
        old := b.v.Load()
        if b.v.CompareAndSwap(old, old&^mask) {
            return
        }
    }
}

Go 1.23 added atomic.Uint64.Or and .And natively. Before that, you need the CAS loop above.

Pattern: epoch reclamation¶

type Epoch struct{ gen atomic.Int64 }

type Guard struct{ start int64 }

func (e *Epoch) Enter() Guard { return Guard{start: e.gen.Load()} }

func (e *Epoch) Bump() int64 { return e.gen.Add(1) }

Producers Bump after committing changes. Consumers Enter when starting, observing the current generation.

Pattern: generational counter¶

type GenCounter struct {
    gen atomic.Int64
    n   atomic.Int64
}

func (c *GenCounter) Inc() (gen, n int64) {
    return c.gen.Load(), c.n.Add(1)
}

func (c *GenCounter) Reset() {
    c.gen.Add(1)
    c.n.Store(0)
}

Each increment knows its generation. After Reset, new increments start at 1 in the next generation.

Pattern: latched event¶

type Latch struct {
    done atomic.Bool
    once sync.Once
    ch   chan struct{}
}

func New() *Latch {
    return &Latch{ch: make(chan struct{})}
}

func (l *Latch) Signal() {
    l.once.Do(func() {
        l.done.Store(true)
        close(l.ch)
    })
}

func (l *Latch) Wait() {
    <-l.ch
}

func (l *Latch) WaitOrSignaled() bool {
    select {
    case <-l.ch:
        return true
    default:
        return false
    }
}

Combines an atomic flag (for fast IsSignaled checks), a Once (for idempotent signal), and a channel (for blocking waits). Each primitive serves a purpose.

Appendix: Closing Notes from Production¶

After years of running atomics-heavy Go services, observations:

95% of bugs are missing synchronisation (plain access to shared state). Fixed by adding atomics or mutexes.
4% of bugs are incorrect synchronisation (atomic on one side, plain on the other). Fixed by symmetry.
1% of bugs are real concurrent algorithm bugs (ABA, sloppy CAS, etc.). Hardest to find.
The race detector catches the first two. The third needs careful design and review.
Performance issues with atomics are usually false sharing or contention. Sharding and padding are the cure.
Mutex contention is more common than atomic contention in real Go code.
Channels are the easiest tool to teach; atomics need more care.
Once you internalise SC, you stop thinking about it. It becomes the baseline assumption.

This concludes the senior page. Move to the professional page when you want the design-philosophy level — how memory models are designed, evaluated, and evolved at the language scale.

Appendix: Extra Patterns for Real-Time Systems¶

For real-time Go services (sub-millisecond response budgets), SC atomics have additional concerns:

Pattern: pre-allocated atomic-pointer pool¶

Avoid GC pressure by pre-allocating all atomic-pointer targets:

type Buffer struct{ data [4096]byte }

var pool [256]Buffer
var freeList atomic.Uint64 // bitmap

func Alloc() *Buffer {
    for {
        free := freeList.Load()
        if free == 0 {
            return nil
        }
        i := bits.TrailingZeros64(free)
        if freeList.CompareAndSwap(free, free&^(1<<i)) {
            return &pool[i]
        }
    }
}

func Free(b *Buffer) {
    i := (uintptr(unsafe.Pointer(b)) - uintptr(unsafe.Pointer(&pool[0]))) / unsafe.Sizeof(Buffer{})
    for {
        free := freeList.Load()
        if freeList.CompareAndSwap(free, free|(1<<i)) {
            return
        }
    }
}

256-slot pool with bitmap-tracked free list. SC atomic operations on the bitmap. No GC during the hot path.

Pattern: lock-free timestamp¶

var ts atomic.Int64

func init() {
    go func() {
        for {
            ts.Store(time.Now().UnixNano())
            time.Sleep(time.Millisecond)
        }
    }()
}

func Now() int64 { return ts.Load() }

Replaces 100ns time.Now() syscall with a 1ns atomic load. The cost: 1ms granularity.

Pattern: cache-line-padded latch¶

type Latch struct {
    done atomic.Bool
    _    [63]byte
}

Pad to avoid false sharing with neighbouring atomics.

Pattern: ring buffer with sequence numbers¶

type Ring struct {
    buf  []slot
    mask uint64
    head atomic.Uint64
    _    [56]byte
    tail atomic.Uint64
    _    [56]byte
}

Standard SPSC ring with cache-line padding for head and tail. Producer touches head; consumer touches tail. No false sharing.

Appendix: SC and Determinism¶

A determinism question: if SC promises a global order, is the order deterministic?

No. SC promises that some total order exists; the specific order depends on scheduling, hardware contention, GC pauses, and many other factors. Different runs may produce different orders.

For deterministic execution, you need: - Single-threaded execution. - Or, a deterministic schedule (e.g., simulator). - Or, careful sequencing via channels with deterministic protocols.

Most concurrent Go programs are nondeterministic. SC just bounds the set of legal executions.

Appendix: SC and Testing¶

Testing concurrent code under SC:

Unit tests: race detector + repeated runs.
Property tests: invariants under arbitrary schedules.
Stress tests: high contention, long runs.
Litmus tests: targeted invariant checks for specific reordering scenarios.

Tools to consider: - go test -race -count=N -cpu=1,2,4,8. - Goroutine schedulers like gopls's race tool extensions. - TLA+ for formal models (out-of-band).

A robust concurrent codebase invests in all of these.

Appendix: Acknowledgements and Influences¶

The SC memory model in Go owes much to:

Leslie Lamport (1979 SC paper).
Sarita Adve and Hans-J. Boehm (memory-model formalisation).
Russ Cox (Go memory-model revision).
The Go team for choosing simplicity over flexibility.
The C++ committee for the SC seq-cst formalisation that informed Go's choice.
Java's JSR-133 work that pioneered SC for managed languages.

These influences shaped Go's current memory model.

Truly Final Note¶

The senior page is complete. You have a comprehensive understanding of SC in Go: from intuition to formal definition, from hardware fences to lock-free algorithms, from production patterns to debugging. The professional page is a step toward the philosophy and the future.

Sequential Consistency — Senior Level¶

Table of Contents¶

Introduction¶

Lamport's Original Definition¶

Implications¶

Why this definition is hard to implement¶

Happens-Before, Formally¶

Synchronisation edges in Go¶

Race definition¶

SC-DRF formally¶

SC as a Total Order over Atomics¶

How the Go memory model phrases it¶

Store Buffers and Why SC Is Hard¶

The store buffer¶

The problem¶

The store-buffer litmus, replayed¶

The fence cost¶

The Go Compiler's Fence Emission¶

Strategy per platform¶

The compiler's job¶

Inlining and aliasing¶

x86 TSO in Depth¶

Why TSO?¶

SC on top of TSO¶

x86 quirks¶

ARM64 and the Acquire-Release Pair Trick¶

Background¶

Fences¶

Acquire-release instructions (ARMv8.0)¶

Cheaper variants (ARMv8.3+)¶

Compare-and-swap¶

Fence-only emit¶

RISC-V and Explicit Fences¶

Atomic instructions¶

Why RISC-V matters¶

Cache Coherence vs Consistency¶

MESI states¶

Transitions¶

False sharing¶

The IRIW Litmus and Why SC Differs from Acq-Rel¶

Why SC matters here¶

Reading the Go Runtime Atomics Code¶

Designing Lock-Free Data Structures on SC¶

Treiber stack¶

Michael-Scott queue¶

Hand-over-hand locking with atomics¶

Skip-list with atomic pointers¶

SC at the Runtime Boundary¶

Goroutine state transitions¶

Work-stealing queue¶

GC tricolor marking¶

Memory allocator¶

SC and Cgo, Syscalls, mmap¶

Cgo¶

Syscalls¶

mmap¶

Formal Verification with TLA+¶

Production Engineering: When SC Is the Right Tool¶

Case 1: hot-reload configuration¶

Case 2: stop signal¶

Case 3: counters¶

Case 4: lock-free queue¶

Case 5: pub-sub broadcast¶

Case 6: rate limiting¶

Case 7: routing table¶

Case 8: scheduler state machine¶

Design Trade-offs in Memory Models¶

Simpler model vs more performance knobs¶

Hardware-revealing vs hardware-abstracting¶

Static vs dynamic detection¶

Whole-program vs per-module¶

Future direction¶

What Could Break SC in Go?¶

Pointer trickery via unsafe¶

Cgo without proper atomics on the C side¶

Custom assembly¶

Compiler bugs¶

Hardware bugs¶

Misuse of runtime.GC(), runtime.GOMAXPROCS, etc.¶

Self-Assessment¶

Misuse of `runtime.GC()`, `runtime.GOMAXPROCS`, etc.¶

`sync.Pool` and per-P padding¶