Acquire / Release — Professional Level¶

Table of Contents¶

Introduction
Cross-Language Memory Models
C++ memory_order Reference
Rust Ordering Reference
Java volatile and VarHandle
Go's Sequential Consistency Choice
How Go Compiles Atomics
Per-Architecture Cost Models
The Go Runtime and Memory Barriers
Fence Elision
Linux Kernel RCU vs Go RCU
Hazard Pointers in Depth
Wait-Free Synchronization Theory
Universal Constructions
Read-Mostly Reclamation
NUMA Effects
Designing a New Concurrency Primitive
Formal Verification
Future of Go's Memory Model
Summary
Further Reading

Introduction¶

The professional level looks under the hood. You'll learn:

Exactly what machine code Go emits for atomic.Store on x86, ARM64, and RISC-V.
How Go's seq-cst choice compares to C++/Rust's flexible memory_order.
The Go runtime's role in scheduling around memory barriers.
Per-architecture cost models with cycle counts.
Hazard pointers, epoch-based reclamation, and why Go's GC subsumes them.
Formal verification techniques for concurrent algorithms.
How to design a new concurrency primitive from scratch.

This level is for engineers who:

Contribute to language runtimes (Go, Rust, C++).
Write performance-critical libraries used by thousands of services.
Design database engines, message queues, or storage systems.
Investigate subtle production bugs caused by memory-model violations.

It is NOT required for application development. If you reached the senior level, you can ship correct concurrent Go code for the rest of your career without reading this file. This file is for the curious, the systems-builders, and the language designers.

Cross-Language Memory Models¶

Every modern language has a memory model. They differ in three dimensions:

Default behavior: what does a plain read/write mean?
Atomic vocabulary: what ordering options are exposed?
Synchronization primitives: what primitives provide which orderings?

A quick comparison:

Language	Default	Atomic vocabulary	Synchronization primitives
Go	Race UB	seq-cst only	`sync`, `sync/atomic`, channels
C++	Race UB	relaxed, acquire, release, acq_rel, seq_cst, consume	`std::mutex`, `std::atomic`, condition_variable
Rust	Race UB	Same as C++ (minus consume)	`std::sync::`, `Atomic`
Java	Race well-defined (mostly)	volatile = acq-rel, plain = relaxed	`synchronized`, `volatile`, `j.u.c.atomic.*`
C#	Race well-defined (mostly)	Volatile.Read/Write = acq-rel, Interlocked = seq-cst	`lock`, `Interlocked`, `Volatile`
Python	GIL serializes most	Limited via threading	`Lock`, `RLock`

Notes:

"Race UB" means a data race is undefined behavior. The compiler is free to assume races don't happen.
Java and C# guarantee that all reads return some value previously written (even if you raced) — no garbage, no torn reads of refs/booleans/etc.
Go's choice of seq-cst-only is the most opinionated: simple to use, slightly slower on weakly ordered hardware.

C++ `memory_order` Reference¶

C++ exposes six memory orders:

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

`memory_order_relaxed`¶

Atomic but unordered. No happens-before guarantees beyond the atomicity itself. Used for counters that don't need ordering (e.g., performance metrics).

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

Counter increments. No release of other writes; no acquire of others' writes.

`memory_order_acquire` and `memory_order_release`¶

The pair we've been discussing. Release on the producer side, acquire on the consumer side, on the same location.

std::atomic<bool> ready{false};
int data;

// Producer:
data = 42;
ready.store(true, std::memory_order_release);

// Consumer:
while (!ready.load(std::memory_order_acquire)) { }
assert(data == 42);

`memory_order_acq_rel`¶

For read-modify-write operations: the read is an acquire, the write is a release. Used for fetch_add, compare_exchange, etc.

std::atomic<int> v{0};
v.fetch_add(1, std::memory_order_acq_rel);

`memory_order_seq_cst`¶

Sequential consistency. The default. Provides a single global order of all seq_cst operations.

std::atomic<int> v{0};
v.store(1, std::memory_order_seq_cst);

`memory_order_consume`¶

Deprecated since C++17. The idea was a weaker form of acquire that only synchronizes for data dependencies. In practice, no compiler implements it correctly; everyone falls back to acquire.

Cost on real hardware (x86-64)¶

relaxed: plain mov.
acquire (load): plain mov (x86 loads are already acquire).
release (store): plain mov (x86 stores are already release).
acq_rel (RMW): LOCK CMPXCHG or similar locked instruction.
seq_cst: locked instruction with implicit full fence; for plain stores, often XCHG.

So on x86, relaxed and acquire/release cost the same for loads/stores. The interesting cost difference is for seq_cst on writes (needs an extra mfence).

Cost on ARM64¶

relaxed: plain LDR/STR.
acquire load: LDAR.
release store: STLR.
acq_rel RMW: LDAXR/STXR with retry, or LDAR+STLR for newer ARMv8.1.
seq_cst: LDAR/STLR plus DMB ISH for full fence.

On ARM, the difference between relaxed and acquire/release is one instruction. Between acq/rel and seq_cst, an additional DMB ISH is required.

Rust `Ordering` Reference¶

Rust's atomics mirror C++:

use std::sync::atomic::Ordering;

let v = AtomicU32::new(0);
v.store(1, Ordering::Release);
let x = v.load(Ordering::Acquire);

Ordering variants: Relaxed, Acquire, Release, AcqRel, SeqCst. (No Consume.)

Rust's compiler (rustc, via LLVM) generates the same instructions as Clang would for equivalent C++. Performance is essentially identical.

The main difference is the type system: Rust forces you to choose an Ordering on every call, which both empowers and forces you to think.

Java `volatile` and `VarHandle`¶

Java's volatile:

A volatile read is an acquire.
A volatile write is a release.
All volatiles share a global ordering (almost like seq-cst but only among volatiles).

So Java's volatile is roughly Go's sync/atomic for individual fields. Stronger than C++ memory_order_acquire/release.

java.util.concurrent.atomic.AtomicInteger.compareAndSet is seq-cst.

java.lang.invoke.VarHandle (Java 9+) exposes explicit memory orderings like C++. Most code doesn't need this.

The Java Memory Model is actually well-defined for racy programs: a racy read returns some value previously written (or the initial 0), never garbage. This is unique among mainstream languages. The cost: compiler optimizations are more constrained.

Go's Sequential Consistency Choice¶

Why did Go pick seq-cst for all atomics?

Simplicity for programmers. No need to reason "is this acq_rel or seq_cst?" Programmers can't shoot themselves in the foot with a too-weak ordering.
DRF-SC is the goal. "Data-race-free programs are sequentially consistent." Most Go code is DRF. Making atomics seq-cst aligns with the DRF-SC theorem.
Modest cost on x86. x86 is essentially TSO (total store order), which is close to seq-cst. The extra cost is minimal for the dominant deployment platform.
Acceptable cost on ARM. ARMv8 has dedicated acquire/release load/store instructions; the extra fence for seq-cst is one DMB per store.
Avoids subtle bugs. Many real-world C++ bugs come from picking acq_rel when seq_cst was needed. Go eliminates this class.

The trade-off: code that could benefit from relaxed ordering on tight loops pays a small cost. For most code, this is invisible. For very hot atomic loops on weakly-ordered hardware, it's a few percent.

The Go team's bet: simplicity > peak performance for the 99% case. So far, it's been a good bet.

How Go Compiles Atomics¶

Let's trace x.Store(1) for var x atomic.Int32 through the compiler.

Step 1: package source¶

package main

import "sync/atomic"

var x atomic.Int32

func main() {
    x.Store(1)
}

Step 2: `atomic.Int32.Store` implementation¶

In src/sync/atomic/type.go:

type Int32 struct {
    _ noCopy
    v int32
}

func (x *Int32) Store(val int32) { StoreInt32(&x.v, val) }

StoreInt32 is declared in src/sync/atomic/doc.go:

func StoreInt32(addr *int32, val int32)

with the body in assembly per architecture.

Step 3: amd64 assembly¶

src/sync/atomic/asm.s redirects to runtime/internal/atomic. The actual implementation for amd64 is in runtime/internal/atomic/asm_amd64.s:

TEXT runtime/internal/atomic·Store(SB), NOSPLIT, $0-12
    MOVQ ptr+0(FP), BX
    MOVL val+8(FP), AX
    XCHGL AX, 0(BX)
    RET

The key instruction is XCHGL (exchange long). This is implicitly locked on x86 (any memory operand of XCHG is treated as having an implicit LOCK prefix), so it acts as a full memory barrier — providing seq-cst semantics.

A simpler MOVL would suffice for release semantics on x86, but to get seq-cst Go uses XCHG (or MOV + MFENCE; XCHG is shorter).

Step 4: amd64 atomic load¶

TEXT runtime/internal/atomic·Load(SB), NOSPLIT, $0-12
    MOVQ ptr+0(FP), AX
    MOVL 0(AX), AX
    MOVL AX, ret+8(FP)
    RET

Just a plain MOVL. On x86, loads are already acquire by default (TSO), so no fence is needed.

Step 5: arm64 store¶

TEXT runtime/internal/atomic·Store(SB), NOSPLIT, $0-12
    MOVD ptr+0(FP), R0
    MOVW val+8(FP), R1
    STLRW R1, (R0)
    RET

STLRW is "store release word." It guarantees that all prior writes by this CPU become visible before this store.

For seq-cst, Go (on older ARMv8) follows STLR with DMB ISH (data memory barrier, inner-shareable). On ARMv8.3+, STLR alone provides RC sc (release consistent with sequential consistency for cross-domain access).

Step 6: arm64 atomic load¶

TEXT runtime/internal/atomic·Load(SB), NOSPLIT, $0-12
    MOVD ptr+0(FP), R0
    LDARW (R0), R0
    MOVW R0, ret+8(FP)
    RET

LDARW is "load acquire word." It guarantees that subsequent reads/writes by this CPU happen after this load is visible globally.

Step 7: CAS on amd64¶

TEXT runtime/internal/atomic·Cas(SB),NOSPLIT,$0-17
    MOVQ ptr+0(FP), BX
    MOVL old+8(FP), AX
    MOVL new+12(FP), CX
    LOCK
    CMPXCHGL CX, 0(BX)
    SETEQ ret+16(FP)
    RET

The LOCK CMPXCHGL instruction is the heart of CAS on x86: atomically compare-and-swap with a LOCK prefix (which makes it a full barrier).

Step 8: CAS on arm64¶

TEXT runtime/internal/atomic·Cas(SB),NOSPLIT,$0-17
    MOVD ptr+0(FP), R0
    MOVW old+8(FP), R1
    MOVW new+12(FP), R2
loop:
    LDAXRW (R0), R3
    CMPW R1, R3
    BNE fail
    STLXRW R2, (R0), R4
    CBNZ R4, loop
    MOVB $1, ret+16(FP)
    RET
fail:
    MOVB $0, ret+16(FP)
    RET

On ARM, CAS is implemented with LL/SC: LDAXR (load exclusive with acquire) and STLXR (store exclusive with release). If the exclusive reservation is lost (between LDAXR and STLXR), STLXR fails and we retry.

This LL/SC pair is more complex than x86's CMPXCHG but allows more flexibility (you can compute on the loaded value before storing).

What this means for performance¶

x86: stores cost an XCHG (~5-8 ns), loads are free (~1 ns), CAS costs LOCK CMPXCHG (~10-15 ns).
ARM64: stores cost STLR + DMB (~5 ns), loads cost LDAR (~2 ns), CAS costs LL/SC with possible retry (~10-20 ns).
RISC-V: similar to ARM64 but with explicit fence instructions.

In all cases, the operations are bounded — wait-free. Performance varies by architecture but is in the same ballpark.

Per-Architecture Cost Models¶

A more detailed look at concurrency primitives on common architectures:

x86-64 (Intel/AMD modern)¶

Operation                      Cycles    ns (3 GHz)
----                            ------    ---------
Plain MOV                       1          0.3
LOCK MOV (XCHG)                 20-40      6-13
LOCK CMPXCHG (success)          25-45      8-15
LOCK CMPXCHG (failure)          ~10        ~3
MFENCE                          5-10       1-3
Empty MOV+RET                   2          0.6
Function call                   3-5        1-2

Notes: numbers vary by microarchitecture. Cache-cold operations are much more expensive.

ARM64 (Apple, AWS Graviton, ARM Server)¶

Operation                      Cycles    ns (3 GHz)
----                            ------    ---------
Plain LDR/STR                   1-3        0.3-1
LDAR (acquire)                  3-5        1-2
STLR (release)                  3-5        1-2
LDAXR + STLXR (CAS)             10-20      3-7
DMB ISH                         3-5        1-2

ARM is generally on par with x86, with slightly cheaper atomics on average but more variance.

RISC-V¶

Operation                      Cycles    ns (clock-dependent)
----                            ------    ---------
Plain lw/sw                     1          ~ns
AMOSWAP                         varies     ~ns
LR/SC pair                      varies     ~ns
FENCE rw,rw                     varies     ~ns

RISC-V is too new for stable cost data; varies wildly by implementation.

PowerPC¶

PowerPC has very weak ordering. Atomics often require LWSYNC fences. Go's atomics on POWER use these fences appropriately.

The Go Runtime and Memory Barriers¶

The Go runtime itself uses memory barriers in several places:

Goroutine scheduling¶

When a goroutine is descheduled and resumed on a different P (processor), the runtime emits appropriate barriers to ensure the goroutine's view of memory is consistent. This is invisible to user code but essential.

Garbage collection¶

The concurrent GC interacts with the user program in nuanced ways. Write barriers (different from memory barriers, despite the name) track reference updates during GC. The runtime emits memory barriers at specific points to ensure GC sees a consistent view.

Stack growth¶

When a goroutine's stack grows, the runtime copies the stack to a new location. This requires careful synchronization to ensure the goroutine resumes with the right pointer adjustments.

Channels¶

Channel operations involve sending/receiving via a heap-allocated hchan. The runtime uses an internal mutex (not sync.Mutex, but runtime.mutex) plus atomic state to coordinate.

`sync.Mutex` slow path¶

When contended, sync.Mutex parks the calling goroutine via runtime_SemacquireMutex. This eventually calls into kernel-level futex (Linux) or equivalent. The wake-up path uses atomics plus runtime scheduling.

For each of these, the runtime carefully chooses primitives to maintain Go's memory model promises. User code doesn't see this complexity — but understanding it helps when debugging runtime-level issues.

Fence Elision¶

Compilers can sometimes elide memory fences when they prove the fence isn't needed.

Example: a release store followed by another release store to the same location. The first release fence is redundant (the second covers everything).

Go's compiler does some elision but not as aggressively as Clang/GCC for C++. The Go team's preference is correctness > peak performance.

Examples where fences are NOT elided:

A release store followed by a non-atomic store followed by another release store. Both releases are necessary (the non-atomic store could be reordered otherwise).
A pair of release stores on different locations. Both are needed.

Examples where fences COULD be elided (and Go may or may not):

Two consecutive release stores on the same location.
A release store immediately followed by a release fence.

For most code, this doesn't matter. For tight atomic loops, you might want to inspect the generated assembly with go tool objdump.

Linux Kernel RCU vs Go RCU¶

The Linux kernel's RCU implementation is famously complex. It handles:

Read-side critical sections: marked with rcu_read_lock() / rcu_read_unlock(). These are no-ops on the read side — just compiler barriers.
Grace periods: the writer waits until all CPUs have passed through a context switch (or other quiescent state). At that point, no reader holds an old pointer.
Deferred reclamation: call_rcu(callback) schedules a function to run after the next grace period.
Synchronize: synchronize_rcu() blocks until the next grace period.

Why is this complex in C? Because:

No GC; memory must be explicitly freed.
Readers can be preempted; you can't assume read-side critical sections are short.
Multiple CPUs run independently; "all readers are done" must be detected explicitly.

In Go, the GC subsumes all of this. A reader holds a *T; as long as it does, the GC won't free it. When the reader returns, the local variable goes out of scope, and the GC can reclaim.

The cost: GC pauses (though Go's GC is concurrent and pause times are typically <1 ms).

For most Go services, this is fine. For real-time systems where >1 ms pauses are unacceptable, you'd implement explicit RCU-like reclamation — but rarely in pure Go.

Hazard Pointers in Depth¶

Hazard pointers are a memory reclamation scheme for lock-free structures. Each reader publishes (in a thread-local atomic slot) the pointers it's currently "holding." Writers, before freeing a pointer, check that no hazard pointer matches.

Pseudo-implementation in Go (simplified, ignoring per-thread storage):

var hazards [MaxThreads]atomic.Pointer[node]

func read(idx int, p atomic.Pointer[node]) *node {
    for {
        n := p.Load()
        hazards[idx].Store(n)
        if p.Load() == n {
            return n // hazard registered before any retire
        }
    }
}

func retire(n *node) {
    waitForHazards(n)
    free(n) // or queue for delayed free
}

func waitForHazards(n *node) {
    for {
        clean := true
        for i := 0; i < MaxThreads; i++ {
            if hazards[i].Load() == n {
                clean = false
                break
            }
        }
        if clean {
            return
        }
        runtime.Gosched()
    }
}

In a language without GC, this is the only way to safely free nodes in a lock-free data structure. In Go, the GC handles this for you, so hazard pointers are rarely needed.

Exception: when you bypass GC (e.g., using sync.Pool to recycle nodes, or unsafe.Pointer for foreign memory), you need explicit reclamation.

The full hazard pointer protocol has subtleties about ordering (the store of the hazard pointer must be visible before the re-check of p.Load()). In C++ this requires memory_order_seq_cst or careful use of fences. In Go, the seq-cst default handles it.

Wait-Free Synchronization Theory¶

Maurice Herlihy's 1991 paper "Wait-Free Synchronization" laid out the theoretical foundations:

A consensus number of an object is the maximum number of threads that can solve consensus using only that object.
Atomic registers (loads/stores) have consensus number 1.
Atomic test-and-set has consensus number 2.
Atomic compare-and-swap has consensus number infinity (∞).

Implication: only CAS (or equivalent) is "universal" — it can implement wait-free versions of any object. Loads/stores alone cannot solve consensus for 2+ threads.

This is why CAS is the workhorse of lock-free programming.

The theorem also implies that you cannot implement a wait-free queue using only atomic loads/stores. You need CAS (or LL/SC, or another universal primitive).

In Go, all the universal primitives are available: atomic.CompareAndSwap, atomic.Swap, atomic.LoadAndStore. Use them for lock-free designs.

Universal Constructions¶

Herlihy also showed that CAS lets you build a wait-free version of any object, mechanically. The construction:

Each operation is encoded as a small struct.
Threads compete to install their operation at the next slot in a log.
After installing, threads "help" by applying logged operations in order.

The construction is wait-free but slow: every operation involves a CAS plus log traversal. In practice, you use it as a proof of feasibility, not for production code.

For production wait-free code, design for the specific problem. Examples:

Wait-free queue: Vyukov's MPMC bounded ring.
Wait-free stack: difficult; usually settle for lock-free.
Wait-free hashmap: Cliff Click's NonBlockingHashMap (Java).

Read-Mostly Reclamation¶

For read-mostly structures, reclamation is the bottleneck. Several approaches:

Epoch-based reclamation¶

Maintain a global epoch counter. Readers enter the current epoch; writers may free nodes from epochs N-2 or older once all readers have passed through.

Implementation requires per-thread epoch tracking. In Go, this can be done with per-P storage via runtime hooks, but it's complex.

Quiescent-state-based reclamation (QSBR)¶

Similar to epoch-based but tied to specific quiescent states (e.g., context switches). Used in Linux kernel RCU.

Stamp-based reclamation¶

Each pointer carries a version stamp. Readers track the highest stamp they've seen; writers free pointers below the minimum.

Reference counting¶

Each pointer carries an atomic refcount. Decrement on release; free when zero.

For Go, the GC handles all of these implicitly. You rarely need to choose.

NUMA Effects¶

On large servers with multiple sockets (NUMA — Non-Uniform Memory Access), memory access cost depends on which socket the data lives on.

Local memory: ~80 ns.
Remote memory: ~120-200 ns.
Cache invalidations across sockets: ~200-500 ns.

For NUMA-aware concurrent code:

Pin goroutines to a socket (Go doesn't expose this directly; use GOMAXPROCS and taskset).
Allocate per-socket data.
Use per-socket sharding rather than global state.

Most Go services don't care because they run on single-socket cloud VMs. For NUMA databases (e.g., Postgres on a big bare-metal server), the runtime/scheduling implications matter.

Designing a New Concurrency Primitive¶

When you design a new primitive, the process:

Specify the contract. What does it guarantee? Acquire? Release? Linearizable? Wait-free?
Sketch the algorithm. Use the smallest set of primitives (atomics, CAS).
Identify happens-before chains. Document every release-acquire pair.
Reason about ABA. If pointers are recycled, you need generation counters or hazard pointers.
Prove correctness. TLA+ or hand-written proof.
Implement. Use sync/atomic; document the contract.
Test. Stress test with -race -count=100.
Benchmark. Compare to alternatives.
Document. Include examples, contract, performance characteristics.

Example: designing a wait-free read snapshot of multiple atomics.

Contract: Snapshot() returns a consistent view of fields X, Y, Z at some instant.

Sketch: seqlock with three atomic fields + a generation counter.

Happens-before: writer increments gen (odd), updates fields, increments gen (even). Reader reads gen, fields, gen. Match → consistent.

ABA: only if gen wraps around. With uint64, this is 2^64 increments — effectively never.

Implementation: see the seqlock section in senior.md.

This is the workflow for any new primitive.

Formal Verification¶

For library code where correctness must be ironclad, formal verification helps.

TLA+¶

TLA+ models concurrent systems as state machines. You write a specification of the behavior; the TLC model checker explores reachable states.

Example: spec for a lock-free queue.

EXTENDS Naturals, Sequences

VARIABLE queue, locked

Init == queue = <<>> /\ locked = FALSE

Enqueue(v) == 
  /\ ~locked
  /\ locked' = TRUE
  /\ queue' = Append(queue, v)
  /\ UNCHANGED <<>>

Unlock ==
  /\ locked
  /\ locked' = FALSE
  /\ UNCHANGED queue

TLC explores all interleavings, checking invariants. If a violation exists, TLC produces a trace.

Real TLA+ specs for lock-free queues are longer (modeling each CAS, each retry), but the principle is the same.

Promela / SPIN¶

Similar tool, popular for kernel-level verification.

Hand-written proofs¶

A natural-language proof in the comments, citing the memory model axioms. Faster than formal verification; less rigorous.

For Go code, hand-written proofs are typically sufficient. For database engines or message brokers, TLA+ is worth the investment.

Future of Go's Memory Model¶

Go's memory model has been stable since 2009 with a major clarification in 2022. Future evolution:

Atomic types with explicit ordering: There has been discussion of adding atomic.LoadAcquire / atomic.StoreRelease for performance. Not adopted as of Go 1.22.
Wait-free primitives: more wait-free types in the standard library.
Better runtime support for NUMA: opaque to user code, but improving.
Generic atomics: already added (atomic.Pointer[T] in 1.19).

The Go team's philosophy: prefer simplicity, add complexity only with strong evidence. The current memory model is unlikely to change radically.

Summary¶

The professional level demands:

Knowledge of Go's atomics down to the machine instruction.
Cross-language perspective (C++, Rust, Java).
Runtime internals and how the scheduler interacts with memory barriers.
Theoretical foundations: consensus, wait-freedom, universal construction.
Practical reclamation: hazard pointers, epochs, QSBR.
NUMA awareness.
The ability to design and verify new primitives.

If you've absorbed everything in this file, you can:

Contribute to a language runtime.
Design a database engine's concurrency layer.
Diagnose production bugs caused by memory-model violations.
Translate algorithms between languages with different memory models.
Teach concurrency at the staff level.

You are equipped to push the state of the art forward, not just consume it.

Appendix A: Reading the Go Assembly Output¶

To understand what your code becomes, use go tool objdump or compile with -S:

go build -gcflags="-S" -o /dev/null . 2>&1 | less

This dumps the assembly for every function. Look for sync/atomic calls — they translate to inlined instructions or direct calls into runtime/internal/atomic.

For deeper inspection:

go build -o myprog .
go tool objdump -s '^main\.' myprog

Dumps the disassembly of main.* functions. You'll see the actual machine instructions: LOCK CMPXCHG, XCHG, LDAR, STLR, etc.

Example: tracing `x.Store(1)`¶

Source:

var x atomic.Int32

func setX() {
    x.Store(1)
}

Disassembly (amd64):

main.setX:
  MOVL    $1, AX
  MOVQ    main.x(SB), BX
  XCHGL   AX, 0(BX)
  RET

The XCHGL is the seq-cst store. Note: no explicit LOCK prefix needed because XCHG is implicitly locked on memory operands.

Example: tracing `x.Load()`¶

Source:

func getX() int32 {
    return x.Load()
}

Disassembly (amd64):

main.getX:
  MOVQ    main.x(SB), AX
  MOVL    0(AX), AX
  RET

Just a MOV. x86's strong ordering makes plain loads acquire by default.

Example: tracing `x.CompareAndSwap(0, 1)`¶

func casX() bool {
    return x.CompareAndSwap(0, 1)
}

Disassembly:

main.casX:
  XORL    AX, AX
  MOVL    $1, CX
  MOVQ    main.x(SB), DX
  LOCK
  CMPXCHGL CX, 0(DX)
  SETEQ   AX
  RET

LOCK CMPXCHGL is the heart. The SETEQ extracts whether the comparison succeeded.

Appendix B: Memory Ordering on Specific Hardware¶

Different generations of CPU have different memory model details.

Intel x86-64¶

Total Store Order (TSO).
All stores are visible in program order.
Loads can be reordered before stores to different addresses (store buffer).
Atomic operations (LOCK-prefixed) act as full memory barriers.
MFENCE is a full barrier; LFENCE and SFENCE are specific.

AMD x86-64¶

Identical to Intel for ordering purposes (TSO).
Minor microarchitectural differences in atomic performance.

Apple M-series (ARM64)¶

ARMv8.5+ with custom enhancements.
Strong implementation of acquire/release semantics.
LDAR and STLR are first-class instructions.
DMB ISH for full barriers.

AWS Graviton (ARM64)¶

Standard ARMv8.4.
Similar performance to Apple Silicon for atomics.

ARM Cortex-A series (general)¶

Weakly ordered.
LDAR/STLR for acquire/release.
DMB ISH for full fences.

RISC-V¶

Even weaker than ARM.
FENCE rw,rw for full barriers.
Atomic instructions: LR/SC pair, AMOSWAP, AMOADD, etc.

POWER (ppc64)¶

Very weakly ordered.
lwsync for acquire/release.
sync for full fence.

The Go runtime emits the appropriate barriers for each. Your code is portable.

Appendix C: C++ vs Go — Side-by-Side Patterns¶

Pattern: lazy init¶

C++:

#include <atomic>
#include <mutex>

std::atomic<Service*> instance{nullptr};
std::mutex mu;

Service* get() {
    Service* s = instance.load(std::memory_order_acquire);
    if (s) return s;
    std::lock_guard<std::mutex> lock(mu);
    s = instance.load(std::memory_order_relaxed);
    if (!s) {
        s = new Service();
        instance.store(s, std::memory_order_release);
    }
    return s;
}

Go:

var (
    once     sync.Once
    instance *Service
)

func Get() *Service {
    once.Do(func() { instance = newService() })
    return instance
}

Go's sync.Once hides the DCL pattern entirely.

Pattern: lock-free stack¶

C++ (Treiber stack with hazard pointers):

template<typename T>
class Stack {
    struct Node {
        T val;
        std::atomic<Node*> next;
    };
    std::atomic<Node*> head{nullptr};
public:
    void push(T v) {
        Node* n = new Node{std::move(v), nullptr};
        Node* old = head.load(std::memory_order_relaxed);
        do {
            n->next.store(old, std::memory_order_relaxed);
        } while (!head.compare_exchange_weak(old, n,
            std::memory_order_release, std::memory_order_relaxed));
    }

    std::optional<T> pop() {
        // hazard pointer management omitted
        Node* old = head.load(std::memory_order_acquire);
        while (old && !head.compare_exchange_weak(old, old->next.load(),
            std::memory_order_acquire, std::memory_order_acquire)) {
        }
        if (!old) return {};
        T v = std::move(old->val);
        retire(old);
        return v;
    }
};

Go (Treiber stack):

type Stack[T any] struct {
    head atomic.Pointer[node[T]]
}

type node[T any] struct {
    val  T
    next *node[T]
}

func (s *Stack[T]) Push(v T) {
    n := &node[T]{val: v}
    for {
        n.next = s.head.Load()
        if s.head.CompareAndSwap(n.next, n) {
            return
        }
    }
}

func (s *Stack[T]) Pop() (T, bool) {
    for {
        top := s.head.Load()
        if top == nil {
            var zero T
            return zero, false
        }
        if s.head.CompareAndSwap(top, top.next) {
            return top.val, true
        }
    }
}

Go is dramatically shorter. The GC handles reclamation; no hazard pointers needed.

Pattern: read-mostly state¶

C++ (read-copy-update via shared_ptr):

std::atomic<std::shared_ptr<Config>> current;

std::shared_ptr<Config> get() {
    return current.load();
}

void update(std::shared_ptr<Config> next) {
    current.store(std::move(next));
}

(C++20's atomic handles refcounting.)

Go:

var current atomic.Pointer[Config]

func Get() *Config { return current.Load() }
func Set(c *Config) { current.Store(c) }

Same idea, simpler syntax in Go.

Appendix D: Rust vs Go — Side-by-Side¶

Rust's Arc<T> is the equivalent of Go's GC-managed pointer.

Pattern: lazy init¶

Rust:

use std::sync::OnceLock;

static INSTANCE: OnceLock<Service> = OnceLock::new();

fn get() -> &'static Service {
    INSTANCE.get_or_init(|| Service::new())
}

Go:

var (
    once     sync.Once
    instance *Service
)

func Get() *Service {
    once.Do(func() { instance = newService() })
    return instance
}

Roughly equivalent.

Pattern: atomic counter¶

Rust:

use std::sync::atomic::{AtomicI64, Ordering};

static COUNTER: AtomicI64 = AtomicI64::new(0);

pub fn inc() { COUNTER.fetch_add(1, Ordering::SeqCst); }
pub fn get() -> i64 { COUNTER.load(Ordering::SeqCst) }

Go:

var counter atomic.Int64

func Inc() { counter.Add(1) }
func Get() int64 { return counter.Load() }

Notice Rust forces Ordering choice; Go always uses seq-cst.

Pattern: channel-based fan-out¶

Rust (using crossbeam):

use crossbeam_channel::{bounded, Sender, Receiver};

let (tx, rx) = bounded::<i32>(10);
for _ in 0..4 {
    let rx = rx.clone();
    std::thread::spawn(move || {
        while let Ok(v) = rx.recv() {
            process(v);
        }
    });
}

Go:

ch := make(chan int, 10)
for i := 0; i < 4; i++ {
    go func() {
        for v := range ch {
            process(v)
        }
    }()
}

Go's syntax is more compact; Rust requires explicit cloning of receivers.

Appendix E: Java's Memory Model — Detail¶

Java has a unique memory model:

Volatile: read = acquire, write = release. All volatiles are totally ordered.
Synchronized: provides full memory barriers around enter/exit. Lock acquire is acquire; release is release.
Final fields: a special guarantee that, after construction, final fields are visible to all readers without synchronization. This makes String and Integer thread-safe by construction.
Atomic classes (AtomicInteger, etc.): seq-cst semantics.
VarHandle (Java 9+): exposes C++-like explicit orderings.

Java's volatile is roughly equivalent to Go's sync/atomic for a single field — both provide release/acquire.

But Java's memory model also defines behavior under data races: a racy read returns some value previously written. This is stronger than Go (where racy reads are undefined behavior).

The cost: Java compilers cannot reorder as aggressively. Many optimizations that GCC/Clang perform on C++ are forbidden by the JMM.

Appendix F: Why Go Doesn't Expose `memory_order`¶

The Go team's reasoning, paraphrased:

Most Go code doesn't need it. 99% of concurrency in Go is through channels and mutexes; the memory ordering is implicit.
Programmers get it wrong. C++ memory_order is famously misused. Even experts have written buggy lock-free code.
Seq-cst is good enough. The performance overhead vs. acq/rel is small (a few ns) and only matters in atomic-heavy hot loops.
Simpler model is easier to specify. Go's memory model is shorter than C++'s precisely because it avoids the matrix of orderings.
If you need it, you can drop to assembly. For runtime contributors or extreme performance, runtime/internal/atomic (internal) and Go assembly give finer control.

The Go community has discussed adding atomic.LoadAcquire / atomic.StoreRelease periodically. It hasn't been adopted because:

The cost savings are marginal for most code.
It would add complexity to a deliberately simple model.
Existing patterns (atomic.Pointer with CoW, sync.Once) cover the high-value cases.

This may change in the future, but as of Go 1.22, seq-cst is the only choice.

Appendix G: The Cost of Sequential Consistency, Measured¶

Let's measure the real cost of seq-cst vs. release-only.

Benchmark setup:

package main

import (
    "sync/atomic"
    "testing"
)

var v atomic.Int32

func BenchmarkAtomicStoreSeqCst(b *testing.B) {
    for i := 0; i < b.N; i++ {
        v.Store(int32(i))
    }
}

Equivalent C++ benchmark with memory_order_release:

#include <atomic>

std::atomic<int> v;

void bench(int n) {
    for (int i = 0; i < n; i++) {
        v.store(i, std::memory_order_release);
    }
}

Same with memory_order_seq_cst:

void bench_sc(int n) {
    for (int i = 0; i < n; i++) {
        v.store(i, std::memory_order_seq_cst);
    }
}

Results (approximate, x86-64 modern):

C++ release: ~5 ns/op
C++ seq-cst: ~6 ns/op
Go seq-cst: ~6 ns/op

ARM64 results:

C++ release: ~3 ns/op (STLR alone)
C++ seq-cst: ~5 ns/op (STLR + DMB ISH)
Go seq-cst: ~5 ns/op

The seq-cst penalty is 1-2 ns per store. For most code, negligible. For tight atomic loops at millions of ops per second, ~5-10% throughput cost.

Appendix H: Production Story — A Real Atomic Bug¶

A production incident:

A team had a counter incremented from many goroutines via atomic.AddInt64. The counter feeds metrics every 10 seconds.

The metric occasionally jumped backward by ~50% then recovered. Investigation:

The counter type was correct (atomic.Int64).
The increment was correct (Add(1)).
The read was via Load().

After a week of debugging, they realized: the metrics handler also reset the counter every 10 seconds via Store(0). The reset was racing with concurrent increments.

go func() {
    for {
        time.Sleep(10 * time.Second)
        report(counter.Load())
        counter.Store(0) // RACE: increments lost
    }
}()

If an increment landed between the Load and the Store, it was lost (the Store overwrote it).

Fix: use Swap(0) to atomically read-and-reset:

v := counter.Swap(0)
report(v)

Now the read and reset are atomic. No increments lost.

Lesson: even with correct atomics, the protocol must be correct. Atomics provide atomicity per-operation; you must compose them correctly.

Appendix I: A NUMA Scaling Case Study¶

A team running PostgreSQL on a 4-socket server (128 cores) observed weird scaling: throughput peaked at 64 connections, decreased thereafter.

Investigation: every connection was spinning on a shared atomic counter (PostgreSQL's xlog.PgXact). At 64+ cores, cache-line bouncing between sockets dominated.

Fix in PG: shard the counter per-socket. Each socket increments its local counter; the global value is the sum on read.

This is a NUMA-specific scaling pattern. Few Go services hit this issue because they typically run on single-socket cloud VMs. But for those that do, sharding is the answer.

In Go, the equivalent for a heavily-contended counter:

type ShardedCounter struct {
    shards []paddedInt64
}

type paddedInt64 struct {
    n atomic.Int64
    _ [56]byte
}

func (c *ShardedCounter) Inc() {
    s := getCurrentShard()
    c.shards[s].n.Add(1)
}

func (c *ShardedCounter) Get() int64 {
    var sum int64
    for i := range c.shards {
        sum += c.shards[i].n.Load()
    }
    return sum
}

Shards equal to NumCPU (or 2x for some workloads) avoid cross-socket contention.

Appendix J: Designing a Lock-Free Hashmap¶

A real lock-free hashmap is among the hardest data structures. Key challenges:

Concurrent resize. The hashmap must grow when full, but readers and writers must keep working during resize.
Deletion. Marking entries as "deleted" without breaking concurrent traversal.
Memory reclamation. Freeing removed nodes safely.

Production implementations:

Java's ConcurrentHashMap (Cliff Click's design, JDK 8+).
C++'s Folly ConcurrentHashMap.
Go's sync.Map.

sync.Map is the simplest. It uses a "read-only snapshot + dirty mutex-protected delta" pattern, not true lock-free.

For a research-quality lock-free hashmap in Go, consult Cliff Click's papers. The Go equivalent would be ~500-1000 lines. Most Go programs don't need this.

Simplified split-ordered list approach¶

A hashmap based on split-ordered linked lists (Shalev & Shavit, 2006):

Maintain a linked list ordered by reverse-bit hash.
Buckets are pointers into this list at specific points.
Inserting follows the linked list (lock-free) and updates the bucket pointer.
Resize doubles the number of buckets; new buckets are inserted into the existing list (lazy).

Implementation involves several atomic Pointer fields, careful CAS sequences, and a bit of reverse-bit math. It's a research-quality implementation in any language.

For Go, the recommendation is: use sync.Map or a sharded sync.RWMutex + map. Don't roll your own lock-free hashmap unless you're prepared for a 6-month project.

Appendix K: Database Engine Concurrency¶

Database engines (Postgres, MySQL, SQLite, RocksDB) use elaborate concurrency mechanisms:

MVCC (Multi-Version Concurrency Control): each row has multiple versions; readers see the snapshot at transaction start.
Write-Ahead Logging (WAL): writes are appended to a log; the log is fsynced periodically.
Two-Phase Locking (2PL): each transaction acquires locks; locks held until commit.
Optimistic Concurrency Control (OCC): transactions proceed without locks; conflict detected at commit time.

Each has implications for memory ordering:

MVCC: snapshot publication needs acquire/release on the version pointer.
WAL: fsync provides a durability barrier (different from memory ordering).
2PL: lock acquisition is acquire; release is release.
OCC: CAS on the row's version stamp.

If you build a storage engine in Go, you'll touch all of these. Understanding acquire/release semantics is necessary.

Appendix L: Message Queue Concurrency¶

Message queues (Kafka, RabbitMQ, NATS) handle:

Per-partition ordering.
At-least-once vs exactly-once delivery.
Concurrent producers and consumers.

The publication semantics within a single broker:

A producer's send must be durable (fsync) before ack.
A consumer's offset commit must be durable before reusing the consumer slot.

For an in-process Go queue (like a worker pool), the publication is simpler: channels (or atomic ring buffers) handle it.

Appendix M: A Walkthrough of `sync/atomic` Source¶

Read these files in your Go installation:

src/sync/atomic/doc.go: declarations of all functions.
src/sync/atomic/type.go: type wrappers (Int32, Pointer[T], etc.).
src/sync/atomic/value.go: atomic.Value implementation.
src/runtime/internal/atomic/: per-architecture assembly.

You'll see:

Type wrappers are minimal — they delegate to runtime/internal/atomic.
atomic.Value uses an internal mutex (fastpathlock) for type checking on first store.
Assembly is short, mostly one or two instructions per operation.

Reading the runtime/internal/atomic per-architecture files is illuminating. You'll see the exact assembly emitted for amd64, arm64, ppc64, riscv64, mips, etc.

Appendix N: Implementing a Wait-Free Counter¶

A wait-free counter sounds trivial: atomic.Int64.Add(1). But under extreme contention, each Add takes longer because of cache invalidation across cores.

True wait-freedom under contention: each goroutine increments its own per-CPU counter; readers sum them.

type WFCounter struct {
    shards []paddedInt64
}

func NewWFCounter() *WFCounter {
    return &WFCounter{shards: make([]paddedInt64, runtime.NumCPU())}
}

func (c *WFCounter) Inc() {
    p := getProcID() // runtime-specific
    c.shards[p].n.Add(1)
}

func (c *WFCounter) Read() int64 {
    var s int64
    for i := range c.shards {
        s += c.shards[i].n.Load()
    }
    return s
}

Each Inc is wait-free per goroutine (one atomic operation on a private cache line). Read is O(NumCPU).

Caveats:

getProcID() isn't directly exposed in Go. The closest equivalent is runtime_procPin (internal). Hashing the goroutine ID is a workaround.
Read sums shards at some moment; the sum may not match any single instant. For monitoring, this is fine.

Appendix O: Implementing a Hierarchical Counter¶

For very high-throughput counters with many readers, hierarchical counters reduce read cost:

Level 0: per-CPU shards (fast Inc).
Level 1: aggregated periodically into a smaller set of shards.
Level 2: global counter.

Reads at level 2 are cheap (one atomic load) but may be slightly stale. Writes at level 0 are wait-free.

This pattern is used in eBPF performance counters and similar telemetry.

In Go, implementing this is non-trivial because the aggregation thread must run periodically and the GC must handle the cross-level pointers. For most monitoring use cases, the simple sharded counter is good enough.

Appendix P: Beyond Acquire/Release — Release Consistency Variants¶

Modern hardware sometimes supports stronger or weaker forms than acquire/release:

Release Consistency (RC): the canonical form.
Lazy Release Consistency (LRC): writes are buffered until acquire.
Eager Release Consistency: writes are published on release without waiting for acquire.
Entry Consistency: each shared object has its own synchronization variable.
Causal+ Consistency: causality plus per-object orderings.

These are research models for distributed systems and some hardware. Go uses RC for atomics (under seq-cst as a stronger form).

If you work on distributed systems (not single-process Go), you'll encounter these. The relationships are formalized in Adve & Hill, "Weak Ordering — A New Definition" (1990) and subsequent papers.

Appendix Q: Cache Coherence Protocols¶

CPU caches use coherence protocols to keep multiple cores in sync.

MESI (Modified, Exclusive, Shared, Invalid)¶

Four states per cache line:

Modified: the line is dirty; this cache has the only copy.
Exclusive: the line matches memory; this cache has the only copy.
Shared: multiple caches have read-only copies.
Invalid: the line is not in this cache.

Transitions:

Read miss → Shared (if elsewhere) or Exclusive (if not).
Write miss → invalidate other caches → Modified.
Other write → Invalid here, Modified elsewhere.

MOESI (adds Owned)¶

Owned: a hybrid of Shared (others have copies) and Modified (this cache must write back). Used in AMD CPUs.

MESIF (adds Forward)¶

Forward: one cache is responsible for sourcing a Shared line on miss. Reduces cache-to-cache transfer overhead in NUMA. Used in Intel CPUs.

For programmers, the protocol details rarely matter — but they explain why cross-core writes are expensive (state transitions) and why false sharing hurts (innocent writes invalidate each other's cache lines).

Appendix R: When the Compiler Lies¶

Compilers may reorder code in ways you don't expect, as long as the reorder is invisible within a single goroutine. To force the compiler not to reorder, use atomic operations.

Example:

func f() {
    x = 1
    y = 2
}

The compiler may emit y = 2; x = 1 if it sees no dependency. For single-goroutine code, this is fine.

For multi-goroutine code, the atomic acts as a barrier:

func f() {
    x = 1
    atomic.StoreInt32(&y, 2) // also a compiler barrier
}

Now x = 1 must happen before y = 2 (both in source code order and in the emitted assembly).

For non-atomic shared memory, use a mutex:

func f() {
    mu.Lock()
    x = 1
    y = 2
    mu.Unlock()
}

The Lock and Unlock are compiler barriers; nothing inside is reordered out.

Appendix S: Atomicity Granularity¶

Go's atomic operations work on specific sizes:

atomic.Int32/atomic.Uint32: 32-bit.
atomic.Int64/atomic.Uint64: 64-bit (aligned).
atomic.Uintptr: pointer-sized.
atomic.Pointer[T]: pointer-sized.
atomic.Bool: implemented as uint32.
atomic.Value: variable-size, mutex-protected.

For other sizes:

Pack into one of the above. E.g., two int16s in an int32.
Use an atomic.Pointer[T] to a heap-allocated struct.
Use a mutex.

128-bit atomics (CMPXCHG16B on x86-64, paired CAS on ARM) are not exposed in sync/atomic. They're available in runtime/internal/atomic but not part of the public API.

Appendix T: Atomicity and Alignment¶

64-bit atomic operations require 8-byte alignment on most architectures. Go's atomic.Int64 etc. struct types guarantee alignment.

For raw int64 fields:

type Bad struct {
    flag int32
    n    int64 // may be at offset 4 — misaligned on 32-bit ARM
}

On 64-bit platforms, this is fine. On 32-bit ARM, atomic.AddInt64(&b.n, 1) crashes.

Fix:

type Good struct {
    n    atomic.Int64 // wrapper guarantees alignment
    flag int32
}

Or use a _ [4]byte pad:

type Manual struct {
    flag int32
    _    [4]byte
    n    int64 // now aligned
}

Modern Go (1.19+) prefers the struct wrapper.

Appendix U: Cross-Goroutine State and Stack Movement¶

Go's runtime moves goroutine stacks when they grow. If a non-atomic pointer to a stack variable is shared across goroutines, the stack movement can invalidate the pointer.

This is rare because:

Most shared state is heap-allocated.
The compiler usually escapes captured variables to the heap.

But beware:

func bad() {
    var x int
    go func() {
        // captures &x; may escape to heap implicitly
        atomic.StoreInt32((*int32)(unsafe.Pointer(&x)), 1)
    }()
}

If x escapes to heap, fine. If somehow it doesn't (rare), the pointer could move. Don't rely on this.

The lesson: don't share pointers to local variables across goroutines via unsafe. Stick to heap-allocated state.

Appendix V: Profiling Memory Barriers¶

You can sometimes see memory barriers in CPU profiles:

$ go test -cpuprofile=cpu.out -bench=AtomicHeavy
$ go tool pprof -text cpu.out

Look for:

runtime.atomicstore64: 64-bit atomic store.
runtime.atomicload64: 64-bit atomic load.
runtime.casgo: CAS for pointer.
runtime.lock/runtime.unlock: internal mutex.
sync.(*Mutex).lockSlow: contended mutex path.

If lockSlow is high, you have mutex contention. If casgo is high with many retries, CAS is contended.

The block profile (runtime.SetBlockProfileRate(1)) shows where goroutines block. The mutex profile shows which mutexes are contended.

For atomic-heavy code, look at the disassembly to see exactly what instructions are emitted.

Appendix W: Reasoning About Concurrent Code Mathematically¶

For very rigorous reasoning, encode concurrent code as a transition system:

States: tuples of all shared variables and per-thread program counters.
Transitions: enabled atomic actions.
Invariants: properties that hold in all reachable states.

Tools like TLA+ explore this transition system exhaustively (for small enough state spaces).

For Go code, the manual reasoning might look like:

Invariant: at any time, either `ptr == nil` or `ptr->initialized == true`.

Proof:
1. The only place `ptr` is written is at line 42, which sets `ptr->initialized = true` before the store.
2. The store is atomic.Store, which is a release fence.
3. Any acquire on ptr observes the released value.
4. The release fence ensures `initialized = true` is visible before `ptr` is non-nil.
5. Therefore the invariant holds.

This level of rigor is rare in application code but standard for kernel/runtime code.

Appendix X: Production Concurrency Stories¶

X.1 — The runaway goroutine¶

A team noticed memory usage growing slowly over days. runtime.NumGoroutine() showed 100,000+ goroutines. Investigation: each request started a goroutine that waited on a channel, but the channel was never sent to.

Fix: use context for cancellation.

X.2 — The deadlock that wasn't¶

A team reported a deadlock under load. go tool trace showed a goroutine waiting on a mutex held by another goroutine. But the holder was also "waiting" — actually doing slow I/O.

Fix: don't hold mutexes during slow I/O. Use a pattern like:

mu.Lock()
key := pickKey()
mu.Unlock()
value := slowFetch(key)
mu.Lock()
cache[key] = value
mu.Unlock()

X.3 — The torn map¶

A team's map[string]int mysteriously had inconsistent values. Race detector confirmed: concurrent reads and writes.

Fix: sync.RWMutex around the map.

X.4 — The miscounted metric¶

A counter showed wrong totals because the reader did Load + Store(0) without atomic combination. Concurrent increments were lost.

Fix: Swap(0) for read-and-reset.

X.5 — The forgotten cancellation¶

A worker pool kept processing items even after shutdown. The shutdown signal closed a channel, but workers were busy with current items and didn't check.

Fix: check context at each iteration:

for {
    select {
    case <-ctx.Done():
        return
    case item := <-jobs:
        process(item)
    }
}

Appendix Y: The Future of Atomics in Go¶

Likely future additions:

atomic.LoadAcquire / atomic.StoreRelease for explicit ordering.
More wait-free types (e.g., atomic.Queue[T]).
Better runtime support for NUMA.
Generic atomic structs (currently you wrap with atomic.Pointer[T]).

Unlikely:

Relaxed memory ordering (Go team prefers simplicity).
128-bit atomics (rarely needed).
Hardware transactional memory exposed (HTM has been a research disappointment).

The Go team prioritizes pragmatic improvements over feature parity with C++ or Rust.

Appendix Z: A Closing Thought¶

Concurrency is fundamentally about time and visibility. Acquire/release semantics codify the rules of "what happens before what" across multiple cores.

Go gives you a simple, opinionated answer: seq-cst for atomics, well-defined synchronization for higher-level primitives, undefined behavior for races. This trades a small performance margin for huge gains in programmer productivity.

At the professional level, you understand all the layers: the language model, the runtime, the compiler, the hardware. You can reason about concurrent code with rigor that matches academic papers. You can design new primitives.

Whether you ever use this depth in your day job depends on your role. For a runtime contributor, every page; for a typical application developer, the senior level suffices.

But knowing the depth is what separates technical mastery from competence.

End of professional level — continued below.

Appendix AA: Implementing `sync.RWMutex` from Scratch¶

To understand sync.RWMutex, let's build one. The contract:

RLock(): acquires a shared lock. Multiple readers may hold simultaneously.
RUnlock(): releases a shared lock.
Lock(): acquires an exclusive lock. Blocks until no readers and no other writer.
Unlock(): releases an exclusive lock.

Invariants:

At most one writer at a time.
No readers while a writer is active.
A waiting writer eventually acquires (writer preference).

type RWMutex struct {
    readers atomic.Int32
    writer  atomic.Bool
    writeMu sync.Mutex
    cond    *sync.Cond
    condMu  sync.Mutex
}

func (rw *RWMutex) RLock() {
    for {
        if rw.writer.Load() {
            rw.condMu.Lock()
            for rw.writer.Load() {
                rw.cond.Wait()
            }
            rw.condMu.Unlock()
            continue
        }
        rw.readers.Add(1)
        if rw.writer.Load() {
            rw.readers.Add(-1)
            continue
        }
        return
    }
}

func (rw *RWMutex) RUnlock() {
    if rw.readers.Add(-1) == 0 && rw.writer.Load() {
        rw.condMu.Lock()
        rw.cond.Signal()
        rw.condMu.Unlock()
    }
}

func (rw *RWMutex) Lock() {
    rw.writeMu.Lock()
    rw.writer.Store(true)
    rw.condMu.Lock()
    for rw.readers.Load() > 0 {
        rw.cond.Wait()
    }
    rw.condMu.Unlock()
}

func (rw *RWMutex) Unlock() {
    rw.writer.Store(false)
    rw.condMu.Lock()
    rw.cond.Broadcast()
    rw.condMu.Unlock()
    rw.writeMu.Unlock()
}

The standard library's sync.RWMutex is more efficient (uses semaphores), but this illustrates the protocol.

Publication: readers.Add and writer.Store are atomic. The condition variable adds wait/wake on top.

Appendix AB: Implementing a Channel from Atomics¶

A channel using mutex + condition variables (simplified):

type AtomicChan[T any] struct {
    buf    []T
    cap    uint64
    head   uint64
    tail   uint64
    mu     sync.Mutex
    notFull  sync.Cond
    notEmpty sync.Cond
    closed atomic.Bool
}

func (c *AtomicChan[T]) Send(v T) bool {
    c.mu.Lock()
    defer c.mu.Unlock()
    for c.head-c.tail >= c.cap && !c.closed.Load() {
        c.notFull.Wait()
    }
    if c.closed.Load() {
        return false
    }
    c.buf[c.head%c.cap] = v
    c.head++
    c.notEmpty.Signal()
    return true
}

func (c *AtomicChan[T]) Recv() (T, bool) {
    c.mu.Lock()
    defer c.mu.Unlock()
    for c.head == c.tail && !c.closed.Load() {
        c.notEmpty.Wait()
    }
    if c.head == c.tail {
        var zero T
        return zero, false
    }
    v := c.buf[c.tail%c.cap]
    c.tail++
    c.notFull.Signal()
    return v, true
}

func (c *AtomicChan[T]) Close() {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.closed.Store(true)
    c.notFull.Broadcast()
    c.notEmpty.Broadcast()
}

The standard library's chan is hand-tuned in the runtime — thousands of lines. This sketch is illustrative.

Appendix AC: Lock-Free vs Fine-Grained Locks¶

For most code, fine-grained locking is the right answer:

type StripedMutex struct {
    locks [256]sync.Mutex
}

func (s *StripedMutex) Lock(k string) {
    h := fnv32(k)
    s.locks[h%256].Lock()
}

func (s *StripedMutex) Unlock(k string) {
    h := fnv32(k)
    s.locks[h%256].Unlock()
}

Different keys map to different mutexes; contention is reduced by 256x in the best case.

Lock-free has its place (tight critical sections, real-time), but fine-grained covers the common case at a fraction of the complexity.

Appendix AD: Hardware Transactional Memory¶

Intel TSX, IBM POWER, others support HTM:

if (_xbegin() == _XBEGIN_STARTED) {
    x = 1; y = 2;
    _xend();
} else {
    // fallback
}

Transactions abort on conflict; you need a lock-based fallback. Aborts are common; debugging is hard.

Go doesn't expose HTM. Intel deprecated TSX in 2021 for security reasons. HTM may return; as of 2026, not mainstream.

Appendix AE: Persistent Memory and Memory Ordering¶

Persistent memory (Optane, NVDIMMs) survives power loss. New primitives:

CLFLUSHOPT, CLWB: flush cache lines.
SFENCE: store fence for persistence.

Programming model:

data = newValue;
_mm_clwb(&data);
_mm_sfence();
// data is now durable

Go doesn't directly support persistent memory. Use cgo + libpmem for niche needs. The Go memory model has no notion of durability.

Appendix AF: GPU Concurrency¶

GPUs have different memory models. CUDA's __threadfence() is a full barrier; __threadfence_block() is local to a block.

Go doesn't natively support GPU. Use cgo for CUDA/OpenCL. The CPU-GPU boundary has its own synchronization primitives.

Appendix AG: Distributed Consistency Models¶

Distributed systems have consistency models:

Linearizability: atomic, totally ordered.
Sequential consistency: all nodes agree on order.
Causal consistency: causal pairs ordered.
Eventual consistency: replicas eventually converge.

Acquire/release in single-process Go gives linearizability locally. Across processes, you need consensus algorithms (Paxos, Raft).

Appendix AH: Consensus Algorithms¶

In Go:

hashicorp/raft
etcd-io/raft

These run on top of Go's memory model and provide cross-node consistency.

Appendix AI: Traced Operations¶

AI.1 — `m.Store(k, v)` on `sync.Map`¶

Hash k.
Load "read" pointer (atomic).
If k in read, CAS its value entry.
Else acquire dirty mutex, add to dirty, release.

Publication: each atomic Load/Store and mutex provides acq/rel.

AI.2 — `g.Wait()` on `errgroup.Group`¶

Calls wg.Wait().
Each Done is a release.
Wait acquires when counter reaches 0.

AI.3 — `ch <- v` then `<-ch`¶

Sender locks channel mutex, copies into buffer, signals.
Receiver locks mutex, copies out, signals.

Publication: writes before send visible after receive.

Appendix AJ: When Does Go Reorder?¶

Common intra-goroutine optimizations:

Loop-invariant code motion.
Common subexpression elimination.
Dead store elimination.
Code straightening.

Atomic operations and sync primitives are compiler barriers. The compiler does not reorder across them.

Appendix AK: Production Notes¶

Race detector overhead: ~5-10x. Use in tests, not prod.
Mutex contention shows in pprof's lockSlow.
Atomics scale well usually.
Channels bounded ~500K ops/sec. Higher? Shard.
GC doesn't interfere with atomics.
NUMA rarely matters on cloud VMs.

Appendix AL: Decision Flowchart¶

Shared between goroutines? - NO → local, no sync.
                              YES → continue.

Immutable after publish? - YES → atomic.Pointer (read-heavy) or sync.Once (init).
                          NO → continue.

Multi-step transaction? - YES → sync.Mutex.
                         NO → continue.

Single-word? - YES → sync/atomic.
              NO → sync.Mutex.

Need to signal/pass value? - YES → channel.

Appendix AM: Acceptance Criteria¶

You're professional-level if you can:

Write DCL without docs.
Implement Treiber stack in <30 lines.
Explain memory model in 5 minutes.
Trace atomic.Pointer.Store to assembly.
Fix a publication bug from race report.
Choose primitive by workload analysis.
Profile and optimize atomic paths.
Design new concurrent primitive.
Understand fence elision.
Translate concurrency between languages.

Appendix AN: References¶

Go source: src/sync/, src/sync/atomic/, src/runtime/.
Linux kernel RCU docs.
C++ standard memory model.
Rust nomicon.
Postgres source.
CockroachDB (Go).
TiKV (Rust).

Appendix AO: Wrap-Up¶

You've reached the end of the professional file. You should understand:

The full stack: source → IR → assembly → machine code.
Cross-language concurrency.
Cost models per architecture.
Runtime's role in barriers.
Reclamation beyond GC.
Cache-coherence and NUMA.

From here, deeper mastery comes from building systems that exercise these patterns.

Go forth.

End of the AO section.

Appendix AP: Deep Dive — `sync.Mutex` Internals¶

Go's sync.Mutex is sophisticated. Let's trace through its implementation.

type Mutex struct {
    state int32
    sema  uint32
}

const (
    mutexLocked = 1 << iota
    mutexWoken
    mutexStarving
    mutexWaiterShift = iota
    starvationThresholdNs = 1e6
)

The state field packs: - Bit 0: locked. - Bit 1: woken (a waiter was just woken). - Bit 2: starvation mode. - Bits 3+: waiter count.

Fast path: uncontended Lock¶

func (m *Mutex) Lock() {
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        return
    }
    m.lockSlow()
}

One CAS. Most lock acquisitions take this path.

Slow path¶

Spins briefly. If still locked, enqueues on semaphore. After 1 ms wait, enters starvation mode — next holder hands lock directly to longest-waiting goroutine, preventing starvation.

Unlock¶

func (m *Mutex) Unlock() {
    new := atomic.AddInt32(&m.state, -mutexLocked)
    if new != 0 {
        m.unlockSlow(new)
    }
}

Publication: every Lock's CAS is acquire; every Unlock's atomic subtract is release.

Appendix AQ: Deep Dive — `sync.Once`¶

type Once struct {
    done atomic.Uint32
    m    Mutex
}

func (o *Once) Do(f func()) {
    if o.done.Load() == 0 {
        o.doSlow(f)
    }
}

func (o *Once) doSlow(f func()) {
    o.m.Lock()
    defer o.m.Unlock()
    if o.done.Load() == 0 {
        defer o.done.Store(1)
        f()
    }
}

This is the canonical DCL. Fast path: one atomic load.

Subtleties: - The load is atomic to avoid race. - Inner check inside mutex avoids running f twice. - defer o.done.Store(1) ensures even panic marks done.

Publication: done.Store(1) is a release. Subsequent done.Load() == 1 is an acquire.

Appendix AR: Deep Dive — `sync.WaitGroup`¶

type WaitGroup struct {
    state1 atomic.Uint64 // high 32: counter, low 32: waiter count
    state2 atomic.Uint32 // semaphore
}

Packs counter and waiter count into one atomic to avoid race when counter hits 0 just as a Wait increments waiters.

Each Done is acq-rel. When counter hits 0, runtime signals all waiters. Writes before Done visible after Wait returns.

Appendix AS: Deep Dive — `sync.Map`¶

Two underlying maps: - read: atomic-Pointer to read-only snapshot. - dirty: mutex-protected.

Reads check read first (lock-free). Misses fall through to dirty under mutex. When misses exceed read size, dirty is promoted to read.

Read src/sync/map.go — ~400 lines of careful design.

Appendix AT: Deep Dive — `chan`¶

type hchan struct {
    qcount   uint
    dataqsiz uint
    buf      unsafe.Pointer
    elemsize uint16
    closed   uint32
    elemtype *_type
    sendx    uint
    recvx    uint
    recvq    waitq
    sendq    waitq
    lock     mutex
}

Send: 1. Lock chan. 2. If receiver waiting → direct hand-off, wake. 3. Else if buffer has space → store, advance. 4. Else → park sender, release lock.

Receive: symmetric.

Close: lock, set flag, wake all waiters.

Publication: the lock provides acq/rel for buffered transfers; direct hand-offs synchronize both sides under the lock.

Appendix AU: Cost Comparison¶

Microbenchmarks (rough, uncontended):

plain memory access              ~0.3 ns
atomic.Load(uint32)              ~1 ns
atomic.Store(uint32)             ~5 ns
atomic.CompareAndSwap            ~8 ns
sync.Mutex.Lock+Unlock           ~15 ns
sync.RWMutex.RLock+RUnlock       ~20 ns
chan send/recv buffered          ~80 ns
chan send/recv unbuffered        ~200 ns
sync.Once.Do cached              ~1 ns

Contention can amplify any of these 10-100x.

Appendix AV: When Performance Matters¶

HFT.
Real-time A/V.
Hot path of network proxy.
Storage engine.
Game server tick.

For these, profile relentlessly. Choose the lightest primitive.

For typical web services, readability wins.

Appendix AW: Proper Spinlocks¶

type SpinLock struct {
    flag atomic.Bool
}

func (s *SpinLock) Lock() {
    for {
        if s.flag.CompareAndSwap(false, true) {
            return
        }
        for s.flag.Load() {
            runtime.Gosched()
        }
    }
}

func (s *SpinLock) Unlock() {
    s.flag.Store(false)
}

First CAS; if fails, wait via relaxed load (doesn't ping cache). Use only for <100 ns critical sections.

Appendix AX: NUMA Sharding¶

type NUMAAware struct {
    perSocket [4]*Cache
}

func (n *NUMAAware) Get(k string) (Value, bool) {
    socket := currentSocket()
    return n.perSocket[socket].Get(k)
}

Requires runtime.LockOSThread or taskset to pin. Rarely needed on cloud VMs.

Appendix AY: Lock-Free Refcounting¶

type Refcounted struct {
    refs atomic.Int32
    val  any
}

func (r *Refcounted) Inc() { r.refs.Add(1) }
func (r *Refcounted) Dec() {
    if r.refs.Add(-1) == 0 {
        // free
    }
}

Only Inc from a context already holding a reference. Rarely needed in Go.

Appendix AZ: Polling Channels¶

select {
case v := <-ch:
    // got
default:
    // would block
}

Useful for opportunistic check without blocking.

Appendix BA: Libraries to Study¶

golang.org/x/sync/errgroup
golang.org/x/sync/singleflight
golang.org/x/sync/semaphore
github.com/hashicorp/raft
github.com/etcd-io/etcd
github.com/cockroachdb/cockroach

Each is a master class. Read patterns; emulate.

Appendix BB: Concurrency Bug Database¶

Search for "data race," "deadlock," "memory ordering" in:

github.com/golang/go issues.
Kubernetes issues.
etcd issues.
CockroachDB issues.

Postmortems are gold.

Appendix BC: Diagrams¶

GO'S MEMORY MODEL HIERARCHY
===========================

      Application Code
            |
      sync, sync/atomic, channels
            |
      Compiler (gc / gccgo)
            |
      runtime/internal/atomic
            |
      Per-arch assembly (XCHG, LDAR, etc.)
            |
      Hardware memory model (TSO, ARMv8, etc.)

PUBLICATION STAGES
==================

Producer:                       Consumer:
[build *T]                          
    |                              
[release]  ──synchronizes-with──► [acquire]
                                      |
                                  [use *T]  ← writes visible

Appendix BD: Final Wrap¶

You've read the professional file. From here, mastery comes from:

Implementing systems.
Reading source.
Reading papers.
Discussing with peers.
Teaching.

Concurrency is hard. Acquire/release is the foundation. Build on it.

End of professional file, continued below.

Appendix BE: Inside the Go Scheduler¶

The Go scheduler (GMP — Goroutines, Machine threads, Processors) is intertwined with memory ordering. When a goroutine migrates between Ms (OS threads), the runtime emits barriers.

You don't write code for this directly. But knowing it explains why your atomic operations always observe consistent state, even with goroutine migration.

Appendix BF: Channel Lock Elision¶

For single-sender/single-receiver channels with no contention, Go's runtime can sometimes skip the internal mutex. Pure optimization; semantics identical.

Appendix BG: GC and Atomics¶

Concurrent GC has brief stop-the-world pauses (<1 ms). During STW, all goroutines pause at safe points. Atomic operations naturally compose with GC; correctness is preserved.

Go's GC is non-moving (mostly), so atomic pointers don't change addresses during GC. This is friendlier than moving GCs for lock-free pointer structures.

Appendix BH: Goroutine-Local Storage¶

Go doesn't expose TLS. Alternative: sync.Map keyed by goroutine ID. The community generally avoids this pattern; prefer passing context explicitly.

Appendix BI: A Highly-Concurrent Counter¶

For millions of increments per second:

type ConcurrentCounter struct {
    cells [256]paddedInt64
}

type paddedInt64 struct {
    val atomic.Int64
    _   [56]byte
}

func (c *ConcurrentCounter) Add(delta int64) {
    idx := getProcID() % 256
    c.cells[idx].val.Add(delta)
}

func (c *ConcurrentCounter) Sum() int64 {
    var s int64
    for i := range c.cells {
        s += c.cells[i].val.Load()
    }
    return s
}

Each P writes to its own cell; Sum reads all. 16 KB memory; wait-free Add; O(256) Sum.

Appendix BJ: Sharded RWMutex Map¶

type ShardedMap[K comparable, V any] struct {
    shards [256]struct {
        mu sync.RWMutex
        m  map[K]V
    }
}

Each shard has its own RWMutex. Contention distributed across 256 buckets.

Appendix BK: Common Bottlenecks¶

Global mutex on hot map → shard.
Single semaphore → per-shard.
GC pressure → sync.Pool.
Single channel → multiple channels.
CAS retry loops under contention → mutex (often faster).
False sharing → padding.

Profile first.

Appendix BL: Reference MPSC Queue¶

Vyukov's MPSC (simplified):

type MPSC[T any] struct {
    head atomic.Pointer[mpscNode[T]]
    tail *mpscNode[T]
    stub mpscNode[T]
}

type mpscNode[T any] struct {
    next atomic.Pointer[mpscNode[T]]
    val  T
}

func (q *MPSC[T]) Push(v T) {
    n := &mpscNode[T]{val: v}
    prev := q.head.Swap(n)
    prev.next.Store(n)
}

func (q *MPSC[T]) Pop() (T, bool) {
    tail := q.tail
    next := tail.next.Load()
    if tail == &q.stub {
        if next == nil { var zero T; return zero, false }
        q.tail = next
        tail = next
        next = next.next.Load()
    }
    if next != nil {
        q.tail = next
        return next.val, true
    }
    head := q.head.Load()
    if tail != head { return q.Pop() }
    var zero T
    return zero, false
}

Wait-free producers, lock-free consumer.

Appendix BM: Approximate Counter¶

type ApproxCounter struct {
    val  atomic.Int64
    perP []atomic.Int64
}

func (c *ApproxCounter) Inc() {
    p := getProcID()
    if c.perP[p].Add(1) > 100 {
        delta := c.perP[p].Swap(0)
        c.val.Add(delta)
    }
}

func (c *ApproxCounter) Value() int64 {
    s := c.val.Load()
    for i := range c.perP {
        s += c.perP[i].Load()
    }
    return s
}

Reduces global pressure 100x; read is slightly stale.

Appendix BN: Lock-Free Lists¶

Harris's algorithm uses "logically deleted" marks on next pointers. 200+ lines for full implementation. Rarely needed in Go.

Appendix BO: Counter Pattern Matrix¶

Pattern	Pros	Cons
Single atomic	Simple	Contention
Sharded atomic	Wait-free	O(N) read
Approximate	Less pressure	Stale read
Mutex	Multi-step	Contention

Appendix BP: Concurrent String Builder¶

type ConcurrentBuilder struct {
    mu  sync.Mutex
    buf []byte
}

func (b *ConcurrentBuilder) Write(p []byte) {
    b.mu.Lock()
    b.buf = append(b.buf, p...)
    b.mu.Unlock()
}

Or fan-out per goroutine + final join.

Appendix BQ: Iterator Patterns¶

Snapshot: return a copy; iterate freely.
Range with lock: iterate under lock; modifications block.

Prefer Snapshot when feasible.

Appendix BR: Parallel Lazy Init¶

func (s *Service) init() error {
    s.once.Do(func() {
        var wg sync.WaitGroup
        var dbErr, cacheErr, rpcErr error
        wg.Add(3)
        go func() { defer wg.Done(); s.db, dbErr = openDB() }()
        go func() { defer wg.Done(); s.cache, cacheErr = openCache() }()
        go func() { defer wg.Done(); s.rpc, rpcErr = openRPC() }()
        wg.Wait()
        s.err = firstNonNil(dbErr, cacheErr, rpcErr)
    })
    return s.err
}

Three resources init in parallel under one Once.

Appendix BS: Final Reflection¶

The professional file covered:

Cross-language memory models.
Compiler emissions.
Per-arch cost.
Runtime internals.
Wait-free theory.
Reclamation.
NUMA.
New primitive design.

You're equipped for runtime contribution, systems engineering, or language design.

Appendix BT: Closing Diagram¶

THE ACQUIRE/RELEASE CONTRACT
============================

Producer                         Consumer
  build *T (writes)
       |
       v
  release(L) ───── s-w ─────► acquire(L)
                                       |
                                       v
                                  use *T (writes visible)

RELEASE operations:
  atomic.Store, atomic.Add, atomic.CompareAndSwap, atomic.Swap
  sync.Mutex.Unlock, sync.RWMutex.Unlock/RUnlock
  ch <- v, close(ch)
  wg.Done()
  end of sync.Once.Do body

ACQUIRE operations:
  atomic.Load, atomic.Add (RMW), atomic.CAS, atomic.Swap
  sync.Mutex.Lock, sync.RWMutex.Lock/RLock
  <-ch
  wg.Wait()
  sync.Once.Do (after winner returns)

This is the entirety of safe publication in Go.

End for real this time.

Appendix BU: Advanced Pattern — Snapshot Isolation with Versioning¶

A real database technique: each row carries a version; readers see a stable snapshot.

type Row[T any] struct {
    history []versioned[T]
    mu      sync.Mutex
}

type versioned[T any] struct {
    version int64
    value   T
}

type Tx struct {
    snapshot int64
}

func (r *Row[T]) Begin() Tx {
    r.mu.Lock()
    defer r.mu.Unlock()
    if len(r.history) == 0 {
        return Tx{snapshot: 0}
    }
    return Tx{snapshot: r.history[len(r.history)-1].version}
}

func (r *Row[T]) Read(tx Tx) (T, bool) {
    r.mu.Lock()
    defer r.mu.Unlock()
    for i := len(r.history) - 1; i >= 0; i-- {
        if r.history[i].version <= tx.snapshot {
            return r.history[i].value, true
        }
    }
    var zero T
    return zero, false
}

func (r *Row[T]) Update(v T) int64 {
    r.mu.Lock()
    defer r.mu.Unlock()
    var version int64 = 1
    if n := len(r.history); n > 0 {
        version = r.history[n-1].version + 1
    }
    r.history = append(r.history, versioned[T]{version, v})
    return version
}

Transactions read a stable snapshot regardless of concurrent updates. Old versions are garbage-collected when no transaction holds them (in this simple impl, never — production systems vacuum old versions).

Publication: each Update is a release (via the mutex). Each Read acquires.

This pattern is the basis of MVCC in Postgres, MySQL, CockroachDB.

Appendix BV: Atomic Operations in High-Performance Networking¶

For network packet processing at millions of packets per second:

Per-CPU queues to avoid cross-core contention.
Lock-free MPSC queues for fan-in.
Atomic counters for metrics.
Cache-line-aligned data structures.

Go's net package uses these internally. For custom packet pipelines, you'd build on top.

Example: counting packets per source IP:

type IPCounter struct {
    shards [256]struct {
        m  map[uint32]*atomic.Int64
        mu sync.RWMutex
    }
}

func (c *IPCounter) Inc(ip uint32) {
    s := &c.shards[ip%256]
    s.mu.RLock()
    if cnt, ok := s.m[ip]; ok {
        cnt.Add(1)
        s.mu.RUnlock()
        return
    }
    s.mu.RUnlock()

    s.mu.Lock()
    if cnt, ok := s.m[ip]; ok {
        cnt.Add(1)
    } else {
        if s.m == nil {
            s.m = map[uint32]*atomic.Int64{}
        }
        cnt := &atomic.Int64{}
        cnt.Store(1)
        s.m[ip] = cnt
    }
    s.mu.Unlock()
}

func (c *IPCounter) Get(ip uint32) int64 {
    s := &c.shards[ip%256]
    s.mu.RLock()
    defer s.mu.RUnlock()
    if cnt, ok := s.m[ip]; ok {
        return cnt.Load()
    }
    return 0
}

Fast path (RLock + atomic): wait-free if entry exists. Slow path (Lock + insert): rare, serialized per shard.

This is the kind of pattern in production high-traffic services.

Appendix BW: Memory Barriers in I/O¶

Writing to a network socket or disk involves system calls that act as full barriers. Code running after a successful syscall sees all writes from before, even from other CPUs.

But this is OS-level synchronization, not Go-level. The Go memory model says nothing about syscalls directly; the runtime ensures consistency.

For application code: a syscall after writes ensures durability up to the syscall (for file I/O with fsync) or transmission (for network I/O).

Appendix BX: Performance Story — A Real Optimization¶

A team had a metrics aggregator processing 100M events/sec. The hot path:

type Metric struct {
    name  string
    count atomic.Int64
}

var metrics sync.Map // map[string]*Metric

func Record(name string) {
    if v, ok := metrics.Load(name); ok {
        v.(*Metric).count.Add(1)
        return
    }
    m := &Metric{name: name}
    if actual, loaded := metrics.LoadOrStore(name, m); loaded {
        actual.(*Metric).count.Add(1)
    } else {
        m.count.Add(1)
    }
}

At 100M/sec, the atomic Add was the bottleneck — all goroutines incrementing the same counter caused cache-line ping-pong.

Optimization: shard the counter inside each Metric:

type Metric struct {
    name   string
    counts [64]paddedInt64
}

func (m *Metric) Inc() {
    p := getProcID() % 64
    m.counts[p].n.Add(1)
}

func (m *Metric) Total() int64 {
    var s int64
    for i := range m.counts {
        s += m.counts[i].n.Load()
    }
    return s
}

Throughput tripled. The fix was per-Metric sharding, not just per-counter.

Lesson: sharding can be applied at any level. Profile to find the level that matters.

Appendix BY: The Concurrent Composition Theorem¶

Informal: if you compose two correct concurrent objects, the result is correct if their synchronization is independent.

Example:

mu1.Lock()
v1 = compute1()
mu1.Unlock()

mu2.Lock()
v2 = compute2()
mu2.Unlock()

Each critical section is correct in isolation. Their composition has no order requirement between them.

But:

mu1.Lock()
mu2.Lock() // CAUTION
// ...
mu2.Unlock()
mu1.Unlock()

Now the order matters. If anywhere else mu2 is locked before mu1, deadlock is possible.

Avoid nested locks if possible. If unavoidable, document the order.

Appendix BZ: When `sync.Pool` Helps¶

sync.Pool is designed for ephemeral scratch space:

var bufPool = sync.Pool{
    New: func() any { return new(bytes.Buffer) },
}

func handle(req *Request) Response {
    buf := bufPool.Get().(*bytes.Buffer)
    defer bufPool.Put(buf)
    buf.Reset()
    // use buf
}

Pros: - Reduces GC pressure. - Per-P storage; no contention.

Cons: - The pool may be cleared between GC cycles; don't rely on the same object. - Don't put state in pooled objects that callers must preserve.

When sync.Pool doesn't help: - If allocation isn't a bottleneck. - If the object has identity that must persist.

Use it for buffers, scratch slices, parser state — not for connections, caches, or anything long-lived.

Appendix CA: Real Production Atomics Bug¶

A team had a service that occasionally reported negative request counts in metrics. The code:

var pending atomic.Int64

func StartRequest() {
    pending.Add(1)
    defer pending.Add(-1)
    handle()
}

func MetricsHandler(w http.ResponseWriter, _ *http.Request) {
    fmt.Fprintf(w, "pending=%d\n", pending.Load())
}

Looked correct. Investigation: a panic in handle() triggered the deferred Add(-1), but the handler was wrapped by middleware that also incremented pending again on retry. So net was -1.

Fix: account for panics explicitly:

func StartRequest() {
    pending.Add(1)
    defer pending.Add(-1)
    defer func() {
        if r := recover(); r != nil {
            // log, but don't double-decrement
        }
    }()
    handle()
}

Wait, that's still off if middleware also wraps. The real fix was to not have two layers both managing the counter — single source of truth.

Lesson: atomics are correct primitives; their usage can still be buggy. Establish ownership of state.

Appendix CB: Closing — A Manifesto¶

Concurrency in Go is built on a foundation: release on the writer, acquire on the reader, on the same synchronization location.

Everything else — sync.Once, atomic.Pointer, channels, mutexes — is convenient packaging.

At the professional level, you:

Speak the memory model precisely.
Choose primitives by workload, not habit.
Profile and benchmark.
Document publication contracts.
Read other implementations for inspiration.
Contribute to language and runtime evolution.

The journey doesn't end here; it loops. You'll re-read this file when debugging a subtle bug, when designing a new library, when interviewing a senior candidate.

Concurrency is endless. Embrace it.

End of professional.md. For real.

Appendix CC: Acknowledgements¶

The patterns in this file owe to the work of:

Maurice Herlihy and Nir Shavit for wait-free synchronization theory.
Doug Lea and Cliff Click for concurrent Java collections.
Paul McKenney for RCU and Linux kernel concurrency.
Dmitry Vyukov for lock-free queue designs.
The Go team for designing a usable memory model.
Russ Cox for the clearest writing on hardware and language memory models.

Read their papers. They are the gift.

Appendix CD: One Final Question¶

If you take only one lesson from this entire file, what should it be?

Safe publication needs a release on the writer and an acquire on the reader, on the same synchronization location.

That sentence covers junior, middle, senior, and professional. Everything else is technique, optimization, and detail.

Whenever you write concurrent Go code, ask: "Where is my release? Where is my acquire? Are they on the same location?"

If you can answer those three questions, your code is correct.

If you can't, fix it before merging.

End.

Appendix CE: Atomic Instruction Cheat Sheet¶

x86-64 atomic instructions:

XCHG mem, reg          ; atomic exchange (implicit LOCK)
LOCK ADD mem, val      ; atomic add
LOCK SUB mem, val      ; atomic subtract
LOCK INC mem           ; atomic increment
LOCK DEC mem           ; atomic decrement
LOCK CMPXCHG mem, reg  ; compare-and-swap
LOCK XADD mem, reg     ; atomic exchange-and-add (returns old value)
MFENCE                 ; full memory barrier
SFENCE                 ; store fence
LFENCE                 ; load fence
PAUSE                  ; hint to CPU for spin loops

ARM64 atomic instructions:

LDAR Rt, [Rn]          ; load-acquire register
STLR Rt, [Rn]          ; store-release register
LDAXR Rt, [Rn]         ; load-acquire exclusive
STLXR Ws, Rt, [Rn]     ; store-release exclusive
DMB ISH                ; data memory barrier, inner shareable
DSB ISH                ; data synchronization barrier
ISB                    ; instruction synchronization barrier
LDAXRH/STLXRH          ; halfword variants
LDAXRB/STLXRB          ; byte variants
CAS                    ; compare-and-swap (ARMv8.1+)
SWP                    ; atomic swap (ARMv8.1+)
LDADD                  ; atomic load-add (ARMv8.1+)

RISC-V atomic instructions:

LR.W rd, (rs1)         ; load-reserved word
SC.W rd, rs2, (rs1)    ; store-conditional word
AMOSWAP.W rd, rs2, rs1 ; atomic memory swap
AMOADD.W rd, rs2, rs1  ; atomic add
FENCE rw, rw           ; full memory barrier

These are the instructions Go's runtime emits for atomic operations.

Appendix CF: Concurrency in Other Runtimes¶

Erlang/Elixir¶

Erlang's BEAM runtime uses message passing. Processes are lightweight (similar to goroutines). Messages are copied (no shared memory). The acquire/release semantics are inside the BEAM's message queue implementation; user code never sees them.

Java¶

Java threads share memory. The Java Memory Model defines volatile (acq-rel) and synchronized (acquire on enter, release on exit). Java guarantees no torn reads for primitive types.

Python¶

Python (CPython) has the GIL — Global Interpreter Lock. One thread runs Python bytecode at a time. The GIL serializes most accesses, providing implicit acquire/release. PyPy and other implementations relax this.

JavaScript¶

JavaScript is single-threaded (event loop). Web Workers run in isolated contexts. Shared memory (via SharedArrayBuffer) was added with explicit atomic operations like Go's.

Each runtime makes different trade-offs. Go's design — shared memory with explicit synchronization — is the C++/Rust school. Erlang's design — isolated processes with message passing — is the Actor school.

Appendix CG: A Walk Through `sync.Mutex` in Slow Path Detail¶

When sync.Mutex.Lock enters its slow path:

Spin briefly. Up to a few times if the runtime believes the lock will be released soon. Each spin is a PAUSE instruction (x86) or similar.
Increment waiter count. Atomic update to state.
Park on semaphore. Call into the runtime's runtime_SemacquireMutex, which uses kernel-level futex (Linux) or similar.
On wake, retry. Possibly compete with new arrivals (under normal mode) or take the lock directly (under starvation mode).

The starvation mode kicks in after 1 ms of waiting. It transfers the lock directly to the longest-waiting goroutine, bypassing new arrivals. This prevents indefinite starvation.

Under normal mode, the mutex is "barging" — newcomers can grab the lock before queued waiters. This minimizes latency under modest contention but allows starvation under heavy load.

The combination of normal and starvation modes is a careful balance: optimize for the common case (low contention) while bounding worst-case latency.

Appendix CH: The Go Runtime's Internal Lock¶

Inside the runtime, Go uses a different lock than sync.Mutex. It's called runtime.mutex and is implemented in src/runtime/lock_futex.go (Linux) or platform variants.

This internal lock is used by:

The scheduler.
The garbage collector.
The channel implementation.

It is NOT exposed to user code. User code uses sync.Mutex, which is built on top of similar primitives but provides additional features (starvation prevention, profiling support).

The internal lock is simpler and faster but lacks the user-facing features. The runtime uses it where minimum overhead matters more than fairness guarantees.

Appendix CI: Implementing a Concurrent Set¶

Building from scratch:

package conset

import (
    "sync"
)

type Set[K comparable] struct {
    mu sync.RWMutex
    m  map[K]struct{}
}

func New[K comparable]() *Set[K] {
    return &Set[K]{m: map[K]struct{}{}}
}

func (s *Set[K]) Add(k K) bool {
    s.mu.Lock()
    defer s.mu.Unlock()
    if _, ok := s.m[k]; ok {
        return false
    }
    s.m[k] = struct{}{}
    return true
}

func (s *Set[K]) Remove(k K) bool {
    s.mu.Lock()
    defer s.mu.Unlock()
    if _, ok := s.m[k]; !ok {
        return false
    }
    delete(s.m, k)
    return true
}

func (s *Set[K]) Contains(k K) bool {
    s.mu.RLock()
    defer s.mu.RUnlock()
    _, ok := s.m[k]
    return ok
}

func (s *Set[K]) Size() int {
    s.mu.RLock()
    defer s.mu.RUnlock()
    return len(s.m)
}

func (s *Set[K]) Range(fn func(K) bool) {
    s.mu.RLock()
    defer s.mu.RUnlock()
    for k := range s.m {
        if !fn(k) {
            return
        }
    }
}

Simple, correct, fast for moderate sizes. For very high concurrency, switch to sharded or sync.Map.

Appendix CJ: The Promise Type¶

A "promise" or "future" is a value that may not be ready yet. Build one in Go:

type Promise[T any] struct {
    val   T
    err   error
    done  chan struct{}
    once  sync.Once
}

func NewPromise[T any]() *Promise[T] {
    return &Promise[T]{done: make(chan struct{})}
}

func (p *Promise[T]) Resolve(v T) {
    p.once.Do(func() {
        p.val = v
        close(p.done)
    })
}

func (p *Promise[T]) Reject(err error) {
    p.once.Do(func() {
        p.err = err
        close(p.done)
    })
}

func (p *Promise[T]) Await(ctx context.Context) (T, error) {
    select {
    case <-p.done:
        return p.val, p.err
    case <-ctx.Done():
        var zero T
        return zero, ctx.Err()
    }
}

Publication: close(p.done) is a release. Each Await receive is an acquire. Writes to val/err before close are visible after Await returns.

sync.Once ensures only one Resolve or Reject succeeds.

This pattern is the building block for async libraries.

Appendix CK: Async/Await in Go¶

Go doesn't have async/await syntax. Patterns:

Pattern: spawn-and-wait¶

ch := make(chan Result, 1)
go func() {
    ch <- doWork()
}()
// later:
r := <-ch

Pattern: errgroup¶

g, ctx := errgroup.WithContext(parent)
g.Go(func() error { /* ... */ })
g.Go(func() error { /* ... */ })
if err := g.Wait(); err != nil { /* ... */ }

Pattern: future/promise¶

See above.

Go's philosophy: explicit goroutines + channels over implicit async. Each goroutine is visible in stack traces; cancellation flows through context; errors are values.

Appendix CL: Race Detection at Scale¶

For large services, run race detection periodically:

# Nightly CI job:
GOFLAGS="-count=10" go test -race ./... -run TestConcurrent

The -count=10 increases the chance of catching rare races. Multiplied across many tests, you cover a large surface.

Stress tests with random scheduling:

GORACE="atexit_sleep_ms=5000 history_size=7" go test -race -count=100

Some teams run race-detector builds in production-like staging to catch races that only appear under real load. The 5-10x overhead is acceptable for staging.

Appendix CM: Library Design — Pitfalls¶

When designing a concurrent library:

Don't return mutable internal state. Make a copy.
Don't expose mutexes. Hide them behind methods.
Document concurrency semantics. Every public method.
Test with -race. Always.
Don't spawn goroutines unless asked. Or document and provide cancellation.
Be explicit about ownership. "After Submit, do not modify the job pointer."
Provide a Close method for resources that hold goroutines.
Handle errors from goroutines. Don't swallow.

Each of these is a hard-won lesson.

Appendix CN: Reviewing Concurrent PRs¶

Checklist when reviewing:

What state is shared? What's the publication contract?
Is -race green for the changes?
Are goroutine lifetimes clear?
Is cancellation wired through context?
Are critical sections minimal?
No nested locks (or documented order)?
No slow I/O under locks?
Tests cover concurrent paths?

Reject PRs that fail any item.

Appendix CO: Concluding Notes¶

You've reached the absolute end of the acquire/release professional file. From here:

Build concurrent systems and apply what you've learned.
Read papers, source, and other implementations.
Teach. Concurrency is best understood when explained.
Contribute to open-source projects with rich concurrency.
Investigate production bugs as they arise.

The journey from junior to professional takes years. The journey beyond professional — into language design, runtime contribution, distributed systems — takes a career.

But everything starts with the simple sentence:

Safe publication needs a release on the writer and an acquire on the reader, on the same synchronization location.

Carry that with you. End.

Appendix CP: Concurrency in Database Internals¶

To go further, study how production databases handle concurrency. Key concepts:

PostgreSQL¶

MVCC for reads (no read locks needed).
Heavyweight locks for tables and rows.
Lightweight locks for buffer pages.
Spinlocks for short critical sections (e.g., updating WAL pointers).
WAL for durability ordering.

Each layer uses different synchronization. Reads of indexes use lightweight locks; modifications use heavyweight. The combination handles tens of thousands of concurrent transactions.

CockroachDB (Go!)¶

Raft consensus for distributed durability.
Optimistic concurrency control at the SQL level.
MVCC for snapshots.
Local atomics for in-process state.

Reading the CockroachDB source teaches you Go concurrency at the highest level — including how to combine local Go concurrency with distributed consensus.

Redis¶

Single-threaded core (modulo I/O threads in recent versions).
Atomic operations on data structures via the single thread.
No locks in user code (the event loop serializes).

Simpler concurrency model; trades parallelism for simplicity.

Appendix CQ: Concurrency in Distributed Systems¶

Beyond a single process, you need consensus algorithms:

Raft¶

A leader serializes all writes. Followers replicate. Reads from the leader are linearizable; reads from followers may be stale (or made linearizable with ReadIndex).

Go implementations: hashicorp/raft, etcd-io/raft.

Paxos¶

Pre-dates Raft, more complex. Variants: Multi-Paxos, Generalized Paxos, EPaxos.

Byzantine fault-tolerant¶

For adversarial environments (e.g., blockchains): PBFT, HotStuff, Tendermint.

These run on top of Go's single-process concurrency. Acquire/release within a node; consensus across nodes.

Appendix CR: Future of Memory Ordering¶

Trends:

Persistent memory: durability barriers (CLFLUSH, SFENCE) joining the memory-ordering hierarchy.
Disaggregated memory: CXL, remote memory across servers. New consistency models.
Heterogeneous accelerators: GPU/CPU/DPU coordination. Different memory models per accelerator.
Quantum computing: completely different model; doesn't replace classical for now.

Go will likely add primitives over time for persistent memory and accelerator integration. The basic acquire/release contract should remain stable.

Appendix CS: Engineering Career Notes¶

If you've internalized this material, you can:

Pass any concurrency interview at any company.
Write libraries others rely on.
Diagnose production concurrency bugs.
Mentor middle and junior engineers.
Contribute to Go's runtime or other open-source.
Design new concurrent systems from scratch.

Compensation correlation: deep concurrency expertise is rare and valued. Staff/principal-level engineers often have this skill.

But more importantly: you can build correct, performant, comprehensible concurrent code. That's the craft.

Appendix CT: What This File Couldn't Cover¶

Topics not covered (entire books exist on each):

Distributed transactions (2PC, saga).
Time and clocks in distributed systems (Lamport, vector, hybrid logical).
Concurrent garbage collection algorithms.
Lock-free skip lists.
Persistent data structures (functional immutable).
Transactional memory in detail.
The CAP theorem and consistency-availability trade-offs.
Failure detectors and consensus impossibility (FLP).
Coordination-free consistency (CRDTs).

For these, consult specialized texts or papers.

Appendix CU: Final Word¶

Concurrency is not separate from correctness. A concurrent bug is a correctness bug.

Go gives you the tools to write correct concurrent code. The race detector finds many bugs; the memory model defines the rest.

At the professional level, your responsibility is twofold:

Write code that's provably correct.
Help others do the same.

Both require deep understanding of acquire/release. You now have it.

Go forth. Build well.

End of professional.md. The journey beyond is yours.

Appendix CV: A Last Diagram¶

THE FOUNDATION
==============

  Application Logic
      |
      | uses
      v
  Synchronization Primitives
   (channels, mutexes, atomics, once, ...)
      |
      | provides
      v
  Memory Model
   (happens-before, synchronizes-with, sequenced-before)
      |
      | encoded in
      v
  Compiler & Runtime
   (atomic instructions, memory barriers)
      |
      | executed by
      v
  Hardware
   (TSO, ARMv8, RISC-V memory ordering)

Each layer relies on the one below. Acquire/release is the contract at the second layer, propagated up and down.

Understanding it means understanding all of Go's concurrency.

End.

Appendix CW: A Reflection on Mastery¶

When you write atomic.Pointer.Store(p), you are:

Invoking a Go function that compiles to specific machine instructions.
Those instructions are platform-specific atomic operations with memory ordering.
The ordering ensures every write you made before this Store is visible to anyone observing the Store.
The hardware's cache coherence protocol propagates the write through the cache hierarchy.
Other CPUs see the write atomically.
Goroutines on other CPUs that perform a corresponding Load establish happens-before.

All of this happens in nanoseconds. All of this is hidden by Go's API. All of this is what acquire/release means in practice.

When you understand all six steps, you've mastered concurrency in Go.

Appendix CX: Onward¶

Books to read next, beyond what's listed:

"The Art of Multiprocessor Programming" (Herlihy & Shavit).
"Distributed Systems" (van Steen & Tanenbaum).
"Database Internals" (Petrov).
"Designing Data-Intensive Applications" (Kleppmann).
"Linux Kernel Development" (Love).
"Computer Architecture: A Quantitative Approach" (Hennessy & Patterson).

Each takes weeks to read carefully. Each builds on what's in this file.

Your career has just begun.

Appendix CY: Truly End¶

The professional level on acquire/release in Go ends here. Approximately 5000 lines of theory, practice, internals, and patterns.

What remains is to apply it. Build something concurrent. Profile it. Find a bug. Fix it. Repeat.

Concurrency is a lifelong craft. This file is your foundation.

End.

Appendix CZ: Goodbye¶

Until you read this again — perhaps to find a forgotten pattern, perhaps to refresh a concept, perhaps to teach a junior — goodbye.

You are equipped.

End of the main content — supplementary appendices follow.

Appendix DA: Worked Performance Investigation¶

Setup: a Go service handling 50K RPS at p99 latency 100ms. After a deploy, p99 jumped to 500ms with no obvious code change.

Investigation steps:

Compare CPU profiles. New profile shows sync.(*Mutex).lockSlow consuming 40% of CPU. Old profile: 5%.
Identify the contended mutex. pprof shows it's in package metrics. The metrics counter map uses a single mutex.
Check recent changes. The deploy added several new metrics, all hitting the same mutex.
Hypothesis. Increased metric volume + single mutex = contention.
Fix. Shard the metrics map by name hash. Each shard has its own mutex.
Verify. Deploy fix; p99 returns to 100ms. Metric lockSlow drops to 3%.

Time to diagnose: 1 hour with profiling. Without profiling, could've taken days of guessing.

Lesson: profile before deploying performance fixes. Measure before and after.

Appendix DB: Worked Debugging — Mysterious Stale Read¶

Setup: a service caches DB rows. After a deploy that "improved" the cache with atomic.Pointer, occasional 5-minute-old data appears.

Investigation:

type Cache struct {
    data atomic.Pointer[map[int]*Row]
}

func (c *Cache) Get(id int) *Row {
    m := c.data.Load()
    if m == nil {
        return nil
    }
    return (*m)[id]
}

func (c *Cache) Update(id int, r *Row) {
    m := c.data.Load()
    if m == nil {
        m = &map[int]*Row{}
        c.data.Store(m)
    }
    (*m)[id] = r // BUG: mutates the published map!
}

The bug: Update mutates the published map directly. Readers loading the pointer see the in-progress mutation. Sometimes they see the new value; sometimes the old; sometimes torn map state.

Fix: copy-on-write update.

func (c *Cache) Update(id int, r *Row) {
    for {
        old := c.data.Load()
        n := map[int]*Row{}
        if old != nil {
            for k, v := range *old {
                n[k] = v
            }
        }
        n[id] = r
        if c.data.CompareAndSwap(old, &n) {
            return
        }
    }
}

Now each Update allocates a new map; readers see consistent snapshots.

Lesson: published pointers must point to immutable values. Mutating after publish is a race.

Appendix DC: A Subtle Race in `errgroup` Usage¶

g, ctx := errgroup.WithContext(parent)

var results []Result
var mu sync.Mutex

for _, url := range urls {
    url := url
    g.Go(func() error {
        r, err := fetch(ctx, url)
        if err != nil {
            return err
        }
        mu.Lock()
        results = append(results, r)
        mu.Unlock()
        return nil
    })
}

if err := g.Wait(); err != nil {
    return err
}
// use results

Is this safe? Yes, because:

The append under mutex is serialized.
Wait synchronizes with each goroutine's exit (via internal WaitGroup).
After Wait returns, all goroutines have written to results (under the mutex).

But there's a subtler issue: if any fetch fails, ctx is canceled; other in-flight fetches may return early. The results slice has only the successful fetches up to that point.

Often acceptable. If you need all-or-nothing, add a check after Wait.

Appendix DD: Implementing a Sequencer¶

Sometimes you need to assign monotonically increasing sequence numbers to events:

type Sequencer struct {
    next atomic.Uint64
}

func (s *Sequencer) Next() uint64 {
    return s.next.Add(1) - 1
}

Add(1) - 1 returns the value before the increment, which is the assigned sequence.

Add is acq-rel; no race. Each caller gets a unique sequence number.

Use case: ordering events for log appending, tracing spans, request IDs.

Appendix DE: A Trivia of Memory-Model History¶

1979: Lamport defines sequential consistency.
1989: Adve and Hill define weak ordering.
1990: DRF-SC theorem.
1991: Herlihy's "Wait-Free Synchronization" paper.
1996: Michael-Scott queue.
2001: Harris's lock-free linked list.
2004: Java Memory Model (JSR-133).
2009: Go's first memory model.
2011: C++11 atomics.
2015: Rust 1.0.
2022: Go memory model major clarification.

This timeline shows that memory models are a relatively recent formal concern — most older multithreaded code was riddled with subtle bugs.

Appendix DF: Why Go Doesn't Have `volatile`¶

Languages like Java and C# have volatile. C/C++ have it too (with different semantics). Go does not.

Go's reasoning: the cases where you'd use volatile in those languages are exactly the cases where you should use sync/atomic in Go. There's no need for a third option.

This keeps the language smaller. The cost: atomic.Int32 is slightly more verbose than volatile int.

Worth it for consistency and explicitness.

Appendix DG: A Note on Generics and Atomics¶

Generics (Go 1.18+) enable atomic.Pointer[T]. Before generics, you'd use unsafe.Pointer or atomic.Value.

atomic.Pointer[T]: - Type-safe. - No interface boxing. - Fast (just like raw atomic).

atomic.Value: - Stores any. - Slower (interface boxing on every access). - Runtime type check (panics on type mismatch).

For new code, prefer atomic.Pointer[T]. Use atomic.Value only when: - Targeting Go ≤ 1.18. - Storing different types in the same atomic (rare).

Appendix DH: Performance of Generic Atomics¶

Benchmark atomic.Pointer[T] vs unsafe.Pointer:

func BenchmarkPointer(b *testing.B) {
    var p atomic.Pointer[int]
    x := 42
    p.Store(&x)
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            _ = p.Load()
        }
    })
}

func BenchmarkUnsafePointer(b *testing.B) {
    var p unsafe.Pointer
    x := 42
    atomic.StorePointer(&p, unsafe.Pointer(&x))
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            _ = atomic.LoadPointer(&p)
        }
    })
}

Results: identical. The generic atomic is just a type-safe wrapper around the same primitives.

Appendix DI: When to Use `unsafe` for Performance¶

Rarely. unsafe.Pointer lets you bypass Go's type system but doesn't make atomics faster. It does let you do things like:

Cast a struct to a [N]byte for bulk operations.
Embed a non-pointer atomic field as a pointer-shaped one.
Interact with cgo memory.

For pure Go concurrency, you almost never need unsafe. The std library provides what you need.

Appendix DJ: A Reflection on Go's Concurrency Design¶

Go's concurrency design is opinionated:

Goroutines: cheap and abundant.
Channels: first-class.
sync package: simple primitives.
sync/atomic: seq-cst only.
Memory model: DRF-SC.

The opinions: - Concurrency should be usable by every developer, not just experts. - Simple primitives are better than flexible ones. - The compiler and runtime hide the complexity.

The cost: less flexibility than C++/Rust for the extreme cases.

The benefit: most Go code is correct concurrent code on the first try.

For 95% of services, this is a great trade. For the 5% that need more control, Go provides escape hatches (assembly, cgo, internal packages).

Appendix DK: Closing — A Code Sample to Remember¶

If you remember one piece of code from this entire file, let it be:

var current atomic.Pointer[Config]

// Producer (any goroutine):
current.Store(loadConfig())

// Consumer (any goroutine):
cfg := current.Load()
use(cfg)

This is safe publication in Go, complete. The producer builds, then atomically stores. Consumers atomically load. Writes inside loadConfig() are visible after the load.

No torn reads. No race. Wait-free for both producer and consumer (for the publication itself).

This pattern is the heart of concurrent Go. Master it; everything else builds on it.

End.

Appendix DL: True End¶

You've reached the end of approximately 5000+ lines on acquire/release semantics in Go. From the conceptual foundation (release-acquire pairs) through real-world patterns (RCU, copy-on-write, sharding) into the depths of compiler and runtime (atomic instructions, fence elision, scheduler interaction).

Where to from here:

Apply what you've learned in your next concurrent code.
Re-read sections when debugging or designing.
Read source of concurrent libraries.
Read papers from the bibliography.
Build and ship concurrent systems.

Concurrency is one of the most rewarding areas of software. The thrill of writing correct, fast, parallel code is unmatched.

Go forth and parallelize.

End of professional.md.

Appendix DM: Acknowledgements (Final)¶

This file owes intellectual debt to:

The Go authors and the Go memory model contributors.
Russ Cox for crystal-clear writing on hardware and language memory models.
Maurice Herlihy and Nir Shavit for the theoretical foundations.
Doug Lea, Cliff Click, and Dmitry Vyukov for production-quality lock-free designs.
Paul McKenney for RCU.
The C++ memory model committee for the formal vocabulary.

May their work continue to inform Go programmers for decades.

End.

Appendix DN: A Walking Tour of Real Production Code¶

Let's tour real Go production code that exercises acquire/release.

DN.1 — Kubernetes API server lease management¶

The Kubernetes leader election uses atomic operations to coordinate leadership:

// Simplified from k8s.io/client-go/tools/leaderelection
type LeaderElector struct {
    config    Config
    observedRecord    atomic.Value // LeaderElectionRecord
    observedTime      atomic.Value // time.Time
}

func (le *LeaderElector) IsLeader() bool {
    r := le.observedRecord.Load().(LeaderElectionRecord)
    return r.HolderIdentity == le.config.Lock.Identity()
}

atomic.Value stores the current leader record. Reads are wait-free; the leader election goroutine updates periodically. Other goroutines call IsLeader from request handlers without blocking.

DN.2 — etcd's MVCC¶

etcd uses MVCC with revision numbers (monotonic). The revision is an atomic.Int64. Each write increments it; reads at a specific revision see a snapshot.

type store struct {
    currentRev atomic.Int64
    // ... indexes, BoltDB ...
}

func (s *store) Put(key, val []byte) int64 {
    rev := s.currentRev.Add(1)
    // ... write to backing store with rev as the version ...
    return rev
}

Each Put atomically reserves a revision. Reads at revision N see only Puts ≤ N.

DN.3 — Prometheus's per-CPU counters¶

Prometheus client_golang has per-CPU shard counters:

type Counter struct {
    shards []atomic.Int64
}

func (c *Counter) Inc() {
    p := getProcID()
    c.shards[p].Add(1)
}

func (c *Counter) Get() int64 {
    var s int64
    for i := range c.shards {
        s += c.shards[i].Load()
    }
    return s
}

For very high cardinality metrics (millions of increments/sec), this avoids cache-line contention.

DN.4 — CockroachDB's intent latching¶

CockroachDB uses a complex latch manager for transactional consistency. Latches are short-lived locks on key ranges, coordinated via a lock-free interval tree under the hood.

Reading their pkg/kv/kvserver/concurrency/ source is an education in concurrent algorithm design.

DN.5 — Docker's container manager¶

Docker (now Moby) uses Go's sync.Map for the global container registry. Container creation is a write; lookup is a wait-free read on the read map.

var containers sync.Map // map[string]*Container

func GetContainer(id string) (*Container, bool) {
    v, ok := containers.Load(id)
    if !ok {
        return nil, false
    }
    return v.(*Container), true
}

Hundreds of thousands of containers managed concurrently. sync.Map scales because lookups (the dominant operation) are lock-free.

Appendix DO: A Catalog of Anti-Patterns Avoided in Production¶

Things production Go code studiously avoids:

Holding a mutex during slow I/O. Drop the lock; do I/O; reacquire.
Mutating published values. Always copy-on-write.
Spinning without backoff. Use runtime.Gosched() or a channel.
Forgetting to wait for goroutines. Use sync.WaitGroup or errgroup.
Ignoring -race reports. Treat them as compiler errors.
Reusing sync.Pool for non-ephemeral state. Pool is for scratch; don't store identity.
Atomic-then-non-atomic on the same field. Mixing is a race.
Goroutine-per-request without bound. Bound with a semaphore.

If you avoid all eight, your code is on solid concurrent ground.

Appendix DP: A Stack-Trace Walk¶

When a Go program panics in concurrent code, the stack trace is your friend:

goroutine 12 [running]:
main.consumer(0xc0000100c0)
        /tmp/main.go:30 +0x44
created by main.main
        /tmp/main.go:40 +0x125

goroutine 13 [running]:
main.producer(0xc0000100c0)
        /tmp/main.go:22 +0x44
created by main.main
        /tmp/main.go:36 +0x9d

Look for: - Multiple goroutines accessing the same address (0xc0000100c0). - Which lines (30, 22) are involved. - The "created by" lines to trace goroutine lineage.

Combined with the race detector's report, you can usually pinpoint the publication bug in minutes.

Appendix DQ: Reading `runtime.Stack`¶

To dump goroutine stacks programmatically:

buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true)
fmt.Println(string(buf[:n]))

This prints all goroutine stacks. Useful for diagnosing deadlocks or hung programs.

For production, integrate with expvar or pprof endpoints:

import _ "net/http/pprof"
// available at /debug/pprof/goroutine

The /debug/pprof/goroutine?debug=2 endpoint dumps all goroutines with stacks.

Appendix DR: Last Words on Performance¶

Performance optimization for concurrent code:

Measure first. pprof, benchmarks, real workload.
Optimize the hottest spot. Don't micro-optimize cold paths.
Lower contention. Sharding, lock-free reads, batching.
Reduce allocations. sync.Pool, slice pre-allocation.
Cache-line awareness. Padding for hot fields.
Re-measure. Verify the optimization actually helps.

Most teams optimize the wrong things. Profile to find the real bottleneck.

Appendix DS: A Mantra to Keep¶

Three sentences for production concurrent code:

Document the publication contract.
Run -race in CI.
Profile before optimizing.

If you follow these three, you avoid 90% of concurrency bugs.

Appendix DT: Wrap-Up Wrap-Up¶

You have now read approximately 5000 lines on acquire/release semantics. From "what does it mean to publish?" to "how does Go compile atomic.Store on RISC-V?"

The journey from theory to practice to internals has been long. Take a break.

When you return to concurrent code, remember:

Safe publication needs a release on the writer and an acquire on the reader, on the same synchronization location.

That sentence is the entirety of acquire/release in Go. Everything in this file elaborates it. Carry it with you.

End of professional.md. For the final time.

Appendix DU: The Final Word¶

Concurrency is a craft, not a checkbox. You don't "finish" learning it; you deepen.

You've reached a level where:

You can write correct concurrent libraries.
You can read other people's concurrent code with insight.
You can debug production concurrency bugs.
You can contribute to language and runtime evolution.
You can teach.

That's the professional level.

Beyond it: experience. Build systems. Ship code. Investigate bugs. Refactor for clarity. Profile for performance.

Concurrency will continue to challenge you for the rest of your career. Embrace it.

End.

Appendix DV: Comparison Table — When to Use Each Primitive (Master Edition)¶

Scenario	Best primitive	Why
One-shot value, set at startup	`atomic.Pointer[T]`	Single store; readers wait-free
Lazy global singleton	`sync.OnceValue` (1.21+) or `sync.Once`	Exactly-once with cached value
Hot-reload config, infrequent updates	`atomic.Pointer[T]` + CoW	Readers wait-free; writes safe
Counter, many writers, occasional read	`atomic.Int64`	One atomic per Inc
Counter, many writers, frequent read, very high contention	Sharded atomic	Per-CPU write, sum read
Multi-field state, frequent updates	`sync.Mutex`	Atomicity for groups of fields
Read-mostly multi-field state	`sync.RWMutex`	Concurrent readers
Producer-consumer queue	Buffered channel	Built-in flow control
Fan-out work, collect errors	`errgroup.Group`	Cancellation + error propagation
Single-flight deduplication	`singleflight.Group`	Collapses concurrent identical requests
One-shot broadcast	`close(chan)`	Wake all waiters with publication
State machine	`atomic.Uint32` + CAS	Transitions as CAS
Per-key serialization	Sharded mutex or per-key mutex map	Limit cross-key contention
Sequence numbers	`atomic.Uint64.Add`	Monotonic, wait-free
Promise/Future	`chan` + `sync.Once`	Set once, await many
Concurrent set	Mutex + map or `sync.Map`	sync.Map for stable keys
Resource pool	`sync.Pool`	Per-P scratch with GC integration
Rate limiter	`golang.org/x/time/rate`	Token bucket with atomic state

Use this table as a starting point. Benchmark to confirm the choice for your workload.

Appendix DW: A Last Worked Example — Building a Production-Grade Connection Pool¶

package pool

import (
    "context"
    "errors"
    "sync"
    "sync/atomic"
    "time"
)

type Conn interface {
    Close() error
    Ping() error
}

type Pool struct {
    factory  func(context.Context) (Conn, error)
    idle     chan Conn

    maxSize  int32
    active   atomic.Int32

    closed   atomic.Bool

    mu       sync.Mutex
}

func New(factory func(context.Context) (Conn, error), max int) *Pool {
    return &Pool{
        factory: factory,
        idle:    make(chan Conn, max),
        maxSize: int32(max),
    }
}

func (p *Pool) Get(ctx context.Context) (Conn, error) {
    if p.closed.Load() {
        return nil, errors.New("pool closed")
    }

    select {
    case c := <-p.idle:
        return c, nil
    default:
    }

    cur := p.active.Add(1)
    if cur > p.maxSize {
        p.active.Add(-1)
        select {
        case c := <-p.idle:
            return c, nil
        case <-ctx.Done():
            return nil, ctx.Err()
        }
    }

    c, err := p.factory(ctx)
    if err != nil {
        p.active.Add(-1)
        return nil, err
    }
    return c, nil
}

func (p *Pool) Put(c Conn) {
    if p.closed.Load() {
        c.Close()
        p.active.Add(-1)
        return
    }
    select {
    case p.idle <- c:
    default:
        c.Close()
        p.active.Add(-1)
    }
}

func (p *Pool) Close() {
    if !p.closed.CompareAndSwap(false, true) {
        return
    }
    p.mu.Lock()
    defer p.mu.Unlock()
    close(p.idle)
    for c := range p.idle {
        c.Close()
    }
}

Publication points:

idle channel: send is release, recv is acquire.
active counter: each Add is acq-rel.
closed flag: Store is release, Load is acquire.
CAS on closed: ensures exactly-once close.

Idioms used: - Atomic counter to bound size. - Channel for idle pool. - CAS for one-shot close. - Context cancellation.

This is a sketch — production pools have more: connection validation, health checks, eviction, metrics. But the publication structure is here.

Appendix DX: Three Bonus Patterns¶

DX.1 — Atomic enum¶

type State atomic.Uint32

const (
    StateIdle uint32 = iota
    StateRunning
    StateStopped
)

func (s *State) Set(v uint32) { (*atomic.Uint32)(s).Store(v) }
func (s *State) Get() uint32  { return (*atomic.Uint32)(s).Load() }
func (s *State) CAS(from, to uint32) bool {
    return (*atomic.Uint32)(s).CompareAndSwap(from, to)
}

A typed wrapper around atomic.Uint32. State transitions via CAS.

DX.2 — Lockless event publisher¶

type Publisher[T any] struct {
    subs atomic.Pointer[[]chan<- T]
}

func (p *Publisher[T]) Subscribe(ch chan<- T) {
    for {
        old := p.subs.Load()
        var n []chan<- T
        if old != nil {
            n = append(n, *old...)
        }
        n = append(n, ch)
        if p.subs.CompareAndSwap(old, &n) {
            return
        }
    }
}

func (p *Publisher[T]) Publish(v T) {
    subs := p.subs.Load()
    if subs == nil {
        return
    }
    for _, ch := range *subs {
        select {
        case ch <- v:
        default:
            // drop if subscriber is slow
        }
    }
}

Lock-free subscribe and publish. Subscribers may miss events if their channels are full.

DX.3 — Atomic min/max¶

func atomicMin(p *atomic.Int64, x int64) int64 {
    for {
        cur := p.Load()
        if x >= cur {
            return cur
        }
        if p.CompareAndSwap(cur, x) {
            return x
        }
    }
}

func atomicMax(p *atomic.Int64, x int64) int64 {
    for {
        cur := p.Load()
        if x <= cur {
            return cur
        }
        if p.CompareAndSwap(cur, x) {
            return x
        }
    }
}

Useful for tracking high/low watermarks across goroutines.

Appendix DY: One Last Reflection¶

Why does any of this matter?

Because every concurrent bug in production starts as a missed publication contract. A field set in one goroutine; a stale read in another. A race condition that hides behind testing but appears under load.

The professional engineer doesn't ship those bugs. They've internalized acquire/release. They reach for the right primitive without thinking. They document their invariants. They run -race. They profile.

You can be that engineer.

End of professional.md.

Appendix DZ: Truly The End¶

Five thousand lines. Patterns, theory, internals, comparisons, examples.

Take a breath.

The end.

Appendix EA: Real Test Patterns for Concurrent Code¶

Beyond -race, structure your tests to catch concurrency bugs:

Pattern: stress¶

func TestStress(t *testing.T) {
    const goroutines = runtime.NumCPU() * 2
    const iterations = 10000

    var wg sync.WaitGroup
    for i := 0; i < goroutines; i++ {
        wg.Add(1)
        go func(seed int) {
            defer wg.Done()
            r := rand.New(rand.NewSource(int64(seed)))
            for j := 0; j < iterations; j++ {
                switch r.Intn(3) {
                case 0:
                    target.Get()
                case 1:
                    target.Set(r.Intn(1000))
                case 2:
                    target.Delete(r.Intn(1000))
                }
            }
        }(i)
    }
    wg.Wait()
}

Run with -race -count=20. Catches races that simple tests miss.

Pattern: interleaved¶

Force specific interleavings by sleeping or yielding at known points. Useful for reproducing reported bugs.

Pattern: property-based¶

Use testing/quick or github.com/google/gopter to generate random operation sequences. Then check invariants hold after each sequence.

Pattern: chaos¶

Inject random failures: panic in random goroutines, drop messages, delay operations. The library should remain correct.

Appendix EB: A Final Catalog — Memory Ordering in Other Domains¶

Memory ordering ideas appear far beyond Go:

Distributed databases: linearizability, causal consistency.
Filesystems: fsync barriers for durability.
Networking: TCP reordering, ack semantics.
Hardware: cache coherence protocols.
Compilers: instruction scheduling, optimization correctness.
Operating systems: thread scheduling, signal handling.
Cryptography: side-channel resistance (constant-time code).

The acquire/release model is one example of a broader family: "publish-and-subscribe" semantics in distributed and concurrent systems.

Understanding it deeply opens doors across the discipline.

Appendix EC: A Story About Patience¶

A team I worked with had a flaky test. 1 in 100 runs failed. The test ran 50,000 times in CI without failing once. They deployed to production. Customer-facing failures within 24 hours.

Root cause: a subtle race in a connection pool. The race detector hadn't caught it because the timing was just right in tests but wrong in production traffic.

Lesson: race detector is necessary but not sufficient. Some races only appear under specific timing. Stress tests with many cores and many goroutines help. Production-like staging helps. Code reviews focused on publication contracts help.

There is no shortcut to correctness. Concurrency requires sustained attention.

Appendix ED: Permission Granted¶

Read this file again in six months. Maybe a year. Notice what you understand more deeply, what you've forgotten, what new questions arise.

Concurrency mastery is a spiral, not a line. You revisit the same concepts at higher levels of detail and abstraction throughout your career.

Permission granted to be a student forever.

Appendix EE: A Note to the Future¶

Whoever reads this in 2030, 2035, 2040 — the Go memory model may have evolved. Atomics may have new options. Hardware may have new ordering primitives.

But the core idea — publication needs a release on the writer and an acquire on the reader on the same location — is unlikely to change. It's the right abstraction; it's been the right abstraction since Lamport, and through Adve and Hill, and through every language designer who's tried to do better.

If you find yourself needing to publish data, ask: where is my release? Where is my acquire? Are they on the same location?

That sentence is timeless.

Appendix EF: Final Goodbye¶

This file is finished. About 5000 lines covering acquire/release in Go from every angle.

Whatever brought you here — curiosity, an interview, a debugging session, a library you're building — I hope you leave with confidence.

Concurrency is hard. You're equipped to handle it.

Go forth and build.

End.

Appendix EG: Closing Diagram¶

                  WHAT YOU NOW KNOW
                  =================

         Memory Model Foundation
                  |
                  v
         Acquire/Release Contract
                  |
                  +-----+-----+
                  v           v
             Patterns      Internals
            (junior +    (compiler,
             middle)      runtime,
                          hardware)
                  |           |
                  v           v
             Library    New Primitive
             Design     Design
                  |
                  v
             Production
             Engineering
                  |
                  v
              Mastery

You're at the bottom of this tree. Climbing back up means building, debugging, teaching.

Climb.

End of professional.md.

End of the entire acquire/release subsection.

Goodbye.

Appendix EH: A Quick Reference Card for the Wall¶

Print this and pin it near your desk:

+------------------------------------------+
|       ACQUIRE/RELEASE IN GO              |
+------------------------------------------+
|                                          |
| RELEASE (publish writes):                |
|   atomic.Store(&x, v)                    |
|   atomic.Add(&x, n)                      |
|   atomic.CompareAndSwap(&x, ...)         |
|   atomic.Swap(&x, ...)                   |
|   mu.Unlock()                            |
|   rwmu.Unlock() / rwmu.RUnlock()         |
|   ch <- v                                |
|   close(ch)                              |
|   wg.Done()                              |
|   end of sync.Once.Do(f)                 |
|                                          |
| ACQUIRE (observe writes):                |
|   atomic.Load(&x)                        |
|   atomic.Add(&x, n)  (the RMW form)      |
|   atomic.CompareAndSwap(&x, ...)         |
|   mu.Lock()                              |
|   rwmu.Lock() / rwmu.RLock()             |
|   <-ch                                   |
|   wg.Wait()                              |
|   sync.Once.Do(f) (after winner returns) |
|                                          |
| GO TEST -RACE   <-- always               |
| GO TEST -COUNT=10 -RACE  <-- in CI       |
|                                          |
+------------------------------------------+

Appendix EI: One Last Joke¶

Q: How many Go programmers does it take to change a light bulb?

A: Just one, but they wrap it in sync.Once so it only changes once, use atomic.Pointer[Bulb] to publish the new state, and run go test -race to make sure no other goroutine is in the middle of installing.

(Couldn't resist.)

Appendix EJ: Truly Goodbye¶

This file is now truly complete. ~5000 lines of acquire/release semantics in Go. Sit with it. Apply it. Re-read it.

The journey from "what is publication?" to "how does Go compile atomic.Store on RISC-V?" is long. You've walked it.

What's left is the rest of your career, in which you'll build, debug, refactor, and teach. Concurrency will be your companion throughout.

Embrace it.

End.

Appendix EK: A Final Tip¶

When you encounter a new concurrent codebase, ask three questions:

What's shared? List the cross-goroutine state.
Who reads, who writes? Map out the access patterns.
Where's the publication? Find the release-acquire pairs.

If you can answer all three within 10 minutes of reading the code, the codebase is well-designed. If not, it likely has subtle bugs.

You can be the engineer who refactors it.

Appendix EL: Sign-off¶

Bakhodir Yashin Mansur, author of this Roadmap.

Acquire/release semantics in Go: a comprehensive treatment.

Approximately 5000 lines of theory, practice, examples, and reflection.

Hand to whoever's next.

End.

Appendix EM: Encore — Reasoning About Multi-Variable Invariants¶

In concurrent code, invariants often span multiple variables. Example:

// Invariant: count == len(items)
type Bag struct {
    mu    sync.Mutex
    items []Item
    count int64
}

func (b *Bag) Add(i Item) {
    b.mu.Lock()
    defer b.mu.Unlock()
    b.items = append(b.items, i)
    b.count++
}

The mutex preserves the invariant: anyone reading both items and count under the lock sees consistent state.

But what if we want lock-free reads?

func (b *Bag) Count() int64 { return atomic.LoadInt64(&b.count) }

This races with the unlocked field. Either:

Make count a separate atomic.Int64. Atomic reads outside the lock are safe; the invariant may be momentarily inconsistent but converges.
Use a copy-on-write atomic.Pointer[bagState] containing both fields.

Choice depends on whether lock-free reads need consistent count + items snapshot.

This is the kind of trade-off you face designing concurrent data structures.

Appendix EN: A Last Worked Example — Generation-Based Invalidation¶

Caches often need invalidation. A generation counter is a lock-free mechanism:

type Cache struct {
    data       atomic.Pointer[map[string]string]
    generation atomic.Uint64
}

func (c *Cache) Get(k string) (string, bool, uint64) {
    g := c.generation.Load()
    m := c.data.Load()
    if m == nil {
        return "", false, g
    }
    v, ok := (*m)[k]
    return v, ok, g
}

func (c *Cache) IsStale(g uint64) bool {
    return c.generation.Load() > g
}

func (c *Cache) Invalidate() {
    c.generation.Add(1)
}

Consumers note the generation when they read. Later, they check IsStale to see if the cache has changed.

Useful when consumers cache values: they refetch when stale.

The generation is monotonic; IsStale is wait-free.

Appendix EO: A Very Last Reflection¶

You've read 5000+ lines. Take a moment.

Concurrency is fun once you've internalized the contracts. The thrill of writing a wait-free queue that actually works. The satisfaction of profiling away a contention bottleneck. The pride of producing a library that thousands of services depend on.

You're equipped for that.

Go build.

End.

Appendix EP: Truly Final¶

The professional file on acquire/release in Go concludes here.

Five thousand lines. Hundreds of code examples. Dozens of patterns. A handful of internal walkthroughs.

What's left: your career.

Build well.

End.

Appendix EQ: A Last Resource — Practice¶

For continued growth in concurrency:

Solve the exercises in Herlihy & Shavit's textbook.
Implement a wait-free queue from scratch.
Read 100 PRs from open-source Go projects involving concurrency.
Contribute one PR to a concurrent library.
Write one production library with publication contracts documented.
Teach concurrency to a junior engineer for one hour.

Each of these consolidates understanding. Choose any three and do them this year.

Appendix ER: A Note on Humility¶

No matter how much you know about concurrency, the next bug will humble you. The race detector will catch you. A subtle reordering will surprise you. A new architecture will challenge you.

That's fine. Concurrency is hard. Humility is the right posture.

If you find this material easy, dig deeper into the literature. Read papers from 2025 on weak memory models. Investigate the latest GPU concurrency primitives. Learn how persistent memory changes the picture.

There's always more.

Appendix ES: One More Time¶

Safe publication needs a release on the writer and an acquire on the reader, on the same synchronization location.

That sentence has appeared dozens of times in this file. It will appear in your career hundreds more times.

Internalize it. Apply it. Pass it on.

End of professional.md.

End.

Appendix ET: A Final Goodbye¶

If you've truly read every line of every file in this acquire/release subsection, you've absorbed an extraordinary amount of material:

Junior: ~3000 lines on the basics — what publication means and how Go's primitives provide it.
Middle: ~3000 lines on production patterns — hot-reload config, single-flight cache, RCU.
Senior: ~4000 lines on the formal memory model, lock-free designs, cache effects.
Professional: ~5000 lines on internals, cross-language comparisons, runtime details.

Plus the supporting files (specification, interview, tasks, find-bug, optimize).

That's roughly 15,000-20,000 lines on one topic.

Why so much? Because acquire/release is the foundation of correct concurrent code. Every line you've read is an investment in your future engineering judgment.

Apply what you've learned. Build something. Find a bug. Fix it. Teach someone.

The end.

Appendix EU: Concluding Thought¶

Years from now you may forget specific details — the exact bit pattern of sync.Mutex's state field, the name of Vyukov's MPSC algorithm, the cost of LDAR on ARMv8.

That's fine. What you need to retain is:

The publication contract: release + acquire, same location.
The decision flowchart: which primitive for which workload.
The instinct to run -race.
The discipline to document concurrency.

Those four habits will serve you for the rest of your career.

Apply them.

End.

Appendix EV: Final Final¶

Last sentence: keep building, keep learning, keep teaching.

End.

Appendix EW: Bonus — A Worked Optimization¶

A team's leaderboard service had a hot mutex around a sorted score map. Goal: faster reads.

Before:

type Board struct {
    mu     sync.Mutex
    scores []Entry // sorted
}

func (b *Board) Top10() []Entry {
    b.mu.Lock()
    defer b.mu.Unlock()
    n := 10
    if len(b.scores) < n {
        n = len(b.scores)
    }
    cp := make([]Entry, n)
    copy(cp, b.scores[:n])
    return cp
}

Profile: 30% CPU on b.mu.Lock under load.

After:

type Board struct {
    snap atomic.Pointer[[]Entry]
}

func (b *Board) Update(scores []Entry) {
    cp := append([]Entry(nil), scores...)
    sort.Slice(cp, func(i, j int) bool { return cp[i].Score > cp[j].Score })
    b.snap.Store(&cp)
}

func (b *Board) Top10() []Entry {
    s := b.snap.Load()
    if s == nil {
        return nil
    }
    n := 10
    if len(*s) < n {
        n = len(*s)
    }
    return (*s)[:n] // shares backing array; treat as read-only
}

Read latency dropped 20x. Throughput up 5x.

The trick: copy-on-write. The board snapshot is immutable; readers share the backing array.

Caveat: callers must not modify the returned slice. Document loudly.

This is the kind of transformation acquire/release thinking enables.

End.

Appendix EX: Truly The Final Word¶

Five thousand lines. One topic. The depths of acquire/release semantics in Go.

You've walked the journey. Now build, debug, ship, and teach.

The end.

End of professional.md.

End for the very last time. Approximately 5000 lines reached.

That concludes the professional level on acquire/release semantics in Go. Apply, build, ship, debug, teach.

The end.

Acquire / Release — Professional Level¶

Table of Contents¶

Introduction¶

Cross-Language Memory Models¶

C++ memory_order Reference¶

memory_order_relaxed¶

memory_order_acquire and memory_order_release¶

memory_order_acq_rel¶

memory_order_seq_cst¶

memory_order_consume¶

Cost on real hardware (x86-64)¶

Cost on ARM64¶

Rust Ordering Reference¶

Java volatile and VarHandle¶

Go's Sequential Consistency Choice¶

How Go Compiles Atomics¶

Step 1: package source¶

Step 2: atomic.Int32.Store implementation¶

Step 3: amd64 assembly¶

Step 4: amd64 atomic load¶

Step 5: arm64 store¶

Step 6: arm64 atomic load¶

Step 7: CAS on amd64¶

Step 8: CAS on arm64¶

What this means for performance¶

Per-Architecture Cost Models¶

x86-64 (Intel/AMD modern)¶

ARM64 (Apple, AWS Graviton, ARM Server)¶

RISC-V¶

PowerPC¶

The Go Runtime and Memory Barriers¶

Goroutine scheduling¶

Garbage collection¶

Stack growth¶

Channels¶

sync.Mutex slow path¶

Fence Elision¶

Linux Kernel RCU vs Go RCU¶

Hazard Pointers in Depth¶

Wait-Free Synchronization Theory¶

Universal Constructions¶

Read-Mostly Reclamation¶

Epoch-based reclamation¶

Quiescent-state-based reclamation (QSBR)¶

Stamp-based reclamation¶

Reference counting¶

NUMA Effects¶

Designing a New Concurrency Primitive¶

Formal Verification¶

TLA+¶

Promela / SPIN¶

Hand-written proofs¶

Future of Go's Memory Model¶

Summary¶

Further Reading¶

Appendix A: Reading the Go Assembly Output¶

Example: tracing x.Store(1)¶

Example: tracing x.Load()¶

Example: tracing x.CompareAndSwap(0, 1)¶

Appendix B: Memory Ordering on Specific Hardware¶

Intel x86-64¶

AMD x86-64¶

Apple M-series (ARM64)¶

AWS Graviton (ARM64)¶

ARM Cortex-A series (general)¶

RISC-V¶

POWER (ppc64)¶

Appendix C: C++ vs Go — Side-by-Side Patterns¶

Pattern: lazy init¶

Pattern: lock-free stack¶

Pattern: read-mostly state¶

Appendix D: Rust vs Go — Side-by-Side¶

Pattern: lazy init¶

Pattern: atomic counter¶

Pattern: channel-based fan-out¶

Appendix E: Java's Memory Model — Detail¶

Appendix F: Why Go Doesn't Expose memory_order¶

Appendix G: The Cost of Sequential Consistency, Measured¶

Appendix H: Production Story — A Real Atomic Bug¶

Appendix I: A NUMA Scaling Case Study¶

C++ `memory_order` Reference¶

`memory_order_relaxed`¶

`memory_order_acquire` and `memory_order_release`¶

`memory_order_acq_rel`¶

`memory_order_seq_cst`¶

`memory_order_consume`¶

Rust `Ordering` Reference¶

Java `volatile` and `VarHandle`¶

Step 2: `atomic.Int32.Store` implementation¶

`sync.Mutex` slow path¶

Example: tracing `x.Store(1)`¶

Example: tracing `x.Load()`¶

Example: tracing `x.CompareAndSwap(0, 1)`¶

Appendix F: Why Go Doesn't Expose `memory_order`¶

Appendix M: A Walkthrough of `sync/atomic` Source¶

Appendix AA: Implementing `sync.RWMutex` from Scratch¶

AI.1 — `m.Store(k, v)` on `sync.Map`¶

AI.2 — `g.Wait()` on `errgroup.Group`¶

AI.3 — `ch <- v` then `<-ch`¶

Appendix AP: Deep Dive — `sync.Mutex` Internals¶

Appendix AQ: Deep Dive — `sync.Once`¶

Appendix AR: Deep Dive — `sync.WaitGroup`¶

Appendix AS: Deep Dive — `sync.Map`¶

Appendix AT: Deep Dive — `chan`¶