Copy-on-Write — Professional Level¶
Table of Contents¶
- Introduction
- The Go Memory Model in Detail
- Inside
sync/atomic.Pointer[T] - Inside
sync/atomic.Value - Codegen Across Architectures
- Memory Ordering Primitives
- GC Interaction With Snapshots
- Write Barriers and Pointer Writes
- Escape Analysis and Snapshots
- Cache-Line Effects
- False Sharing in COW Systems
- RCU in the Linux Kernel Detail
- Hazard Pointers and Epoch-Based Reclamation
- Lock-Free Programming Subtleties
- Weak References and Snapshot Caches
- unsafe.Pointer for Advanced Cases
- Assembly-Level Considerations
- Profiling Lock-Free COW
- Correctness Proofs and Race Conditions
- Designing for Hardware
- Self-Assessment
- Summary
Introduction¶
Focus: "I can design and ship production COW systems. Now I want to understand the runtime mechanics: how
atomic.Pointer[T]is compiled, what the memory model formally guarantees, how the GC interacts with retained snapshots, and the lowest-level mechanics of lock-free programming."
At the professional level, the question is no longer "what should I build?" but "why does the thing I built work?" The senior level taught the patterns; this level explains the substrate.
A professional Go engineer can read the runtime source, follow atomic.Pointer[T] from Go code through SSA IR to architecture-specific instructions, identify cache-line effects in hot paths, and reason about correctness with reference to the Go memory model. They can debug a COW failure by reading goroutine stacks and runtime.MutexProfile output. They know when unsafe.Pointer is necessary and what it costs.
This file dives into the mechanics: memory ordering, codegen, GC interaction, hardware effects, and the formal model that ties them together. Read it slowly. The material is dense but rewards careful study.
The Go Memory Model in Detail¶
The Go memory model defines when one goroutine's writes become visible to another goroutine's reads. It is the formal foundation of all concurrent Go code, including COW.
The happens-before relation¶
The model is built around the happens-before relation. An event A happens-before event B (written A → B) if, intuitively, A's effects are guaranteed to be observable by B.
The relation is established by:
- Program order within a single goroutine: if
Ais written beforeBin source code,A → B. - Goroutine creation: the
gostatement happens-before the first instruction of the new goroutine. - Channel operations: a send on a channel happens-before the corresponding receive.
- Mutex operations: the n-th
Unlockon a mutex happens-before the (n+1)-thLock. sync.Once.Do: the function body happens-before any later call toDoreturns.sync/atomicoperations: as of Go 1.19, atomics establish happens-before edges between corresponding operations on the same atomic value.
What atomic operations guarantee¶
The Go memory model treats sync/atomic operations as sequentially consistent. From the Go memory model document:
The APIs in the sync/atomic package are collectively "atomic operations" that can be used to synchronize the execution of different goroutines. If the effect of an atomic operation A is observed by atomic operation B, then A is synchronized before B. All the atomic operations executed in a program behave as though executed in some sequentially consistent order.
This is a strong guarantee. Sequential consistency means there is a single global ordering of all atomic operations that is consistent with each goroutine's program order. Concretely:
- An
atomic.Store(p, v)followed byatomic.Load(p) == vin another goroutine establishes happens-before: everything before the Store happens-before everything after the Load. - A
CompareAndSwapthat succeeds is both a Store and a Load: it establishes happens-before in both directions.
Why this matters for COW¶
The COW publication pattern relies on this guarantee:
// Writer
nextSnap := &Snapshot{Field: value} // (W1)
ptr.Store(nextSnap) // (W2)
// Reader
snap := ptr.Load() // (R1)
use(snap.Field) // (R2)
The chain: - W1 → W2 (program order). - W2 → R1 (because R1 observes the value W2 stored; atomic edge). - R1 → R2 (program order).
Therefore W1 → R2: the reader's snap.Field read sees the value the writer assigned.
Without atomic store/load, this chain breaks. Plain reads and writes have no happens-before with each other; the reader might see a partially-constructed snapshot.
Subtleties¶
- Memory model edge is per-atomic-variable. A store to one
atomic.Pointerdoes not synchronize with a load from another. - Non-atomic accesses inside the snapshot. Once the snapshot is published, readers access its fields with plain (non-atomic) loads. This is safe only because the snapshot is immutable — readers and writers do not race on its fields.
- Sequential consistency is stronger than acquire/release. Other languages (C++, Rust) expose finer-grained ordering. Go chose simplicity.
- The cost. Sequential consistency requires more synchronization at the hardware level — typically an
MFENCEor equivalent before atomic stores on x86, more on weaker architectures.
What the memory model does NOT guarantee¶
- Order of operations on different atomic variables. Two stores to two different
atomic.Pointers may appear in different orders to different observers (no, wait — the SC guarantee says they appear in the same order to all observers; this is the strength of SC). - Visibility without synchronization. A plain write to a shared variable has no happens-before with any read; the reader may see stale or torn values forever.
- Anything about timing. The memory model is about ordering, not real-time progress.
Inside sync/atomic.Pointer[T]¶
atomic.Pointer[T] is a generic wrapper. Let's look at how it is implemented.
Source¶
In the Go source (src/sync/atomic/type.go):
// A Pointer is an atomic pointer of type *T. The zero value is a nil *T.
type Pointer[T any] struct {
_ [0]*T // for type signature checks
_ noCopy
v unsafe.Pointer
}
// Load atomically loads and returns the value stored in x.
func (x *Pointer[T]) Load() *T { return (*T)(LoadPointer(&x.v)) }
// Store atomically stores val into x.
func (x *Pointer[T]) Store(val *T) { StorePointer(&x.v, unsafe.Pointer(val)) }
// Swap atomically stores new into x and returns the previous value.
func (x *Pointer[T]) Swap(new *T) (old *T) { return (*T)(SwapPointer(&x.v, unsafe.Pointer(new))) }
// CompareAndSwap executes the compare-and-swap operation for x.
func (x *Pointer[T]) CompareAndSwap(old, new *T) (swapped bool) {
return CompareAndSwapPointer(&x.v, unsafe.Pointer(old), unsafe.Pointer(new))
}
LoadPointer, StorePointer, SwapPointer, CompareAndSwapPointer are runtime intrinsics — the compiler recognises them and emits architecture-specific instructions.
Type safety¶
The [0]*T field is a clever trick. It is zero-sized at runtime but enforces type compatibility at compile time. Two Pointer[T] types with different T cannot be assigned to each other.
The noCopy field is checked by go vet to prevent accidental copying:
go vet flags struct copies whose fields have Lock/Unlock methods. This is how the vet checker catches var p atomic.Pointer[T] = q.p.
The intrinsics¶
runtime/internal/atomic defines the platform-specific operations. On amd64:
The corresponding assembly (in runtime/internal/atomic/atomic_amd64.s):
A plain MOV instruction. On x86, an aligned 64-bit MOV is atomic and provides acquire semantics for free. No fence needed.
For Store:
XCHGQ is used because it provides a full memory fence (sequential consistency). A plain MOV would suffice for x86 ordering but Go's memory model requires SC even on weaker architectures, so XCHG is the common choice.
Cost summary¶
| Operation | x86 cost | ARM64 cost | Effect |
|---|---|---|---|
Load | ~1.5 ns (cache hit) | ~3 ns (with LDAR) | Acquire fence |
Store | ~10 ns (XCHG) | ~5 ns (with STLR) | Full fence (SC) |
Swap | ~10 ns | ~5 ns | Same as Store |
CompareAndSwap | ~15 ns (LOCK CMPXCHG) | ~10 ns (LL/SC loop) | Full fence (SC) |
Loads dominate in COW reads. Stores happen rarely. The asymmetry is exactly what makes COW efficient.
Why Pointer types are special¶
The GC must scan all pointers on the heap. Atomic operations on pointer types coordinate with the GC's write barrier (more on this later). Atomic operations on non-pointer types (Int32, Bool, etc.) do not need this coordination and use simpler implementations.
Inside sync/atomic.Value¶
atomic.Value predates generics. It stores an interface{}:
But interface{} is two words (type pointer + data pointer). Storing two words atomically is hardware-dependent. The implementation uses a careful protocol.
The implementation strategy¶
type efaceWords struct {
typ unsafe.Pointer
data unsafe.Pointer
}
func (v *Value) Load() (val any) {
vp := (*efaceWords)(unsafe.Pointer(v))
typ := LoadPointer(&vp.typ)
if typ == nil || typ == unsafe.Pointer(&firstStoreInProgress) {
return nil
}
data := LoadPointer(&vp.data)
vlp := (*efaceWords)(unsafe.Pointer(&val))
vlp.typ = typ
vlp.data = data
return
}
The Load reads the type word first. If it's nil (initial state) or &firstStoreInProgress (a sentinel during the very first store), no value is available.
The Store is more complex. The first Store uses a CAS to install the type word, then writes the data word. Subsequent Stores write the data word atomically, having first verified the type matches:
func (v *Value) Store(val any) {
if val == nil {
panic("sync/atomic: store of nil value into Value")
}
vp := (*efaceWords)(unsafe.Pointer(v))
vlp := (*efaceWords)(unsafe.Pointer(&val))
for {
typ := LoadPointer(&vp.typ)
if typ == nil {
// First store; CAS in the type word
runtime_procPin()
if !CompareAndSwapPointer(&vp.typ, nil, unsafe.Pointer(&firstStoreInProgress)) {
runtime_procUnpin()
continue
}
StorePointer(&vp.data, vlp.data)
StorePointer(&vp.typ, vlp.typ)
runtime_procUnpin()
return
}
if typ == unsafe.Pointer(&firstStoreInProgress) {
continue
}
if typ != vlp.typ {
panic("sync/atomic: store of inconsistently typed value into Value")
}
StorePointer(&vp.data, vlp.data)
return
}
}
Key observations:
- The first store CASes a sentinel into the type word, then stores the data, then stores the real type. Loaders that see the sentinel retry.
runtime_procPin()pins the goroutine to its M to prevent preemption during the first-store protocol — without this, a preempted goroutine could leave the value in an inconsistent state observable by another goroutine.- Subsequent stores require the type to match. Mismatched types panic.
Why atomic.Pointer[T] is faster¶
atomic.Pointer[T] stores a single word (the pointer). One MOV (Load) or one XCHG (Store). No protocol, no sentinel, no pinning, no panic risk.
atomic.Value stores an interface, which is two words. The protocol adds overhead. The type check (and potential panic) adds risk.
Benchmark:
BenchmarkAtomicValueLoad-8 250000000 5.5 ns/op
BenchmarkAtomicPointerLoad-8 500000000 1.6 ns/op
BenchmarkAtomicValueStore-8 50000000 25.0 ns/op
BenchmarkAtomicPointerStore-8 150000000 7.5 ns/op
3–4× difference. For new code on Go 1.19+, always use atomic.Pointer[T].
Codegen Across Architectures¶
The Go compiler generates different instructions for atomic operations depending on target architecture.
amd64 (x86-64)¶
Strong memory model. Loads and stores of aligned 64-bit values are atomic by hardware. The compiler emits:
Load: plainMOVQ.Store:XCHGQ(for the full-fence sequential consistency requirement).CompareAndSwap:LOCK CMPXCHG.
ARM64¶
Weaker memory model. The compiler uses load-acquire and store-release instructions:
Load:LDAR(load-acquire register).Store:STLR(store-release register).CompareAndSwap:CAS(Go 1.20+ when the host supports the LSE extension) or an LL/SC loop withLDAXR/STLXR.
These instructions ensure ordering with respect to other memory operations but are slightly more expensive than plain MOV.
riscv64¶
Weak memory model. Uses LR.W/SC.W (load-reserved / store-conditional) for atomics. Memory fences via FENCE instructions.
arm (32-bit)¶
Older ARM lacks atomic 64-bit operations in some configurations. Go falls back to using locks for 64-bit atomics on these targets. Performance is much worse — avoid 32-bit ARM if you can.
Why this matters¶
For COW, the implication is:
- On x86, COW Load is extremely cheap — basically free.
- On ARM64, slightly more expensive but still fast.
- On older ARM, atomic costs are dominated by lock overhead — measure carefully.
You usually do not target older ARM. But knowing the difference helps explain why benchmarks vary across machines.
Inspecting codegen¶
To see the actual assembly:
Or use the playground's "Show Assembly" feature. For atomic.Pointer.Load, you should see a single MOV (on amd64) inlined into the caller.
Memory Ordering Primitives¶
Beyond Go's sequential consistency, hardware exposes finer ordering. Understanding these helps explain costs.
Memory barriers¶
- Load fence (LFENCE on x86): all earlier loads complete before later loads.
- Store fence (SFENCE): all earlier stores complete before later stores.
- Full fence (MFENCE): all earlier operations complete before later operations.
x86's strong model means LFENCE and SFENCE are largely unnecessary (loads and stores are naturally ordered relative to themselves). MFENCE is needed for store-to-load ordering, which x86 does not guarantee by default.
Acquire-release semantics¶
- Acquire: a load that becomes a happens-before edge from any store that happens-before-it.
- Release: a store that establishes happens-before to any later acquire-load of the same location.
This is what C++ memory_order_acquire / memory_order_release express. Go does not expose these — every atomic in Go is sequentially consistent.
For COW publication, acquire-release would suffice: - Writer's Store needs release semantics so prior writes (snapshot construction) are visible. - Reader's Load needs acquire semantics so subsequent reads (snapshot fields) see the published state.
Sequential consistency is strictly stronger than acquire-release. Go pays the small extra cost in exchange for simplicity.
Happens-before vs synchronizes-with¶
The Go memory model uses "synchronized before" (capital S) — synonymous with happens-before for our purposes. The C++ memory model uses "synchronizes with" for the inter-thread relation and "happens before" for the combined relation.
For COW purposes, the distinction is academic. The important fact: atomic Load and Store create an ordering edge that orders all non-atomic memory accesses on either side.
GC Interaction With Snapshots¶
The garbage collector's role in COW is what makes the pattern practical in Go. Understanding the interaction is critical.
Reachability tracing¶
Go's GC is a tricolor mark-and-sweep. It traces from roots (globals, stacks, etc.) through reachable pointers, marking everything live. Anything unreachable is swept.
In a COW system, the live snapshots are:
- The current snapshot (reachable from
atomic.Pointer[T]). - Any snapshot held in a local variable of a running goroutine.
- Any snapshot referenced from another live object (closures, channels, etc.).
Old snapshots become unreachable as soon as no live reference exists. The GC reclaims them on the next cycle.
Write barriers¶
When the writer does ptr.Store(next), the GC must learn that next is now reachable from a root. This is the job of the write barrier.
A write barrier is a small piece of code the compiler inserts around every pointer write. In Go's hybrid concurrent collector, it works (simplified):
// Before:
ptr = next
// With write barrier (conceptually):
oldPtr := ptr
ptr = next
gcWriteBarrier(oldPtr, next)
gcWriteBarrier informs the GC that next is now reachable (and oldPtr may have been freed if it was the last reference).
For atomic.Pointer[T].Store, the write barrier is woven into the atomic operation. The runtime ensures that the GC sees the new pointer atomically with the write.
GC during a COW publish¶
A snapshot publish is one atomic Store. During the Store:
- The new snapshot pointer becomes the value of
atomic.Pointer.v. - The write barrier marks the new snapshot as reachable from the atomic pointer.
- The old snapshot is no longer reachable through this pointer (other references may exist).
The Store completes in nanoseconds. The GC may run concurrently — it tracks the pointer change via the write barrier, even if the mark phase is in progress.
When the GC runs¶
Go's GC is triggered by heap growth. With GOGC=100, it runs whenever the heap doubles since last GC. For a COW system with 5 MB live heap and rapid writes producing garbage, the GC may run every few hundred milliseconds.
GC pauses (stop-the-world) are typically <1 ms for heaps under a few GB. The hybrid concurrent collector keeps pauses short by doing most work concurrently with goroutine execution.
Snapshot pinning and GC¶
A long-lived goroutine holding a snapshot pointer pins that snapshot. The GC cannot reclaim it as long as the goroutine is alive and its stack contains the reference.
Stacks are precisely scanned: the GC knows exactly which slots contain pointers vs scalars. A local variable holding *Snapshot is a pointer; the GC sees it.
After the variable is re-assigned or the function returns, the reference is gone. The next GC cycle reclaims the snapshot if no other references exist.
Soft memory limits¶
Go 1.19+ supports a soft memory limit:
When live heap approaches the limit, the GC runs more aggressively. For COW systems with bursty writes that may pin many snapshots simultaneously, this can prevent OOM.
Final note on reachability¶
Reachability is exact — no false positives. If a snapshot is unreachable, the GC will eventually reclaim it. Memory leaks in Go COW systems are reachability bugs: somewhere, a path from a root to the old snapshot still exists, usually through a long-lived data structure or goroutine.
Write Barriers and Pointer Writes¶
Go's write barrier deserves a closer look because it affects atomic pointer performance.
What it is¶
The write barrier is a function the compiler inserts at every pointer write to a heap object. Its purpose: maintain the GC's invariants during concurrent marking.
In Go's hybrid concurrent collector, the write barrier is needed only during the marking phase. Outside of marking (most of the time), pointer writes proceed without the barrier.
Cost¶
A typical pointer write with the write barrier active costs ~10-30 ns. Without the barrier (most of the time), it's a few ns.
For atomic.Pointer.Store, the cost is dominated by the atomic instruction itself (~10 ns) plus the conditional write barrier overhead. Net: ~10-30 ns per store.
When the barrier fires¶
The barrier checks a global flag (writeBarrier.enabled). When false (mutator phase), the barrier is a no-op. When true (concurrent marking), the barrier records the pointer write so the GC can update its mark bits.
The atomic.Store codegen includes the barrier check:
TEXT runtime/internal/atomic.StorePointer(SB)
// ... atomic store
CMPB writeBarrier.enabled, $0
JEQ done
CALL gcWriteBarrier
done:
RET
The check is one byte-load and one branch — essentially free unless the barrier is active.
Implications for COW¶
- Atomic stores are not always 10 ns; they may spike to 30-50 ns during GC.
- High-frequency COW writes during GC can prolong the marking phase.
- For latency-sensitive systems, monitor GC overlap with write bursts.
runtime.GC immediately after a publish¶
A pattern in deterministic-latency systems:
This is rarely worth the cost (~ms per runtime.GC). Use only when you absolutely need the old snapshot reclaimed immediately.
Escape Analysis and Snapshots¶
Escape analysis decides whether a value lives on the stack or the heap. For COW, snapshots usually escape — but understanding when and why illuminates the pattern.
When a snapshot escapes¶
A value escapes if its lifetime exceeds the function that created it. For COW writers:
func Update() {
next := *cur.Load() // (1) shallow copy on the stack initially
next.Field = newVal // (2) mutate stack value
cur.Store(&next) // (3) escape! address taken and stored to a heap-reachable location
}
After (3), next must live on the heap because its address is stored in cur, which is reachable from elsewhere.
The compiler detects this and allocates next on the heap from the start. Build with -gcflags '-m' to see escape decisions:
When a snapshot does not escape¶
If you build a "snapshot" but never publish it (e.g., for use within a single function), it can stay on the stack:
func computeOnce() Value {
temp := Snapshot{...} // never escapes
return temp.Field // returned value escapes; temp doesn't
}
This is rare for production COW code. Almost every snapshot is published, hence escapes.
Cost¶
Stack allocations are essentially free. Heap allocations cost ~10 ns each, plus future GC work.
For COW writers, heap allocations are unavoidable. The cost is amortized over the snapshot's lifetime.
Pre-allocation tricks¶
You can pre-allocate snapshots and reuse them via sync.Pool, but only if the old snapshot is no longer reachable. This is hard to guarantee correctly — readers may still hold references.
The pragmatic answer: let the GC handle reclamation. Pre-allocation is a niche optimization.
Cache-Line Effects¶
Modern CPUs work in 64-byte cache lines. Memory access patterns affect performance enormously through caching.
Snapshot pointer location¶
The atomic.Pointer.v field is one word (8 bytes). It sits in a cache line with other fields of its containing struct.
If the containing struct has many fields, the atomic pointer shares a cache line with them. Writing to those fields invalidates the cache line for all readers of the atomic pointer.
type Store struct {
cur atomic.Pointer[Config] // hot
mu sync.Mutex // hot during writes
updates atomic.Int64 // hot
created time.Time // cold
}
If cur, mu, and updates share a cache line, a single write to updates invalidates the line that readers loading cur need. Readers stall on cache misses.
Cache-line padding¶
To isolate the atomic pointer:
The padding ensures cur lives in its own cache line. Reads do not contend with writes to other fields.
When padding matters¶
- High-contention COW systems with many readers and writes from other goroutines.
- Multi-socket NUMA systems where cache-line ping-pong is expensive.
For most Go services with moderate read rates, padding is overkill. Measure before applying.
Cache locality of the snapshot¶
The snapshot itself benefits from locality. A compact struct with frequently-accessed fields near each other reads faster than a sparse one:
// Good: hot fields together
type Config struct {
Timeout time.Duration
MaxConns int
LogLevel string
// ... cold fields below
Description string
OperatorEmail string
}
Putting Description first means every Load + Read accesses a different cache line than the hot fields.
Cache pressure on reads¶
A read of cur.Load() + .SomeField touches at most 2 cache lines (the atomic pointer, and the snapshot's hot region). Each is 64 bytes. 128 bytes of cache used per logical read.
At 1 M reads/sec, you touch 128 MB/sec of cache footprint — well within L1/L2.
If the snapshot is huge (10 MB), most reads touch only the hot prefix. The cold parts age out of cache between accesses.
False Sharing in COW Systems¶
False sharing happens when unrelated data shares a cache line and writes to one part invalidate the line for readers of another.
Classic example¶
a and b likely share a cache line. A writer incrementing a invalidates the cache line for any reader of b. Even though the data is independent, the CPU sees the line as contended.
In COW¶
cur and mu may share a cache line. The writer locks mu (writing to its internal state), invalidating the line that contains cur. Readers of cur then experience cache misses.
For most workloads this is unmeasurable. For 1 M reads/sec across many cores with active writers, it can add noticeable latency.
Mitigation¶
Pad to separate cache lines:
64 bytes minus the 8-byte pointer = 56 bytes of padding.
Go provides runtime.NumCPU() and you can derive the line size with platform-specific knowledge. The constant is typically 64 on x86 and 128 on some ARM.
Measurement¶
Use perf (Linux) or Intel VTune (cross-platform) to measure cache misses. In Go, look for runtime.MemStats.NumGC and runtime.ReadMemStats for indirect signals.
The simpler test: benchmark with and without padding. If padding helps, you had false sharing.
RCU in the Linux Kernel Detail¶
RCU in the Linux kernel is the gold-standard implementation of COW-style patterns. Studying it deepens understanding.
The RCU API¶
// Reader:
rcu_read_lock();
ptr = rcu_dereference(global_ptr);
use(ptr->field);
rcu_read_unlock();
// Writer:
new = build_new();
old = global_ptr;
rcu_assign_pointer(global_ptr, new);
synchronize_rcu(); // wait for grace period
kfree(old);
rcu_read_lock and rcu_read_unlock are usually no-ops (in the default RCU "preemptible" mode they only disable preemption). They mark "I am in a critical section reading the structure."
rcu_dereference loads the pointer with the appropriate memory barrier.
rcu_assign_pointer stores the new pointer with the appropriate memory barrier.
synchronize_rcu waits for a grace period — the point at which all readers that were active when the writer published are guaranteed to have finished.
Grace period detection¶
The kernel knows a CPU has finished any reading critical section by observing it pass through a "quiescent state":
- Voluntary context switch.
- Returning to user mode.
- Idle.
Once every CPU has passed through a quiescent state since the publish, the writer can safely free the old structure.
Why Go is simpler¶
In Go, the equivalent of synchronize_rcu is "wait for the GC to reclaim it." But you don't wait explicitly — the GC handles it.
The trade-off: Go's GC has more overhead than the kernel's RCU but is much easier to use. For most workloads, Go's approach is preferable. For the highest-performance kernel code, manual RCU wins.
Lessons applicable to Go¶
- The notion of a "reader critical section" maps to "the lifetime of a Go variable holding the snapshot pointer."
- The notion of "grace period" maps to "GC cycle that proves no goroutine references the old snapshot."
- The kernel's
rcu_dereferencecorresponds to Go'satomic.Pointer.Load.
Understanding the kernel's discipline can guide Go code: keep reader critical sections short; don't pin snapshots across long operations.
Hazard Pointers and Epoch-Based Reclamation¶
For systems where the GC's reclamation is insufficient — typically because you must close a resource as soon as it's no longer in use — there are two manual techniques.
Hazard pointers¶
Each thread publishes a "hazard pointer": the pointer it is currently reading. Before reclaiming, the writer checks all hazard pointers; if any matches the pointer to reclaim, wait.
var hazards [N]atomic.Pointer[Snapshot]
// Reader:
func Read(id int) *Snapshot {
for {
p := global.Load()
hazards[id].Store(p)
// double-check
if global.Load() == p {
return p
}
}
}
// Writer:
func Publish(next *Snapshot) {
old := global.Swap(next)
for {
safe := true
for _, h := range hazards {
if h.Load() == old {
safe = false
break
}
}
if safe { break }
runtime.Gosched()
}
old.Free()
}
Hazard pointers give deterministic reclamation but add complexity.
Epoch-based reclamation (EBR)¶
Threads operate within "epochs" that advance globally. Reclamation defers until all threads have advanced past the epoch of the old data.
type Epoch struct {
global atomic.Int64
local [N]atomic.Int64
}
func (e *Epoch) Enter(id int) {
e.local[id].Store(e.global.Load())
}
func (e *Epoch) Exit(id int) {
e.local[id].Store(-1)
}
func (e *Epoch) Reclaim(g int64, fn func()) {
for {
safe := true
for i := 0; i < N; i++ {
le := e.local[i].Load()
if le != -1 && le <= g {
safe = false
break
}
}
if safe {
fn()
return
}
runtime.Gosched()
}
}
Less per-thread overhead than hazard pointers, but more global synchronization.
When to use either in Go¶
Almost never. The Go GC is suitable for 99% of cases. Reach for hazard pointers or EBR only when:
- You're integrating with non-Go code that requires deterministic resource management.
- You're writing a Go runtime extension.
- You're benchmarking and the GC's overhead is provably unacceptable.
For application-level COW, trust the GC.
Lock-Free Programming Subtleties¶
Some COW designs venture into lock-free territory. The subtleties:
The ABA problem¶
Pointer A is replaced by B, then back to A. A CAS comparing the current pointer to A succeeds incorrectly — the structure has changed even though the pointer matches.
In Go, ABA is essentially impossible because every &next allocates a fresh pointer (the GC doesn't reuse pointers while they're reachable). Each snapshot has a unique identity throughout its lifetime.
In C, ABA is a constant concern, addressed by version-stamped pointers or hazard pointers.
Memory reuse¶
C systems may reuse freed memory immediately. A reader holding a "freed" pointer may suddenly find unrelated data there. Hazard pointers / EBR prevent this.
Go avoids this entirely: a freed pointer cannot be reused while any reader holds it (GC reachability ensures this).
CAS retry loops¶
A CAS loop with N writers and contention causes average N retries per write. Total work = N² operations. Watch for this in benchmarks.
Liveness¶
Lock-free does not mean wait-free. A CAS loop can theoretically retry forever under starvation. In practice, hardware bounds this, but pathological cases exist.
Linearizability¶
A linearizable operation appears to take effect at a single instant between invocation and response. atomic.Pointer.Store is linearizable. A sequence of Load + non-atomic field access is not linearizable — the field access happens after the Load.
For COW, this means: the publish is linearizable, but the reader's full operation (Load + traverse) is not. The reader may see a "stale" snapshot relative to wall-clock time. This is intentional and acceptable for the COW pattern.
Weak References and Snapshot Caches¶
Go 1.24+ introduces weak references via runtime/weak. They enable snapshot-aware caches without preventing GC.
The problem they solve¶
A cache keyed on *Snapshot keeps the snapshot alive:
Even after a new snapshot is published, the old one remains alive because cache references it. Memory grows.
Weak references¶
A weak pointer does not keep the target alive. When the GC determines the target is otherwise unreachable, the weak pointer is "cleared" — calling .Value() returns nil.
For snapshot caches, this allows entries to be reclaimed automatically when their snapshot is no longer current.
Caveats¶
- Weak references add a small overhead per access.
- Not all GC operations finalize weak references immediately; there may be a one-cycle delay.
- The API is new; expect refinements.
For most COW caches, plain caches with explicit eviction work fine. Weak references are an option to keep in mind for memory-sensitive scenarios.
unsafe.Pointer for Advanced Cases¶
Most COW code does not need unsafe.Pointer. A few situations warrant it.
Case 1: Pre-Go-1.19 codebases¶
atomic.LoadPointer and atomic.StorePointer operate on unsafe.Pointer. If you support pre-generic Go, you use them directly:
var p unsafe.Pointer // points to *Config
func Load() *Config {
return (*Config)(atomic.LoadPointer(&p))
}
func Store(c *Config) {
atomic.StorePointer(&p, unsafe.Pointer(c))
}
Verbose and type-unsafe. Migrate to atomic.Pointer[T] as soon as possible.
Case 2: Type-erased snapshots¶
A library that handles snapshots of arbitrary type without using generics:
type Store struct {
v unsafe.Pointer
}
func (s *Store) Load() unsafe.Pointer { return atomic.LoadPointer(&s.v) }
func (s *Store) Store(p unsafe.Pointer) { atomic.StorePointer(&s.v, p) }
Callers cast to/from concrete types. Lots of opportunity for bugs.
Case 3: Interfacing with C via cgo¶
If you must share state with C code that reads/writes the pointer, unsafe.Pointer is the only way.
Case 4: SeqLock-like protocols¶
Some lock-free designs need to read multiple words atomically. Go does not have 128-bit atomic types, so you may use unsafe for layout control.
General advice¶
Avoid unsafe.Pointer in COW code. The type-safe atomic.Pointer[T] is sufficient for 99% of cases. When you must use unsafe, isolate it to one package and document heavily.
Assembly-Level Considerations¶
For the curious, here is what atomic.Pointer.Load looks like at the assembly level on amd64.
Source¶
Compiled (simplified)¶
TEXT runtime/atomic.LoadPointer(SB), NOSPLIT, $0-16
MOVQ ptr+0(FP), AX ; load argument (pointer to atomic.Pointer.v)
MOVQ (AX), AX ; dereference; atomic on aligned 64-bit
MOVQ AX, ret+8(FP) ; return value
RET
A single MOVQ does the work. The x86 memory model guarantees this is atomic for aligned 8-byte accesses.
Compiled (with full SC semantics for amd64)¶
The runtime ensures the load happens before any later non-atomic loads. On amd64, this is implicit (TSO memory model). No fences needed.
On ARM64¶
TEXT runtime/atomic.LoadPointer(SB), NOSPLIT, $0-16
MOVD ptr+0(FP), R0
LDAR R0, R0 ; load-acquire register
MOVD R0, ret+8(FP)
RET
LDAR (load-acquire register) is the ARM64 load-acquire instruction. It costs slightly more than a plain LDR but provides the needed ordering.
Performance¶
| Architecture | Load cost | Notes |
|---|---|---|
| amd64 | ~1.5 ns | Cache hit; bare MOV |
| amd64 (cache miss) | ~50 ns | L3 or main memory |
| arm64 | ~3 ns | LDAR slightly slower than LDR |
| riscv64 | ~3 ns | Plus fence |
For COW, the L1-cache-hit path is what you optimize for. Hot snapshot pointers stay in L1; cold ones pay the miss penalty.
Profiling Lock-Free COW¶
Profiling lock-free systems requires different tools than profiling lock-based ones.
pprof mutex profile¶
Captures contention on sync.Mutex and sync.RWMutex. For COW with a writer mutex, this shows time spent waiting on the writer mutex.
pprof block profile¶
Captures goroutine blocking (channel send/recv, mutex Lock, etc.). Useful for finding subscribers stuck on channel reads.
Goroutine profile¶
Stack traces for all live goroutines. Useful for finding pinned snapshots and stuck readers.
Heap profile¶
Allocations by call site. Find write-heavy code paths.
Trace¶
Full goroutine and GC trace. Visualize with go tool trace. Shows GC pauses, write barrier activity, and goroutine scheduling.
Custom metrics¶
Embed counters in the COW store:
Emit periodically to your metrics backend.
runtime.GoroutineProfile¶
Programmatic access to goroutine info. Tag goroutines with pprof.Labels to identify snapshot holders.
ctx = pprof.WithLabels(ctx, pprof.Labels("snapshot", fmt.Sprintf("v%d", snap.Version)))
pprof.SetGoroutineLabels(ctx)
Later, pprof.Lookup("goroutine") shows each goroutine's snapshot version.
Correctness Proofs and Race Conditions¶
For high-assurance code, formal reasoning helps.
The publication invariant¶
A COW snapshot is correctly published if:
- Every field of the snapshot is initialized before
Store. - No field is modified after
Store.
The Go memory model guarantees:
- Writes to the snapshot's fields (in the writer goroutine) happen-before the
Store. - The
Storehappens-before the reader'sLoadof the same value. - The reader's
Loadhappens-before its access of the snapshot's fields.
Transitively: writes happen-before reads. The snapshot is consistently visible.
The mutation invariant¶
If the snapshot is mutated after Store, the invariant breaks. Readers may see partial mutations because there is no happens-before edge between the mutation and the reads.
This is why "never mutate a published snapshot" is the central rule. The race detector catches violations.
The writer serialization invariant¶
If two writers concurrently Store based on the same loaded snapshot, one of their updates is lost. Even if both writes are atomic, the read-modify-write sequence is not.
The fix: writer mutex (serializes the entire RMW) or CAS loop (retries on conflict).
Linearizability of atomic.Pointer.Store¶
atomic.Pointer.Store is linearizable: it appears to take effect at a single instant. Two concurrent Stores have a well-defined order: the second observes the first's effect.
The reader's Load is linearizable: it sees the value as of the moment it executed.
The combined operation Load + read field is not linearizable — the field read happens at some moment after the Load. But this is acceptable for COW because the snapshot is immutable after publication.
Common race conditions¶
- Mutated snapshot. Writer modifies fields after
Store. - Aliased slice/map. Writer's
next.Sliceshares backing withold.Slice;nextis published; writer appends. - Lost update. Two writers race; one's
Storeoverwrites the other. - Stale read. Reader loads at T1; writer publishes at T2 > T1; reader proceeds with stale snapshot. Often not a bug — snapshot is intended to be stable. Bug only if business logic assumes freshness.
- Snapshot leak. Long-lived goroutine holds snapshot; new snapshots are also retained because the old chain is reachable.
The race detector catches 1, 2, and 3 reliably. 4 is by design. 5 requires heap profiling.
Designing for Hardware¶
Production COW for the highest-performance use cases requires hardware awareness.
NUMA¶
On multi-socket systems, memory access is faster within a NUMA node than across. The current snapshot's pointer location and the snapshot's data may be in different nodes.
Mitigation: pin the writer to one node; readers on the same node have fast access. Cross-node readers pay extra latency.
Go's runtime is not NUMA-aware by default. For extreme performance, you may need to launch one COW system per NUMA node.
Cache-line size¶
x86 and ARM64 are 64 bytes. Some platforms are 128. Align hot structures accordingly.
Branch prediction¶
atomic.Pointer.Store includes a write-barrier check. The check almost always falls through (the barrier is off most of the time), so the branch is well-predicted. Don't try to "optimize" by removing it — you'd be doing the runtime's job, badly.
TLB pressure¶
Very large snapshots (> tens of MB) may stress the TLB. Modern CPUs handle this with huge pages, but Go applications usually do not use huge pages.
For most COW systems this is negligible.
Cache coherence protocols¶
MESI / MOESI / MESIF determine how cache lines transit between cores. Heavily-shared read-only lines (like the snapshot pointer) settle into "Shared" state across many cores with no further coherence traffic. Writers must transition the line to "Modified" / "Exclusive", briefly stalling readers.
A COW writer that publishes 1000 times per second invalidates the cache line 1000 times per second. With 64 cores each reading the line, that's 64 000 cache misses per second. Still tiny in absolute terms (microseconds total), but noticeable in some benchmarks.
Self-Assessment¶
A professional-level engineer should be able to answer:
- Describe the Go memory model's happens-before relation and how it applies to COW.
- Explain the difference between
atomic.Pointer[T].Loadandatomic.Value.Loadat the instruction level. - Walk through the codegen for
atomic.Pointer.Storeon amd64 vs arm64. - Explain how the GC's write barrier interacts with
atomic.Pointer.Store. - Describe what happens during a GC cycle when many snapshots are pinned by in-flight readers.
- Identify a false-sharing problem in a COW store and apply padding.
- Compare Go's GC-based reclamation with the Linux kernel's RCU.
- Explain when hazard pointers or EBR might be needed in Go.
- Describe the ABA problem and why it does not occur in Go COW.
- Use
pprofmutex and block profiles to debug a contended COW system. - Read and explain the runtime source for
atomic.Pointer[T].
If you can do all of these, you have professional-level mastery.
Summary¶
At the professional level, COW becomes a question of runtime mechanics. You understand the memory model that makes atomic.Pointer[T].Load correct without a lock. You can read the codegen and explain why it costs what it costs on different architectures. You understand the GC's role in reclamation and the write barrier's cost. You know when to reach for hazard pointers or weak references and when the GC's automatic reclamation suffices.
This file completes the COW curriculum. The remaining files in this sub-section (specification, interview, tasks, find-bug, optimize) are reference material applying the knowledge built across all four levels.
Appendix: Reading the Runtime Source¶
A useful exercise: read the runtime source for atomic.Pointer[T]. Files:
src/sync/atomic/type.go— the public API.src/sync/atomic/doc.go— documentation.src/runtime/internal/atomic/atomic_amd64.s— amd64 implementation.src/runtime/internal/atomic/atomic_arm64.s— arm64 implementation.src/runtime/mbarrier.go— write barrier.src/runtime/mgcmark.go— GC marking.
Tracing a single atomic.Pointer[Config].Store(c) from source to instruction:
Pointer[T].Store(val *T)callsStorePointer(&x.v, unsafe.Pointer(val)).StorePointeris a runtime intrinsic. The compiler recognizes it and emits the architecture-specific assembly.- On amd64, this is a single
XCHGQ(with implicit lock prefix for fence semantics). - Before the XCHG, the write barrier check fires. If GC is in concurrent marking, it calls
gcWriteBarrier. - The XCHG completes; the store is visible to all CPUs.
End-to-end: 5-15 ns depending on GC state and cache. For a system shipping millions of stores per day, this is fast enough that the rest of the system dominates.
Appendix: A Formal Look at Sequential Consistency¶
Sequential consistency (SC) is the strongest practical memory model. Formally, an execution is sequentially consistent if there exists a total order on all atomic operations such that:
- The total order is consistent with each goroutine's program order.
- Each atomic load returns the value of the most recent (in the total order) atomic store to the same variable.
For Go, every atomic operation participates in this total order. Non-atomic operations do not — they are ordered only by happens-before edges established by atomic operations.
Why SC is enough for COW¶
Consider two writers and one reader:
Writer A: Writer B: Reader:
build snapA build snapB load ptr
Store(ptr, snapA) Store(ptr, snapB) use(snap.field)
SC says there's a total order on the two Stores and the one Load. Cases:
- Order: A's Store, B's Store, R's Load. Reader sees snapB.
- Order: A's Store, R's Load, B's Store. Reader sees snapA.
- Order: B's Store, A's Store, R's Load. Reader sees snapA.
- Order: B's Store, R's Load, A's Store. Reader sees snapB.
- Order: R's Load, A's Store, B's Store. Reader sees whatever was stored before (initial value).
- Order: R's Load, B's Store, A's Store. Reader sees initial value.
In every case, the reader sees a single, fully-constructed snapshot. There is no case where the reader sees a half-built snapshot.
The reader's subsequent field accesses, while not atomic, are ordered after the Load by program order. The snapshot's fields were written before its Store (program order in writer). Combined, every field access sees the correctly-written value.
This is the formal proof of COW's correctness.
Appendix: GC Pause Analysis¶
Modern Go GC pauses are very short, but understanding their composition matters.
A GC cycle has phases:
- STW start (microseconds). Stop all goroutines briefly to set up.
- Concurrent mark (milliseconds-seconds). GC and goroutines run together; goroutines pay the write barrier cost.
- STW mark termination (microseconds). Finalize marking.
- Concurrent sweep (background). Free unreachable memory.
For COW: - Phase 2 is when the write barrier is active. Atomic Stores cost slightly more. - Phase 4 reclaims old snapshots.
Tuning levers: - GOGC: higher value = less frequent GC, more memory used. - GOMEMLIMIT: cap heap; GC more aggressive near the limit. - Pacing: GC tries to balance allocation rate with reclaim rate.
For COW systems with bursty writes (e.g., a deploy that triggers many config reloads), pausing GC briefly to handle the burst can be done with debug.SetGCPercent(-1) (disable) ... but rarely advisable.
Appendix: When to Drop to Assembly¶
In 1 in 1000 cases, you may benefit from hand-written assembly for COW operations.
Scenarios: - A custom synchronization primitive not exposed by sync/atomic. - 128-bit atomic operations (Go has no native support). - Hardware-specific instructions (e.g., x86's CLFLUSH for cache control).
The Go assembler is documented in cmd/asm. Files end in .s and use Plan 9 syntax (different from gas).
For 99.9% of COW code, the standard library is sufficient. Reach for assembly only when profiling proves it necessary and reviewing engineers agree.
Appendix: Cross-Reference Tables¶
Mapping Go API to Hardware¶
| Go API | x86 instruction | ARM64 instruction |
|---|---|---|
Pointer.Load() | MOVQ | LDAR |
Pointer.Store() | XCHGQ | STLR |
Pointer.Swap() | XCHGQ | SWPAR |
Pointer.CompareAndSwap() | LOCK CMPXCHGQ | CASAR (LSE) or LL/SC |
Int64.Add() | LOCK XADDQ | LDADD (LSE) or LL/SC |
Mapping COW Patterns to Memory Model Edges¶
| Pattern | HB edge |
|---|---|
| Writer writes snapshot fields, then Store | Field writes → Store |
| Store(A) on goroutine 1, Load returning A on goroutine 2 | Store → Load |
| Load returning A, then access A's fields | Load → field access |
| End-to-end | Writer's field writes → Reader's field access |
Mapping Concerns to Tools¶
| Concern | Tool |
|---|---|
| Race conditions | go test -race |
| Mutex contention | runtime.SetMutexProfileFraction(1) + pprof |
| GC overhead | runtime/metrics, runtime.ReadMemStats |
| Snapshot pinning | pprof goroutine + custom labels |
| Cache misses | perf stat -e cache-misses (Linux) |
| Memory allocation | pprof allocs |
| End-to-end traces | go tool trace |
Final Professional-Level Synthesis¶
Copy-on-write at the professional level is the intersection of:
- The Go memory model.
- The Go runtime (atomic primitives, GC, write barriers).
- Architecture-specific codegen and hardware memory models.
- Low-level diagnostic tools (pprof, perf, trace).
A professional Go engineer treats COW as a transparent abstraction over these layers. They can debug a memory-leak bug by reading goroutine profiles. They can explain why an atomic.Pointer.Store costs more on ARM than on x86. They understand when the GC's reclamation is sufficient and when manual schemes are needed.
This depth is rarely needed in day-to-day work. But when a production issue surfaces — a mysterious latency spike, a heap that won't shrink, a benchmark that varies wildly across machines — the professional-level engineer is the one who finds the root cause.
The remaining files in this section (specification, interview, tasks, find-bug, optimize) build on this depth with reference material, practice problems, and bug-finding exercises. Use them to reinforce and apply what you've learned across all four levels.
Appendix: Glossary of Professional-Level Concepts¶
| Term | Meaning |
|---|---|
| Happens-before | The fundamental ordering relation in the Go memory model. |
| Sequential consistency | Strongest practical memory model; Go's choice for atomic operations. |
| Acquire-release | Finer-grained ordering; not exposed in Go. |
| Memory fence | Hardware instruction enforcing ordering between memory operations. |
| Write barrier | Runtime hook that informs the GC of pointer writes. |
| Cache line | The unit of memory transfer in modern CPUs, typically 64 bytes. |
| False sharing | Cache invalidation due to unrelated data on the same line. |
| NUMA | Non-uniform memory access; multi-socket memory architecture. |
| Tricolor GC | Go's mark-and-sweep algorithm using white/grey/black coloring. |
| STW (stop-the-world) | A brief pause where all goroutines stop for GC bookkeeping. |
| Escape analysis | Compile-time decision whether a value lives on stack or heap. |
| Hazard pointer | Manual reclamation technique: thread publishes which pointer it reads. |
| Epoch-based reclamation | Manual reclamation technique: threads advance through epochs. |
| ABA problem | A pointer change that CAS does not detect. |
| Linearizability | An operation appearing to take effect at a single instant. |
| TSO (Total Store Order) | x86's memory model: stronger than acquire-release, weaker than SC. |
| LL/SC (Load-Linked / Store-Conditional) | RISC primitive for atomic operations. |
A professional engineer recognizes all of these concepts and their relevance to COW.
Closing Notes for Professional Level¶
This is the deepest layer of COW understanding. Below this lies hardware engineering, which is generally outside the Go engineer's purview.
The journey from junior to professional has been:
- Junior: the pattern, the API, the trivial mistakes.
- Middle: the engineering, the trade-offs, the production patterns.
- Senior: the algorithms, the architecture, the failure modes at scale.
- Professional: the substrate, the runtime, the formal model.
You now have the full picture. Use it judiciously. The right COW design is usually the simplest one that meets requirements; deeper knowledge informs when "simplest" doesn't suffice.
Production engineering rewards depth even when day-to-day code doesn't show it. The day a strange bug surfaces is the day this knowledge pays for itself.
Appendix: A Tour of the runtime/internal/atomic Package¶
For the truly curious, a tour of src/runtime/internal/atomic:
atomic_amd64.s— x86-64 implementations.atomic_arm64.s— ARM64 implementations.atomic_mips64x.s,atomic_ppc64x.s,atomic_riscv64.s,atomic_s390x.s— other arches.types.go— the runtime's own atomic types (separate fromsync/atomic).unaligned.go— fallbacks for unaligned access.
The package is internal to the runtime and not exposed to user code. But reading it teaches Go's implementation choices for atomic operations.
A particularly interesting file: atomic_amd64.go, which contains Go signatures with //go:noescape and //go:nosplit directives marking the intrinsics. The compiler intercepts these and substitutes inline assembly.
Appendix: Cross-Language Memory Model Comparison¶
| Language | Default atomics | Stronger | Weaker |
|---|---|---|---|
| Go | Sequential consistency | — | — |
| Java | Sequential consistency for volatile | — | — |
| C++11+ | Acquire-release | Sequential consistency | Relaxed |
| Rust | Acquire-release | SeqCst | Relaxed |
| Python | GIL (no real concurrency) | — | — |
| JavaScript | SC for Atomics object | — | — |
Go's choice of SC for all atomics is friendlier to programmers but slightly more expensive than acquire-release. The trade-off is intentional: simplicity for correctness.
Appendix: References for Deep Study¶
- The Go Programming Language Specification:
go.dev/ref/spec. - The Go Memory Model:
go.dev/ref/mem. - The Go Programming Language (Donovan & Kernighan).
- Concurrent Programming in Go (Cox-Buday).
- Programming with POSIX Threads (Butenhof) — for portable concurrency.
- The Art of Multiprocessor Programming (Herlihy & Shavit).
- Is Parallel Programming Hard, And, If So, What Can You Do About It? (McKenney).
- Memory Models for C/C++ Programmers (Mazières lecture notes).
- "Hardware Memory Models" (Russ Cox).
- "Programming Language Memory Models" (Russ Cox).
- "Memory Models" essays at preshing.com.
- The Linux kernel RCU documentation.
The professional engineer has read most of these. The senior engineer has read at least half. The middle-level engineer has heard of them and can find what they need.
Final Note¶
Copy-on-write is one of the rare patterns where Go's design choices align perfectly with the problem. The atomic.Pointer[T] API is exactly the right tool. The garbage collector handles the hardest part for free. The memory model is strong enough to be intuitive.
This made it easy to ship COW code at the junior level. The middle level showed the engineering decisions. The senior level revealed the algorithmic depths. The professional level explained why all of this works.
A Go engineer who has internalized this material can deploy COW systems with confidence at any scale and debug them when they break. They know when to choose COW, when to reject it, and what to migrate to when its limits are reached.
That mastery is the goal of this curriculum.
Deep Dive: Implementing Your Own atomic.Pointer¶
To cement understanding, here is a from-scratch implementation of an atomic pointer type. This is for educational purposes; in production always use sync/atomic.Pointer[T].
Building blocks¶
package myatomic
import (
"sync/atomic"
"unsafe"
)
type Pointer[T any] struct {
_ noCopy
v unsafe.Pointer
}
type noCopy struct{}
func (*noCopy) Lock() {}
func (*noCopy) Unlock() {}
func (p *Pointer[T]) Load() *T {
return (*T)(atomic.LoadPointer(&p.v))
}
func (p *Pointer[T]) Store(val *T) {
atomic.StorePointer(&p.v, unsafe.Pointer(val))
}
func (p *Pointer[T]) Swap(new *T) (old *T) {
return (*T)(atomic.SwapPointer(&p.v, unsafe.Pointer(new)))
}
func (p *Pointer[T]) CompareAndSwap(old, new *T) bool {
return atomic.CompareAndSwapPointer(&p.v, unsafe.Pointer(old), unsafe.Pointer(new))
}
This is essentially the source of the standard atomic.Pointer[T]. Our version cuts the [0]*T field, but the standard library uses it for an extra layer of type checking.
What we lose without the [0]*T field¶
type FastPointer[T any] struct {
_ noCopy
v unsafe.Pointer
}
var a FastPointer[int]
var b FastPointer[string]
// Without [0]*T, this might compile if we were sloppy:
// (it doesn't, because the generic types differ, but conceptually...)
The [0]*T field is a zero-sized marker that ensures different T types produce distinct generic instantiations. Without it, in some compilers, two instantiations might be mistakenly equated.
Testing the implementation¶
func TestMyAtomic(t *testing.T) {
var p Pointer[int]
a := 42
p.Store(&a)
if *p.Load() != 42 { t.Fatal("load") }
b := 100
p.Store(&b)
if *p.Load() != 100 { t.Fatal("store2") }
c := 200
if !p.CompareAndSwap(&b, &c) { t.Fatal("CAS") }
if *p.Load() != 200 { t.Fatal("CAS load") }
}
This works because we delegate to the runtime's atomic primitives. We added nothing; we just adapted the API.
Why we have generics now¶
Before generics, every COW user wrote unsafe.Pointer casts. The generic atomic.Pointer[T] is purely an ergonomic improvement; it has the same machine code at runtime.
Deep Dive: Tracing a COW Operation¶
Walk through a complete COW publish and load at the lowest level.
The setup¶
var cfg atomic.Pointer[Config]
type Config struct {
Field int
}
func main() {
c := &Config{Field: 42}
cfg.Store(c) // (1) Initial publish
go func() {
got := cfg.Load() // (2) Read by goroutine
fmt.Println(got.Field)
}()
new := &Config{Field: 100}
cfg.Store(new) // (3) New publish
}
Step 1: Initial Store¶
The writer (main goroutine) executes cfg.Store(c).
- Compiler emits a call to
atomic.StorePointer(&cfg.v, unsafe.Pointer(c)). - On amd64, this becomes
XCHGQ(with implicit lock prefix). - Before the XCHG, write barrier check: is GC marking? If yes, call
gcWriteBarrier. (Usually no during program startup.) - The XCHG writes the pointer value of
cintocfg.v. Atomically. - After: any goroutine that reads
cfg.vsees the address ofc.
The writer's view of memory after step 1: c lives on the heap (because its address is taken and stored to a heap-reachable atomic pointer). c.Field == 42.
Step 2: Goroutine creation¶
go func() { ... }() starts a new goroutine. The Go spec guarantees:
The
gostatement happens-before the start of the goroutine.
So all writes from main (including c.Field = 42 and cfg.Store(c)) are visible to the new goroutine.
Step 3: Reader Load¶
The new goroutine executes cfg.Load().
- Compiler emits
atomic.LoadPointer(&cfg.v). - On amd64: a plain
MOVQfromcfg.v. (Memory model permits because x86 loads are acquire by default.) - Returns the pointer to
c. - The goroutine then accesses
got.Field— a plain (non-atomic) load. Reads 42.
How do we know Field == 42 and not, say, 0?
- Main wrote
c.Field = 42in program order beforecfg.Store(c). cfg.Store(c)happens-beforecfg.Load()(atomic synchronization).cfg.Load()happens-beforegot.Fieldaccess (program order).- Transitively:
c.Field = 42happens-beforegot.Fieldaccess.
This is what makes COW correct.
Step 4: Second Store¶
Main executes cfg.Store(new).
- Same as step 1: XCHGQ replaces
cfg.v. - Write barrier may fire if GC is marking.
- After:
cfg.vpoints atnew. The pointer tocis no longer held bycfg.
The goroutine that loaded c still holds it in its local variable got. From the GC's perspective, c is reachable until that goroutine finishes.
Step 5: GC reclamation¶
Eventually: - The goroutine finishes; got goes out of scope; c becomes unreachable. - Next GC cycle marks live heap; c is not visited; c is swept.
Memory reclaimed automatically.
Deep Dive: Atomic Operations Cost in Detail¶
The naive view: "atomic loads cost a nanosecond." Reality is richer.
On a quiet machine¶
BenchmarkLoadCold ~50 ns/op (cache miss to L3)
BenchmarkLoadL1Hit ~1.5 ns/op
BenchmarkLoadL2Hit ~5 ns/op
BenchmarkLoadL3Hit ~20 ns/op
BenchmarkStore ~10 ns/op (XCHG)
BenchmarkCAS ~15 ns/op (LOCK CMPXCHG)
The cache state dominates. Reads of frequently-accessed pointers stay in L1; reads of cold pointers pay L2/L3/main memory latency.
Under contention¶
When many cores write the same cache line, the line bounces between cores. Each access pays the latency of cross-core coherence traffic.
For COW: the pointer is written rarely, so reads stay in L1 across many cores. This is the magic.
When a writer publishes, the line transitions to Modified on the writer's core; readers' L1 copies become Invalid. The next read on each reader core fetches the line — adding ~50 ns one-time cost per reader core per publish.
For a service with 64 cores and 1 publish/sec, this is ~3 µs/sec of cache-coherence work. Trivial.
For a service with 64 cores and 100 publishes/sec, it's ~300 µs/sec. Still trivial.
For a service with 64 cores and 100 000 publishes/sec, it's ~300 ms/sec — 30% CPU just on coherence. At this point your design is wrong.
MESI state machine¶
[Invalid] --read--> [Shared] --write by other--> [Invalid]
[Shared] --write by me--> [Modified]
[Modified] --read by other--> [Shared]
COW's optimization: keep the line in Shared as much as possible. Writes briefly invalidate it; reads restore Shared. The pattern stays efficient as long as writes are rare.
Per-architecture differences¶
ARM and RISC-V have weaker memory models. Atomic stores require explicit fences, which are slightly more expensive than x86's implicit ordering. But absolute costs are still in the nanosecond range.
For most Go services, the cross-architecture difference is in the noise. Optimize for clarity; let hardware variance be a footnote.
Deep Dive: GC Pauses and COW¶
Let's trace what happens to a COW system during a GC cycle.
Pre-GC¶
- Heap is at threshold (e.g., 200 MB after last GC + 100 MB allocation since).
- COW system has 1 current snapshot + a few pinned snapshots (in flight requests).
- Total snapshot memory: ~10 MB out of 200 MB live.
GC start (STW)¶
The runtime briefly stops all goroutines (microseconds) to set up. During this time: - COW reads cannot proceed. - COW writes are blocked.
Concurrent mark begins¶
GC starts tracing reachable objects from roots. While this runs: - Goroutines resume. - COW reads happen normally. - COW writes now invoke the write barrier on the atomic Store.
The write barrier records the new pointer write. The GC sees the new snapshot and marks it reachable.
Marking duration¶
For a 200 MB heap with 1 million pointers, marking takes ~50-100 ms wall time. During this time: - COW writes are slightly slower (write barrier active). - New allocations (from COW writes) are accounted by the GC's pacing algorithm.
Mark termination (STW)¶
A brief stop-the-world to finalize marking. Microseconds.
Concurrent sweep¶
Unreachable objects are reclaimed background. Old snapshots that became unreachable after the last GC are freed during this phase.
Visible impact¶
For a COW system: - One ~100 µs pause at GC start, one at mark termination. - Slightly higher write latency during marking. - New allocations (snapshot rebuilds) trigger the next GC cycle.
For most workloads this is invisible. For latency-sensitive systems, monitor runtime/metrics:
samples := []metrics.Sample{
{Name: "/gc/pauses:seconds"},
{Name: "/gc/cycles/automatic:gc-cycles"},
}
Spikes in pause time correlate with allocation-heavy moments.
Deep Dive: Atomic Operations Across Compilation¶
Let's see what a single line of Go compiles to. Take:
Where cfg is atomic.Pointer[Config] and Config has a Timeout time.Duration field.
Go SSA IR (approximate)¶
v1 = AddrOf cfg.v
v2 = AtomicLoad64 v1
v3 = CastUnsafeToPointer v2
v4 = AddrOf <*Config>.Timeout from v3
v5 = Load v4
Return v5
amd64 codegen¶
TEXT GetTimeout(SB)
LEAQ cfg+0(SB), AX ; load address of cfg.v
MOVQ (AX), BX ; atomic load (aligned MOV is atomic on x86)
MOVQ Timeout+24(BX), CX ; load Timeout field
MOVQ CX, ret+0(FP)
RET
Three loads. Total cost: ~5 ns (each load is 1-2 ns from L1).
Why so few instructions?¶
Several optimizations: - The Go compiler inlines small functions including Pointer.Load. - The Go compiler inlines field access. - The Go compiler skips redundant pointer arithmetic when offsets are constants.
The function does essentially zero overhead beyond the loads themselves.
How escape analysis affects this¶
cfg.Load() returns a *Config. If the function uses cfg.Load().Timeout directly, the *Config pointer never escapes the function — it's a stack-local reference. No allocation.
If you stored it: c := cfg.Load(); use(c), escape analysis might still find that c doesn't escape. Compiler optimizes.
Deep Dive: Common Hardware Footguns¶
Footgun 1: Unaligned access¶
Atomic operations require alignment to the operand size. atomic.Int64 on 32-bit ARM may be unaligned within a struct, causing crashes or non-atomic behavior.
Go's runtime ensures alignment for top-level atomic types. But embedded in a struct, you must align manually:
atomic.Int64 is 8-byte aligned. Place it first or pad explicitly.
For atomic.Pointer[T], this isn't an issue because pointers are word-aligned on all supported architectures.
Footgun 2: Cache-line straddling¶
If a struct field crosses a cache-line boundary, atomic operations on it become non-atomic (the hardware may split the access). Go's runtime prevents this for atomic types by alignment, but if you use unsafe.Pointer arithmetic, beware.
Footgun 3: Memory bandwidth saturation¶
High-frequency atomic operations can saturate memory bandwidth. A single CPU can do ~1 billion atomic ops/sec uncontended, but the memory subsystem limits total throughput.
For COW: if writes happen at GHz frequency (you're doing it wrong), bandwidth becomes the bottleneck.
Footgun 4: NUMA effects¶
Atomic operations on a pointer in a different NUMA node pay extra latency. For NUMA-aware applications, partition data per node.
Go does not expose NUMA primitives directly. You can pin OS threads with runtime.LockOSThread and rely on the OS scheduler to keep them on a node.
Footgun 5: TSO vs. acquire/release¶
x86's TSO is stronger than acquire/release but weaker than SC. Specifically:
Wait, that's a store-buffer effect on weaker models, not TSO. On TSO, writes are visible in program order, so the above is impossible.
But for cross-variable atomic ordering, TSO is not enough — you need SC. Go's atomic.Store uses XCHG to provide SC even on TSO machines, paying a small extra cost.
For COW: not a concern. A single atomic pointer is always coherent.
Deep Dive: A SeqLock-Style Read of Multiple Words¶
Suppose you need to read a 128-bit value atomically (e.g., a 64-bit version plus a 64-bit pointer). Go's atomic.Pointer[T] only handles 64-bit pointers.
A SeqLock pattern:
type Versioned struct {
version atomic.Uint64
pointer atomic.Pointer[T]
}
func (v *Versioned) Read() (*T, uint64) {
for {
v1 := v.version.Load()
if v1&1 != 0 {
continue // writer in progress
}
p := v.pointer.Load()
v2 := v.version.Load()
if v1 == v2 {
return p, v1
}
}
}
func (v *Versioned) Write(p *T) {
v.version.Add(1) // odd version: writer in progress
v.pointer.Store(p)
v.version.Add(1) // even version: write complete
}
This is the SeqLock pattern. Readers can detect concurrent writes via the version field.
For pure COW, the SeqLock is unnecessary — atomic.Pointer[T] is itself atomic. But for multi-word reads (pointer + version, or pointer + flags), SeqLock-style is useful.
Why this is rarely needed in Go¶
Go's GC and atomic.Pointer give us almost everything for free. SeqLock-style is mainly useful when: - You need version-aware reads. - You're working with structures that span multiple cache lines. - You're matching the C kernel's RCU API exactly.
For 99% of Go code, plain atomic.Pointer[T] is sufficient.
Deep Dive: A Walk Through sync.Map¶
sync.Map is the standard library's most sophisticated read-mostly map. Its design has lessons for COW.
Two maps¶
type Map struct {
mu Mutex
read atomic.Value // readOnly
dirty map[any]*entry
misses int
}
type readOnly struct {
m map[any]*entry
amended bool
}
readis the read-only map. Read with atomic.Value.Load.dirtyis the writable map. Locked bymu.
Reads first check read. If the key is there, return atomically. If not (and the dirty map has it), fall through to the mutex.
Read fast path¶
func (m *Map) Load(key any) (value any, ok bool) {
read, _ := m.read.Load().(readOnly)
e, ok := read.m[key]
if !ok && read.amended {
// slow path
}
if !ok { return nil, false }
return e.load()
}
For keys present in read, the entire load is a few atomic operations — no mutex.
Write¶
Writes go to dirty. Periodically, dirty is promoted to read (with a new readOnly value stored atomically).
Lessons for COW¶
sync.Mapis COW for the read map and mutex-protected for the write map.- Promotion of dirty to read is a single atomic.Store of a new readOnly snapshot.
- It's a hybrid: COW for the read path, mutex for the write path, with periodic synchronization.
The pattern: use COW where you can, mutex where you must. Many real systems mix both.
Deep Dive: Detecting Reads During Marking¶
The Go GC's concurrent marker traces from roots. To remain correct, it uses a write barrier on pointer stores. But what about pointer reads?
Reads are free: the GC doesn't care if a reader sees a particular value. The barrier only ensures the GC sees pointer writes.
For COW: read paths pay no GC cost. Write paths pay the write-barrier cost during marking, but no other cost.
This asymmetry is a major reason COW is so efficient in Go.
The "snapshot at the beginning" technique¶
Go's GC takes a snapshot at the start of marking (logically), then traces from it. Pointers stored after the snapshot are tracked via the write barrier. This ensures any reachable object at the start (or made reachable during marking) is marked.
For COW: a new snapshot published during marking is correctly marked because of the write barrier. Old snapshots are reclaimed if no remaining root reaches them.
The hybrid concurrent GC's progress¶
Go's GC has improved over many versions: - 1.4: Stop-the-world. - 1.5: Concurrent marker, but with significant pauses. - 1.8+: Sub-millisecond pauses for most heaps. - 1.14+: Asynchronous preemption (helps GC progress without cooperative yield points). - 1.19+: Soft memory limit, improved pacing. - 1.21+: Continuous improvements to allocator and marker.
For COW systems, each version makes GC overhead lower. Production Go code today has GC pauses below 1 ms for heaps under a few GB.
Deep Dive: GOMAXPROCS and COW Scaling¶
GOMAXPROCS controls how many goroutines can run in parallel. For COW:
- Reads scale linearly with
GOMAXPROCS(no contention). - Writes scale with
GOMAXPROCSif you have multiple writers and shard or use CAS; otherwise serialized.
Read scaling¶
GOMAXPROCS=1: 1 read at a time. Throughput = 1 / (Load + use) ≈ 100M reads/sec.
GOMAXPROCS=16: 16 reads in parallel. Throughput ≈ 1.6G reads/sec.
Linear scaling because each reader's atomic Load doesn't invalidate any other reader's cache line.
Write scaling¶
With a single writer mutex, write throughput is constant regardless of GOMAXPROCS. With sharded mutexes (N shards), write throughput scales up to N.
For CAS-based writers: contention grows with the number of writers, so throughput plateaus.
Best practice¶
For read-mostly COW, GOMAXPROCS=NumCPU is correct. Reads benefit from parallelism. Writes are bounded by the design's serialization, not by GOMAXPROCS.
Deep Dive: Comparing COW to RW Locks at the Hardware Level¶
A sync.RWMutex.RLock does:
- Atomic increment of reader count.
- Check writer-waiting flag.
- (If writer waiting) wait.
- Otherwise proceed.
Step 1 is an atomic operation on the reader count. Multiple readers contend for this cache line. Even with many readers, contention causes ~10-30 ns per RLock.
A COW Load does:
- Atomic load of the pointer.
The pointer's cache line is shared (not modified by readers). No contention.
The hardware-level difference: COW reads do not invalidate any cache line. RW Mutex reads do.
Numbers¶
- COW Load on a 64-core machine: ~1.5 ns regardless of how many cores are reading.
- RWMutex RLock on a 64-core machine: ~30 ns uncontended, ~100 ns under heavy read load.
20× difference. The reason: RW Mutex requires modifying shared state (the reader count) on every RLock; COW does not.
When RW Mutex catches up¶
If your reads are heavy (do significant work after RLock), the constant overhead is amortized. RW Mutex becomes acceptable when read work is microseconds, not nanoseconds.
For trivial reads (single field access), COW dominates.
Deep Dive: Goroutine Stack Scanning¶
The GC scans every goroutine's stack to find pointers. This is the "stack scan" phase.
For COW: a goroutine holding a snapshot pointer in a local variable contributes that pointer to the live set. The snapshot is kept alive as long as the goroutine's stack contains the reference.
Precision¶
Go's stack scanning is precise — the compiler emits maps describing which stack slots are pointers. The GC follows them exactly. No false positives.
This is why COW works so well: pinned snapshots are exactly those goroutines hold, no more.
Implications for long-running goroutines¶
A goroutine that loads a snapshot once and runs for an hour pins it for an hour. The pin extends to anything else reachable from the snapshot (its fields, including slices and maps).
Mitigation: bound goroutine lifetimes or periodically re-load.
Stack growth¶
Goroutine stacks grow as needed. A function call that requires more stack space triggers a "stack growth" — the runtime allocates a larger stack and copies the contents. Stack-resident pointers (including snapshot references) move with the stack.
The atomic pointer is not on the stack — it's a global or heap-allocated struct. Stack growth doesn't affect it.
Deep Dive: A Cross-Architecture COW Benchmark¶
To see the differences in action, here is a benchmark comparing COW Load latency across architectures.
func BenchmarkCOWLoad(b *testing.B) {
var p atomic.Pointer[int]
x := 42
p.Store(&x)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_ = *p.Load()
}
})
}
Typical results (median):
| Architecture | Cores | ns/op | ops/sec/core |
|---|---|---|---|
| amd64 (Intel Xeon Gold) | 32 | 1.5 | 670M |
| amd64 (AMD EPYC) | 64 | 1.7 | 590M |
| arm64 (Apple M1) | 8 | 1.8 | 555M |
| arm64 (AWS Graviton2) | 64 | 2.0 | 500M |
| riscv64 (SiFive) | 4 | 4.5 | 220M |
x86 wins on absolute latency. ARM is close. RISC-V is slower (younger architecture, fewer optimizations).
The takeaway: for COW, modern hardware is fast. Algorithm-level optimizations matter much more than microsecond-level architecture differences.
Closing Professional-Level Synthesis¶
You now possess the complete mental model of COW in Go:
- The high-level pattern (junior).
- The architectural choices (middle).
- The algorithmic optimizations (senior).
- The runtime mechanics (professional).
Each layer builds on the previous. Each is sufficient for a range of work; together they make you fluent across the entire stack.
The next files in this section apply this knowledge in different formats: - specification.md — formal Go-spec excerpts. - interview.md — interview Q&A. - tasks.md — hands-on exercises. - find-bug.md — bug-finding practice. - optimize.md — performance tuning.
Use them to reinforce, practice, and apply what you've learned. Real mastery comes from doing.
Final Reference: A Cheat Sheet for the Professional¶
// Declare
var p atomic.Pointer[T]
// Initialize
p.Store(&initial)
// Read (one operation)
v := p.Load()
// Write (mutex-protected)
mu.Lock(); defer mu.Unlock()
next := *p.Load()
fn(&next)
p.Store(&next)
// CAS (lock-free)
for {
cur := p.Load()
next := transform(cur)
if p.CompareAndSwap(cur, next) { break }
}
Costs (amd64, L1-hit): - Load: ~1.5 ns - Store: ~10 ns - CAS: ~15 ns
Memory model guarantees: - Store(v) happens-before any Load that observes v. - Within a goroutine, program order is preserved. - Atomic operations are sequentially consistent.
GC interaction: - Pointers in snapshots are scanned during marking. - Old snapshots are reclaimed when no goroutine's reachable set contains them. - Write barriers fire on atomic Store during marking.
This is what every Go engineer should know about COW at the runtime level.
Appendix: Selected Go Runtime Source Pointers¶
For the curious, these are the most relevant source files to read:
src/sync/atomic/type.go— public API (Pointer, Int64, etc.).src/sync/atomic/doc.go— package documentation.src/runtime/internal/atomic/*— runtime intrinsics.src/runtime/mbarrier.go— write barrier.src/runtime/mgcmark.go— GC marking phase.src/runtime/proc.go— goroutine scheduling (relevant for stack scanning).src/runtime/mheap.go— heap management.src/runtime/malloc.go— allocation paths.
Reading even a few of these makes Go's atomic and GC behavior less mysterious.
A senior or professional engineer benefits from a quarterly habit: skim one runtime file and understand it. Over time, you build a complete picture of the runtime.
Appendix: The History of atomic.Value and atomic.Pointer[T]¶
A brief timeline:
- Go 1.4 (Dec 2014):
sync/atomic.Valueadded. Storesinterface{}. Designed for COW. - Go 1.13 (Sep 2019):
atomic.Value.Swapadded. - Go 1.17 (Aug 2021):
atomic.Value.CompareAndSwapadded. - Go 1.18 (Mar 2022): Generics introduced.
- Go 1.19 (Aug 2022):
atomic.Pointer[T],atomic.Int32,atomic.Int64,atomic.Uint32,atomic.Uint64,atomic.Uintptr,atomic.Booladded. - Go 1.22 (Feb 2024): Improved escape analysis for atomic types.
- Go 1.24+ (TBD): Continued refinement.
The progression shows Go's commitment to making atomic operations safer and more ergonomic. New code on 1.19+ should use the typed forms.
Appendix: Notable Bug Reports and CVEs¶
A few public bugs related to atomic operations and COW patterns:
- #52382 (atomic.Value with nil interface): Storing a
nilinterface caused panic in some paths; fixed. - #48141 (RWMutex starvation): Long-running writers could starve readers; fixed by writer-preferring semantics in 1.18+.
- #37052 (memory model clarification): A long-running discussion that eventually produced the Go 1.19 memory model, formalizing atomic ordering.
Following Go's issue tracker is the best way to stay current on subtle bug fixes that may affect your COW code.
Appendix: Future Directions¶
Looking ahead:
- More granular memory ordering. Some teams advocate exposing acquire/release semantics for performance. Trade-off: complexity vs speed.
- Improved escape analysis. Snapshot allocations may move to stack in more cases.
- Standard persistent collections. Long-discussed; may eventually ship.
- Hardware transactional memory. Intel TSX, ARM TME. Not yet widely supported in Go, but watch this space.
A professional engineer keeps an eye on these developments, anticipating shifts in best practice.
Appendix: A Final, Practical Example¶
To close, a complete production COW example using everything covered:
package config
import (
"context"
"encoding/json"
"errors"
"fmt"
"os"
"sync"
"sync/atomic"
"time"
"github.com/prometheus/client_golang/prometheus"
"go.opentelemetry.io/otel/trace"
)
type Config struct {
Version int64 `json:"version"`
PublishedAt time.Time `json:"published_at"`
Settings map[string]string `json:"settings"`
Features map[string]bool `json:"features"`
Routes []Route `json:"routes"`
}
type Route struct {
Pattern string `json:"pattern"`
Backend string `json:"backend"`
}
func (c *Config) Validate() error {
if c.Settings == nil { return errors.New("settings nil") }
if c.Features == nil { return errors.New("features nil") }
for i, r := range c.Routes {
if r.Pattern == "" || r.Backend == "" {
return fmt.Errorf("route %d invalid", i)
}
}
return nil
}
type Store struct {
cur atomic.Pointer[Config]
mu sync.Mutex
tracer trace.Tracer
metrics *metrics
path string
nextVer int64
}
type metrics struct {
version prometheus.Gauge
age prometheus.Gauge
updates prometheus.Counter
failures prometheus.Counter
loadLatency prometheus.Histogram
writeLatency prometheus.Histogram
}
func NewStore(initial *Config, path string, tracer trace.Tracer) (*Store, error) {
if err := initial.Validate(); err != nil {
return nil, err
}
s := &Store{
path: path,
tracer: tracer,
nextVer: 1,
metrics: newMetrics(),
}
initial.Version = s.nextVer
initial.PublishedAt = time.Now()
s.cur.Store(initial)
s.metrics.version.Set(float64(initial.Version))
s.nextVer++
return s, nil
}
func (s *Store) Get(ctx context.Context) *Config {
_, span := s.tracer.Start(ctx, "config.Get")
defer span.End()
c := s.cur.Load()
s.metrics.age.Set(time.Since(c.PublishedAt).Seconds())
return c
}
func (s *Store) Update(ctx context.Context, fn func(*Config) error) (err error) {
ctx, span := s.tracer.Start(ctx, "config.Update")
defer span.End()
start := time.Now()
defer func() { s.metrics.writeLatency.Observe(time.Since(start).Seconds()) }()
s.mu.Lock()
defer s.mu.Unlock()
old := s.cur.Load()
next := old.Clone()
if err = fn(next); err != nil {
s.metrics.failures.Inc()
span.RecordError(err)
return
}
if err = next.Validate(); err != nil {
s.metrics.failures.Inc()
span.RecordError(err)
return
}
next.Version = s.nextVer
next.PublishedAt = time.Now()
s.nextVer++
s.cur.Store(next)
s.metrics.updates.Inc()
s.metrics.version.Set(float64(next.Version))
return nil
}
func (s *Store) Reload(ctx context.Context) error {
ctx, span := s.tracer.Start(ctx, "config.Reload")
defer span.End()
b, err := os.ReadFile(s.path)
if err != nil {
s.metrics.failures.Inc()
return err
}
var next Config
if err := json.Unmarshal(b, &next); err != nil {
s.metrics.failures.Inc()
return err
}
s.mu.Lock()
defer s.mu.Unlock()
if err := next.Validate(); err != nil {
s.metrics.failures.Inc()
return err
}
next.Version = s.nextVer
next.PublishedAt = time.Now()
s.nextVer++
s.cur.Store(&next)
s.metrics.updates.Inc()
s.metrics.version.Set(float64(next.Version))
return nil
}
func (c *Config) Clone() *Config {
cp := *c
cp.Settings = make(map[string]string, len(c.Settings))
for k, v := range c.Settings { cp.Settings[k] = v }
cp.Features = make(map[string]bool, len(c.Features))
for k, v := range c.Features { cp.Features[k] = v }
cp.Routes = append([]Route(nil), c.Routes...)
return &cp
}
This is what production COW looks like. It uses: - atomic.Pointer[T] for the snapshot. - A writer mutex for serialization. - Validation before publish. - Deep clone via a method on the snapshot type. - Tracing (OpenTelemetry). - Metrics (Prometheus). - Reload from a JSON file. - Errors propagated to the caller.
Every pattern from junior through professional level is here. This is the synthesis.
A Final, Final Word¶
Copy-on-write is one of the most elegant patterns in Go. From a five-line atomic.Pointer[T] snippet to a million-entry sharded persistent map with hot reload and full observability, the pattern scales gracefully.
The journey from "I write go f()" to "I can explain how atomic.Pointer[T].Load is implemented on ARM64" is long. But it's a journey worth taking. The depth informs decision-making at every level.
For your next COW system: start simple. Use atomic.Pointer[T] and a writer mutex. Measure. If you need more, you have the toolkit to scale up.
That is the gift of having read four levels of this document. Use it well.
Extended Section: Memory Model — Worked Examples¶
To deepen understanding, walk through specific examples of memory-model reasoning applied to COW scenarios.
Example 1: Single producer, single consumer¶
var p atomic.Pointer[Data]
var ready atomic.Bool
// Producer
data := &Data{Value: 42}
p.Store(data)
ready.Store(true)
// Consumer
for !ready.Load() { runtime.Gosched() }
v := p.Load().Value
Question: is v == 42 guaranteed?
Walking through: - data.Value = 42 happens-before p.Store(data) (program order). - p.Store(data) happens-before ready.Store(true) (program order in producer). - ready.Load() returning true happens-after ready.Store(true) (atomic sync). - ready.Load() happens-before p.Load() (program order in consumer).
Transitively: data.Value = 42 happens-before p.Load().Value access.
Therefore v == 42 is guaranteed.
Example 2: Two writers, one reader¶
var p atomic.Pointer[int]
// Writer A
a := 1
p.Store(&a)
// Writer B
b := 2
p.Store(&b)
// Reader
v := *p.Load()
Question: what values can v be?
Sequential consistency requires a total order on the three atomic operations. The reader sees the value of the most recent Store in this order.
Cases: - Store(A), Store(B), Load: v == 2 - Store(B), Store(A), Load: v == 1 - Store(A), Load, Store(B): v == 1 - Load, Store(A), Store(B): v == (initial, likely panic on nil deref) - Other orderings: similarly bounded.
In all valid orderings, v is either 1, 2, or initial. No partial state.
Example 3: Snapshot field consistency¶
var p atomic.Pointer[Pair]
type Pair struct {
A int
B int
}
// Writer
p.Store(&Pair{A: 1, B: 2})
p.Store(&Pair{A: 10, B: 20})
// Reader
s := p.Load()
fmt.Printf("%d %d", s.A, s.B)
Question: can the reader see 1 20 or 10 2?
No. Each Pair literal is fully constructed before the Store. The reader sees one full Pair: 1 2 or 10 20.
The construction &Pair{A: 1, B: 2} happens entirely before the Store (program order in writer). The reader's field accesses happen after the Load (program order). The Store/Load atomic edge connects them.
Example 4: Mutated snapshot¶
var p atomic.Pointer[Pair]
pair := &Pair{A: 1, B: 2}
p.Store(pair)
// Reader 1
go func() {
s := p.Load()
fmt.Printf("R1: %d %d\n", s.A, s.B)
}()
// Buggy writer (mutates published snapshot!)
pair.A = 100
pair.B = 200
Question: what does R1 print?
Undefined behavior. R1 may print 1 2, 100 200, 1 200, 100 2, or anything. The race detector flags this as a data race.
There is no happens-before relationship between the buggy writer's pair.A = 100 and the reader's s.A access. They are racing.
Lesson: never mutate a published snapshot.
Example 5: Cross-pointer ordering¶
var pa, pb atomic.Pointer[int]
a, b := 1, 2
// Writer
pa.Store(&a)
pb.Store(&b)
// Reader
if pb.Load() != nil {
x := *pa.Load() // is x guaranteed to be 1?
}
Question: if the reader observes pb != nil, does it see pa updated?
Yes, by sequential consistency. In the total order: - pa.Store(&a) happens before pb.Store(&b) (program order in writer). - pb.Load() (returning non-nil) happens after pb.Store(&b) (atomic edge). - pa.Load() happens after pb.Load() (program order in reader).
By transitivity: pa.Store(&a) happens before pa.Load(). So pa.Load() returns &a, and *pa.Load() == 1.
This is a powerful guarantee — atomic operations on different variables are coordinated by the total order.
Example 6: Memory barriers without atomics¶
var x int
var done atomic.Bool
// Writer
x = 42
done.Store(true)
// Reader
if done.Load() {
fmt.Println(x)
}
Question: does the reader print 42?
Yes. The atomic Store/Load creates a happens-before edge. The non-atomic write x = 42 happens-before the Store (program order in writer); the Load (returning true) happens-after the Store (atomic edge); the read of x happens-after the Load (program order).
Transitively: x = 42 happens-before read of x. The reader sees 42.
This pattern is the foundation of "publishing data via an atomic flag" — common in lazy initialization.
Extended Section: Runtime Internals Deep Dive¶
A closer look at how Go's runtime supports COW.
The atomic intrinsics¶
The compiler treats functions in runtime/internal/atomic specially. They are recognized by name and replaced with inline assembly. Source:
// From src/runtime/internal/atomic/atomic_amd64.go:
//go:noescape
func LoadPointer(ptr *unsafe.Pointer) unsafe.Pointer
//go:noescape tells the compiler the function does not cause its arguments to escape — important for stack allocation.
The corresponding implementation is in atomic_amd64.s:
The function is "no-split" — it cannot trigger stack growth. This is required for runtime intrinsics because stack growth itself uses atomic operations.
The write barrier¶
src/runtime/mbarrier.go implements the GC write barrier. The key function:
//go:nowritebarrier
func gcWriteBarrier(ptr unsafe.Pointer, val unsafe.Pointer) {
// ... record the write in the GC's tracking
}
For every pointer write to a heap object, the compiler inserts a call to gcWriteBarrier. The barrier is conditional — it only does real work when GC is in concurrent marking phase.
The conditional is the byte writeBarrier.enabled. The compiler emits a fast check:
In the common case (no GC marking), this is one byte-load and one branch — well-predicted, essentially free.
The scheduler¶
src/runtime/proc.go is the scheduler. For COW:
runtime.Goschedyields the current goroutine.runtime_procPin(used byatomic.Value) pins the goroutine to its current P (processor) for safety during multi-word operations.- Stack scanning happens periodically; it sees all stack-local pointers including snapshot references.
Memory allocation¶
src/runtime/malloc.go handles allocations. Each snapshot rebuild allocates new memory; the path is:
- Determine size class.
- Allocate from the per-P cache (mcache).
- If cache empty, refill from central cache (mcentral).
- If central empty, fall to heap (mheap).
- If heap empty, request more from OS.
For COW writes: most allocations hit the per-P cache (~10 ns). Cold paths are rare.
GC heuristics¶
src/runtime/mgcpacer.go manages GC pacing. For COW workloads:
- GC triggers when heap grows by
GOGCpercent. - During marking, the allocator helps the GC by participating in marking work ("mutator assist").
- High allocation rates trigger more frequent GC.
For COW with high write rates, the GC paces itself to keep up. If it can't, allocations slow down to avoid heap blowup.
Extended Section: Architecture-Specific Details¶
A closer look at the differences across CPU architectures.
x86-64 (amd64)¶
The strongest commodity memory model. Key properties:
- TSO (Total Store Order). All cores see writes in the order they were performed. The only relaxation: a core may see its own writes before others (store buffer).
- Aligned 8-byte MOVs are atomic. No special instructions needed for plain reads/writes of aligned 64-bit values.
- LOCK prefix. Adds full-fence semantics. Used by atomic RMW operations (XCHG, XADD, CMPXCHG).
- Cache-line size: 64 bytes.
- MESI cache protocol (or MOESI on AMD).
For COW: - Load is a plain MOV — essentially free. - Store uses XCHG to enforce full fence (Go's SC requirement). - CAS uses LOCK CMPXCHG.
The cost of SC on x86 is low because TSO is already close to SC.
ARM64 (aarch64)¶
Weaker memory model. Loads and stores can be reordered freely unless you use special instructions.
- LDAR/STLR: load-acquire and store-release. Provide acquire/release semantics.
- CAS: As of ARMv8.1 LSE extension, native
CASinstruction. Without LSE, use LL/SC loop (LDAXR/STLXR). - Cache-line size: 64 bytes (typically); some chips have 128.
For COW: - Load uses LDAR, slightly more expensive than LDR. - Store uses STLR, with implicit ordering. - CAS uses CASAL (if LSE) or LL/SC loop.
Go's atomic.Pointer.Load on ARM64 compiles to a single LDAR — about 2-3 ns.
RISC-V (riscv64)¶
Weakest of the three major architectures. Memory operations have explicit ordering only via FENCE instructions.
- LR/SC: Load-Reserved / Store-Conditional for atomic RMW.
- AMO* instructions: Atomic Memory Operations (AMOADD, AMOSWAP, AMOOR, etc.).
- FENCE: Explicit barrier with
pr,pw,sr,swqualifiers.
For COW: - Load is plain LD plus FENCE r,rw for acquire. - Store is FENCE rw,w plus SD. - CAS uses an LR/SC loop or AMOSWAP.
Costs are similar to ARM64. Go's RISC-V support is fairly recent; expect performance to improve.
32-bit ARM¶
64-bit atomics on 32-bit ARM are non-atomic without special handling. Go falls back to a software lock-based implementation.
For COW: not a concern unless you target 32-bit ARM (Raspberry Pi 1-3, some embedded). Even then, costs are higher but functional.
MIPS, PPC, s390x¶
Go supports several other architectures. Each has its own memory model and instructions. The Go team handles the codegen; user code stays the same.
For most Go services, you do not need to think about these. The portability of atomic.Pointer[T] is one of its great virtues.
Extended Section: Hardware Effects in Practice¶
Beyond memory models, real hardware has effects worth knowing about.
Cache hierarchy¶
Modern CPUs have three levels of cache:
- L1: ~32 KB per core, ~1 ns latency.
- L2: ~256-1024 KB per core, ~5 ns latency.
- L3: ~16-64 MB shared, ~20-30 ns latency.
- Main memory: ~100 ns latency.
A COW snapshot pointer accessed from L1 costs ~1.5 ns. From main memory, ~100 ns. The 60× difference depends on access pattern.
For hot snapshots (read 1000 times per ms), L1 hit rate is essentially 100%. For cold snapshots (read rarely), miss rate is high.
NUMA¶
On multi-socket systems, memory has affinity to a socket. Cross-socket access pays extra latency (~150 ns vs ~100 ns).
Go does not surface NUMA, but you can: - Pin OS threads with runtime.LockOSThread. - Use numactl to launch the process bound to a socket.
For most Go services, NUMA is invisible. For 64+ core machines with heavy COW traffic, it can matter.
Cache coherence traffic¶
When a writer modifies a cache line, other cores' copies must invalidate. This is "coherence traffic" — a non-trivial cost.
For COW: the atomic pointer's cache line is written occasionally. After a write, all reader cores' copies of the line invalidate; their next read fetches the new value.
For 1 publish/sec and 64 reader cores: ~64 invalidations/sec. Trivial.
For 1000 publishes/sec: ~64 000 invalidations/sec. Each costs ~50 ns of coherence latency. Total: ~3 ms/sec of overhead — still small.
For 100 000 publishes/sec: prohibitive. Don't do this.
Branch prediction¶
Modern CPUs predict branches. A well-predicted branch costs ~1 cycle; a mispredicted branch costs ~10-20 cycles.
The write barrier check is well-predicted: it falls through 99% of the time. The branch is essentially free.
A poorly-written hot path with unpredictable branches (e.g., a series of if statements based on the snapshot's fields) can be slow even if the snapshot read is fast.
Speculative execution¶
CPUs execute instructions ahead of branches and commit when the branch resolves. For COW:
The CPU may speculatively execute do() before fully resolving the Load. If the branch was correctly predicted, the work is committed; if not, it's discarded.
This is generally transparent and helpful. Spectre-class vulnerabilities (speculation-related leaks) are a concern for security but not correctness.
Prefetching¶
CPUs prefetch ahead based on access patterns. For a COW reader that walks c.Items[0], then c.Items[1], etc., the prefetcher loads ahead.
For random access (e.g., hash map probes), the prefetcher is less effective. Hot snapshots usually exhibit good prefetch behavior.
Extended Section: A Lock-Free Producer-Consumer with COW¶
Here is a more advanced pattern combining COW with a lock-free queue.
Goal¶
A producer publishes a snapshot. Many consumers each process the snapshot. When a new snapshot is published, all consumers eventually switch to it.
Naive (broken) approach¶
The consumer holds s for one iteration. The next iteration reloads. Latency between publish and observe: ~one iteration of process.
Notification version¶
type Notifier struct {
cur atomic.Pointer[Snapshot]
ch atomic.Pointer[chan struct{}]
}
func (n *Notifier) Publish(s *Snapshot) {
old := n.ch.Load()
next := make(chan struct{})
n.ch.Store(&next)
n.cur.Store(s)
close(*old)
}
func (n *Notifier) Consume(fn func(*Snapshot)) {
for {
notif := *n.ch.Load()
s := n.cur.Load()
fn(s)
<-notif
}
}
Consumers block on the notification channel. When publish happens, the channel is closed, waking all consumers.
Issues: - Ordering: n.ch.Store(&next) happens before n.cur.Store(s) and before close(*old). Consumers waking up should re-Load n.cur and n.ch. - The consumer pattern needs to re-load both cur and ch each iteration.
Correct version¶
func (n *Notifier) Consume(fn func(*Snapshot)) {
var lastSeen *Snapshot
for {
notif := *n.ch.Load()
s := n.cur.Load()
if s != lastSeen {
fn(s)
lastSeen = s
}
<-notif
}
}
The consumer waits on notif. When woken, it re-loads cur. If cur differs from last, it processes the new snapshot.
Memory model check: - cur.Store(s) happens-before close(*old) (atomic + non-atomic ordering). - <-notif happens-after close(notif) (channel ordering). - cur.Load() after <-notif sees s (transitively).
Correct.
Variant: bounded consumers¶
Sometimes you only want one consumer to process each snapshot. Use a Once per snapshot:
type Snapshot struct {
data Data
once sync.Once
}
func (n *Notifier) ConsumeUnique(fn func(*Snapshot)) {
for {
notif := *n.ch.Load()
s := n.cur.Load()
s.once.Do(func() { fn(s) })
<-notif
}
}
The sync.Once ensures fn runs exactly once per snapshot, regardless of how many consumers see it.
Extended Section: A Wait-Free Atomic Snapshot¶
A subtle variation: a snapshot that contains multiple atomic pointers, each readable independently but together representing a consistent view.
Problem statement¶
You want to atomically update both Config and Routes. Readers can read either alone (and the field they read is current). But occasionally a reader wants both, consistent as of one moment.
Solution 1: Combined snapshot¶
The standard answer:
One Load gets both. Updates rebuild both. Simple and correct.
Solution 2: Version-tagged independent pointers¶
If updates to Config and Routes are independent and the combined view is rare, separate pointers may be better:
type Tagged[T any] struct {
Version int64
Value *T
}
var cfg atomic.Pointer[Tagged[Config]]
var routes atomic.Pointer[Tagged[Routes]]
func ReadConsistent() (*Config, *Routes) {
for {
cfgV := cfg.Load()
routesV := routes.Load()
if cfg.Load() == cfgV {
return cfgV.Value, routesV.Value
}
}
}
The check cfg.Load() == cfgV verifies cfg hasn't changed. If it has, retry. The version tag is unused here but useful for monitoring.
This pattern guarantees a consistent view (cfgV's version corresponds to a moment when routesV was also current... almost — the gap between the two Loads still allows a routes update). Truly consistent multi-pointer reads need more care.
Solution 3: A "world snapshot"¶
Periodically (e.g., once per second), take a complete snapshot of all stores:
type World struct {
Config *Config
Routes *Routes
}
var world atomic.Pointer[World]
func updater() {
for {
world.Store(&World{
Config: cfg.Load(),
Routes: routes.Load(),
})
time.Sleep(time.Second)
}
}
Consumers read world.Load() for consistent snapshots updated up to 1 sec stale. Cheap and correct.
The trade-off: consistency lags by ~1 sec.
Recommendation¶
For 99% of cases, Solution 1 (combined snapshot) is the right answer. It costs more per write but solves the problem cleanly.
Extended Section: Profiling COW Memory¶
Diagnosing memory issues in COW systems requires specific techniques.
Identifying pinned snapshots¶
Use pprof goroutine ?debug=2 to dump all goroutine stacks:
Search for goroutines holding snapshot references:
Goroutines stuck in handler functions while holding snapshots are pin sources.
Heap snapshot analysis¶
Take a heap snapshot:
In the pprof shell:
Look for many live *Snapshot allocations. If you see many, you have pinning.
Allocation profile¶
Shows cumulative allocations. The biggest allocators are usually writers building snapshots.
Tracking snapshot generation¶
Tag goroutines with the snapshot version they hold:
ctx = pprof.WithLabels(ctx, pprof.Labels("snapshot_version", fmt.Sprint(snap.Version)))
pprof.SetGoroutineLabels(ctx)
Then in the goroutine profile, you can see how many goroutines hold each version. A long tail of old versions indicates pinning.
Heap differential¶
Take heap snapshots at intervals:
curl 'http://localhost:6060/debug/pprof/heap?gc=1' > heap1.pprof
sleep 60
curl 'http://localhost:6060/debug/pprof/heap?gc=1' > heap2.pprof
go tool pprof -base heap1.pprof heap2.pprof
?gc=1 forces GC before snapshot, so you see only persisted growth. The diff shows what grew in 60 sec — usually a pin source if your COW is misbehaving.
Extended Section: A Production Postmortem (Hypothetical)¶
A typical COW failure and how it would be diagnosed.
The incident¶
A web service deployed yesterday is OOMing in production. Memory grows steadily; goroutines pile up. Restarts buy a few hours; the problem recurs.
Initial investigation¶
Heap profile shows 5 GB of *RouteTable objects alive. The current snapshot is 50 MB; there are ~100 versions alive.
Hypothesis¶
Some goroutine is pinning old snapshots.
Goroutine profile¶
Dump goroutines. Find that 10 000 goroutines are blocked in processRequest. Each holds a *RouteTable reference in a local variable.
Root cause¶
A recent change made requests time out at 10 minutes instead of 30 seconds. Long requests started piling up; each pins a *RouteTable. With route updates every 5 seconds, accumulated pins grow.
Fix¶
- Restore 30-second timeout.
- Add an alert on goroutine count.
- Add a metric on snapshot age across in-flight requests.
Lesson¶
COW pinning is a foot-gun. Always bound request lifetimes. Monitor snapshot age.
Extended Section: Cross-Cutting Concerns¶
Logging¶
COW publishes are natural log events:
log.Info("config published",
"version", next.Version,
"size", next.Size(),
"diff", computeDiff(old, next),
)
Logging on every publish gives auditability. Aggregating publishes (one log per N) reduces volume.
Tracing¶
Wrap reads and writes in tracing spans:
For high-frequency reads, this adds overhead. Use sampling:
Metrics¶
Track the basics: version, age, write rate, allocation rate. Export as Prometheus gauges and counters.
var (
configVersion = prometheus.NewGauge(prometheus.GaugeOpts{Name: "config_version"})
configAge = prometheus.NewGauge(prometheus.GaugeOpts{Name: "config_age_seconds"})
configUpdates = prometheus.NewCounter(prometheus.CounterOpts{Name: "config_updates_total"})
)
Health checks¶
A health check endpoint exposes config state:
func healthHandler(w http.ResponseWriter, r *http.Request) {
c := store.Get()
if c == nil || time.Since(c.PublishedAt) > 24*time.Hour {
http.Error(w, "config stale", 503)
return
}
json.NewEncoder(w).Encode(c)
}
Migration safety¶
When changing the snapshot type:
- Add fields, never remove. Old readers ignore new fields; new readers handle missing old fields.
- Add fields with default zero values.
- Don't change field types.
- Don't reorder fields if serialization order matters.
Snapshot schemas are forever in distributed systems.
Extended Section: COW for Specific Domains¶
COW for game servers¶
Game servers often have a "world snapshot" — positions, scores, events. Players read the snapshot; the server publishes updates many times per second.
For 60 fps games: 60 updates/sec. Each update can be a full snapshot or a delta. COW with delta encoding works well.
type WorldSnapshot struct {
Tick int64
Players []PlayerState
}
var world atomic.Pointer[WorldSnapshot]
Players (or their AI proxies) read the snapshot; the server publishes new ones. The pattern is identical to a web service's config.
Critical concern: latency. Players must see updates within milliseconds. atomic.Pointer.Load is ~1.5 ns; publish is ~ms (build cost). Both are well within 16 ms (60 fps) budget.
COW for trading systems¶
Order books in trading systems are read constantly and written frequently. Pure COW would generate too much garbage.
Patterns: - Per-symbol COW (each symbol's order book is independent). - Lock-free queue for incremental updates. - Snapshot publishes only on significant changes.
COW for ML inference¶
Models are loaded once and read many times. COW for the model snapshot allows hot-swap:
type Model struct {
weights []float32
graph *Graph
}
var current atomic.Pointer[Model]
func Predict(input []float32) []float32 {
m := current.Load()
return m.graph.Forward(m.weights, input)
}
func ReloadModel(path string) error {
next, err := LoadModel(path)
if err != nil { return err }
current.Store(next)
return nil
}
A new model file triggers a reload; new inference requests use the new model. In-flight requests finish on the old model.
For very large models (GB-scale), pinning the old model during in-flight requests adds memory pressure. Bound request times.
COW for routing in service meshes¶
Envoy, Istio, and similar service meshes maintain routing tables. Updates happen during config push; reads happen on every packet.
The natural COW use case. Most implementations use C++ atomic pointers in production; the Go equivalents are identical in spirit.
Extended Section: COW Anti-Patterns at the Runtime Level¶
Anti-pattern 1: Cache-line ping-pong¶
Two atomic variables sharing a cache line. Writers to one invalidate the line for readers of the other.
If counter and cur share a cache line, every counter increment invalidates the line. COW reads pay extra cache miss latency.
Fix: padding.
Anti-pattern 2: Allocating in the hot read path¶
Returning a struct by value may force allocation if the struct is large. Return a pointer.
Anti-pattern 3: Snapshot containing very large slices¶
Every rebuild copies (or shares) the slice header. Sharing is fine if you never append to next.History. But if the slice is mutated, each rebuild costs 100 MB.
For history-like data, use a persistent log structure (append-only with structural sharing).
Anti-pattern 4: Type assertions in hot paths¶
Type assertions cost a few ns. For 1 M reads/sec, that's measurable. Use atomic.Pointer[Config] instead.
Anti-pattern 5: Defer in tight publish loops¶
func (s *Store) Update(fn func(*Config)) {
s.mu.Lock()
defer s.mu.Unlock() // OK for one call; overhead in a loop
// ...
}
Defer has overhead (~30 ns). For one call this is fine. If you batch many calls inside a critical section, restructure to avoid the defer.
Anti-pattern 6: Reflection on snapshots¶
func dumpAll(s *Snapshot) {
v := reflect.ValueOf(s).Elem()
for i := 0; i < v.NumField(); i++ {
// ...
}
}
Reflection is slow. For occasional diagnostics it's fine. In hot paths, write explicit code.
Extended Section: The Future of COW in Go¶
A speculative look at where things might go.
Generic persistent collections¶
The Go team has discussed adding persistent collections to the standard library. If/when this happens, COW with structural sharing becomes a one-liner:
Until then, third-party libraries fill the gap.
Better escape analysis¶
The compiler continues to improve. Future versions may stack-allocate snapshots in more cases, reducing GC pressure.
Weak references (already in Go 1.24+)¶
weak.Pointer[T] lets caches reference snapshots without pinning them. Useful for snapshot-keyed memoization.
Hardware transactional memory¶
Intel TSX and ARM TME provide hardware-level transactions. Theoretically, snapshot publishes could be wrapped in transactions for stronger atomicity guarantees. In practice, Go doesn't expose these; you'd need CGO + assembly.
Larger-than-pointer atomic operations¶
A long-standing request: atomic 128-bit operations. With them, you could atomically update a (version, pointer) pair without a SeqLock. Go has not added this, but the standard library may eventually.
Memory model refinements¶
The Go memory model is mature but not frozen. Subtle clarifications continue to roll out. Stay informed via the Go team's blog and release notes.
Extended Section: Closing Reflections on Mastery¶
After four levels of this curriculum, the COW pattern should feel natural. You should be able to:
- Identify when a workload calls for COW within seconds.
- Sketch a working implementation in minutes.
- Predict the performance characteristics.
- Diagnose production failures from heap profiles and goroutine dumps.
- Explain the runtime mechanics from API to instruction.
This is mastery. It does not come from reading documents; it comes from writing, debugging, and shipping COW code. Use these documents as reference; let production teach you the rest.
The pattern's beauty¶
COW is one of the rare patterns where simplicity at the API level coincides with depth at the implementation level. atomic.Pointer.Store(next) looks trivial; behind it lies the entire Go memory model, the GC, write barriers, cache coherence, and hardware memory models.
A novice can use COW correctly. An expert understands every layer.
The pattern's risks¶
COW's elegance masks two risks: 1. Misuse — using COW where mutex would be better, or vice versa. 2. Hidden costs — pinned snapshots, GC pressure, cache-line effects.
Both are addressed by measurement. Trust benchmarks over instinct.
What to do next¶
- Build a COW system in your codebase.
- Run it with the race detector.
- Profile its memory and CPU under realistic load.
- Identify one micro-optimization that helps; apply it.
- Document the design for the next engineer.
These steps turn knowledge into instinct.
Closing Note from the Author of This Curriculum¶
Copy-on-write is the kind of pattern that rewards study. The more you understand, the better your designs become. The four levels — junior, middle, senior, professional — are not gates but progressive expansions of what you can build.
This professional-level document is the last word, but every sentence is also an invitation to investigate further. Read the runtime source. Benchmark on your hardware. Debug your own systems. Each act of investigation deepens the picture.
Whether you ship a 50-line COW config or a million-entry sharded HAMT cache, the principles are the same. The complexity scales; the discipline does not.
Good luck. Build great things.
One Last Appendix: A Glossary Recap¶
Combining all levels' glossaries, the complete COW vocabulary:
| Term | Level introduced | One-line meaning |
|---|---|---|
| Copy-on-write | Junior | Replace, don't mutate |
| Snapshot | Junior | Immutable version of data |
atomic.Pointer[T] | Junior | Type-safe atomic pointer (Go 1.19+) |
atomic.Value | Junior | Pre-generic atomic container |
| Writer mutex | Junior | Serialises writers |
| Lost update | Junior | Two writers; one's update is overwritten |
| Snapshot consistency | Middle | One snapshot for one operation |
| Sharded COW | Middle | One store per shard |
| Batched writes | Middle | Accumulate updates; flush together |
| Watcher | Middle | Notified on snapshot change |
| Reload | Middle | Replace snapshot with newly-loaded |
| Structural sharing | Senior | New version shares unchanged parts |
| HAMT | Senior | Persistent hash map structure |
| RCU | Senior | Read-copy-update discipline |
| Quiescence | Senior | A reader has no in-flight references |
| Hazard pointer | Senior | Per-thread record of current pointer |
| EBR | Senior | Epoch-based reclamation |
| MVCC | Senior | Multi-version concurrency control |
| Persistent data structure | Senior | Preserves previous versions |
| Sequential consistency | Professional | Strongest practical memory model |
| Happens-before | Professional | Ordering relation in memory model |
| Acquire-release | Professional | Finer-grained ordering |
| Write barrier | Professional | GC hook on pointer writes |
| Cache line | Professional | 64-byte unit of memory transfer |
| False sharing | Professional | Unrelated data on same cache line |
| MESI | Professional | Cache coherence protocol |
| ABA | Professional | Pointer change CAS does not detect |
| Linearizability | Professional | Operation takes effect at single instant |
| Tricolor GC | Professional | Go's mark-sweep with grey state |
| Stop-the-world | Professional | Brief pause for GC bookkeeping |
| Stack scan | Professional | GC's pass over goroutine stacks |
| Write barrier flag | Professional | Conditional firing of barrier |
A professional Go engineer recognises every term in this table.
Truly Final Words¶
The journey is complete. From "what is a goroutine" to "how is atomic.Pointer[T].Store compiled to assembly", you've traversed the entire stack of copy-on-write in Go.
Use it.
Extended Section: The Long Walk Through the Runtime¶
For the truly motivated, here is a guided tour of the Go runtime as it pertains to COW, with file pointers and key functions.
File 1: src/sync/atomic/type.go¶
This defines the user-facing types. Look at Pointer[T any]:
The first field is a zero-sized phantom field for type-identity purposes. The second prevents copying. The third is the actual pointer storage.
The methods are wrappers around runtime intrinsics:
LoadPointer is declared but not defined here — it's an intrinsic recognized by the compiler.
File 2: src/sync/atomic/doc.go¶
Declares the intrinsic signatures with //go:noescape:
//go:noescape
func LoadPointer(addr *unsafe.Pointer) (val unsafe.Pointer)
//go:noescape
func StorePointer(addr *unsafe.Pointer, val unsafe.Pointer)
//go:noescape
func CompareAndSwapPointer(addr *unsafe.Pointer, old, new unsafe.Pointer) (swapped bool)
//go:noescape is critical: it tells the compiler the function does not cause its arguments to escape to the heap, enabling stack allocation.
File 3: src/runtime/internal/atomic/atomic_amd64.s¶
The amd64 implementation. For LoadPointer:
A single load. The x86 memory model guarantees atomicity for aligned 8-byte loads.
For StorePointer:
XCHGQ provides full-fence semantics implicitly (the LOCK prefix is implicit for XCHG with memory operands).
For CompareAndSwapPointer:
TEXT ·CompareAndSwapPointer(SB), NOSPLIT, $0-25
MOVQ ptr+0(FP), BX
MOVQ old+8(FP), AX
MOVQ new+16(FP), CX
LOCK
CMPXCHGQ CX, 0(BX)
SETEQ ret+24(FP)
RET
LOCK CMPXCHG is the canonical atomic CAS. SETEQ reads the zero flag (set by CMPXCHG on success).
File 4: src/runtime/mbarrier.go¶
The write barrier. Key function:
//go:nowritebarrierrec
//go:nosplit
func gcWriteBarrier(slot *uintptr, ptr uintptr) {
if !writeBarrier.enabled {
return
}
// ...
}
The barrier is conditional. Most of the time it returns early. During concurrent marking, it records the new pointer in the GC's tracking structures.
For atomic stores, the compiler emits the barrier check inline with the atomic operation. The check is one byte-load + one branch.
File 5: src/runtime/mgcmark.go¶
The GC marker. Functions of interest:
markroot: marks the roots (globals, stacks).scanstack: scans a goroutine's stack for pointers.gcDrain: the main marking loop.
For COW: scanstack is what finds your local snapshot pointers and keeps them alive. The scan is precise — it uses compiler-generated pointer maps.
File 6: src/runtime/proc.go¶
The scheduler. Functions related to COW indirectly:
runtime.Gosched: yields the current goroutine.goexit0: finalizes a goroutine.gopark/goready: blocks/unblocks goroutines.
When a goroutine exits, its stack-local snapshot references are no longer reachable. The snapshots can be GC'd if no other references exist.
File 7: src/runtime/malloc.go¶
Memory allocation. For COW writes, every snapshot rebuild calls into this code.
mallocgc: the main entry point for heap allocation.nextFree: per-P fast-path allocation.
Per-P allocation is the reason snapshot allocations are fast (~10 ns each). The per-P cache (mcache) is essentially a thread-local pool.
File 8: src/runtime/mheap.go¶
Heap management. The GC sweep phase walks here.
For COW: not directly relevant, but useful context. The heap is divided into spans; spans have size classes; allocations come from the appropriate span.
File 9: src/runtime/mgcpacer.go¶
GC pacing. Controls when GC triggers and how aggressively it runs.
For COW with bursty writes: GC pacer adapts. If allocation rate spikes, GC runs more frequently. If allocation rate is low, GC defers.
Reading exercise¶
Pick one of these files. Read it in full once. Read it again with a debugger attached to a small Go program. Each pass adds clarity.
This is how senior engineers become professional engineers: reading the runtime, understanding it, and connecting it back to their daily work.
Extended Section: Cross-Reference With Other Go Concepts¶
COW is just one pattern; it intersects with many others.
COW and channels¶
Channels are themselves COW-like: a send publishes a value; a receive observes it. The send happens-before the receive (channel ordering).
For COW snapshots passed through channels:
The channel provides the publish; the consumer reads. No atomic.Pointer needed.
Use channels for one-to-one or many-to-one publish. Use atomic.Pointer for one-to-many (broadcast).
COW and contexts¶
context.Context is value-typed and immutable. Setting a value returns a new context that shares parent context. This is structural sharing applied to request-scoped data.
A snapshot can be added to a context:
Subsequent calls retrieve it:
Idiomatic and safe.
COW and errgroup¶
For parallel snapshot operations:
g, ctx := errgroup.WithContext(parentCtx)
snap := store.Get()
for _, item := range snap.Items {
item := item
g.Go(func() error {
return process(ctx, item)
})
}
return g.Wait()
Each parallel worker sees the same snapshot. Consistent.
COW and sync.Once¶
A common pattern: lazy initialization with COW once.
var (
cfgOnce sync.Once
cfg *Config
)
func GetConfig() *Config {
cfgOnce.Do(func() {
cfg = loadInitial()
})
return cfg
}
This is essentially COW with one Store. The sync.Once provides the happens-before edge.
But: this doesn't support updates. For reload-capable config, use atomic.Pointer[T].
COW and sync.Pool¶
For temporary objects (e.g., builders during snapshot construction):
var builderPool = sync.Pool{
New: func() any { return &Builder{m: make(map[string]string)} },
}
func buildSnapshot(...) *Snapshot {
b := builderPool.Get().(*Builder)
defer builderPool.Put(b)
clear(b.m) // reset
// ... use b
snap := &Snapshot{m: b.m}
// careful: snap.m shares with b.m; do a deep copy if b is pooled
return snap
}
Mixing pools and COW requires care: pooled objects must not be referenced after Put.
COW and generics¶
The whole atomic.Pointer[T] API is generic. For typed COW stores:
type COWStore[T any] struct {
cur atomic.Pointer[T]
mu sync.Mutex
}
func NewCOWStore[T any](initial *T) *COWStore[T] {
s := &COWStore[T]{}
s.cur.Store(initial)
return s
}
func (s *COWStore[T]) Get() *T { return s.cur.Load() }
func (s *COWStore[T]) Update(fn func(*T)) {
s.mu.Lock()
defer s.mu.Unlock()
old := s.cur.Load()
var next T
if old != nil {
next = *old
}
fn(&next)
s.cur.Store(&next)
}
One library, all your COW use cases.
COW and embedding¶
For composable COW stores:
type BaseStore struct {
cur atomic.Pointer[Config]
mu sync.Mutex
}
type ExtendedStore struct {
BaseStore
auditLog []AuditEntry
}
The embedded BaseStore provides COW behavior. The ExtendedStore adds audit logging on top.
COW and interfaces¶
You can hide a COW store behind an interface:
Implementations can be COW, mutex-based, or remote (RPC). Consumers don't need to care.
Extended Section: Compile-Time and Build-Time Considerations¶
Build tags¶
For cross-platform COW code:
The standard library's runtime/internal/atomic uses this pattern extensively.
Inlining¶
Small functions like Pointer.Load are usually inlined by the compiler. Force inlining is not possible; the compiler decides based on body size.
To verify inlining:
For hot read paths, ensure inlining happens. If not, refactor.
Escape analysis¶
Snapshots typically escape. Builders, on the other hand, may stay on the stack.
Bounds checking¶
Bounds checks add small overhead. For tight loops over slice fields, occasional // nolint:gosec and direct index access via _ = slice[len-1] can hint the compiler to skip subsequent bounds checks.
For COW reads, bounds checks rarely matter.
Race detector overhead¶
The race detector adds ~10× overhead. Use in tests, not in production builds.
Build with -race for staging environments to catch subtle races in real workloads.
CGO and atomics¶
If you mix Go and C code, atomic operations on shared memory require careful coordination. C's _Atomic types and Go's sync/atomic use the same hardware primitives but with different language bindings.
For COW: avoid sharing snapshots across the Go/C boundary. Pass them by value or use a separate synchronization mechanism.
Extended Section: Patterns Reference¶
A consolidated catalogue of COW patterns covered in this curriculum.
Pattern: Plain rebuild COW¶
For small, infrequently-updated data. Junior level.
Pattern: Sharded COW¶
For map-shaped data with parallel writers. Middle level.
Pattern: Persistent HAMT inside COW¶
For large, frequently-updated maps. Senior level.
Pattern: Generational COW¶
For huge data with bursty writes. Senior level.
Pattern: Layered config¶
For hierarchical configuration with independent layers. Senior level.
Pattern: Versioned snapshots¶
For change detection, audit, conflict resolution. Senior level.
Pattern: Watcher list¶
For subscriber notification on changes. Middle level.
Pattern: SIGHUP reload¶
For operator-triggered config reloads. Middle level.
Pattern: Periodic poll¶
For background sync with external sources. Middle level.
Pattern: Snapshot per request¶
For request-scoped consistency. Middle level.
Pattern: Single-writer goroutine¶
For audit logging and backpressure. Middle level.
Pattern: Batched writer¶
For high write rates with GC sensitivity. Middle level.
Pattern: CAS loop¶
For lock-free writers with low contention. Middle level.
Pattern: SeqLock¶
For multi-word atomic reads. Professional level.
Pattern: Hazard pointers¶
For deterministic reclamation when GC isn't enough. Professional level.
Pattern: EBR (epoch-based reclamation)¶
For low-overhead manual reclamation. Professional level.
Pattern: Snapshot of snapshots¶
For consistent cross-store updates. Senior level.
Pattern: Snapshot diff¶
For audit, replay, change notifications. Senior level.
Pattern: Read-through cache with snapshot key¶
For memoization keyed on snapshot identity. Senior level.
Pattern: Distributed COW with control plane¶
For cross-process COW with reconciliation. Senior level.
Recognize each by name and use case; that's professional fluency.
Extended Section: An Architectural Q&A¶
Common questions a professional engineer faces.
Q: I have a write-heavy COW; what do I do?¶
A: Investigate whether the workload truly needs COW semantics. If you need consistent multi-key reads, switch to persistent HAMT. If you only need per-key consistency, switch to sync.Map or sharded mutex map.
Q: How do I tell if a snapshot is leaking?¶
A: Take heap profiles at intervals; look for growth in snapshot-related types. Use pprof goroutine to find pinned references. The "snapshot age" metric per holder is the canonical detection.
Q: My COW publish takes 100 ms; how do I speed it up?¶
A: Profile the build step. Common culprits: too-large rebuild, unnecessary deep copy, slow validation. Mitigate via structural sharing (persistent HAMT), batching, or moving expensive work outside the writer mutex.
Q: How do I add a watcher without blocking writers?¶
A: Use buffered channels with non-blocking send + drop on full. Or dispatch watchers to goroutines.
Q: When should I use atomic.Value instead of atomic.Pointer[T]?¶
A: Only when supporting pre-1.19 Go. New code: always atomic.Pointer[T].
Q: How do I migrate from RWMutex to COW?¶
A: Incrementally. Add an atomic.Pointer next to the RWMutex. Update writers to publish via atomic.Pointer. Migrate readers one at a time from RLock to Load. Eventually remove the RWMutex.
Q: My COW is correct but slow; what do I check?¶
A: First check that snapshots aren't huge. Then check write rate. Then look for cache-line ping-pong (use padding). Then look for GC overhead (allocations per write).
Q: How do I test COW correctness?¶
A: go test -race plus torture tests with many concurrent readers and writers. Assert old snapshots remain valid after subsequent updates.
Q: Can I COW across goroutines spawned in different packages?¶
A: Yes. The atomic.Pointer is just a memory location. Any goroutine with a reference to it can Load. Same for the snapshot — once published, it's a regular pointer.
Q: How do I handle very long writer mutex hold times?¶
A: Build outside the lock if possible. Take the lock only for the final Store. If you need read-modify-write semantics, hold the lock but minimize the work inside.
Q: What's the longest-lived snapshot I should allow?¶
A: Bounded by your memory budget. A rule of thumb: snapshots should not live longer than ~10x the typical update interval. For a 1-second update cadence, 10-second-old snapshots are concerning.
Extended Section: Things You Don't Need (But Are Tempted To Add)¶
Don't add: Manual reference counting¶
The GC handles this. Manual counting is fragile and almost always wrong.
Don't add: Custom memory allocator¶
Go's allocator is excellent. Custom allocators add complexity and rarely help.
Don't add: Lock-free everything¶
A simple writer mutex outperforms most lock-free designs for COW. CAS loops are tempting but rarely necessary.
Don't add: Snapshot diff in every publish¶
Diffs are expensive. Compute only when needed (e.g., for logging significant changes).
Don't add: Cross-process sync mechanisms¶
Stick to single-process COW. For cross-process, use established tools (etcd, Redis, etc.).
Don't add: Manual cache-line padding everywhere¶
Padding helps only in measured cases. Premature padding wastes memory.
Don't add: Custom atomic types¶
atomic.Pointer[T], atomic.Int64, etc. cover essentially every case. Custom atomics introduce bugs.
Don't add: Snapshot serialization without a clear need¶
Snapshots are in-memory. Serializing on every publish is expensive and usually unnecessary.
Don't add: Reflection¶
Reflection on snapshots is slow. Use explicit code.
Don't add: Generic abstractions before they're needed¶
Premature genericness adds complexity. Start concrete; generalize when patterns emerge.
Extended Section: The Mental Model Recap¶
At every level, a different mental model dominates.
Junior: "I store a pointer; everyone loads it."
Middle: "I publish a complete, validated snapshot; readers see one consistent view."
Senior: "I share structure with previous versions to keep the per-write cost bounded."
Professional: "Atomic operations create happens-before edges that the entire pattern relies on; the GC handles reclamation."
The progression is not just about depth — it's about which abstraction is in focus. At each level, you operate at the right level for your work.
A senior engineer never thinks about MOVQ instructions while writing code. A professional engineer never thinks about senior-level concerns when debugging — they go straight to the runtime. Both fluencies serve different moments.
Extended Section: Final Synthesis Diagram¶
+-------------------------------------------------------------+
| Application Layer |
| - Config, FeatureFlags, RouteTable, etc. |
+-------------------------------------------------------------+
| Pattern Layer |
| - COW + writer mutex |
| - Persistent HAMT for large maps |
| - Sharded COW for parallel writers |
| - Batched updates for GC pressure |
+-------------------------------------------------------------+
| Language Layer |
| - atomic.Pointer[T] (Go 1.19+) |
| - sync.Mutex / sync.RWMutex |
| - sync.Map |
| - struct value copy semantics |
+-------------------------------------------------------------+
| Memory Model |
| - Sequential consistency for atomics |
| - happens-before relation |
| - Race detector |
+-------------------------------------------------------------+
| Runtime |
| - Atomic intrinsics |
| - Write barrier |
| - Garbage collector (mark-sweep, concurrent) |
| - Goroutine scheduler |
+-------------------------------------------------------------+
| Hardware |
| - CPU memory model (TSO on x86, weak on ARM) |
| - Cache hierarchy (L1/L2/L3) |
| - Cache-line size (64 bytes) |
| - Cache coherence (MESI/MOESI) |
+-------------------------------------------------------------+
Each layer rests on the one below. A change in any layer ripples through. A professional engineer understands all layers and can debug at any of them.
Final Outro¶
You have reached the end of the professional level. The COW journey is complete.
Take what you've learned and use it. Build COW systems that scale. Debug them with confidence. Mentor others through the same journey.
The pattern is timeless. The Go-specific details may evolve, but the principles — immutable snapshots, atomic publication, garbage collection, structural sharing, memory ordering — endure.
Welcome to professional-level mastery. Now go build something.
Bonus Section: Atomic Operations Reference¶
A complete reference of every sync/atomic API and its COW relevance.
atomic.Bool¶
A single boolean. Used as a flag.
COW use: indicating "is the snapshot initialized." Rarely needed if you always Store the initial snapshot in the constructor.
atomic.Int32, atomic.Int64¶
Signed integers. Common for counters and version numbers.
COW use: snapshot version (alongside atomic.Pointer[T]).
atomic.Uint32, atomic.Uint64¶
Unsigned integers. Similar to Int32/Int64.
atomic.Uintptr¶
Machine-word-sized integer. Used for low-level pointer arithmetic.
atomic.Pointer[T]¶
The COW workhorse. Typed atomic pointer.
Use this in 95% of COW code.
atomic.Value¶
Pre-generic atomic container. Stores interface{}.
Avoid in new code.
atomic.AddInt32/64¶
Atomically add. Returns the new value.
For COW: used in counters embedded inside snapshots (if you accept that they break strict immutability).
atomic.CompareAndSwapInt32/64/Pointer¶
CAS. Returns true if the swap happened.
For COW: lock-free writer pattern.
atomic.LoadInt32/64/Pointer¶
Atomic load. Returns the value.
For COW: the read path.
atomic.StoreInt32/64/Pointer¶
Atomic store.
For COW: the publish.
atomic.SwapInt32/64/Pointer¶
Atomic swap: stores new, returns old.
For COW: rarely needed; Store plus a separate Load suffices.
Choosing among them¶
| Need | Use |
|---|---|
| Single boolean flag | atomic.Bool |
| Counter | atomic.Int64.Add |
| Pointer to immutable snapshot | atomic.Pointer[T] |
| Pre-1.19 codebase | atomic.Value |
| Multi-word atomic | SeqLock pattern with multiple atomic.Uint64s |
Bonus Section: Microbenchmarks of Common Operations¶
Sample numbers on a recent Intel Xeon at 3 GHz.
Atomic Loads¶
Pointer.Load (uncontended, L1): 1.5 ns
Pointer.Load (L2): 5 ns
Pointer.Load (L3): 25 ns
Pointer.Load (main memory): 100 ns
Pointer.Load (cross-NUMA): 200 ns
Int64.Load (uncontended, L1): 1.5 ns
Int64.Load + Add: 3 ns
Int64.CompareAndSwap (success): 15 ns
Int64.CompareAndSwap (failure): 15 ns + retry cost
Atomic Stores¶
Mutex Operations¶
Mutex.Lock (uncontended): 20 ns
Mutex.Unlock (uncontended): 15 ns
RWMutex.RLock (uncontended): 15 ns
RWMutex.RUnlock (uncontended): 10 ns
RWMutex.Lock (uncontended): 30 ns
Map Operations¶
map[K]V Get: 5 ns
map[K]V Set (no rehash): 10 ns
map[K]V Set (with rehash): microseconds
sync.Map.Load: 30 ns
sync.Map.Store: 150 ns
Channel Operations¶
ch <- v (buffered, has space): 50 ns
ch <- v (unbuffered, receiver ready): 100 ns
<-ch (buffered, has value): 50 ns
<-ch (unbuffered, sender ready): 100 ns
GC¶
GC start (STW): ~100 µs
GC mark phase (concurrent): variable (50ms for 200 MB heap)
GC sweep (concurrent): background
Write barrier active: adds ~5 ns per pointer write
Use these as starting estimates. Your hardware will vary.
Bonus Section: Real-World COW Microbenchmark Suite¶
A complete suite to evaluate any COW design.
package cowbench
import (
"sync"
"sync/atomic"
"testing"
)
type Config struct {
X, Y, Z int
}
// Baseline: atomic.Pointer
type AP struct {
p atomic.Pointer[Config]
}
func NewAP() *AP {
a := &AP{}
a.p.Store(&Config{})
return a
}
func (a *AP) Get() *Config { return a.p.Load() }
func (a *AP) Set(c *Config) {
a.p.Store(c)
}
// Baseline: RWMutex
type RW struct {
mu sync.RWMutex
c *Config
}
func NewRW() *RW { return &RW{c: &Config{}} }
func (r *RW) Get() *Config {
r.mu.RLock()
defer r.mu.RUnlock()
return r.c
}
func (r *RW) Set(c *Config) {
r.mu.Lock()
defer r.mu.Unlock()
r.c = c
}
// Baseline: Mutex
type Mu struct {
mu sync.Mutex
c *Config
}
func NewMu() *Mu { return &Mu{c: &Config{}} }
func (m *Mu) Get() *Config {
m.mu.Lock()
defer m.mu.Unlock()
return m.c
}
func (m *Mu) Set(c *Config) {
m.mu.Lock()
defer m.mu.Unlock()
m.c = c
}
// Benchmarks
func BenchmarkAPRead(b *testing.B) {
a := NewAP()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_ = a.Get().X
}
})
}
func BenchmarkRWRead(b *testing.B) {
r := NewRW()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_ = r.Get().X
}
})
}
func BenchmarkMuRead(b *testing.B) {
m := NewMu()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_ = m.Get().X
}
})
}
func BenchmarkAPMixed(b *testing.B) {
a := NewAP()
b.RunParallel(func(pb *testing.PB) {
i := 0
for pb.Next() {
i++
if i%10 == 0 {
a.Set(&Config{X: i})
} else {
_ = a.Get().X
}
}
})
}
func BenchmarkRWMixed(b *testing.B) {
r := NewRW()
b.RunParallel(func(pb *testing.PB) {
i := 0
for pb.Next() {
i++
if i%10 == 0 {
r.Set(&Config{X: i})
} else {
_ = r.Get().X
}
}
})
}
Sample output on an 8-core machine:
BenchmarkAPRead-8 500000000 2.1 ns/op
BenchmarkRWRead-8 50000000 45 ns/op
BenchmarkMuRead-8 30000000 60 ns/op
BenchmarkAPMixed-8 50000000 35 ns/op
BenchmarkRWMixed-8 20000000 120 ns/op
BenchmarkMuMixed-8 20000000 150 ns/op
Pure reads: COW is 20× faster than RWMutex, 30× faster than Mutex. Mixed 90R/10W: COW is 3-4× faster.
For your real workload, run a similar suite and decide based on numbers.
Bonus Section: Notable Real-World COW Code Examples¶
Reading real code is the fastest way to internalize patterns. Here are pointers to real Go code that uses COW:
Standard library¶
crypto/tls.Config— uses snapshot pattern viaClone.net/http.ServeMux— uses RWMutex but the pattern is COW-flavored.expvarpackage — atomic.Value for variables.
Prometheus¶
client_golang/prometheus.Registry— uses RWMutex but the read path is conceptually COW.
etcd¶
raftpackage — uses COW for state machine snapshots.
Docker / Swarm¶
- Internal state machines use snapshot replication.
Reading recommendations¶
- Read
crypto/tls.Config.Clone()first. Short, clean example. - Read
expvar's use of atomic.Value. Shows the pre-generics pattern. - Read
etcd/raft's snapshot code for distributed COW.
Each gives a different perspective on production COW.
Bonus Section: Tools and Utilities¶
Useful tools for COW work.
go vet¶
Catches common mistakes: - copylocks — accidentally copying a struct containing a mutex. - atomic — misuse of atomic operations.
Run on every PR.
Race detector¶
Already covered. Run with -race everywhere except production.
pprof¶
Already covered. The Swiss Army knife of Go performance.
go tool trace¶
Visualizes goroutine execution and GC. Helpful for understanding write-burst behavior.
staticcheck¶
Third-party linter. Catches additional patterns including some COW misuse.
dlv (Delve)¶
The Go debugger. Set a breakpoint at atomic.StorePointer to watch publishes in action.
go-callvis¶
Visualizes call graphs. Useful for understanding how snapshots flow through a system.
benchstat¶
Compares benchmark results.
go test -bench=. -count=10 > old.txt
# make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt
Essential for any performance work.
Bonus Section: A Production Diary Across Levels¶
Imagine a service shipped with junior-level COW, then evolved across years.
Year 0: Initial deployment¶
5 routes, 10 features, 1 reload/day. Simple atomic.Pointer[Config] with writer mutex. No metrics. Works perfectly.
Year 1: Growth¶
500 routes, 100 features, 10 reloads/day. Snapshot grew to 100 KB. Still fine.
Year 2: First incident¶
A misconfigured route caused 5xx errors. Operators needed to know "what's the current config?" Added metrics: config_version, config_age. Added reload-failure counter.
Year 3: Pinned snapshot bug¶
A goroutine leak caused memory growth. Took 2 weeks to diagnose. Root cause: forgotten subscriber whose handler hung. Added timeout. Added pinned_snapshots gauge.
Year 4: Sharding¶
Routes grew to 50 000. Single-config rebuilds caused 50 ms latency spikes. Sharded by URL prefix. Per-shard rebuilds <1 ms.
Year 5: Persistent HAMT¶
Per-tenant configs grew. Migrating to persistent HAMT for cheap incremental updates. Reads slightly slower; writes 100× faster.
Year 6: Distributed COW¶
Multi-region deployment. Added control-plane fanout. Each region has local COW; control plane pushes updates.
Year 7: Observability tooling¶
Built a "config dashboard" showing version + age across all instances. Operators love it. Incident response time dropped.
Year 8: Migration to weak references¶
For snapshot-keyed memoization caches, switched to weak.Pointer[T]. Memory growth from stale cache entries disappeared.
Each year introduced a new pattern from this curriculum. The patterns scaled as the system grew.
Bonus Section: Lessons for Mentoring Junior Engineers¶
If you're a senior engineer mentoring juniors on COW, the key lessons:
Lesson 1: Start with the rule¶
"Don't mutate published snapshots." That's it. Everything else follows.
Lesson 2: Show the bug first¶
Write a deliberately buggy COW snippet. Show the race detector flagging it. Then fix it. The lesson sticks.
Lesson 3: Build a working example¶
Walk through a 50-line COW config end-to-end. Run it. Trigger reloads. Observe behavior.
Lesson 4: Demonstrate the failure modes¶
Show what happens with a long-pinned snapshot. Show heap growth. Show how to detect it via pprof.
Lesson 5: Benchmark against alternatives¶
Show RWMutex vs COW timing. Numbers convince.
Lesson 6: Don't drown them in theory¶
Memory model details, structural sharing, RCU — these come later. Junior engineers need the pattern and the rule; theory follows experience.
Lesson 7: Encourage go test -race¶
Always. Every project. Every PR. Make it habitual.
Lesson 8: Code review COW changes carefully¶
Mutation bugs are subtle. A second pair of eyes catches them.
A junior engineer who internalizes these eight lessons can ship COW code safely. The depth comes later.
Bonus Section: COW Design Review Checklist¶
For PRs introducing or modifying COW code, ask:
- Is the snapshot type clearly documented as immutable?
- Are all slices and maps inside the snapshot deep-copied before mutation?
- Is there a writer mutex (or alternative) preventing lost updates?
- Is the initial snapshot stored before any reader could run?
- Are tests using the race detector?
- Are reads cached in a local variable rather than calling Load multiple times?
- Is validation done before publish?
- On reload failure, does the old snapshot remain current?
- Are metrics emitted for version, age, success/failure?
- Are watchers correctly unsubscribed (no leaks)?
- Is the snapshot small enough that rebuild cost is acceptable?
- Has the design been benchmarked against RWMutex / sync.Map alternatives?
A code review with this checklist catches 95% of COW bugs and design issues.
Bonus Section: Glossary of Useful Commands¶
For working with COW code at the runtime level.
# Run with race detector
go test -race ./...
# Profile heap
go tool pprof http://localhost:6060/debug/pprof/heap
# Profile goroutines
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Profile mutex contention
go tool pprof http://localhost:6060/debug/pprof/mutex
# Force a GC (debugging)
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap?gc=1
# Trace
go tool trace trace.out
# Inspect escape analysis
go build -gcflags '-m -m' .
# Inspect generated assembly
go build -gcflags '-S' . 2>&1 | less
# Benchmark
go test -bench=. -benchmem
# Compare benchmarks
benchstat old.txt new.txt
# Vet
go vet ./...
# Staticcheck
staticcheck ./...
# Run a single test
go test -run TestCOW -v -race -count=10
Memorize these. They become muscle memory.
Bonus Section: Real-World COW Latencies Across the Stack¶
A breakdown of where time goes in a typical COW operation.
Read (full breakdown)¶
1. Function call overhead 0.5 ns
2. atomic.Pointer.Load 1.5 ns (cache hit)
3. Pointer dereference 1.0 ns
4. Field access 0.5 ns
5. Function return 0.5 ns
Total ~4 ns
Write (full breakdown)¶
1. Mutex.Lock 20 ns (uncontended)
2. atomic.Pointer.Load (current snapshot) 1.5 ns
3. Allocate new snapshot struct 20 ns (10 ns for malloc + GC accounting)
4. Copy old snapshot fields ~20 ns for a small struct
5. Apply update depends on operation
6. atomic.Pointer.Store 10 ns
7. Mutex.Unlock 15 ns
Total ~90 ns + update cost
For a 1000-entry map write with full rebuild:
... (steps 1-2 as above) ~22 ns
3. Allocate 1000-entry map ~30 µs
4. Copy 1000 entries ~30 µs
5. Apply update 1 µs
... (steps 6-7 as above) ~25 ns
Total ~61 µs
Map rebuild dominates. For HAMT update:
... (steps 1-2 as above) ~22 ns
3. Allocate ~6 nodes for HAMT path ~100 ns
4. Set value in deepest node 50 ns
... (steps 6-7 as above) ~25 ns
Total ~200 ns
HAMT is 300× faster than full rebuild for 1000-entry maps. As size grows, the gap widens.
Bonus Section: An Existential Note¶
After thousands of lines about COW, what is the takeaway?
COW is a tool. A powerful one. But just a tool.
The deeper takeaway: the patterns we build matter less than the discipline with which we apply them. A simple COW done correctly outperforms a fancy persistent HAMT done sloppily. Discipline beats cleverness.
This discipline is what separates the engineer who ships reliable software from the engineer who debugs production at 3am. Patterns, benchmarks, observability, code review — they all serve discipline.
Use COW where it fits. Don't use it where it doesn't. Measure. Review. Document. Repeat.
That's the whole craft.
Bonus Section: Where to Go From Here¶
After mastering COW, related topics that build on the same foundations:
- Lock-free data structures. COW is a starting point; lock-free queues, stacks, and trees are extensions.
- Database internals. MVCC in PostgreSQL, LSM trees in RocksDB, COW filesystems like ZFS.
- Distributed systems. Replication protocols, CRDTs, vector clocks.
- Concurrent algorithms. Parallel hashtables, work-stealing schedulers, RCU implementations in other languages.
- Memory models. C++, Rust, Java memory models. Each has lessons applicable to Go.
Pick one and dive deeper. The journey continues.
Final Reflection¶
I once thought atomic.Pointer was a clever trick. Now I see it as the natural expression of a fundamental concurrency principle: separate the publish from the construct, the immutable from the mutable, the readable from the writable.
In Go, this principle has the right ergonomics: a single API, garbage collection, simple memory model. The result is a pattern accessible to juniors and powerful enough for the largest production systems.
The four levels of this curriculum mirror four years of mentoring engineers through COW. Each level closes a specific question:
- Junior: "How do I use this?"
- Middle: "When do I use this?"
- Senior: "How do I scale this?"
- Professional: "Why does this work?"
By the end, you should be able to answer any of these questions about any concurrent data structure. The skills transfer.
Build well.
Final Inventory¶
This file covered:
- The Go memory model in detail.
- Inside
sync/atomic.Pointer[T]andsync/atomic.Value. - Codegen across architectures (amd64, arm64, riscv64).
- Memory ordering primitives.
- GC interaction with snapshots.
- Write barriers and pointer writes.
- Escape analysis.
- Cache-line effects and false sharing.
- RCU in the Linux kernel.
- Hazard pointers and EBR.
- Lock-free subtleties (ABA, linearizability, etc.).
- Weak references.
unsafe.Pointerusage.- Assembly-level details.
- Profiling techniques.
- Correctness proofs.
- Designing for hardware.
- A complete COW configuration example.
- Deep dives into specific patterns.
- Cross-references with other Go concepts.
That's the professional level. Use it as both a learning resource and a reference. Re-read sections as needed.
The Last Word¶
Copy-on-write in Go is a pattern of remarkable elegance and depth. From atomic.Pointer[T].Store(next) to the MOV instruction that implements it, every layer is approachable.
You now have the complete picture. Whether you build a 50-line config or a million-entry sharded HAMT, the principles are the same. Apply them with discipline. Measure relentlessly. Document for the next engineer.
That's the professional-level mastery.
Now: build something.
Postscript: Selected Code Reading Exercises¶
For the engineer who has finished this curriculum and wants to deepen further, work through these exercises. Each takes 30 minutes to 2 hours.
Exercise 1: Implement atomic.Pointer[T] from scratch¶
Without looking at the standard library, write your own typed atomic pointer using unsafe.Pointer and the runtime intrinsics. Compare against the standard library's source.
Exercise 2: Read the runtime atomic package¶
Open src/runtime/internal/atomic in your Go installation. Read three files in detail. Write a one-paragraph summary of each.
Exercise 3: Write a persistent linked list¶
Implement type List[T] struct { head T; tail *List[T] } with Cons, Head, Tail. Verify with tests that old versions remain accessible after Cons.
Exercise 4: Build a sharded COW map¶
Implement a sharded COW map with 16 shards. Benchmark against a single-shard version. Measure write throughput improvement.
Exercise 5: Profile a leaky COW system¶
Write a deliberately leaky COW system (e.g., spawn goroutines that hold snapshots and sleep forever). Use pprof to find the leak. Document your findings.
Exercise 6: Compare atomic.Value to atomic.Pointer[T]¶
Write benchmarks. Show the performance gap on your hardware.
Exercise 7: Implement a HAMT¶
From scratch, write a working HAMT in Go. Implement Get, Set, Delete, Range. Test with concurrent reads + serialized writes.
Exercise 8: Run with the race detector¶
Take a small COW package and deliberately introduce a snapshot mutation bug. Verify the race detector catches it.
Exercise 9: Reverse-engineer the codegen¶
Pick a function that uses atomic.Pointer.Load. Build with -gcflags '-S'. Annotate the resulting assembly line by line.
Exercise 10: Write a postmortem¶
For a COW system you've seen or built, write a hypothetical postmortem for a failure. Identify root cause, contributing factors, mitigation.
Each exercise reinforces a different aspect of the curriculum.
Postscript: A Note on Style¶
Reading and writing COW code well includes stylistic choices. A few I have come to value:
Style: Use clear names¶
Not var p atomic.Pointer[Config] but var config atomic.Pointer[Config]. The variable name should communicate the role.
Style: Document immutability¶
Every snapshot type should have a doc comment about its contract.
// Config is an immutable configuration snapshot. After publishing via
// Store.cur.Store(c), fields must not be modified.
type Config struct { ... }
Style: Group fields by mutability¶
Even though the snapshot is "all immutable," group fields that change together. The reader can tell which fields are likely to vary across snapshots.
Style: Keep the writer API small¶
Three methods cover most cases: Get, Update, Replace. Avoid proliferating ad-hoc setters.
Style: Test concurrent behavior¶
Even if your tests pass without -race, run with -race. The discipline is part of the code.
Style: Comment the writer mutex¶
It's easy to mistake the writer mutex for a general protection. Add a comment:
Style: Avoid premature optimization¶
Start with atomic.Pointer[T] and a mutex. Move to fancier patterns only when measurements demand it.
Style: Document the workload assumption¶
// Config is read on every request (~10K/sec); reloaded on SIGHUP (~1/day).
// COW chosen for the read-mostly profile and snapshot consistency.
type Store struct { ... }
Future maintainers benefit.
Postscript: Honoring the Patterns That Came Before¶
COW did not emerge in a vacuum. It draws on decades of computer science:
- John McCarthy's Lisp (1958): persistent lists.
- Phil Bagwell's HAMT (2001).
- Rich Hickey's Clojure (2007): brought persistent structures to mainstream.
- Paul McKenney's RCU (early 1990s, hardened in Linux kernel).
- David Mosberger and others on memory models.
The Go runtime is a culmination of these ideas. When you write atomic.Pointer[Config].Store(next), you stand on the shoulders of these giants.
Acknowledging this is not just etiquette — it's a reminder that good engineering builds on tradition. We are not inventing new patterns; we are applying old ones with new ergonomics.
Postscript: A Final Inventory¶
This professional-level file has covered:
- The Go memory model in formal detail.
- The implementation of
atomic.Pointer[T]andatomic.Value. - Codegen across amd64, arm64, riscv64.
- Cache effects, false sharing, NUMA.
- The garbage collector's role.
- Write barriers and their cost.
- RCU as the kernel-space analogue.
- Hazard pointers and EBR.
- Lock-free programming subtleties.
- Weak references.
unsafe.Pointerusage.- Profiling techniques.
- Real-world walkthroughs.
- Cross-references with other Go concepts.
- A complete production-quality COW example.
That is the professional-level surface area.
Postscript: A Welcoming Word for the Next Reader¶
If you are reading this curriculum for the first time, welcome. The COW pattern is one of the most rewarding to master in Go. Take your time. Re-read sections. Run the code. Profile the benchmarks.
If you are returning for reference, welcome back. The patterns are stable; the wisdom accumulates with each application.
If you are mentoring others, this curriculum is yours to use, adapt, and extend.
Knowledge wants to be shared.
Truly The Last Word¶
Copy-on-write in Go: an elegant pattern, a deep mechanism, a daily practice. From four lines of Go to thousands of words of explanation, the journey is complete.
Build well. Measure deeply. Share generously.
That is the professional engineer's creed.
Final Appendix: Quick-Reference Card¶
Pattern:
var p atomic.Pointer[T]
p.Store(initial)
v := p.Load()
Update:
mu.Lock(); defer mu.Unlock()
cur := p.Load()
next := *cur
fn(&next)
p.Store(&next)
Costs (amd64):
Load: ~1.5 ns
Store: ~10 ns
CAS: ~15 ns
Memory model:
Store happens-before any Load observing it.
Rule:
Never mutate a published snapshot.
Choose COW when:
Reads >> Writes (>1000:1).
Snapshot is small (<10 MB).
Need multi-field consistency.
Avoid COW when:
Write-heavy workload.
Snapshot is huge.
Per-element transactions needed.
Tools:
go test -race
go tool pprof
go tool trace
benchstat
staticcheck
Migration paths:
COW -> sharded COW (write contention)
COW -> persistent HAMT (snapshot too large)
COW -> sync.Map (write rate climbs)
Carry this in your head. The rest you can look up.
End of professional level. End of COW curriculum.
Now build.