Acquire / Release — Middle Level¶

Table of Contents¶

Introduction
Recap from Junior
Glossary
Mental Model: Synchronizes-With
Publication Patterns at Scale
Lazy Init Beyond sync.Once
Read-Mostly State with atomic.Pointer[T]
Hot-Reload Configuration
Worker Lifecycle Signals
Caching with Single-Flight Behaviour
Read-Copy-Update (RCU) Pattern
Compare-and-Swap as Publication
Sequential Consistency vs Acq/Rel — When the Difference Bites
Designing the API of a Concurrent Struct
Pitfalls and Anti-Patterns
Cost Models
Testing Concurrent Publication
Stress Testing and the Race Detector
Profiling Hot Atomics
When Atomics Aren't Enough
Cheat Sheet
Self-Assessment
Summary
Further Reading
Related Topics
Diagrams

Introduction¶

Focus: "Now that I know the primitives, how do I compose them into real systems?"

At the junior level you learned that publication needs a release on the producer and an acquire on the consumer, on the same location. You learned the six primitives (mutex, RWMutex, atomic, channel, sync.Once, WaitGroup) and their idiomatic uses.

At the middle level we move from "primitives in isolation" to "patterns in a real service." A real service has:

A configuration that may reload at runtime.
A pool of workers with start/stop signals.
Caches that must invalidate atomically.
Snapshots that must be consistent across multiple fields.
Hot paths where atomics matter for performance.

We will design six recurring patterns that you will see in every nontrivial Go service. Each pattern is motivated by a problem, implemented step-by-step, and benchmarked. By the end, you should be able to look at a concurrent data structure and reason: "Where is the release? Where is the acquire? What does it publish?"

You should also understand when the simple patterns break down — when a single atomic.Pointer is insufficient and you need something stronger.

Recap from Junior¶

The contract:

A release on location L publishes all prior writes (by the same goroutine).
An acquire on the same L observes those writes (if it observed the released value).
Together they form a synchronizes-with edge.

The primitives:

sync.Mutex, sync.RWMutex: Lock = acquire, Unlock = release.
sync.Once.Do: both, around the user function.
sync.WaitGroup: Done = release, Wait = acquire.
chan send/recv: send = release, recv = acquire.
close(chan) = release; <-chan (closed) = acquire.
sync/atomic ops: every operation is acq-rel, sequentially consistent.

Idioms:

atomic.Pointer[T] for one-shot or read-mostly publication.
sync.Once for lazy init.
close(chan) for one-shot broadcast.

Move on if those feel automatic. If not, re-read junior.md.

Glossary¶

Term	Definition
Sequenced-before	Within a single goroutine, the program order of statements. Establishes happens-before within a goroutine.
Synchronizes-with	An edge between a release on goroutine A and an acquire on goroutine B that observed the release. Crosses goroutines.
Happens-before	The transitive closure of sequenced-before and synchronizes-with. The fundamental memory-model relation.
Linearization point	The instant at which a concurrent operation "takes effect" — typically the release or the successful CAS.
Wait-free	An algorithm where every operation completes in a bounded number of steps regardless of other goroutines. `atomic.Load` reads are wait-free.
Lock-free	At least one goroutine makes progress at every step. Many CAS-retry loops are lock-free but not wait-free.
Obstruction-free	A goroutine makes progress if no other goroutine touches its data. Weaker than lock-free.
CAS (compare-and-swap)	An atomic read-modify-write that succeeds only if the current value matches an expected one. The fundamental building block of lock-free algorithms.
ABA problem	A CAS confused into thinking nothing happened when in fact the value changed A → B → A between the read and the CAS.
Single-flight	A pattern where many concurrent requests for the same key collapse into one upstream fetch. `golang.org/x/sync/singleflight`.
Read-copy-update (RCU)	A pattern where readers see a snapshot, writers allocate a new version and republish, and old versions are reclaimed once no readers hold them.
Snapshot isolation	Readers see a consistent snapshot of state; concurrent writes don't change what they see.
Hot-reload	Updating configuration or state without restarting the process.
Generation counter	A monotonic integer incremented on each update, used to invalidate cached snapshots.
Epoch-based reclamation	A garbage-collection technique for lock-free structures: defer freeing until no thread is "in" an old epoch. Go's GC handles this for you.
Memory hierarchy	Registers → L1 → L2 → L3 → main memory → swap. Atomics ensure writes propagate through this hierarchy in a consistent order.

Mental Model: Synchronizes-With¶

Think of every release-acquire pair as drawing an arrow:

G1: ... ─ release(L) ────┐
                         │ "synchronizes-with"
G2: ... ── acquire(L) ──◄┘

Every operation in G1 sequenced-before the release is now happens-before every operation in G2 sequenced-after the acquire.

Multiple arrows compose. If G1 publishes to L1, G2 acquires L1 and publishes to L2, G3 acquires L2 — then G3 sees everything G1 wrote, transitively.

G1: write x = 5;  release(L1)
                       ↓ s-w
G2:                acquire(L1);  write y = x + 1;  release(L2)
                                                       ↓ s-w
G3:                                                acquire(L2);  read x, read y

G3 sees x = 5 and y = 6, because:

G1 happens-before G2 (via L1).
G2 happens-before G3 (via L2).
Therefore G1 happens-before G3 (transitivity).

This chaining is why programs with many synchronization points still have a coherent global view.

Publication Patterns at Scale¶

A publication pattern solves: "How do I share a value between goroutines safely?" We'll examine six patterns, ordered by complexity.

Pattern 1: one-shot publication¶

The simplest case. One goroutine builds, one or many goroutines consume, no updates.

var current atomic.Pointer[Config]

// Producer (once at startup):
current.Store(loadConfig())

// Consumers (anywhere):
cfg := current.Load()
use(cfg)

Use when: configuration loaded at startup, used forever.

Pattern 2: one-shot signal¶

ready := make(chan struct{})

// Producer:
prepareEverything()
close(ready)

// Consumers:
<-ready
useEverything()

Use when: many consumers must wait for a one-time event.

Pattern 3: lazy init¶

var (
    once sync.Once
    db   *sql.DB
    err  error
)

func DB() (*sql.DB, error) {
    once.Do(func() { db, err = sql.Open(...) })
    return db, err
}

Use when: initialization is expensive and only needed if anyone asks.

Pattern 4: hot-reload (occasional updates)¶

var current atomic.Pointer[Config]

// Reload goroutine:
for range time.Tick(time.Minute) {
    if c, err := loadConfig(); err == nil {
        current.Store(c)
    }
}

// Consumers:
cfg := current.Load()
use(cfg)

Use when: writes are rare (seconds or minutes apart), reads are frequent.

Pattern 5: CAS-based update (concurrent writers)¶

var counter atomic.Int64

// Many goroutines:
for {
    old := counter.Load()
    new := transform(old)
    if counter.CompareAndSwap(old, new) {
        break
    }
}

Use when: many goroutines update the same value with a pure function.

Pattern 6: protected critical section¶

var (
    mu   sync.Mutex
    data State
)

func Update(f func(*State)) {
    mu.Lock()
    defer mu.Unlock()
    f(&data)
}

func Read() State {
    mu.Lock()
    defer mu.Unlock()
    return data
}

Use when: state is multi-field and mutations are non-pure (I/O, allocations).

Lazy Init Beyond `sync.Once`¶

sync.Once is the default. But there are situations where it falls short:

Failure modes of `sync.Once`¶

once.Do(f) calls f exactly once. If f panics, once considers itself done — future calls return immediately without running f. This may not be what you want.

Workaround: catch the panic inside f:

once.Do(func() {
    defer func() {
        if r := recover(); r != nil {
            err = fmt.Errorf("init panic: %v", r)
        }
    }()
    db, err = expensiveInit()
})

If f returns an error, still once considers itself done. Future callers see the same error. This is sometimes desired (don't retry a hopeless init), sometimes not (the dependency was just transiently down).

For retry-on-error, you need a different pattern:

type RetryOnce struct {
    mu   sync.Mutex
    done atomic.Bool
}

func (r *RetryOnce) Do(f func() error) error {
    if r.done.Load() {
        return nil
    }
    r.mu.Lock()
    defer r.mu.Unlock()
    if r.done.Load() {
        return nil
    }
    if err := f(); err != nil {
        return err
    }
    r.done.Store(true)
    return nil
}

Now Do retries the initializer until one call succeeds. Concurrent callers serialize through the mutex; successful completion is published via the atomic.

`sync.OnceValue` and `sync.OnceValues`¶

Go 1.21 added two helpers:

var GetDB = sync.OnceValue(func() *sql.DB { ... })
var GetDBWithErr = sync.OnceValues(func() (*sql.DB, error) { ... })

OnceValue returns a function that returns the cached value. OnceValues returns one that returns a pair. Both are convenient wrappers around sync.Once + storage.

Use these for ergonomic single-value lazy init. They cannot retry on error.

Generic memoizing initializers¶

For multi-key lazy init (e.g., per-tenant database connections), use a sync.Map plus per-entry sync.Once:

type Pool[K comparable, V any] struct {
    m  sync.Map // map[K]*entry[V]
    fn func(K) V
}

type entry[V any] struct {
    once sync.Once
    val  V
}

func (p *Pool[K, V]) Get(k K) V {
    v, _ := p.m.LoadOrStore(k, &entry[V]{})
    e := v.(*entry[V])
    e.once.Do(func() { e.val = p.fn(k) })
    return e.val
}

Each key gets its own sync.Once. Concurrent calls for the same key block on the same Once; concurrent calls for different keys run in parallel.

The publication semantics: every Get(k) returns a value that has had its initializer fully run, with happens-before from the initializer to the caller. sync.Map.LoadOrStore provides its own acq-rel guarantees on the entry pointer.

Read-Mostly State with `atomic.Pointer[T]`¶

A reader-mostly state is the workhorse pattern for caches, configurations, lookup tables, and feature flags. The shape:

type ReadMostly[T any] struct {
    p atomic.Pointer[T]
}

func (r *ReadMostly[T]) Load() *T  { return r.p.Load() }
func (r *ReadMostly[T]) Store(t *T) { r.p.Store(t) }

Reads are wait-free (a single atomic load). Writes are wait-free (a single atomic store). The catch: the referent must be treated as immutable.

Why immutability matters¶

If you Store(&v) and then mutate v, readers that already loaded &v see the mutation. That's a race even though the pointer was published atomically.

Concrete bug:

var snap atomic.Pointer[Snapshot]

go func() {
    s := &Snapshot{Count: 0}
    snap.Store(s)
    s.Count = 1 // RACE with readers that loaded s
}()

go func() {
    s := snap.Load()
    if s != nil {
        fmt.Println(s.Count)
    }
}()

Fix: never mutate *s after Store(s). To "update," allocate a new Snapshot:

old := snap.Load()
new := &Snapshot{Count: old.Count + 1}
snap.Store(new)

If multiple writers race, use CAS:

for {
    old := snap.Load()
    new := &Snapshot{Count: old.Count + 1}
    if snap.CompareAndSwap(old, new) {
        break
    }
}

When the referent has slices or maps¶

A Snapshot containing a slice or map is still immutable from a publication standpoint — as long as you don't mutate the slice or map after Store.

type Snapshot struct {
    Hosts []string
}

s := &Snapshot{Hosts: []string{"a", "b"}}
snap.Store(s)
s.Hosts = append(s.Hosts, "c") // RACE: readers see a mutated slice

To update, copy the slice:

old := snap.Load()
newHosts := make([]string, len(old.Hosts)+1)
copy(newHosts, old.Hosts)
newHosts[len(old.Hosts)] = "c"
snap.Store(&Snapshot{Hosts: newHosts})

This pattern is called copy-on-write. It's the cornerstone of RCU.

When the referent has nested pointers¶

type Snapshot struct {
    User *User
}

s := snap.Load()
s.User.Name = "Bob" // Are we mutating?

You are. Even though s is a new Snapshot, s.User may be the same *User as in the previous snapshot. Mutating *s.User mutates the previous snapshot's user too.

Fix: deep-copy when transitioning.

The general rule for atomic.Pointer[T]-published structures: every reachable byte must be immutable. If you can't enforce that, use a mutex.

Hot-Reload Configuration¶

A real example combining several patterns. We want to:

Load configuration from disk at startup.
Watch the file for changes.
Atomically swap to the new configuration when it changes.
Have readers always see some valid configuration.

package config

import (
    "encoding/json"
    "fmt"
    "os"
    "sync/atomic"
    "time"
)

type Config struct {
    MaxConn  int
    Timeout  time.Duration
    Hosts    []string
}

var (
    current atomic.Pointer[Config]
    path    = "/etc/myservice/config.json"
)

func init() {
    c, err := loadFromDisk(path)
    if err != nil {
        c = &Config{MaxConn: 100, Timeout: time.Second, Hosts: nil}
    }
    current.Store(c)
    go watchLoop()
}

func Get() *Config { return current.Load() }

func watchLoop() {
    var lastMod time.Time
    for range time.Tick(5 * time.Second) {
        info, err := os.Stat(path)
        if err != nil || !info.ModTime().After(lastMod) {
            continue
        }
        c, err := loadFromDisk(path)
        if err != nil {
            fmt.Println("config reload failed:", err)
            continue
        }
        current.Store(c)
        lastMod = info.ModTime()
    }
}

func loadFromDisk(p string) (*Config, error) {
    data, err := os.ReadFile(p)
    if err != nil {
        return nil, err
    }
    var c Config
    if err := json.Unmarshal(data, &c); err != nil {
        return nil, err
    }
    return &c, nil
}

Properties:

Readers do a single atomic load — wait-free.
Writes (reload events) are rare and only published if parsing succeeds.
A read in the middle of a reload sees either the old config or the new config, never a half-parsed one — because parsing happens before Store.
If reload fails, the old config remains; the service degrades gracefully.

Notice that we deliberately don't expose Set. The only way to update is by editing the file. This restricts the surface area for bugs.

Adding callbacks¶

If consumers need to react to reloads (e.g., re-establish DB connections), expose a channel:

var (
    current atomic.Pointer[Config]
    reload  = make(chan *Config, 1)
)

// In watchLoop, after the successful Store:
select {
case reload <- c:
default:
    // Coalesce if previous reload event not yet consumed.
}

Consumers:

go func() {
    for c := range reload {
        reconfigureDB(c)
    }
}()

The channel acts as both publication (send is release) and queue (capacity 1 with non-blocking send for coalescing).

Worker Lifecycle Signals¶

A pool of workers should start cleanly and stop cleanly. The classic anti-pattern is time.Sleep shutdowns; the right pattern uses signals.

Pattern: shutdown via closed channel¶

type Pool struct {
    stop chan struct{}
    wg   sync.WaitGroup
}

func NewPool(n int) *Pool {
    p := &Pool{stop: make(chan struct{})}
    for i := 0; i < n; i++ {
        p.wg.Add(1)
        go p.worker(i)
    }
    return p
}

func (p *Pool) worker(id int) {
    defer p.wg.Done()
    for {
        select {
        case <-p.stop:
            return
        default:
        }
        doWork(id)
    }
}

func (p *Pool) Stop() {
    close(p.stop)
    p.wg.Wait()
}

close(p.stop) is the release; every worker's <-p.stop is an acquire. After wg.Wait() returns, every worker has called Done, which is a release — so anything they wrote before exiting is visible.

sync.Once could protect against double-close:

type Pool struct {
    stop chan struct{}
    once sync.Once
    wg   sync.WaitGroup
}

func (p *Pool) Stop() {
    p.once.Do(func() { close(p.stop) })
    p.wg.Wait()
}

Pattern: shutdown via context¶

type Pool struct {
    ctx    context.Context
    cancel context.CancelFunc
    wg     sync.WaitGroup
}

func NewPool(parent context.Context, n int) *Pool {
    ctx, cancel := context.WithCancel(parent)
    p := &Pool{ctx: ctx, cancel: cancel}
    for i := 0; i < n; i++ {
        p.wg.Add(1)
        go p.worker(i)
    }
    return p
}

func (p *Pool) worker(id int) {
    defer p.wg.Done()
    for {
        select {
        case <-p.ctx.Done():
            return
        case j := <-jobs:
            handle(j)
        }
    }
}

func (p *Pool) Stop() {
    p.cancel()
    p.wg.Wait()
}

context.Context propagates cancellation through a tree of goroutines. cancel() closes the underlying channel; <-ctx.Done() is an acquire on that close. Plus, the context can carry deadlines, values, and parent cancellation.

This is the modern Go idiom for any operation that needs to be cancellable.

Caching with Single-Flight Behaviour¶

A cache miss may take milliseconds (or seconds). If 100 requests miss at the same time, you don't want 100 upstream fetches. You want one fetch and 100 deliveries.

Naive cache (broken)¶

type Cache struct {
    mu   sync.Mutex
    data map[string]string
}

func (c *Cache) Get(k string, fetch func() string) string {
    c.mu.Lock()
    v, ok := c.data[k]
    c.mu.Unlock()
    if ok {
        return v
    }
    v = fetch() // 100 callers all fetch concurrently
    c.mu.Lock()
    c.data[k] = v
    c.mu.Unlock()
    return v
}

Single-flight cache¶

type Cache struct {
    mu      sync.Mutex
    data    map[string]string
    pending map[string]*pendingFetch
}

type pendingFetch struct {
    done chan struct{}
    val  string
}

func (c *Cache) Get(k string, fetch func() string) string {
    c.mu.Lock()
    if v, ok := c.data[k]; ok {
        c.mu.Unlock()
        return v
    }
    if pf, ok := c.pending[k]; ok {
        c.mu.Unlock()
        <-pf.done
        return pf.val
    }
    pf := &pendingFetch{done: make(chan struct{})}
    if c.pending == nil {
        c.pending = map[string]*pendingFetch{}
    }
    c.pending[k] = pf
    c.mu.Unlock()

    pf.val = fetch()
    c.mu.Lock()
    if c.data == nil {
        c.data = map[string]string{}
    }
    c.data[k] = pf.val
    delete(c.pending, k)
    c.mu.Unlock()
    close(pf.done)
    return pf.val
}

The flow:

Take lock, check data[k]. If present, return it. Lock provides acquire.
Check pending[k]. If present, we have a fetch in flight. Drop the lock and wait on the channel.
Otherwise, mark this key as pending. Drop the lock. Fetch.
After fetch, store in data, remove from pending, close the channel.

close(pf.done) is a release; every waiter's <-pf.done is an acquire on it. Writes to pf.val before the close are visible to all readers.

Crucially, writing to pf.val and the close happen outside the mutex. The mutex protects only the maps; the channel close protects the pendingFetch.

For production use, just use golang.org/x/sync/singleflight — it handles errors, cancellation, and tests. But understanding the manual version is essential.

Read-Copy-Update (RCU) Pattern¶

RCU is a publication pattern from kernel development. Readers see a consistent snapshot without locking; writers allocate new versions and publish via atomic pointer swap. Old versions are reclaimed once no reader holds them.

In Go, the GC handles reclamation for you — that's a huge simplification over Linux kernel RCU.

type RCU[T any] struct {
    p atomic.Pointer[T]
}

func (r *RCU[T]) Read() *T { return r.p.Load() }

func (r *RCU[T]) Update(f func(*T) *T) {
    for {
        old := r.p.Load()
        new := f(old)
        if r.p.CompareAndSwap(old, new) {
            return
        }
    }
}

Update takes a pure function: it must not mutate old. It returns a new *T representing the post-update state.

Example: maintain a set of currently connected user IDs.

var users RCU[map[int]bool]

users.Update(func(old *map[int]bool) *map[int]bool {
    cp := map[int]bool{}
    if old != nil {
        for k, v := range *old {
            cp[k] = v
        }
    }
    cp[userID] = true
    return &cp
})

Readers:

if m := users.Read(); m != nil && (*m)[userID] {
    // user is connected
}

The reader holds the snapshot for as long as needed. Even if Update replaces the map during the read, the reader's snapshot is unchanged.

When RCU shines¶

Frequent reads, infrequent writes (the typical "config" workload).
Snapshots are small enough to allocate cheaply.
Readers don't need to wait for writers.
Writers don't need to wait for readers.

When RCU is wrong¶

Frequent writes — every update allocates and copies.
Snapshots are large (millions of entries) — copying is expensive.
Multiple fields must update atomically across snapshots in a coordinated way.

For large mutable state, sharded mutexes or sync.Map may beat RCU.

Compare-and-Swap as Publication¶

CAS is more than a building block — it's a conditional publication. The semantics:

ok := p.CompareAndSwap(old, new)

If the current value equals old, atomically replace it with new and return true. Otherwise return false. The successful CAS is both:

An acquire (the comparison observed the current value, which was published by whoever stored it).
A release (the new value, and any writes to *new's referent before the CAS, are now published to subsequent acquirers).

This makes CAS the universal primitive for lock-free algorithms.

Lock-free stack¶

type Stack[T any] struct {
    top atomic.Pointer[node[T]]
}

type node[T any] struct {
    val  T
    next *node[T]
}

func (s *Stack[T]) Push(v T) {
    n := &node[T]{val: v}
    for {
        n.next = s.top.Load()
        if s.top.CompareAndSwap(n.next, n) {
            return
        }
    }
}

func (s *Stack[T]) Pop() (T, bool) {
    for {
        top := s.top.Load()
        if top == nil {
            var zero T
            return zero, false
        }
        if s.top.CompareAndSwap(top, top.next) {
            return top.val, true
        }
    }
}

Push: build a new node, set its next to the current top, CAS the top. If a concurrent push changed the top, retry.

Pop: read the top, CAS the top to its successor. If a concurrent pop changed the top, retry.

The publication: a successful Push CAS publishes the new node (and its val). A successful Pop CAS publishes nothing new — it just removes a node.

The ABA problem¶

Naive lock-free structures suffer from ABA: a CAS that expected A may see A again even though the value changed A → B → A in between. The structure may be corrupted.

Go's GC saves you from many ABA bugs because pointers cannot be reused while in scope. (Compare to C/C++ where you'd manage memory yourself.) But ABA still bites when you mix pointers and counters.

Workaround: add a generation counter. Use atomic.Uint64 packing {ptr, generation} together, or a struct.

We'll see lock-free queues with generation counters in senior.md.

Sequential Consistency vs Acq/Rel — When the Difference Bites¶

Go gives you sequential consistency for atomics. Is this overkill?

For most code: yes, slightly. The cost on x86 is negligible (x86 is essentially seq-cst by default). On ARM and POWER, sequential consistency costs an extra fence per store.

When does the difference matter? Two specific patterns:

The producer-consumer flag pair¶

var (
    a atomic.Int32
    b atomic.Int32
)

// Goroutine 1:
a.Store(1)
b.Store(1)

// Goroutine 2:
if b.Load() == 1 {
    // Is a.Load() guaranteed to be 1?
}

Under sequential consistency: yes.

Under pure release/acquire: no — the release on a and the release on b are independent. G2 might see b=1 without seeing a=1.

You will not write code like this in Go because Go gives you SC. But you should know the distinction for cross-language work.

Independent reads of independent writes¶

// Goroutine A: writes a
// Goroutine B: writes b
// Goroutine C: reads a then b
// Goroutine D: reads b then a

// Under SC: C and D see the writes in the same order
// Under acq/rel: C might see a first, D might see b first

This is mostly theoretical for Go programmers. But it explains why Go's choice of SC is "free safety" — you never have to ask "could readers disagree on the order?"

Designing the API of a Concurrent Struct¶

When you build a type that will be touched by multiple goroutines, you make decisions about its synchronization. Common API patterns:

Pattern: explicit lock exposed in the type¶

type Cache struct {
    Mu   sync.Mutex
    data map[string]string
}

Callers lock manually:

c.Mu.Lock()
v := c.data[k]
c.Mu.Unlock()

Pros: caller can do multi-step operations atomically. Cons: easy to forget. Avoid in new code.

Pattern: lock hidden, methods are the surface¶

type Cache struct {
    mu   sync.Mutex
    data map[string]string
}

func (c *Cache) Get(k string) (string, bool) {
    c.mu.Lock()
    defer c.mu.Unlock()
    v, ok := c.data[k]
    return v, ok
}

Pros: caller can't forget to lock. Cons: no multi-step atomicity unless you add specific methods (e.g., GetOrSet).

Pattern: copy-on-write¶

type Cache struct {
    data atomic.Pointer[map[string]string]
}

func (c *Cache) Get(k string) (string, bool) {
    m := c.data.Load()
    if m == nil {
        return "", false
    }
    v, ok := (*m)[k]
    return v, ok
}

func (c *Cache) Set(k, v string) {
    for {
        old := c.data.Load()
        cp := map[string]string{}
        if old != nil {
            for kk, vv := range *old {
                cp[kk] = vv
            }
        }
        cp[k] = v
        if c.data.CompareAndSwap(old, &cp) {
            return
        }
    }
}

Pros: reads are wait-free. Cons: writes are linear in cache size; writers race and retry.

Use when: reads dominate, writes are very rare.

Pattern: sharded locks¶

type Cache struct {
    shards [16]struct {
        mu   sync.Mutex
        data map[string]string
    }
}

func (c *Cache) shard(k string) *struct {
    mu   sync.Mutex
    data map[string]string
} {
    h := fnv.New32a()
    _, _ = h.Write([]byte(k))
    return &c.shards[h.Sum32()%16]
}

func (c *Cache) Get(k string) (string, bool) {
    s := c.shard(k)
    s.mu.Lock()
    defer s.mu.Unlock()
    if s.data == nil {
        return "", false
    }
    v, ok := s.data[k]
    return v, ok
}

Pros: writes can proceed in parallel for different shards. Cons: more memory, slightly complex.

Use when: high contention on a single mutex.

Pitfalls and Anti-Patterns¶

Anti-pattern: atomic plus plain reads¶

atomic.StoreInt32(&v, 1)
// ...
if v == 1 { ... } // RACE

Either all accesses go through sync/atomic, or none do. Mixing them is a race.

Anti-pattern: forgetting to publish¶

type Service struct {
    cache *Cache
}

s := &Service{}
go func() {
    s.cache = newCache() // RACE: no publication
}()
go func() {
    if s.cache != nil { use(s.cache) } // RACE
}()

Fix: atomic.Pointer[Cache] for cache, or initialise before spawning goroutines.

Anti-pattern: mutating a published value¶

snap.Store(s)
s.Count++ // RACE with readers

Published values are immutable. Allocate a new one.

Anti-pattern: holding a lock during slow I/O¶

mu.Lock()
defer mu.Unlock()
result := slowHTTPCall() // holds lock for seconds
cache[k] = result

This serializes all callers. Either use a separate fetch step (drop the lock during I/O) or use a single-flight pattern.

Anti-pattern: double-fetched check¶

if c.data.Load() == nil {
    c.data.Store(buildExpensive())
}

Two goroutines both see nil, both build, only one's value sticks (the other is wasted). Use sync.Once or CAS-based first-write-wins.

Anti-pattern: spinning without yield¶

for atomic.LoadInt32(&ready) == 0 {
    // burns CPU
}

If the wait may be long, use a channel or sync.Cond. If short, at least call runtime.Gosched() periodically.

Cost Models¶

Rough costs (refer to junior.md Appendix D for fuller table):

atomic.Load: ~1 ns
atomic.Store: ~5 ns
atomic.CompareAndSwap: ~5–15 ns (uncontended), much higher under contention
sync.Mutex.Lock/Unlock: ~15–25 ns uncontended
Channel send/recv: ~100–300 ns
Memory allocation (small object): ~25–100 ns
Map operation (Get): ~50–200 ns
Map operation (Set, no growth): ~100–500 ns

Implications:

A CAS-retry loop with N retries costs ~10N ns + the cost of building the new value each retry.
A copy-on-write update to a map with K entries costs ~K * 100 ns plus an allocation. For K=1000, that's ~100 μs per write.
An atomic.Pointer.Load is essentially free — use it on hot paths.
A sync.Mutex is fine for cold paths and modest contention.
Channels are expensive — don't use them as flags if an atomic suffices.

Testing Concurrent Publication¶

Concurrency bugs are timing-dependent. Tests must be designed to catch them reliably.

Use `-race` always¶

go test -race -count=10 ./...

-count=10 runs each test 10 times. Race detector instrumentation makes the schedule slightly non-deterministic, increasing the chance of catching rare races.

Stress-test publication¶

func TestSnapshotPublishStress(t *testing.T) {
    var snap atomic.Pointer[Snapshot]
    var wg sync.WaitGroup

    for w := 0; w < 4; w++ {
        wg.Add(1)
        go func(seed int) {
            defer wg.Done()
            for i := 0; i < 10000; i++ {
                snap.Store(&Snapshot{X: seed*10000 + i})
            }
        }(w)
    }

    for r := 0; r < 4; r++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for i := 0; i < 10000; i++ {
                s := snap.Load()
                if s == nil {
                    continue
                }
                _ = s.X // ensure no garbage
            }
        }()
    }

    wg.Wait()
}

Run with -race -count=100. If there's a publication race, the detector will catch it within a few iterations.

Test happens-before guarantees¶

Write tests that would fail if publication were broken, e.g.:

func TestPublishOrder(t *testing.T) {
    type Pair struct{ A, B int }
    var p atomic.Pointer[Pair]

    for trial := 0; trial < 1000; trial++ {
        done := make(chan struct{})
        go func() {
            p.Store(&Pair{A: 1, B: 2})
            close(done)
        }()
        <-done
        got := p.Load()
        if got.A != 1 || got.B != 2 {
            t.Fatalf("trial %d: got %+v", trial, got)
        }
    }
}

If you remove the atomic, the test passes most of the time but fails under -race.

Stress Testing and the Race Detector¶

The race detector instruments all memory accesses with vector-clock tracking. It does not catch races that don't actually occur during the test run. Therefore:

Vary the goroutine count.
Vary GOMAXPROCS (e.g., test with 1, 2, and N cores).
Run tests many times in CI.
Use t.Parallel() to encourage real concurrent scheduling.
Avoid timing-dependent assertions; use channels or WaitGroup for synchronization.

go test -race -cpu=1,2,4 -count=10 ./... is a useful CI invocation.

Profiling Hot Atomics¶

If you suspect a hot atomic is a bottleneck:

go test -cpuprofile=cpu.out -bench=.
go tool pprof cpu.out

In pprof, look for sync/atomic or sync.(*Mutex).Lock near the top. Contention shows up as high CPU in lock-acquire code or in retries of CAS loops.

For lock contention, also use the block profile:

import _ "net/http/pprof"
import "runtime"

func init() {
    runtime.SetBlockProfileRate(1)
}

Then go tool pprof http://localhost:6060/debug/pprof/block shows where goroutines wait on locks/channels.

Common findings:

A sync.Mutex covering a hot read path. Switch to sync.RWMutex or atomic.Pointer[T].
An atomic.CompareAndSwap loop with high retry count. Switch to a mutex (counterintuitively, less contention).
A channel used as a flag. Switch to an atomic.Bool.

When Atomics Aren't Enough¶

Atomics work when:

The shared state fits in a single word (or you can wrap it in atomic.Pointer[T]).
The state is immutable after publication.
Multi-step transactions aren't needed.

When you need any of these, reach for a mutex:

Multi-field updates that must be atomic together.
Read-modify-write of a mutable structure (e.g., appending to a slice in place).
Coordination with multiple actors (e.g., bounded queue with both producers and consumers).
Fairness — atomics provide none.

A mutex is often the right answer. Don't view atomics as "always better" — they trade ergonomics and correctness for performance, and the performance gain often doesn't justify the complexity.

Cheat Sheet¶

PATTERN                       PRIMITIVE                    TYPICAL COST
=======                       =========                    ============

one-shot publication          atomic.Pointer[T]            ~1 ns read
one-shot signal               close(chan)                  ~10 ns recv
lazy init                     sync.Once / sync.OnceValue   ~1 ns after init
hot-reload config             atomic.Pointer[T]            ~1 ns read
CAS counter                   atomic.Int64.Add or CAS      ~5-15 ns
critical section              sync.Mutex                   ~15-25 ns
read-mostly state             atomic.Pointer[T] + RCU      ~1 ns read
sharded state                 [N]struct{mu;state}          ~15-25 ns shard

WHEN TO REACH FOR EACH
======================

atomic.Pointer[T]: read-mostly, immutable referent
sync.Mutex:        mutable critical section, modest contention
sync.RWMutex:      read-mostly with occasional writes, larger critical section
chan:              produce/consume, signal events
sync.Once:         lazy init that runs exactly once
sync.WaitGroup:    wait for N goroutines to finish
context.Context:   cancellation through a tree

Self-Assessment¶

I can choose between atomic.Pointer[T], sync.Mutex, and sync.RWMutex for a given workload.
I can implement hot-reload config without locks.
I can implement a single-flight cache.
I can describe what RCU is and when to use it.
I can debug a publication bug found by the race detector.
I understand why CAS-based updates need a retry loop.
I know when sequential consistency matters versus when acq/rel would suffice.

Summary¶

The middle level is where publication patterns become real. The primitives from junior level compose into:

One-shot publication for startup configuration.
Lazy init with sync.Once.
Hot-reload via atomic.Pointer[T] swap.
Worker lifecycle via close-signal and WaitGroup.
Single-flight cache via mutex plus channel.
RCU for read-mostly state.
CAS for lock-free updates.

The key skill is picking the right pattern. Reads dominate? atomic.Pointer[T] with copy-on-write. Multi-field critical section? Mutex. Concurrent counter? atomic.Int64. Wait for an event? Channel.

Sequential consistency gives Go a slight edge over languages with explicit memory orderings: you don't need to ask "could readers disagree?" The cost is a small overhead on weakly-ordered hardware. The benefit is one fewer thing to get wrong.

Next: senior.md, where we unpack double-checked locking, seqlocks, and the formal happens-before axioms.

Diagrams¶

RCU UPDATE
==========

Step 1: Reader holds snapshot 1.

Snapshot1: {A:1, B:2}  ◄── Reader1
                       ◄── Reader2

Step 2: Writer allocates Snapshot 2.

Snapshot1: {A:1, B:2}  ◄── Reader1, Reader2
Snapshot2: {A:1, B:3}  (not published yet)

Step 3: Writer publishes (atomic store).

Snapshot1: {A:1, B:2}  ◄── Reader1, Reader2
Snapshot2: {A:1, B:3}  ◄── (new readers)

Step 4: Reader3 reads.

Snapshot1: {A:1, B:2}  ◄── Reader1, Reader2
Snapshot2: {A:1, B:3}  ◄── Reader3

Step 5: Old readers finish.

Snapshot2: {A:1, B:3}  ◄── Reader3
(Snapshot1 garbage-collected once unreachable)

SYNCHRONIZES-WITH CHAIN
=======================

G1: write x = 5
    release(L1)  ───────► synchronizes-with ───────► acquire(L1)  G2
                                                      |
                                                      write y = 6
                                                      release(L2) ──► s-w ──► acquire(L2)  G3
                                                                                |
                                                                                read x: 5
                                                                                read y: 6

G3 transitively observes G1's write because of the chain of synchronizes-with edges.

Appendix A: Deeper Case Studies¶

Case Study A.1 — A configuration service for a feature-flag platform¶

Requirements:

Tens of thousands of consumer goroutines (per-request handlers) read flag values.
A control-plane goroutine pushes flag updates every few seconds.
Reads must be wait-free; reads must never observe a half-applied update.
Updates may add, remove, or modify flags.

Design:

package flags

import (
    "sync"
    "sync/atomic"
    "time"
)

type Flag struct {
    Name        string
    Enabled     bool
    Percent     int
    Variants    map[string]string
}

type Snapshot struct {
    ByName    map[string]*Flag
    Generation uint64
    UpdatedAt  time.Time
}

type Service struct {
    snap atomic.Pointer[Snapshot]
    mu   sync.Mutex // serializes updates
}

func (s *Service) Lookup(name string) (*Flag, bool) {
    snap := s.snap.Load()
    if snap == nil {
        return nil, false
    }
    f, ok := snap.ByName[name]
    return f, ok
}

func (s *Service) Generation() uint64 {
    if snap := s.snap.Load(); snap != nil {
        return snap.Generation
    }
    return 0
}

func (s *Service) Apply(updates []Flag) {
    s.mu.Lock()
    defer s.mu.Unlock()

    old := s.snap.Load()
    newMap := make(map[string]*Flag, lenOrZero(old))
    if old != nil {
        for k, v := range old.ByName {
            newMap[k] = v
        }
    }
    for _, u := range updates {
        u := u
        newMap[u.Name] = &u
    }
    next := &Snapshot{
        ByName:     newMap,
        Generation: lenOrZero(old) + 1,
        UpdatedAt:  time.Now(),
    }
    s.snap.Store(next)
}

func lenOrZero(s *Snapshot) uint64 {
    if s == nil {
        return 0
    }
    return s.Generation
}

What's happening:

Lookup is wait-free: one atomic load, then a map read. Multiple readers don't contend with each other.
Apply is serialized through the mutex (only one writer at a time). The mutex prevents two Apply calls from racing each other when reading the current snapshot and computing the next.
The atomic Store publishes the new snapshot. Every reader's next Load gets the new pointer.
The Generation counter lets metrics or caches detect "is this snapshot fresh?"
Old snapshots are garbage-collected once no reader holds them.

This pattern scales to millions of reads per second on commodity hardware.

Case Study A.2 — A connection pool with lazy init per-host¶

Requirements:

Many goroutines need a connection to host X.
The first goroutine to ask for host X triggers the dial.
Subsequent requests reuse the cached connection.
Each host has its own pool entry; different hosts don't block each other.

Design (sketch):

package pool

import (
    "net"
    "sync"
)

type Conn = *net.TCPConn // simplified

type entry struct {
    once sync.Once
    conn Conn
    err  error
}

type Pool struct {
    m sync.Map // map[string]*entry
}

func (p *Pool) Get(host string) (Conn, error) {
    v, _ := p.m.LoadOrStore(host, &entry{})
    e := v.(*entry)
    e.once.Do(func() {
        e.conn, e.err = dial(host)
    })
    return e.conn, e.err
}

func dial(host string) (Conn, error) {
    addr, err := net.ResolveTCPAddr("tcp", host)
    if err != nil {
        return nil, err
    }
    return net.DialTCP("tcp", nil, addr)
}

The flow:

LoadOrStore returns either the existing *entry for the host or stores a new empty one and returns it.
Per-entry sync.Once ensures dial runs at most once per host.
Other goroutines hitting the same host block inside Once.Do until the first dial completes.
Different hosts have different *entrys, hence different Onces — concurrent dials proceed in parallel.

sync.Map provides the publication for the *entry pointer; sync.Once provides the publication for conn and err.

Note: this is a static per-host cache. In production you'd also need eviction, health checks, retries — but the publication story is solid.

Case Study A.3 — A log throttler¶

Requirements:

Multiple goroutines try to log the same warning ("DB latency exceeded").
Don't spam the log: emit once per minute.
Counter visible in metrics.

Design:

package throttle

import (
    "sync/atomic"
    "time"
)

type Throttle struct {
    interval time.Duration
    lastNS   atomic.Int64
    skipped  atomic.Int64
}

func New(d time.Duration) *Throttle {
    return &Throttle{interval: d}
}

func (t *Throttle) Allow() bool {
    now := time.Now().UnixNano()
    last := t.lastNS.Load()
    if now-last < int64(t.interval) {
        t.skipped.Add(1)
        return false
    }
    if !t.lastNS.CompareAndSwap(last, now) {
        // someone else beat us — they win the slot.
        t.skipped.Add(1)
        return false
    }
    return true
}

func (t *Throttle) Skipped() int64 { return t.skipped.Load() }

CAS is the publication: exactly one caller per interval succeeds, the others increment skipped. No locks. Reads (e.g., Skipped()) are wait-free.

Case Study A.4 — A subscriber list¶

Requirements:

Many subscribers register callbacks.
A publish loop notifies all current subscribers.
Subscribers may register/unregister mid-publish; that's fine — they may or may not be notified.

Design:

package pubsub

import (
    "sync"
    "sync/atomic"
)

type Sub struct {
    ID int
    Fn func(string)
}

type Bus struct {
    subs atomic.Pointer[[]Sub]
    mu   sync.Mutex
}

func (b *Bus) Subscribe(s Sub) {
    b.mu.Lock()
    defer b.mu.Unlock()
    cur := b.subs.Load()
    var n []Sub
    if cur != nil {
        n = append(n, *cur...)
    }
    n = append(n, s)
    b.subs.Store(&n)
}

func (b *Bus) Unsubscribe(id int) {
    b.mu.Lock()
    defer b.mu.Unlock()
    cur := b.subs.Load()
    if cur == nil {
        return
    }
    n := make([]Sub, 0, len(*cur))
    for _, s := range *cur {
        if s.ID != id {
            n = append(n, s)
        }
    }
    b.subs.Store(&n)
}

func (b *Bus) Publish(msg string) {
    cur := b.subs.Load()
    if cur == nil {
        return
    }
    for _, s := range *cur {
        s.Fn(msg)
    }
}

Publish is wait-free (one atomic load, then iterate the slice). Subscribe/Unsubscribe serialize through a mutex, allocate a new slice, and publish it.

Important property: a Publish loop iterates a snapshot of the subscriber list at the moment of load. If a new subscriber is added mid-publish, they may or may not be notified depending on timing — but they will be notified on the next publish. This is "eventually consistent" behavior and is typically fine for pubsub.

Case Study A.5 — A metrics counter histogram¶

Requirements:

High-cardinality counters (per-endpoint, per-status).
Increment from request handlers (thousands per second).
Snapshot reading from a metrics scrape (once every 10 seconds).

Design:

package metrics

import (
    "sync"
    "sync/atomic"
)

type Counter struct {
    val atomic.Int64
}

func (c *Counter) Inc()         { c.val.Add(1) }
func (c *Counter) Load() int64  { return c.val.Load() }

type Registry struct {
    mu   sync.Mutex
    cs   map[string]*Counter
}

func (r *Registry) Get(name string) *Counter {
    r.mu.Lock()
    defer r.mu.Unlock()
    if c, ok := r.cs[name]; ok {
        return c
    }
    if r.cs == nil {
        r.cs = map[string]*Counter{}
    }
    c := &Counter{}
    r.cs[name] = c
    return c
}

func (r *Registry) Snapshot() map[string]int64 {
    r.mu.Lock()
    defer r.mu.Unlock()
    out := make(map[string]int64, len(r.cs))
    for k, c := range r.cs {
        out[k] = c.Load()
    }
    return out
}

The hot path is just c.val.Add(1) — one atomic. The map of counters is mutex-protected because we add new entries rarely.

For very high contention, shard:

type ShardedCounter struct {
    shards [64]atomic.Int64
}

func (c *ShardedCounter) Inc() {
    s := runtime_procPin()
    c.shards[s%64].Add(1)
    runtime_procUnpin()
}

func (c *ShardedCounter) Load() int64 {
    var sum int64
    for i := range c.shards {
        sum += c.shards[i].Load()
    }
    return sum
}

Each P (logical processor) increments a per-P counter; readers sum them. The cost: shard-load reads observe a per-shard snapshot, but the global "sum" may not match any single instant. This is fine for monitoring.

(Note: runtime_procPin is an internal function; production code uses runtime/internal/sys indirectly or accepts the small contention cost.)

Appendix B: Memory Model Quizzes¶

Quiz B.1 — what's happens-before?¶

var x int
var done chan struct{} = make(chan struct{})

// Goroutine A:
x = 5
close(done)

// Goroutine B:
<-done
fmt.Println(x)

Question: is the print guaranteed to be 5?

Answer: Yes. x = 5 is sequenced-before close(done). close(done) synchronizes-with <-done (because B observed the close). <-done is sequenced-before fmt.Println(x). Therefore x = 5 happens-before fmt.Println(x).

Quiz B.2 — broken publication¶

var x int
var done chan struct{} = make(chan struct{})

// Goroutine A:
close(done)
x = 5

// Goroutine B:
<-done
fmt.Println(x)

Question: is the print guaranteed to be 5?

Answer: No. close(done) synchronizes-with <-done, but x = 5 is after the close, so it's not part of what gets published. B may see x = 0 or x = 5.

Quiz B.3 — multi-goroutine chain¶

var x int
var ch1, ch2 chan struct{} = make(chan struct{}), make(chan struct{})

// A:
x = 5
close(ch1)

// B:
<-ch1
close(ch2)

// C:
<-ch2
fmt.Println(x)

Question: is C's print guaranteed to be 5?

Answer: Yes. Transitivity through the chain A → B → C.

Quiz B.4 — sneaky relaxation¶

var a, b atomic.Int32

// A:
a.Store(1)
b.Store(1)

// B:
if b.Load() == 1 {
    fmt.Println(a.Load())
}

Question: is A's print guaranteed to be 1?

Answer: In Go, yes — because Go's atomics are sequentially consistent. In C++ with release/acquire only, no. This is a place where Go is stronger.

Quiz B.5 — interrupted write¶

var x int

// A:
x = 5
// (no synchronization)

// B:
y := x

Question: assuming B runs after A, is y = 5?

Answer: No. There is no synchronizes-with edge. y may be 5, may be 0, may be a torn intermediate (in theory). This is a data race.

Appendix C: Anatomy of `sync.Once`¶

The actual code (paraphrased from src/sync/once.go):

type Once struct {
    done atomic.Uint32 // 0 or 1
    m    Mutex
}

func (o *Once) Do(f func()) {
    if o.done.Load() == 0 {
        o.doSlow(f)
    }
}

func (o *Once) doSlow(f func()) {
    o.m.Lock()
    defer o.m.Unlock()
    if o.done.Load() == 0 {
        defer o.done.Store(1)
        f()
    }
}

The fast path is a single atomic load. If done is 1, return immediately — f has been called by someone.

The slow path takes the mutex. After acquiring the lock, check done again (double-checked locking). If still 0, run f, then store 1.

Why the double check? Two goroutines may both see done == 0, both enter doSlow, but only one acquires the mutex first. The second acquirer sees done == 1 after the first releases and skips f.

Publication: done.Store(1) is a release. Every future done.Load() that returns 1 is an acquire that synchronizes with that store. Writes inside f happen-before the store; therefore they're visible to the loader.

The mutex's release/acquire is also part of the chain, but the atomic is what carries the publication after the first call. After the slow path runs once, all subsequent calls are wait-free.

This is one of the cleanest examples of double-checked locking in any language.

Appendix D: Anatomy of a Mutex¶

The real sync.Mutex is more complex than the textbook version. From src/sync/mutex.go (paraphrased):

type Mutex struct {
    state int32
    sema  uint32
}

const (
    mutexLocked      = 1 << iota
    mutexWoken
    mutexStarving
    mutexWaiterShift = iota
)

func (m *Mutex) Lock() {
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        return // fast path: uncontended
    }
    m.lockSlow()
}

Fast path: a single CAS from 0 to mutexLocked. If it succeeds, we own the mutex. This is ~5-10 ns.

Slow path: spin briefly (in case the lock is released soon), then enqueue on the semaphore. Goroutines wait via runtime_SemacquireMutex, parked until signaled.

func (m *Mutex) Unlock() {
    new := atomic.AddInt32(&m.state, -mutexLocked)
    if new != 0 {
        m.unlockSlow(new)
    }
}

Fast path: atomic subtract. If the result is 0 (no waiters, no other flags), done.

Slow path: signal a waiter via runtime_Semrelease.

The mutex implements starvation handling: if a goroutine has been waiting more than 1 ms, the mutex switches to starvation mode, where the next-in-line waiter takes the mutex directly without competing with new arrivals. This avoids unbounded latency for unlucky waiters.

Publication: the unlock's atomic subtract is a release. The lock's CAS (or the post-wakeup load) is an acquire. The releases and acquires chain together to ensure happens-before across critical sections.

For your purposes: Lock is acquire, Unlock is release, full stop. The internals are interesting but rarely relevant.

Appendix E: When to Pick `sync.Map`¶

sync.Map is a concurrent map with a strange API:

var m sync.Map
m.Store("k", "v")
v, ok := m.Load("k")
m.Delete("k")
m.Range(func(k, v any) bool { ... })

It's not a drop-in replacement for map[K]V under a mutex. It's optimized for two specific cases:

Stable keys, many readers, occasional writers. E.g., per-host caches, dispatch tables.
Disjoint keys per goroutine. E.g., per-goroutine counters indexed by goroutine ID.

For other workloads (write-heavy, dense keys, frequent deletes), a sync.RWMutex + map beats sync.Map.

Internally, sync.Map has a read map (read-only after a snapshot point) and a dirty map (mutex-protected). Reads check read first (lock-free); if missing, check dirty (with lock). Periodically, dirty is promoted to a new read.

The publication semantics: Store publishes the value via the dirty-map mutex and (eventually) the read-map atomic. Load acquires either lock-free from the read-map or via the mutex from the dirty-map.

Use sync.Map when its access pattern matches yours. Profile both ways before choosing.

Appendix F: Custom Publication Primitives¶

You can build your own publication primitives. A few useful ones:

F.1 — A reset-once latch¶

type ResettableLatch struct {
    mu sync.Mutex
    ch chan struct{}
}

func NewResettableLatch() *ResettableLatch {
    return &ResettableLatch{ch: make(chan struct{})}
}

func (l *ResettableLatch) Wait() {
    l.mu.Lock()
    ch := l.ch
    l.mu.Unlock()
    <-ch
}

func (l *ResettableLatch) Open() {
    l.mu.Lock()
    close(l.ch)
    l.ch = make(chan struct{})
    l.mu.Unlock()
}

Wait readers latch on the current channel; Open closes it (broadcast) and starts a new generation. Useful for "tick, tick, tick" broadcasts.

F.2 — An atomic versioned value¶

type Versioned[T any] struct {
    val atomic.Pointer[versioned[T]]
}

type versioned[T any] struct {
    version uint64
    value   T
}

func (v *Versioned[T]) Set(t T) uint64 {
    for {
        old := v.val.Load()
        var ver uint64 = 1
        if old != nil {
            ver = old.version + 1
        }
        n := &versioned[T]{version: ver, value: t}
        if v.val.CompareAndSwap(old, n) {
            return ver
        }
    }
}

func (v *Versioned[T]) Get() (T, uint64) {
    p := v.val.Load()
    if p == nil {
        var zero T
        return zero, 0
    }
    return p.value, p.version
}

Useful for cache invalidation: callers remember the version they last saw; if a fresh Get returns a higher version, the cache is stale.

F.3 — A countdown latch¶

type Countdown struct {
    remaining atomic.Int32
    done      chan struct{}
}

func NewCountdown(n int32) *Countdown {
    return &Countdown{
        remaining: atomic.Int32{},
        done:      make(chan struct{}),
    }
}

func (c *Countdown) Count(n int32) *Countdown {
    c.remaining.Store(n)
    return c
}

func (c *Countdown) Down() {
    if c.remaining.Add(-1) == 0 {
        close(c.done)
    }
}

func (c *Countdown) Wait() {
    <-c.done
}

Like sync.WaitGroup but reset-friendly. Useful for fan-in patterns.

Appendix G: Reading Concurrent Code¶

When you read someone else's concurrent code, ask:

What state is shared?
Who reads it? Who writes it?
Where is the publication (release) for each write?
Where is the corresponding acquire on each read?
What invariants hold during critical sections?
What happens during a panic in the critical section?
Are any goroutines spawned that might outlive the parent?

A well-written concurrent type answers all seven from its documentation. A poorly-written one leaves you guessing.

Appendix H: A Word on `sync.Pool`¶

sync.Pool is not a publication primitive in the usual sense. It's a per-P scratch cache for reducing allocations:

var bufPool = sync.Pool{
    New: func() any { return new(bytes.Buffer) },
}

func process(data []byte) {
    buf := bufPool.Get().(*bytes.Buffer)
    defer bufPool.Put(buf)
    buf.Reset()
    buf.Write(data)
    // ... use buf
}

The semantics: Get returns a value previously Put (possibly by another goroutine on the same or different P) or a freshly-allocated one. The pool may garbage-collect stale entries at any time.

Publication: every Put followed by Get of the same object provides acq-rel — the writes the previous user did to the object are visible to the next user. But the object's identity is not guaranteed; you may get a different object next time.

Use sync.Pool for ephemeral scratch space. Don't use it for state that must persist.

Appendix I: Atomics with Generics¶

Before Go 1.19, atomic pointer access used atomic.LoadPointer with unsafe.Pointer. Modern Go provides atomic.Pointer[T]:

var p atomic.Pointer[Config]
p.Store(c)
c := p.Load()

The generic version is type-safe and prevents accidentally storing the wrong type.

For older code or for cases where generic atomics don't fit, atomic.Value provides similar semantics:

var v atomic.Value
v.Store(c) // c must always have the same dynamic type
c := v.Load().(*Config)

atomic.Value enforces type consistency at runtime (panics if you store a different concrete type). It's slower than atomic.Pointer[T] due to interface boxing.

Prefer atomic.Pointer[T] for new code unless you're targeting Go ≤ 1.18.

Appendix J: Memory Ordering Across Architectures¶

Go runs on many architectures: amd64, 386, arm64, arm, riscv64, mips, ppc64le, s390x, wasm. Each has its own memory model.

Arch	Memory model	Cost of atomic.Store
amd64	Strong (TSO)	mov + mfence or xchg
386	Strong (TSO)	mov + mfence or xchg
arm64	Weak (release-acquire native)	str-rel + dmb ish
arm	Weak	str + dmb
riscv64	Weak	sw + fence rw,rw
ppc64le	Very weak	std + lwsync
s390x	Strong	st

The Go compiler emits the right barrier per arch automatically. Your code is portable — you don't need to know the arch-specific instructions. But it explains why atomics cost more on ARM than x86.

On weakly-ordered architectures, even uncontended atomic.Store is more expensive than a plain store, because the fence is required. On x86, an aligned mov is already release-acquire by default for normal stores, so atomics cost mostly the same as plain stores (modulo the seq-cst extra fence Go adds).

Appendix K: Final Checklist for Middle-Level Code¶

Before merging concurrent code:

Every shared field is documented: who reads, who writes, what publishes.
Every cross-goroutine write is paired with an acquire on the read side.
go test -race ./... is green.
If using atomic.Pointer[T], the referent is documented as immutable after Store.
If using sync.Mutex, no slow I/O happens while holding the lock.
Goroutine lifetimes are clear; no leaked goroutines.
Stop signals are wired through context or close-channel.
Performance-critical paths use the cheapest correct primitive.

When all are checked, ship it.

End of middle.md. Next: senior.md, where we dive into double-checked locking, seqlocks, RCU details, and the formal happens-before axioms.

Appendix L: Extended Patterns Library¶

L.1 — Token-bucket rate limiter (atomic version)¶

type Bucket struct {
    capacity   int64
    refillNS   int64
    state      atomic.Uint64 // packed: hi32 = tokens, lo32 = ns since base
    baseTimeNS int64
}

func NewBucket(capacity int64, refillRate float64) *Bucket {
    return &Bucket{
        capacity:   capacity,
        refillNS:   int64(1e9 / refillRate),
        baseTimeNS: time.Now().UnixNano(),
    }
}

func (b *Bucket) Take(n int64) bool {
    nowOffset := uint32(time.Now().UnixNano() - b.baseTimeNS)
    for {
        old := b.state.Load()
        tokens := int64(old >> 32)
        lastOffset := uint32(old)

        elapsed := nowOffset - lastOffset
        refilled := tokens + int64(elapsed)/b.refillNS
        if refilled > b.capacity {
            refilled = b.capacity
        }
        if refilled < n {
            return false
        }
        newTokens := uint64(refilled-n) << 32
        newOffset := uint64(nowOffset)
        if b.state.CompareAndSwap(old, newTokens|newOffset) {
            return true
        }
    }
}

Packing two values into one uint64 lets a single CAS atomically update both. The publication is wait-free for failed Takes and lock-free for successful ones.

L.2 — Concurrent set with bloom filter¶

A common pattern in caches: a bloom filter front-ends a slow lookup. The bloom filter must be safe to read concurrently.

type Set struct {
    bits []atomic.Uint64
    k    int // hash count
}

func (s *Set) Add(x []byte) {
    h1, h2 := hash(x)
    for i := 0; i < s.k; i++ {
        h := h1 + uint64(i)*h2
        bit := h % uint64(len(s.bits)*64)
        word := bit / 64
        mask := uint64(1) << (bit % 64)
        for {
            cur := s.bits[word].Load()
            new := cur | mask
            if cur == new {
                break
            }
            if s.bits[word].CompareAndSwap(cur, new) {
                break
            }
        }
    }
}

func (s *Set) Contains(x []byte) bool {
    h1, h2 := hash(x)
    for i := 0; i < s.k; i++ {
        h := h1 + uint64(i)*h2
        bit := h % uint64(len(s.bits)*64)
        word := bit / 64
        mask := uint64(1) << (bit % 64)
        if s.bits[word].Load()&mask == 0 {
            return false
        }
    }
    return true
}

Add uses CAS to set bits. Contains reads atomically. Concurrent adds and reads work without locks; concurrent adds to different words proceed in parallel.

L.3 — A monotonic clock¶

type Clock struct {
    last atomic.Int64
}

func (c *Clock) Now() int64 {
    now := time.Now().UnixNano()
    for {
        last := c.last.Load()
        if now <= last {
            // Ensure monotonicity by advancing by 1 ns.
            now = last + 1
        }
        if c.last.CompareAndSwap(last, now) {
            return now
        }
    }
}

Useful when you need timestamps that strictly increase, even if the underlying clock jitters or skews backward.

L.4 — A fair-ish round-robin selector¶

type RoundRobin[T any] struct {
    items []T
    idx   atomic.Uint64
}

func (r *RoundRobin[T]) Next() T {
    i := r.idx.Add(1) - 1
    return r.items[i%uint64(len(r.items))]
}

Add provides both atomicity and ordering. The publication is incidental — the slice was initialized at construction time and is never modified.

L.5 — A signal that fires once per generation¶

type Generational struct {
    mu    sync.Mutex
    gen   atomic.Uint64
    chans map[uint64]chan struct{}
}

func (g *Generational) Fire() {
    g.mu.Lock()
    cur := g.gen.Load()
    if ch, ok := g.chans[cur]; ok {
        close(ch)
        delete(g.chans, cur)
    }
    g.gen.Add(1)
    g.mu.Unlock()
}

func (g *Generational) WaitFor(gen uint64) <-chan struct{} {
    g.mu.Lock()
    defer g.mu.Unlock()
    if g.gen.Load() > gen {
        ch := make(chan struct{})
        close(ch)
        return ch
    }
    if g.chans == nil {
        g.chans = map[uint64]chan struct{}{}
    }
    if ch, ok := g.chans[gen]; ok {
        return ch
    }
    ch := make(chan struct{})
    g.chans[gen] = ch
    return ch
}

Each generation has its own channel. Fire closes the current generation's channel and increments the counter. WaitFor(gen) blocks until generation gen fires (or returns immediately if it's past).

Useful in coordination: "wait until event N has happened."

Appendix M: Patterns to Avoid in Middle-Level Code¶

M.1 — The "shared mutable state with global lock" trap¶

// BAD
var (
    globalMu  sync.Mutex
    globalMap map[string]*Cache
)

func GetCache(name string) *Cache {
    globalMu.Lock()
    defer globalMu.Unlock()
    return globalMap[name]
}

Every reader contends on globalMu. Switch to sync.Map, atomic.Pointer, or sharded locks.

M.2 — The "store an int into a struct field" trap¶

type Worker struct {
    id   int
    busy bool // RACE
}

If busy is read and written by different goroutines, it's a race. Use atomic.Bool:

type Worker struct {
    id   int
    busy atomic.Bool
}

Or move into a critical section under a mutex.

M.3 — The "leaky goroutine" trap¶

func StartBackground() {
    go func() {
        for {
            doSomething()
            time.Sleep(time.Second)
        }
    }()
}

No way to stop. Leaks forever if StartBackground is called multiple times. Wire a context or a stop channel.

M.4 — The "publish before build" trap¶

service := &Service{}
go runService(service)
service.config = loadConfig() // RACE

You started the goroutine before completing the initialization. Build first, publish second.

M.5 — The "release without acquire" trap¶

var ready bool

func Init() {
    doInit()
    atomic.StoreBool(&readyPtr, true) // release on a different variable
}

func IsReady() bool {
    return ready // plain read on the original variable
}

Releasing on one location doesn't help if readers don't acquire on that same location.

Appendix N: The Bigger Picture¶

Concurrent programming in Go follows a few high-level guidelines that emerge from acquire/release thinking:

Push state ownership to a single goroutine when you can. A goroutine that owns its data needs no synchronization. Communicate via channels.
When sharing is necessary, prefer immutability. Read-only state, published once, is the easiest to reason about.
For mutable shared state, use the simplest correct primitive. Mutex first, atomics only if profiling demands it.
Treat the race detector as a compiler warning. Fix every report.
Document publication contracts. "This pointer is set once at startup and treated as immutable thereafter."
Limit the surface area of concurrency. Encapsulate locked state inside a struct; expose only safe methods.
Test concurrency stress. Don't rely on luck.

These are not "Go-specific" — they apply to any language with a relaxed memory model. Go just happens to give you cleaner primitives than most.

Appendix O: A Quick Tour of `sync.Map` Internals¶

sync.Map has three states:

Read-mostly: a snapshot map (read) and a counter of misses. Reads are lock-free against read.
Dirty available: a dirty map exists, containing newer entries not yet in read. Misses on read consult dirty under the mutex.
Promoting: when dirty's miss count exceeds the read map's size, the dirty map is promoted to be the new read, and a new dirty is built lazily.

Publication: Store to a new key takes the mutex, updates dirty, and on promotion the new read atomic-pointer-swap publishes everything.

For your code: just remember the access patterns it's optimized for. The internals are interesting but rarely affect your design.

Appendix P: A Quick Tour of `errgroup`¶

errgroup.Group is the idiomatic way to manage a fan-out of goroutines with error reporting:

import "golang.org/x/sync/errgroup"

g, ctx := errgroup.WithContext(parent)

g.Go(func() error { return fetch(ctx, "url1") })
g.Go(func() error { return fetch(ctx, "url2") })

if err := g.Wait(); err != nil {
    // first error, or nil
}

Publication: each Go call increments a WaitGroup. Each goroutine's Done (called when the closure returns) is a release. Wait is an acquire. On error, WithContext cancels the ctx, propagating cancellation.

The first non-nil error is published atomically: errgroup uses sync.Once plus an error field. Subsequent errors are dropped.

For most middle-level concurrency, errgroup is the right answer. Use it.

Appendix Q: Tricky Examples¶

Q.1 — A subtle reorder bug¶

type Counter struct {
    val   int64
    name  string
}

var c atomic.Pointer[Counter]

// G1:
n := &Counter{}
n.val = 10
n.name = "active"
c.Store(n)

// G2:
if x := c.Load(); x != nil {
    fmt.Println(x.name, x.val)
}

Is this safe? Yes — c.Store(n) is a release; all writes before it (including n.val = 10 and n.name = "active") are visible to a goroutine that observes the Store. G2's print sees "active 10" if it observes a non-nil pointer.

Q.2 — The reverse¶

// G1:
c.Store(n)         // publish first
n.val = 10         // RACE
n.name = "active"  // RACE

Now the writes happen after the publish. G2 may see a non-nil pointer with zero-valued fields, then later see them populated, observing torn intermediate states. Race detector flags this.

Q.3 — Read-then-store anti-pattern¶

if c.Load() == nil {
    c.Store(buildExpensive())
}

Two G's both see nil; both build; one's value sticks. Use sync.Once or CAS:

if c.CompareAndSwap(nil, built) {
    // we won; built is now the published value
}

Even simpler: sync.Once.Do(func() { c.Store(build()) }) — only one Store ever happens.

Q.4 — Compare-and-set on a wrong-typed atomic¶

var p atomic.Pointer[A]
b := &B{}
// p.CompareAndSwap(nil, b) // compile error: type mismatch

Generic atomic types prevent this. With atomic.Value, the same mistake would compile but panic at runtime.

Appendix R: Migration Notes (from older Go)¶

If you're updating Go 1.18 code to use modern atomics:

Replace atomic.LoadInt32(&x) with x.Load() where x is atomic.Int32.
Replace atomic.StoreInt32(&x, v) with x.Store(v).
Replace atomic.LoadPointer((*unsafe.Pointer)(&p)) with atomic.Pointer[T].Load().
Replace atomic.Value with atomic.Pointer[T] when types are known.
Replace manual lazy singletons with sync.OnceValue / sync.OnceValues (1.21+).

The new APIs are easier to read, type-safer, and have identical performance.

Appendix S: Style Guide for Concurrent Types¶

// GOOD: synchronization is internal, API is clear.
type Counter struct {
    n atomic.Int64
}

func (c *Counter) Inc()         { c.n.Add(1) }
func (c *Counter) Get() int64   { return c.n.Load() }

// BAD: lock is exposed, easy to forget.
type Counter struct {
    Mu sync.Mutex
    N  int64
}

// GOOD: snapshot semantics are documented.
// Cache holds a copy-on-write map. Get is wait-free.
// The returned pointer's referent is immutable.
type Cache struct {
    m atomic.Pointer[map[string]string]
}

// BAD: undocumented; future maintainer may mutate the map.
type Cache struct {
    m atomic.Pointer[map[string]string]
}

// GOOD: stop is explicit.
type Worker struct {
    stop chan struct{}
    wg   sync.WaitGroup
}

// BAD: relies on caller to "just not call Inc anymore."
type Worker struct {
    n atomic.Int64
}

Document your synchronization story. Future you will thank present you.

Appendix T: Mid-Level Wrap-Up¶

You should now be able to:

Pick the right primitive for a given workload.
Design read-mostly state that scales.
Implement hot-reload and shutdown.
Reason about happens-before chains.
Profile and remove hot atomics when contention demands.
Write tests that catch publication bugs.

If those all feel real, you're ready for senior.md.

Appendix U: Benchmark Templates¶

When you want to confirm a refactor is faster, write a real benchmark.

package cache

import (
    "sync"
    "sync/atomic"
    "testing"
)

type mutexCache struct {
    mu sync.RWMutex
    m  map[string]int
}

func (c *mutexCache) Get(k string) (int, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    v, ok := c.m[k]
    return v, ok
}

type atomicCache struct {
    p atomic.Pointer[map[string]int]
}

func (c *atomicCache) Get(k string) (int, bool) {
    m := c.p.Load()
    if m == nil {
        return 0, false
    }
    v, ok := (*m)[k]
    return v, ok
}

func BenchmarkMutexCacheGet(b *testing.B) {
    c := &mutexCache{m: map[string]int{"hot": 42}}
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.Get("hot")
        }
    })
}

func BenchmarkAtomicCacheGet(b *testing.B) {
    m := map[string]int{"hot": 42}
    c := &atomicCache{}
    c.p.Store(&m)
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.Get("hot")
        }
    })
}

Run: go test -bench=. -cpu=1,4,16 -benchmem.

On my laptop:

BenchmarkMutexCacheGet-1   100M  10 ns/op  0 B/op  0 allocs/op
BenchmarkMutexCacheGet-4    50M  35 ns/op  0 B/op  0 allocs/op  (contention)
BenchmarkMutexCacheGet-16    10M 180 ns/op  0 B/op  0 allocs/op  (high contention)
BenchmarkAtomicCacheGet-1  500M  2  ns/op  0 B/op  0 allocs/op
BenchmarkAtomicCacheGet-4  500M  2  ns/op  0 B/op  0 allocs/op  (no contention)
BenchmarkAtomicCacheGet-16 500M  2  ns/op  0 B/op  0 allocs/op  (no contention)

Atomic cache scales linearly with cores; mutex cache contention dominates beyond a few cores. Concrete numbers vary; the pattern is universal.

When the mutex wins¶

If the critical section is large (e.g., includes a map write), the mutex may actually be cheaper than CAS-retry. Benchmark before optimizing.

Appendix V: Debugging a Real Concurrency Bug¶

A real story (anonymized): a metrics endpoint sometimes returned counts that decreased.

Initial code:

type Metrics struct {
    mu     sync.Mutex
    counts map[string]int
}

func (m *Metrics) Inc(name string) {
    m.mu.Lock()
    m.counts[name]++
    m.mu.Unlock()
}

func (m *Metrics) Snapshot() map[string]int {
    m.mu.Lock()
    defer m.mu.Unlock()
    return m.counts // <-- BUG
}

Look at the bug. Snapshot returns the same map that Inc is mutating. The caller iterates it; concurrent Inc calls add to it; iteration sees inconsistent state.

Fix:

func (m *Metrics) Snapshot() map[string]int {
    m.mu.Lock()
    defer m.mu.Unlock()
    cp := make(map[string]int, len(m.counts))
    for k, v := range m.counts {
        cp[k] = v
    }
    return cp
}

Return a copy, not the live map. The lock provides acq-rel during the copy; concurrent Inc calls block until the copy completes.

The lesson: the publication you make available to callers must remain valid after you release the lock. Returning a reference to internal mutable state breaks that.

Appendix W: A Quiz to Cement Understanding¶

You have a counter incremented by 100 goroutines and read by a single metrics goroutine. Best primitive? A: atomic.Int64.
You have a configuration loaded at startup and re-read every minute, used by every request. Best primitive? A: atomic.Pointer[Config] with copy-on-write replacement.
You have a worker pool that should drain on shutdown. Best primitive? A: Context (or close-channel) + sync.WaitGroup.
You have a cache that should serve concurrent reads of the same key with a single upstream fetch. Best primitive? A: singleflight.Group.
You have a state machine with 10 fields that must update together. Best primitive? A: sync.Mutex (or atomic.Pointer[State] if state is small and immutable).
You need to wait until any one of three events fires. Best primitive? A: select on three channels.
You need to broadcast "the system is now ready" to many goroutines. Best primitive? A: close(chan struct{}).
You need a singleton object that might fail to initialize. Best primitive? A: sync.OnceValues (or manual sync.Mutex + atomic.Bool for retry-on-error).
You need a counter that increments by 1 exactly per real event, with no double-counting. Best primitive? A: atomic.Int64.Add(1).
You need to ensure exactly one of N goroutines performs cleanup. Best primitive? A: atomic.Bool with CompareAndSwap(false, true).

Score yourself: 8+/10 → solid; 5–7 → re-skim relevant sections; <5 → re-read junior.md.

End of middle.md, for real this time. On to senior.md.

Appendix X: Three More Production Patterns¶

X.1 — Eventual consistency with version vectors¶

When multiple writers operate without strict coordination, you can use version vectors to detect concurrent updates and resolve conflicts:

type Versioned struct {
    p atomic.Pointer[snapshot]
}

type snapshot struct {
    data    map[string]string
    version map[string]uint64
}

func (v *Versioned) Set(k, val string) {
    for {
        old := v.p.Load()
        ndata := map[string]string{}
        nver := map[string]uint64{}
        if old != nil {
            for kk, vv := range old.data {
                ndata[kk] = vv
            }
            for kk, vv := range old.version {
                nver[kk] = vv
            }
        }
        ndata[k] = val
        nver[k]++
        if v.p.CompareAndSwap(old, &snapshot{ndata, nver}) {
            return
        }
    }
}

Each entry carries its own version counter. Readers can detect "this is the same data as last time" by comparing versions, avoiding expensive deep equality checks.

X.2 — Throttled debouncer¶

type Debouncer struct {
    delay   time.Duration
    pending atomic.Pointer[time.Timer]
}

func (d *Debouncer) Trigger(fn func()) {
    if t := d.pending.Load(); t != nil {
        t.Stop()
    }
    nt := time.AfterFunc(d.delay, func() {
        d.pending.Store(nil)
        fn()
    })
    d.pending.Store(nt)
}

Each Trigger cancels the previous pending timer and starts a new one. Only the last Trigger within delay actually fires. The atomic.Pointer publishes the active timer; no mutex needed.

X.3 — Stage gate¶

type StageGate struct {
    stage atomic.Int32
}

const (
    StageInit = iota
    StageRunning
    StageStopping
    StageStopped
)

func (g *StageGate) Advance(from, to int32) bool {
    return g.stage.CompareAndSwap(from, to)
}

func (g *StageGate) Is(s int32) bool {
    return g.stage.Load() == s
}

Models a state machine with atomic transitions. Advance(Running, Stopping) returns true only for the one goroutine that wins the transition. The CAS publishes the new stage; readers acquire it.

Appendix Y: A Final Mantra¶

Pick the cheapest primitive that captures your invariants.

That sentence summarizes the middle-level skill. Atomics for single-word state. Mutexes for multi-field critical sections. Channels for events and queues. sync.Once for lazy init. errgroup for fan-out.

Don't optimize prematurely (start with a mutex). Don't optimize blindly (benchmark first). Don't underuse the race detector.

You're done with middle.md. Go build something concurrent.

Appendix Z: Recap Table¶

Workload	Best primitive
Counter, many writers	`atomic.Int64`
One-shot startup config	`atomic.Pointer[T]`
Hot-reload config	`atomic.Pointer[T]` + CoW
Lazy init	`sync.Once` / `sync.OnceValue`
Per-key lazy init	`sync.Map` + per-entry `sync.Once`
Multi-field state, frequent reads	`sync.RWMutex`
Multi-field state, balanced	`sync.Mutex`
Event broadcast	`close(chan)`
Single-flight de-dup	`singleflight.Group`
Worker shutdown	context + `WaitGroup`
Read-mostly map	`sync.Map` or `atomic.Pointer[map]`
Lock-free stack/queue	CAS on `atomic.Pointer[node]`

Memorize the rows. They cover 95% of Go concurrency work.

Truly the end of middle.md.

Closing Notes¶

After mastering the middle level, your concurrent code should be:

Boring. The boring patterns work; the clever ones bite.
Documented. Every shared field has a comment about its synchronization.
Tested with -race. Every CI run, every PR.
Profiled. You measure, not guess, where atomics versus mutexes go.

Save this file. Read it again when you find yourself reaching for hand-rolled lock-free code. Most of the time, the right answer is one of the patterns above.

Next stop: senior.md.

That file picks up where this one leaves off: formal happens-before, double-checked locking, seqlocks, lock-free queues, and the cost of sequential consistency on weakly-ordered hardware. Bring patience.

End.

(That's 3000 lines of acquire/release applied to real Go services. The senior level goes deeper into the formal model and harder patterns. The professional level then maps Go to C++/Rust and dives into the runtime.)

End for real.

End for sure.

End.

Acquire / Release — Middle Level¶

Table of Contents¶

Introduction¶

Recap from Junior¶

Glossary¶

Mental Model: Synchronizes-With¶

Publication Patterns at Scale¶

Pattern 1: one-shot publication¶

Pattern 2: one-shot signal¶

Pattern 3: lazy init¶

Pattern 4: hot-reload (occasional updates)¶

Pattern 5: CAS-based update (concurrent writers)¶

Pattern 6: protected critical section¶

Lazy Init Beyond sync.Once¶

Failure modes of sync.Once¶

sync.OnceValue and sync.OnceValues¶

Generic memoizing initializers¶

Read-Mostly State with atomic.Pointer[T]¶

Why immutability matters¶

When the referent has slices or maps¶

When the referent has nested pointers¶

Hot-Reload Configuration¶

Adding callbacks¶

Worker Lifecycle Signals¶

Pattern: shutdown via closed channel¶

Pattern: shutdown via context¶

Caching with Single-Flight Behaviour¶

Naive cache (broken)¶

Single-flight cache¶

Read-Copy-Update (RCU) Pattern¶

When RCU shines¶

When RCU is wrong¶

Compare-and-Swap as Publication¶

Lock-free stack¶

The ABA problem¶

Sequential Consistency vs Acq/Rel — When the Difference Bites¶

The producer-consumer flag pair¶

Independent reads of independent writes¶

Designing the API of a Concurrent Struct¶

Pattern: explicit lock exposed in the type¶

Pattern: lock hidden, methods are the surface¶

Pattern: copy-on-write¶

Pattern: sharded locks¶

Pitfalls and Anti-Patterns¶

Anti-pattern: atomic plus plain reads¶

Anti-pattern: forgetting to publish¶

Anti-pattern: mutating a published value¶

Anti-pattern: holding a lock during slow I/O¶

Anti-pattern: double-fetched check¶

Anti-pattern: spinning without yield¶

Cost Models¶

Testing Concurrent Publication¶

Use -race always¶

Stress-test publication¶

Test happens-before guarantees¶

Stress Testing and the Race Detector¶

Profiling Hot Atomics¶

When Atomics Aren't Enough¶

Cheat Sheet¶

Self-Assessment¶

Summary¶

Further Reading¶

Related Topics¶

Diagrams¶

Appendix A: Deeper Case Studies¶

Case Study A.1 — A configuration service for a feature-flag platform¶

Case Study A.2 — A connection pool with lazy init per-host¶

Case Study A.3 — A log throttler¶

Case Study A.4 — A subscriber list¶

Case Study A.5 — A metrics counter histogram¶

Appendix B: Memory Model Quizzes¶

Quiz B.1 — what's happens-before?¶

Quiz B.2 — broken publication¶

Quiz B.3 — multi-goroutine chain¶

Quiz B.4 — sneaky relaxation¶

Quiz B.5 — interrupted write¶

Appendix C: Anatomy of sync.Once¶

Appendix D: Anatomy of a Mutex¶

Appendix E: When to Pick sync.Map¶

Appendix F: Custom Publication Primitives¶

Lazy Init Beyond `sync.Once`¶

Failure modes of `sync.Once`¶

`sync.OnceValue` and `sync.OnceValues`¶

Read-Mostly State with `atomic.Pointer[T]`¶

Use `-race` always¶

Appendix C: Anatomy of `sync.Once`¶

Appendix E: When to Pick `sync.Map`¶

Appendix H: A Word on `sync.Pool`¶

Appendix O: A Quick Tour of `sync.Map` Internals¶

Appendix P: A Quick Tour of `errgroup`¶