Sequential Consistency — Optimization Exercises¶

Performance-focused exercises to reduce SC overhead in real workloads. Each exercise shows a baseline implementation and asks you to optimise.

Baseline:

type Stats struct {
    A atomic.Int64
    B atomic.Int64
    C atomic.Int64
}

Multiple goroutines update different fields. Profile shows poor scaling.

Optimise: Add cache-line padding.

type Stats struct {
    A atomic.Int64
    _ [56]byte
    B atomic.Int64
    _ [56]byte
    C atomic.Int64
    _ [56]byte
}

Benchmark target: 3-5x improvement on 8-core machines.

Verification: perf stat -e cache-misses shows reduction.

Exercise 2: Shard a Hot Counter¶

Baseline:

var counter atomic.Int64

func inc() { counter.Add(1) }

Profiles show runtime/internal/atomic.Xadd64 as a hot spot under high contention.

Optimise: Per-CPU sharding.

type ShardedCounter struct {
    shards [64]struct {
        v atomic.Int64
        _ [56]byte
    }
}

func (c *ShardedCounter) Inc(g int) {
    c.shards[g%64].v.Add(1)
}

func (c *ShardedCounter) Sum() int64 {
    var s int64
    for i := range c.shards {
        s += c.shards[i].v.Load()
    }
    return s
}

Benchmark target: 10-50x improvement on 16+ cores.

Exercise 3: Batch Updates¶

Baseline:

// Each event increments shared counter
func handleEvent() {
    counter.Add(1)
}

Counter is hot. Sharding helps but the cost is still per-event.

Optimise: Per-goroutine local accumulation, periodic flush.

type LocalCounter struct {
    local   int64
    counter *atomic.Int64
}

func (l *LocalCounter) Inc() {
    l.local++
    if l.local >= 100 {
        l.counter.Add(l.local)
        l.local = 0
    }
}

Benchmark target: 50-100x reduction in atomic operations.

Trade-off: Counter reads see slightly stale values until flush.

Exercise 4: Replace RWMutex with COW¶

Baseline:

type Config struct{ /* fields */ }

type Store struct {
    mu  sync.RWMutex
    cfg Config
}

func (s *Store) Read() Config {
    s.mu.RLock()
    defer s.mu.RUnlock()
    return s.cfg
}

Reads are frequent. RLock adds overhead.

Optimise: Copy-on-write with atomic.Pointer.

type Store struct {
    cfg atomic.Pointer[Config]
}

func (s *Store) Read() *Config { return s.cfg.Load() }

func (s *Store) Update(c *Config) { s.cfg.Store(c) }

Benchmark target: 100-800x improvement on reads.

Exercise 5: Eliminate Unnecessary Atomics¶

Baseline:

type Server struct {
    requests atomic.Int64
    errors   atomic.Int64
}

func (s *Server) handle() {
    s.requests.Add(1) // every request increments
    // ...
}

Profile shows atomic.Add as significant overhead.

Optimise: Aggregate per-handler stats locally, flush at handler exit.

type LocalStats struct {
    requests int64
    errors   int64
}

func (s *Server) handle() {
    var local LocalStats
    local.requests++
    // do work, possibly local.errors++
    s.requests.Add(local.requests)
    s.errors.Add(local.errors)
}

Benchmark target: Reduce atomic ops by 50%.

Trade-off: Stats only accurate after handler exit.

Exercise 6: Optimise Spin Loops¶

Baseline:

for !ready.Load() {
}

Tight spin burns CPU.

Optimise: Backoff strategy.

spins := 0
for !ready.Load() {
    if spins < 100 {
        spins++
    } else if spins < 1000 {
        runtime.Gosched()
        spins++
    } else {
        time.Sleep(time.Microsecond)
    }
}

Or replace with a channel:

<-readyCh

Benchmark target: CPU usage drops significantly during waits.

Exercise 7: Read-Only Snapshots¶

Baseline:

type Map struct {
    mu sync.RWMutex
    m  map[string]int
}

func (m *Map) Get(k string) int {
    m.mu.RLock()
    defer m.mu.RUnlock()
    return m.m[k]
}

Optimise: Atomic snapshot.

type Map struct {
    m atomic.Pointer[map[string]int]
}

func (m *Map) Get(k string) int {
    return (*m.m.Load())[k]
}

Benchmark target: Reads become near-free.

Exercise 8: NUMA-Aware Sharding¶

Baseline: Sharded counter on a 4-NUMA-node machine. Throughput plateaus across NUMA boundaries.

Optimise: Pin shards to NUMA nodes.

// Requires platform-specific code (Linux: sched_setaffinity).
// Each goroutine pinned to a CPU; uses the shard local to that NUMA node.

Benchmark target: Reduced cross-node traffic; improved throughput.

Note: NUMA-pinning in Go requires cgo or unsafe. Use carefully.

Exercise 9: Avoid CAS Retry Loops Where Possible¶

Baseline:

for {
    old := v.Load()
    if v.CompareAndSwap(old, old+1) {
        break
    }
}

CAS retries under contention.

Optimise: Use Add directly.

v.Add(1)

Add is a single instruction; CAS may retry many times.

Benchmark target: Eliminate retry overhead.

Exercise 10: Lock-Free Stack vs Channel¶

Baseline: Use a channel for inter-goroutine work queue.

ch := make(chan Work, 100)

Optimise: For very high throughput, use lock-free queue.

// Vyukov bounded MPMC queue (see senior.md)

Benchmark target: 3-5x throughput improvement under extreme load.

Trade-off: More complex code; less Go-idiomatic. Prefer channels unless measured benefit.

Exercise 11: Reduce Atomic Pointer Allocations¶

Baseline:

type State struct{ /* large */ }

var cur atomic.Pointer[State]

func update(s State) {
    cur.Store(&s) // allocates
}

Each update allocates a new State on the heap.

Optimise: Pool allocated States.

var pool = sync.Pool{New: func() any { return new(State) }}

func update(s State) {
    n := pool.Get().(*State)
    *n = s
    cur.Store(n)
}

Benchmark target: Reduced GC pressure.

Caveat: Must not put old State back into pool while readers still hold pointer.

Exercise 12: Cache-Aware Data Layout¶

Baseline:

type Entry struct {
    Hot atomic.Int64
    Cold [60]byte // rarely accessed
}

Hot and cold data share cache lines.

Optimise: Separate hot and cold data.

type Entry struct {
    Hot atomic.Int64
    _   [56]byte
}

type ColdData struct {
    Data [60]byte
}

Benchmark target: Less cache pollution; better hot-path performance.

Exercise 13: Pre-Allocate Atomics¶

Baseline:

counters := make(map[string]*atomic.Int64)
// counters added at runtime

Map lookup overhead per access.

Optimise: Pre-allocate known keys.

const (
    CounterA = iota
    CounterB
    CounterC
    NumCounters
)

counters := [NumCounters]atomic.Int64{}

Benchmark target: Replace map lookup with array index.

Exercise 14: Reduce Sharding Overhead¶

Baseline: 64-shard counter sums 64 atomic loads per read.

Optimise: Cache the sum periodically.

type Counter struct {
    shards [64]struct{ v atomic.Int64; _ [56]byte }
    cached atomic.Int64
    last   atomic.Int64 // unix nano
}

func (c *Counter) Sum() int64 {
    last := c.last.Load()
    now := time.Now().UnixNano()
    if now-last < int64(time.Second) {
        return c.cached.Load()
    }
    var s int64
    for i := range c.shards {
        s += c.shards[i].v.Load()
    }
    c.cached.Store(s)
    c.last.Store(now)
    return s
}

Benchmark target: Sum is amortised O(1) on most calls.

Exercise 15: Use Faster CAS on ARM¶

Baseline: Standard atomic.CompareAndSwap.

Optimise: Ensure Go is targeting ARMv8.1+ for native CAS instructions (CASAL).

Set GOARM64=v8.1 or compile with -ldflags "-X main.armVer=8.1".

Benchmark target: Reduced CAS overhead on modern ARM.

Exercise 16: Combine Atomic Reads¶

Baseline:

a := atomicA.Load()
b := atomicB.Load()
// use a and b

Two atomic loads.

Optimise: Pack into one atomic if possible.

var combined atomic.Uint64 // high 32 bits = a, low 32 bits = b

v := combined.Load()
a := uint32(v >> 32)
b := uint32(v)

Benchmark target: One atomic load instead of two; consistent snapshot.

Trade-off: Type packing complicates code.

Exercise 17: Reduce GC Pressure from atomic.Pointer¶

Baseline: COW pattern allocates a new map on every write.

Optimise: Use mutable map with mutex if writes are frequent.

Or: amortise allocations with a "dirty" buffer:

type Cache struct {
    m         atomic.Pointer[map[string]string]
    pending   sync.Mutex
    dirty     map[string]string
}

func (c *Cache) Set(k, v string) {
    c.pending.Lock()
    c.dirty[k] = v
    c.pending.Unlock()
    // periodic flush merges dirty into m
}

Benchmark target: Reduced allocations.

Exercise 18: Use sync.Pool for Temporary Atomics¶

Baseline: Temporary atomic objects allocated per request.

Optimise: Pool them.

var statsPool = sync.Pool{New: func() any { return &Stats{} }}

func handle() {
    s := statsPool.Get().(*Stats)
    defer statsPool.Put(s)
    // use s
}

Benchmark target: Reduced allocations.

Exercise 19: Compile-Time vs Runtime Atomics¶

Baseline: Atomic on a known-immutable value:

var once atomic.Bool
func init() { once.Store(true) }
func check() bool { return once.Load() }

The atomic is unnecessary because the value never changes after init.

Optimise: Plain variable, set once at init:

var initialized = true // race-free because init runs before any other goroutine

Benchmark target: Eliminate the atomic load.

Exercise 20: Profile-Driven Optimisation¶

Process:

Run with pprof:

go test -cpuprofile cpu.prof -bench=.
go tool pprof -top cpu.prof

Identify atomic-heavy paths.
Apply mitigations from exercises 1-19.
Re-profile to confirm.
Measure throughput improvement.

Acceptance: Documented before/after numbers.

General Optimization Principles¶

Measure first. Don't optimise blind.
Profile hot paths. Focus where time is spent.
Sharding scales. Padding prevents false sharing.
Atomics are cheap when uncontended. They scale poorly under contention.
Channels are slower per-op than atomics. But richer semantics.
Less synchronisation is better synchronisation. Avoid sharing if possible.
Trade staleness for performance. Cached sums, batched updates.
Architecture matters. x86 ≠ ARM ≠ RISC-V. Test on target.
Watch GC pressure. Atomic.Pointer COW allocates.
Stay correct. Optimisations must preserve race-freedom.

Closing¶

These exercises build the practical optimisation skills of a senior Go engineer.

For each: - Implement both baseline and optimised versions. - Benchmark with go test -bench. - Profile with pprof if needed. - Document the improvement.

Mastering optimisation is the difference between "works" and "scales."

End.

Sequential Consistency — Optimization Exercises¶

Exercise 1: Reduce False Sharing¶

Exercise 2: Shard a Hot Counter¶

Exercise 3: Batch Updates¶

Exercise 4: Replace RWMutex with COW¶

Exercise 5: Eliminate Unnecessary Atomics¶

Exercise 6: Optimise Spin Loops¶

Exercise 7: Read-Only Snapshots¶

Exercise 8: NUMA-Aware Sharding¶

Exercise 9: Avoid CAS Retry Loops Where Possible¶

Exercise 10: Lock-Free Stack vs Channel¶

Exercise 11: Reduce Atomic Pointer Allocations¶

Exercise 12: Cache-Aware Data Layout¶

Exercise 13: Pre-Allocate Atomics¶

Exercise 14: Reduce Sharding Overhead¶

Exercise 15: Use Faster CAS on ARM¶

Exercise 16: Combine Atomic Reads¶

Exercise 17: Reduce GC Pressure from atomic.Pointer¶

Exercise 18: Use sync.Pool for Temporary Atomics¶

Exercise 19: Compile-Time vs Runtime Atomics¶

Exercise 20: Profile-Driven Optimisation¶

General Optimization Principles¶

Closing¶