Cache Coherence — Hands-On Tasks¶
A progressive set of practical exercises. Each builds on the previous.
Task 1: Reproduce False Sharing¶
Goal: Observe false sharing with your own eyes.
Write a benchmark with two int64 fields in one struct, two goroutines each updating one field. Run with -cpu=2. Note ns/op.
package falseshare
import (
"sync"
"sync/atomic"
"testing"
)
type Pair struct{ a, b int64 }
func BenchmarkPacked(b *testing.B) {
var p Pair
for i := 0; i < b.N; i++ {
var wg sync.WaitGroup
wg.Add(2)
go func() { defer wg.Done(); for j := 0; j < 1_000_000; j++ { atomic.AddInt64(&p.a, 1) } }()
go func() { defer wg.Done(); for j := 0; j < 1_000_000; j++ { atomic.AddInt64(&p.b, 1) } }()
wg.Wait()
}
}
Run: go test -bench=BenchmarkPacked -cpu=2 -benchtime=3s.
Expected: ns/op should be significantly higher than a single-goroutine benchmark for one variable.
Task 2: Fix False Sharing with Padding¶
Goal: Apply the simplest fix and measure the improvement.
Add 56 bytes of padding between a and b:
Write BenchmarkPadded mirroring BenchmarkPacked. Run with the same settings.
Expected: ns/op for padded version is 2-5× faster than packed.
Compare with benchstat:
Task 3: Sweep the Cache Line Boundary¶
Goal: Find the exact cache line size empirically.
Write a benchmark generator that varies the padding between two int64 fields from 0 to 128 bytes in 8-byte steps. For each, measure ns/op.
import "unsafe"
func runWithPad(b *testing.B, pad int) {
buf := make([]byte, 8 + pad + 8)
a := (*int64)(unsafe.Pointer(&buf[0]))
bb := (*int64)(unsafe.Pointer(&buf[8+pad]))
b.ResetTimer()
for i := 0; i < b.N; i++ {
var wg sync.WaitGroup
wg.Add(2)
go func() { defer wg.Done(); for j := 0; j < 100_000; j++ { atomic.AddInt64(a, 1) } }()
go func() { defer wg.Done(); for j := 0; j < 100_000; j++ { atomic.AddInt64(bb, 1) } }()
wg.Wait()
}
}
func BenchmarkPad0(b *testing.B) { runWithPad(b, 0) }
func BenchmarkPad8(b *testing.B) { runWithPad(b, 8) }
// ... up to pad=128
Expected: A step change around pad=56 (when the second int64 crosses to a new cache line on x86) or pad=120 (Apple silicon).
Task 4: Per-CPU Sharded Counter¶
Goal: Build a counter that scales linearly with cores.
Implement:
type ShardedCounter struct {
shards []paddedInt64
}
type paddedInt64 struct {
v atomic.Int64
_ [56]byte
}
func New() *ShardedCounter {
return &ShardedCounter{shards: make([]paddedInt64, runtime.NumCPU())}
}
func (c *ShardedCounter) Add(hint uint64, d int64) {
c.shards[hint%uint64(len(c.shards))].v.Add(d)
}
func (c *ShardedCounter) Sum() int64 {
var s int64
for i := range c.shards {
s += c.shards[i].v.Load()
}
return s
}
Benchmark with -cpu=1,2,4,8. Compare ns/op against a shared atomic.Int64.
Expected: Sharded counter ns/op stays roughly flat (or improves) as cores increase; shared counter degrades.
Task 5: Snapshot Pointer Config Store¶
Goal: Implement a read-mostly config store with atomic.Pointer.
type Config struct {
Timeout time.Duration
MaxConcurrent int
}
type Store struct {
snap atomic.Pointer[Config]
}
func New() *Store { return &Store{} }
func (s *Store) Set(c *Config) { s.snap.Store(c) }
func (s *Store) Get() *Config { return s.snap.Load() }
Benchmark: one writer updating config every second, N readers polling continuously. Vary N (1, 4, 16). Measure read ns/op.
Expected: Read ns/op stays flat as N increases. Writer impact on readers is brief.
Task 6: Layout Verification¶
Goal: Verify your struct layout at runtime.
For a padded struct, verify the hot fields are on different cache lines:
import "unsafe"
func init() {
var c Counter
a := uintptr(unsafe.Pointer(&c.a))
b := uintptr(unsafe.Pointer(&c.b))
if a/64 == b/64 {
panic("Counter.a and Counter.b share a cache line")
}
}
Add such checks for any struct where layout is load-bearing.
Task 7: Detect Coherence in pprof¶
Goal: Learn to read pprof for coherence.
Take any benchmark from Task 1 or 4. Generate a CPU profile:
go test -bench=BenchmarkPacked -benchtime=10s -cpu=4 -cpuprofile=cpu.prof
go tool pprof cpu.prof
(pprof) top10
Expected: runtime/internal/atomic.Xadd64 or similar should be near the top.
In pprof: web for a graph; list <function> to see source-level breakdown.
Task 8: Use perf c2c (Linux)¶
Goal: Generate and read a perf c2c report.
Build a benchmark binary:
Run with perf:
sudo perf c2c record -F 4000 -a -- ./falseshare.test -test.bench=BenchmarkPacked -test.benchtime=5s
sudo perf c2c report --stdio > c2c.txt
Examine c2c.txt. Find the hot cache lines. Verify they correspond to the contended struct fields.
Task 9: Build a Sharded Mutex Map¶
Goal: Implement a concurrent map with sharded mutexes.
type ShardedMap struct {
shards []mapShard
}
type mapShard struct {
mu sync.Mutex
m map[string]interface{}
_ [40]byte // pad
}
func New(n int) *ShardedMap {
shards := make([]mapShard, n)
for i := range shards { shards[i].m = make(map[string]interface{}) }
return &ShardedMap{shards: shards}
}
func (m *ShardedMap) Get(key string) (interface{}, bool) {
s := &m.shards[hash(key)%uint64(len(m.shards))]
s.mu.Lock()
defer s.mu.Unlock()
v, ok := s.m[key]
return v, ok
}
func (m *ShardedMap) Set(key string, value interface{}) {
s := &m.shards[hash(key)%uint64(len(m.shards))]
s.mu.Lock()
s.m[key] = value
s.mu.Unlock()
}
func hash(s string) uint64 {
var h uint64 = 14695981039346656037
for i := 0; i < len(s); i++ {
h ^= uint64(s[i])
h *= 1099511628211
}
return h
}
Benchmark Get and Set under concurrent load. Compare against sync.Map and a single-mutex-protected map.
Task 10: SPSC Lock-Free Ring Buffer¶
Goal: Implement a wait-free single-producer single-consumer ring.
type Ring[T any] struct {
write atomic.Uint64
_ [56]byte
read atomic.Uint64
_ [56]byte
buf []T
mask uint64
}
func New[T any](size int) *Ring[T] {
size = nextPow2(size)
return &Ring[T]{buf: make([]T, size), mask: uint64(size - 1)}
}
func (r *Ring[T]) Push(v T) bool {
w := r.write.Load()
rr := r.read.Load()
if w-rr >= uint64(len(r.buf)) {
return false
}
r.buf[w&r.mask] = v
r.write.Store(w + 1)
return true
}
func (r *Ring[T]) Pop() (T, bool) {
var zero T
rr := r.read.Load()
w := r.write.Load()
if rr == w {
return zero, false
}
v := r.buf[rr&r.mask]
r.read.Store(rr + 1)
return v, true
}
func nextPow2(n int) int {
p := 1; for p < n { p *= 2 }; return p
}
Benchmark with one producer goroutine pushing as fast as possible and one consumer goroutine popping. Measure throughput in items/sec.
Expected: ~200M items/sec on a fast x86.
Task 11: Hot/Cold Struct Split¶
Goal: Restructure a real struct with mixed access patterns.
Take a struct from your own codebase that has both frequently-mutated fields and rarely-touched fields. Refactor to put hot fields first with padding, cold fields after.
type Service struct {
// hot
requests atomic.Int64
_ cpu.CacheLinePad
errors atomic.Int64
_ cpu.CacheLinePad
// cold
name string
address string
config *Config
}
Benchmark before and after. If the struct is in a hot path, you should see throughput improvement.
Task 12: Read-Mostly Map with Copy-on-Write¶
Goal: Implement a COW map.
type COWMap[K comparable, V any] struct {
m atomic.Pointer[map[K]V]
}
func New[K comparable, V any]() *COWMap[K, V] {
m := make(map[K]V)
c := &COWMap[K, V]{}
c.m.Store(&m)
return c
}
func (c *COWMap[K, V]) Get(k K) (V, bool) {
m := c.m.Load()
v, ok := (*m)[k]
return v, ok
}
func (c *COWMap[K, V]) Set(k K, v V) {
for {
old := c.m.Load()
next := make(map[K]V, len(*old)+1)
for k2, v2 := range *old { next[k2] = v2 }
next[k] = v
if c.m.CompareAndSwap(old, &next) { return }
}
}
Benchmark Get vs Set. Compare with sync.Map and a regular map + Mutex. Find the sweet-spot read:write ratio for each.
Task 13: Histogram Aggregation¶
Goal: Build a per-CPU histogram.
type Histogram struct {
shards []histShard
bounds []float64
}
type histShard struct {
buckets []atomic.Int64
_ cpu.CacheLinePad
}
func New(bounds []float64) *Histogram {
n := runtime.NumCPU()
shards := make([]histShard, n)
for i := range shards {
shards[i].buckets = make([]atomic.Int64, len(bounds)+1)
}
return &Histogram{shards: shards, bounds: bounds}
}
func (h *Histogram) Observe(hint uint64, value float64) {
idx := h.bucketIndex(value)
h.shards[hint%uint64(len(h.shards))].buckets[idx].Add(1)
}
func (h *Histogram) Snapshot() []int64 {
out := make([]int64, len(h.bounds)+1)
for i := range h.shards {
for j := range h.shards[i].buckets {
out[j] += h.shards[i].buckets[j].Load()
}
}
return out
}
func (h *Histogram) bucketIndex(value float64) int {
for i, b := range h.bounds {
if value <= b { return i }
}
return len(h.bounds)
}
Benchmark with many concurrent Observers. Compare with a single shared []atomic.Int64.
Task 14: Set Up CI Benchmarks¶
Goal: Add CI that catches regressions.
Add a .github/workflows/bench.yml:
name: bench
on: [pull_request]
jobs:
bench:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
- name: benchmark
run: |
go test -bench=. -count=5 -benchtime=3s -cpu=1,4 -timeout=15m ./... > new.txt
cat new.txt
Add benchstat comparison logic. Fail on >10% regression.
Task 15: Profile Production¶
Goal: Diagnose a real coherence problem.
Pick a service you operate. Run a CPU profile during load:
curl http://service:6060/debug/pprof/profile?seconds=60 > prof.out
go tool pprof -http=:8080 prof.out
Look at the top functions. If any atomic or mutex function is in the top 5, hypothesize a coherence issue. Confirm with the benchmark suite. Fix.
Task 16: Read the Runtime¶
Goal: Internalise the patterns.
Read src/sync/pool.go end to end. Note every place padding appears. Note the pin/unpin pattern.
Read src/runtime/proc.go enough to understand findrunnable. Note how per-P data is accessed.
Read src/runtime/mheap.go to see central array padding.
Write a one-page summary for yourself.
Task 17: Mentor¶
Goal: Teach a junior engineer.
Pair-program a coherence-related fix with a junior. Walk them through:
- Identifying the suspect struct.
- Measuring with a benchmark.
- Applying the fix.
- Confirming the improvement.
Mentoring solidifies your own understanding.
Task 18: Contribute Upstream¶
Goal: Make the ecosystem better.
Pick a library you use. Profile it under load. If you find a coherence issue, fix it locally. Then:
- Open an issue on the project's tracker.
- Submit a PR with measurements.
- Engage with maintainers.
Even small fixes (padding a field, sharding a counter) benefit many users.
Task 19: Design Document¶
Goal: Plan a cache-aware design.
For a service or library you maintain, write a design doc covering:
- Hot data structures and their access patterns.
- Padding/sharding decisions.
- Benchmarks demonstrating scaling.
- CI guardrails.
- Operational monitoring.
Share with your team.
Task 20: Reflect¶
Goal: Internalise the discipline.
After completing the previous tasks, reflect:
- What surprised you?
- Where in your own codebase is the next opportunity?
- How will you operationalise cache awareness on your team?
Write a journal entry. Revisit in three months.
Closing¶
Twenty tasks. Each takes from minutes to days. Complete them in sequence; cache-coherence intuition will become reflex.
End of tasks.md.