sync Source Reading — Practice Tasks¶
Twenty exercises that walk you from "I know what sync.Mutex does" to "I have read the source, understood the starvation heuristic, and can defend the design choices to a peer who built one in another language". Difficulty: Junior, Middle, Senior, Staff.
Each task gives a Goal, a Difficulty, a Skills list, Steps, an Acceptance bar, then a folded Hint and a Reference solution. The reference solutions are complete, runnable Go programs — paste them into a directory, go mod init once, and run. Tasks 4, 13, and 18 ask you to read the standard library; the references quote the relevant lines so you can compare what you found against what is actually there in go1.22+. Other versions move the lines but not the structure.
Treat the source extracts as evidence, not gospel — re-open $(go env GOROOT)/src/sync/ in your editor while you work. The whole package is ~2 000 lines; you can read it in an afternoon.
Task 1: Goroutine-safe counter with Mutex¶
Goal. Build a counter shared across 1 000 goroutines that increments 1 000 times each. First write the broken version without a mutex, run go test -race, watch it fail, then add sync.Mutex and verify the count is exactly 1 000 000.
Difficulty. Junior.
Skills. sync.Mutex, race detector, basic goroutine launch + sync.WaitGroup.
Steps.
- Write
Counterwith anintfield andInc()method. - Spawn 1 000 goroutines, each calling
Inc()1 000 times. - Run
go test -race. Confirm the race detector flags the write. - Add
sync.MutextoCounter; lock around the read-modify-write. - Re-run
go test -race; confirm it passes and the final count is exactly1_000_000.
Acceptance.
go test -race ./...clean.- A run without
-racealso produces1000000on at least 100 consecutive iterations (write a loop in the test, not just one assertion). - The broken version is preserved in
counter_broken.gobehind a//go:build brokentag so a grader can flip the build tag and watch the race detector fire.
Hint
Read `$(go env GOROOT)/src/sync/mutex.go`. The `Lock` method's fast path is one `CompareAndSwap`; the slow path goes through `lockSlow`. You don't need to understand the slow path yet — you need to *use* `Lock`/`Unlock` correctly and feel the difference in `-race` output between the broken and fixed versions. The two classic mistakes: forgetting `defer m.Unlock()` (a panic inside the critical section leaves the mutex locked forever), and copying the `Counter` value (sync types are not copyable; the linter `go vet` will yell).Reference solution
// counter.go
package counter
import "sync"
// Senior decision: embed the mutex by value, not pointer. `sync.Mutex`'s
// zero value is a usable unlocked mutex; a pointer would force a New()
// constructor and an extra allocation. The downside — Counter is no
// longer safe to copy — is what `go vet -copylocks` catches at build
// time. Free safety beats free convenience.
type Counter struct {
mu sync.Mutex
n int
}
func (c *Counter) Inc() {
c.mu.Lock()
// defer Unlock is the canonical idiom even for a one-line critical
// section. The cost is ~30 ns; the safety against panics in larger
// critical sections is worth orders of magnitude more.
defer c.mu.Unlock()
c.n++
}
func (c *Counter) Value() int {
c.mu.Lock()
defer c.mu.Unlock()
return c.n
}
// counter_test.go
package counter
import (
"sync"
"testing"
)
func TestCounter1MIncrements(t *testing.T) {
for trial := 0; trial < 100; trial++ {
var c Counter
var wg sync.WaitGroup
const G, N = 1000, 1000
wg.Add(G)
for g := 0; g < G; g++ {
go func() {
defer wg.Done()
for i := 0; i < N; i++ {
c.Inc()
}
}()
}
wg.Wait()
if got := c.Value(); got != G*N {
t.Fatalf("trial %d: got %d, want %d", trial, got, G*N)
}
}
}
Stretch.
- (a) Replace
sync.Mutexwithatomic.Int64. Benchmark both withgo test -bench=. -benchmem. Atomic should win by 3–10× on aInc-only workload; explain in a comment why (no kernel parking, no spin inlockSlow). - (b) Add a
Reset()method and prove via a test that callingReset()concurrently withInc()is safe — but also prove that the intermediate values are not sequentially consistent (a reader can see anIncfrom after aReset). - (c) Build a
CounterAPI whereIncis lock-free butValueblocks; explain when this hybrid wins (hint: write-dominated metrics scraped rarely).
Task 2: Wait for N workers with WaitGroup¶
Goal. Launch a pool of N workers that each do one unit of work and exit. The main goroutine waits for all of them via sync.WaitGroup, then collects results. Demonstrate the canonical Add before go, Done inside the goroutine, Wait after the loop.
Difficulty. Junior.
Skills. sync.WaitGroup, channel-as-result, defer.
Steps.
- Write a function
Process(items []int) []intthat squares each item using N goroutines. - Use
wg.Add(len(items))before the loop. Inside each goroutine,defer wg.Done(). - Send results into a buffered channel; close the channel after
wg.Wait(). - Drain the channel in main and assert results are correct.
Acceptance.
- For
items := []int{1,2,3,4,5}the result slice (sorted) is{1,4,9,16,25}. go test -raceclean across 100 trials.- The test verifies that
wg.Wait()actually blocks until the lastDone: instrument with adone atomic.Boolset right beforewg.Wait()returns, checked from inside each worker.
Hint
`sync.WaitGroup` has three methods. `Add(n)` adjusts the internal counter by `n` (can be negative; rarely a good idea). `Done()` is `Add(-1)`. `Wait()` blocks until the counter hits zero. The counter is two `uint32`s packed into one `uint64` plus a sema, all atomic — read `wg.state.Add` in `$(go env GOROOT)/src/sync/waitgroup.go` to see the bit layout. Two gotchas: (1) call `Add` before the `go` statement, never inside the goroutine (Task 10 covers this race). (2) Calling `Wait` and `Add` concurrently is fine in Go ≥ 1.6 as long as `Add` is called with a non-zero counter; calling `Add(positive)` after `Wait` returned with counter 0 is undefined.Reference solution
// worker.go
package worker
import "sync"
func Process(items []int) []int {
out := make(chan int, len(items))
var wg sync.WaitGroup
// Senior decision: Add once with the full count, not Add(1) inside
// the loop. Each Add takes an atomic operation; one Add(N) is N×
// cheaper. The pattern also makes "did I Add before the go?" a
// visible structural check, not a per-iteration discipline.
wg.Add(len(items))
for _, x := range items {
x := x // local copy — pre-1.22 loop-variable hazard; explicit
// keeps the intent obvious for readers on older Go.
go func() {
defer wg.Done()
out <- x * x
}()
}
// Senior decision: close the channel AFTER Wait returns. Closing
// inside the loop or in one of the workers would race — only the
// last finisher knows it's last, and even that requires extra
// coordination. Wait gives us the bright line.
go func() {
wg.Wait()
close(out)
}()
var results []int
for v := range out {
results = append(results, v)
}
return results
}
// worker_test.go
package worker
import (
"sort"
"sync"
"sync/atomic"
"testing"
)
func TestProcessSquares(t *testing.T) {
in := []int{1, 2, 3, 4, 5}
got := Process(in)
sort.Ints(got)
want := []int{1, 4, 9, 16, 25}
if len(got) != len(want) {
t.Fatalf("len mismatch: got %v want %v", got, want)
}
for i := range got {
if got[i] != want[i] {
t.Fatalf("idx %d: got %d want %d", i, got[i], want[i])
}
}
}
// Prove Wait actually waits.
func TestWaitBlocksUntilDone(t *testing.T) {
var done atomic.Bool
var wg sync.WaitGroup
wg.Add(3)
for i := 0; i < 3; i++ {
go func() {
defer wg.Done()
if done.Load() {
t.Errorf("Done observed after Wait returned — impossible")
}
}()
}
wg.Wait()
done.Store(true)
}
Stretch.
- (a) Replace the buffered channel with
errgroup.Group. Show that cancellation propagates: if one worker returns an error, the others seectx.Done()and exit early. - (b) Bound concurrency with a semaphore (
semaphore.NewWeighted(8)). Prove viaruntime.NumGoroutine()that no more than 8 are live at once. - (c) Write a stress test that calls
Processfrom 100 goroutines in parallel for 10 seconds and verifies no goroutine leaks (runtime.NumGoroutine()before and after differ by at most a constant).
Task 3: Lazy global init with Once¶
Goal. Initialise an expensive global (say, a regex compiled once) using sync.Once.Do. Call the accessor from 1 000 goroutines concurrently and prove the init ran exactly once.
Difficulty. Junior.
Skills. sync.Once, package-level state, atomic counter for verification.
Steps.
- Declare a package-level
*regexp.Regexpand async.Once. - Write
init()accessor that callsonce.Do(func() { compile, set, increment a counter }). - In a test, spawn 1 000 goroutines calling the accessor.
- Assert the counter is exactly 1 after
Wait.
Acceptance.
- Counter equals 1 across 100 test runs.
go test -raceclean.- A second test calls the accessor sequentially 100 times and confirms the counter stays at 1.
Hint
`sync.Once.Do(f)` runs `f` at most once across all callers, blocks subsequent callers until the first call returns, and is safe for concurrent invocation. The implementation is a fast-path `atomic.Uint32` load + a slow-path mutex; read `$(go env GOROOT)/src/sync/once.go` — it's 30 lines. Tempting bug: calling `Do` with a different function each time. The function passed on the second call is *ignored*; only the first call's `f` ever runs. This is occasionally useful (caller-supplied init that runs only once globally) and occasionally a subtle bug (each caller "thinks" their `f` ran).Reference solution
// once_global.go
package once_global
import (
"regexp"
"sync"
"sync/atomic"
)
var (
once sync.Once
pattern *regexp.Regexp
initCount atomic.Int64
)
// Senior decision: hide the Once and the global behind an accessor.
// Callers don't reach into the package; they call Pattern(). If we ever
// move from "lazy init" to "init at startup" or "init from config",
// only this function changes.
func Pattern() *regexp.Regexp {
once.Do(func() {
// Compile once. regexp.MustCompile is itself idiomatic for
// "constant pattern that MUST parse"; the panic on bad input
// would fire at startup, which is exactly when we want it to.
pattern = regexp.MustCompile(`^[a-z]+$`)
initCount.Add(1)
})
return pattern
}
// Exported for testing — production code should not depend on this.
func InitCount() int64 { return initCount.Load() }
// once_global_test.go
package once_global
import (
"sync"
"testing"
)
func TestOnceRunsExactlyOnce(t *testing.T) {
const G = 1000
var wg sync.WaitGroup
wg.Add(G)
for i := 0; i < G; i++ {
go func() {
defer wg.Done()
p := Pattern()
if p == nil {
t.Errorf("Pattern returned nil")
}
}()
}
wg.Wait()
if c := InitCount(); c != 1 {
t.Fatalf("init ran %d times, want 1", c)
}
}
func TestOnceStaysOnce(t *testing.T) {
for i := 0; i < 100; i++ {
_ = Pattern()
}
if c := InitCount(); c != 1 {
t.Fatalf("init ran %d times after 100 calls, want 1", c)
}
}
Stretch.
- (a) Replace
sync.Oncewithsync.OnceValue[*regexp.Regexp](Go 1.21+). Show the API is shorter and that the underlying mechanism (same Once internally) is identical — read the source to prove it. - (b) Make the init function return an error; build
OnceValues[T,error]semantics manually with oneOnce+ two package vars + a panic-on-error path. Compare with the stdlibOnceValues. - (c) Build a failable OnceValue: on init failure, the next call retries. The stdlib does NOT do this on purpose — explain why (init failure means the binary is broken; retrying papers over real bugs).
Task 4: Read sync/mutex.go and locate the fast path¶
Goal. Open $(go env GOROOT)/src/sync/mutex.go. Find the Lock method. Quote the exact line where the uncontended fast path is taken via CompareAndSwap. Write a 200-word note explaining what each piece of that line means.
Difficulty. Junior.
Skills. Reading stdlib, atomic primitives, CAS semantics.
Steps.
- Locate the file:
go env GOROOTthensrc/sync/mutex.go. - Find
func (m *Mutex) Lock(). It is roughly 10 lines. - Identify the CAS that takes the lock without contention.
- In your notes file, quote the line, name each operand, and explain what state the mutex transitions from / to.
Acceptance.
- Notes file contains the exact line from the current Go release (cite the Go version you read).
- Notes correctly identify: the receiver, the expected old value (
0), the desired new value (mutexLocked), and the meaning of "lock taken without entering the slow path". - A test reproduces the same CAS pattern on a
uint32, confirming the fast path is just one atomic operation.
Hint
The line you're looking for, in Go 1.22, is: inside `func (m *Mutex) Lock()`. `m.state` packs three pieces of state into one int32: bit 0 is `mutexLocked`, bit 1 is `mutexWoken`, bit 2 is `mutexStarving`, and the high bits are the waiter count. Read the constants at the top of the file (`mutexLocked = 1 << iota`, `mutexWoken = 1 << iota`, `mutexStarving = 1 << iota`, `mutexWaiterShift = iota`). The fast path: state == 0 means no holders, no waiters, no woken goroutine, not in starvation mode. Set bit 0 atomically — if the CAS succeeds, we now own the lock and Lock returns. The whole "fast path" is one instruction on x86 (`LOCK CMPXCHG`).Reference solution
// fast_path_demo.go
package mutexnote
import "sync/atomic"
// This is a hand-rolled approximation of the sync.Mutex fast path,
// suitable only for understanding — production code should use
// sync.Mutex. We show the SAME idea (CAS 0 -> 1) so the source
// quote is concrete.
type DemoLock struct {
state atomic.Int32
}
const (
demoLocked int32 = 1 << iota
demoWoken
demoStarving
demoWaiterShift = iota
)
func (d *DemoLock) TryLock() bool {
// Quote from $(go env GOROOT)/src/sync/mutex.go, go1.22:
//
// if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
// if race.Enabled { ... }
// return
// }
// // Slow path (outlined so that the fast path can be inlined)
// m.lockSlow()
//
// Operands:
// &m.state -> pointer to packed state word (int32)
// 0 -> EXPECTED old value: no lock held, no
// waiters, not woken, not starving. The ONLY
// state in which the fast path can succeed.
// mutexLocked -> NEW value: bit 0 set, everything else
// still zero. We just claimed the lock.
//
// If CompareAndSwap returns true, we own the lock and Lock
// returns immediately — a single atomic instruction, no
// function call beyond the inlinable wrapper, no kernel
// involvement, no scheduler interaction. If it returns false,
// someone else holds the lock OR there are waiters OR the
// mutex is in starvation mode; we fall through to lockSlow.
return d.state.CompareAndSwap(0, demoLocked)
}
func (d *DemoLock) Unlock() {
newState := d.state.Add(-demoLocked)
if newState != 0 {
// Real Mutex.Unlock has slow path here to wake waiters.
// Our demo intentionally ignores it — we only model the
// fast path.
panic("demo lock: contended Unlock not implemented")
}
}
// fast_path_demo_test.go
package mutexnote
import "testing"
func TestFastPathTakesLockOnce(t *testing.T) {
var d DemoLock
if !d.TryLock() {
t.Fatal("first TryLock should succeed (state was 0)")
}
if d.TryLock() {
t.Fatal("second TryLock should fail (state was demoLocked)")
}
d.Unlock()
if !d.TryLock() {
t.Fatal("after Unlock, TryLock should succeed again")
}
}
Stretch.
- (a) Open
$(go env GOROOT)/src/sync/mutex.goand findlockSlow. Identify all four state transitions it handles (spin-trying-to-acquire, set woken bit, increment waiter count, enter starvation mode). Diagram them. - (b) Compile the demo lock with
go build -gcflags='-m=2'and confirmTryLockis inlined. Repeat with the realsync.Mutex.Lock— verifylockSlowis not inlined and explain why that's deliberate (inliner budget; the fast path stays small). - (c) Run
go tool objdumpagainst a small program usingsync.Mutex.Lockand find theLOCK CMPXCHGinstruction emitted by the compiler. Confirm it's the only atomic instruction on the happy path.
Task 5: Benchmark Mutex vs RWMutex with 10 readers + 1 writer¶
Goal. Build two implementations of a shared counter — one with sync.Mutex, one with sync.RWMutex — and benchmark both under a workload of 10 concurrent readers and 1 concurrent writer. Show that RWMutex wins under read-heavy load but not by as much as you'd expect.
Difficulty. Middle.
Skills. testing.B, b.RunParallel, RWMutex.RLock/RUnlock, benchmarking discipline.
Steps.
- Build
MutexMapandRWMutexMapwrappingmap[string]int. - Write a benchmark with 10 goroutines doing reads, 1 doing writes, all hitting the same map.
- Use
b.SetParallelismto control the goroutine count; useb.RunParallelfor the worker fan-out. - Run
go test -bench=. -benchmem -count=10and report results. - Write the answer: under what read:write ratio does
RWMutexwin? When does it lose?
Acceptance.
- Benchmark results checked in as
bench_results.txt. - At 10:1 read:write,
RWMutexis at least 2× faster thanMutex. - At 1:1 read:write,
RWMutexis not faster (and may be slower). - A short prose paragraph in the file explains the result:
RWMutex.RLockis more expensive per-call thanMutex.Lockbecause it tracks a reader count.
Hint
Read `$(go env GOROOT)/src/sync/rwmutex.go`. `RLock` does an atomic add on `readerCount`; if the result is negative (writer waiting or active), it parks. `Lock` (writer) negates `readerCount` then waits for outstanding readers. `RWMutex` is built ON TOP OF a `sync.Mutex` — look for the `w sync.Mutex` field. So `RWMutex` is at minimum as expensive as `Mutex` plus the reader bookkeeping. The conventional wisdom "use RWMutex for read-heavy data" is correct, but the threshold where it pays off is higher than people remember — usually around 8:1 reads:writes on modern x86. Below that, the bookkeeping overhead eats the gain.Reference solution
// rwmap.go
package rwmap
import "sync"
type MutexMap struct {
mu sync.Mutex
m map[string]int
}
func NewMutexMap() *MutexMap { return &MutexMap{m: make(map[string]int)} }
func (s *MutexMap) Get(k string) (int, bool) {
s.mu.Lock()
defer s.mu.Unlock()
v, ok := s.m[k]
return v, ok
}
func (s *MutexMap) Set(k string, v int) {
s.mu.Lock()
defer s.mu.Unlock()
s.m[k] = v
}
type RWMutexMap struct {
mu sync.RWMutex
m map[string]int
}
func NewRWMutexMap() *RWMutexMap { return &RWMutexMap{m: make(map[string]int)} }
func (s *RWMutexMap) Get(k string) (int, bool) {
s.mu.RLock()
defer s.mu.RUnlock()
v, ok := s.m[k]
return v, ok
}
func (s *RWMutexMap) Set(k string, v int) {
s.mu.Lock()
defer s.mu.Unlock()
s.m[k] = v
}
// rwmap_bench_test.go
package rwmap
import (
"strconv"
"sync/atomic"
"testing"
)
// 10 readers per writer. We model the ratio by having each parallel
// worker do 10 Gets then 1 Set.
func BenchmarkMutex10R1W(b *testing.B) {
s := NewMutexMap()
for i := 0; i < 1000; i++ {
s.Set(strconv.Itoa(i), i)
}
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1)) % 1000
for r := 0; r < 10; r++ {
_, _ = s.Get(strconv.Itoa(i))
}
s.Set(strconv.Itoa(i), i)
}
})
}
func BenchmarkRWMutex10R1W(b *testing.B) {
s := NewRWMutexMap()
for i := 0; i < 1000; i++ {
s.Set(strconv.Itoa(i), i)
}
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1)) % 1000
for r := 0; r < 10; r++ {
_, _ = s.Get(strconv.Itoa(i))
}
s.Set(strconv.Itoa(i), i)
}
})
}
func BenchmarkMutex1R1W(b *testing.B) {
s := NewMutexMap()
for i := 0; i < 1000; i++ {
s.Set(strconv.Itoa(i), i)
}
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1)) % 1000
_, _ = s.Get(strconv.Itoa(i))
s.Set(strconv.Itoa(i), i)
}
})
}
func BenchmarkRWMutex1R1W(b *testing.B) {
s := NewRWMutexMap()
for i := 0; i < 1000; i++ {
s.Set(strconv.Itoa(i), i)
}
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1)) % 1000
_, _ = s.Get(strconv.Itoa(i))
s.Set(strconv.Itoa(i), i)
}
})
}
Stretch.
- (a) Add a third implementation using
atomic.Pointer[map[string]int]with copy-on-write. Benchmark it. For purely read-heavy workloads with rare writes (think config reload), it should beat both mutex variants. - (b) Vary the critical section duration with a
time.Sleep(N)insideGet. Plot the crossover point whereRWMutexbeatsMutexas a function ofN. Confirm that for very short critical sections (<50 ns),Mutexis competitive even at high read:write ratios. - (c) Replicate the benchmark on a machine with 1 CPU (
GOMAXPROCS=1). ConfirmRWMutex's advantage disappears — no parallel readers means no win.
Task 6: sync.Map for read-heavy workload¶
Goal. Build a cache using sync.Map and a parallel cache using map+RWMutex. Benchmark both under a workload where 95% of operations are reads of the same hot keys. Demonstrate sync.Map wins and explain why via the read/dirty split in its source.
Difficulty. Middle.
Skills. sync.Map, source reading, benchmark discipline.
Steps.
- Implement two caches with the same interface:
Get,Set,Delete. - Pre-populate both with 1 000 keys.
- Benchmark: 95% of operations are
Geton a small hot set (10 keys); 5% areSeton the cold tail. - Read
$(go env GOROOT)/src/sync/map.goand explain in comments whatreadanddirtymean. - Report ratios.
Acceptance.
sync.Mapoutperformsmap+RWMutexby >= 2× on the 95%-read workload.- Source comments correctly describe the
readOnlystruct and the promotion fromdirtytoread. - A second benchmark with 50/50 reads/writes shows
map+RWMutexties or wins — proving the choice is workload-dependent, not absolute.
Hint
`sync.Map` has two storage tiers: `read` (an atomic-loaded `readOnly` struct containing a `map[any]*entry`, no locks on Get hits) and `dirty` (a `map[any]*entry` protected by the embedded `Mutex`). Hits in `read` never take the lock — that's the win for read-heavy. Misses fall through to `dirty` under lock; after enough misses, `dirty` is promoted to `read`. The trade-off: writes (`Store` on a new key) go to `dirty` under lock, then later get promoted. Write-heavy workloads pay the lock every write *and* the promotion cost, which is exactly the case where `sync.Map` loses to `map+RWMutex`.Reference solution
// hotcache.go
package hotcache
import "sync"
// Senior decision: we provide BOTH implementations behind the same
// interface so the benchmark proves the workload-dependence
// empirically. Code that doesn't measure picks whichever the author
// learned first; code that measures picks whichever wins on the
// actual workload.
type Cache interface {
Get(k string) (int, bool)
Set(k string, v int)
Delete(k string)
}
type SyncMapCache struct {
m sync.Map
}
func (c *SyncMapCache) Get(k string) (int, bool) {
v, ok := c.m.Load(k)
if !ok {
return 0, false
}
return v.(int), true
}
func (c *SyncMapCache) Set(k string, v int) { c.m.Store(k, v) }
func (c *SyncMapCache) Delete(k string) { c.m.Delete(k) }
type RWMapCache struct {
mu sync.RWMutex
m map[string]int
}
func NewRWMapCache() *RWMapCache { return &RWMapCache{m: make(map[string]int)} }
func (c *RWMapCache) Get(k string) (int, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
v, ok := c.m[k]
return v, ok
}
func (c *RWMapCache) Set(k string, v int) {
c.mu.Lock()
defer c.mu.Unlock()
c.m[k] = v
}
func (c *RWMapCache) Delete(k string) {
c.mu.Lock()
defer c.mu.Unlock()
delete(c.m, k)
}
// Notes on sync.Map internals (from $(go env GOROOT)/src/sync/map.go):
//
// type Map struct {
// mu Mutex // guards `dirty` and `misses`
// read atomic.Pointer[readOnly]
// dirty map[any]*entry
// misses int
// }
//
// type readOnly struct {
// m map[any]*entry
// amended bool // true if dirty contains keys not in read.m
// }
//
// Get path:
// 1. Atomically load `read`. Look up the key. If found AND value is
// not expunged, return — NO LOCK TAKEN. This is the fast path.
// 2. If not found and `read.amended` is true, take `mu`, check
// `dirty`, increment misses, maybe promote dirty -> read.
//
// Store path:
// 1. If key is in `read` and entry is not expunged, CAS the entry's
// pointer atomically — NO LOCK for updates of existing keys.
// 2. Otherwise take `mu`, update `dirty`, mark `read.amended = true`.
//
// The key insight: for HOT keys (existing, frequently read, rarely
// updated), every Get is one atomic load + one map lookup. No mutex
// of any kind. That's why sync.Map demolishes RWMutex on
// read-dominant workloads with a small working set.
// hotcache_bench_test.go
package hotcache
import (
"strconv"
"sync/atomic"
"testing"
)
func populate(c Cache, n int) {
for i := 0; i < n; i++ {
c.Set(strconv.Itoa(i), i)
}
}
const hotKeys = 10
func bench95Read(b *testing.B, c Cache) {
populate(c, 1000)
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1))
if i%20 == 0 {
// 5% writes to cold tail.
c.Set(strconv.Itoa(100+i%900), i)
} else {
// 95% reads of hot set.
_, _ = c.Get(strconv.Itoa(i % hotKeys))
}
}
})
}
func Benchmark95Read_SyncMap(b *testing.B) { bench95Read(b, &SyncMapCache{}) }
func Benchmark95Read_RWMutex(b *testing.B) { bench95Read(b, NewRWMapCache()) }
func bench50_50(b *testing.B, c Cache) {
populate(c, 1000)
b.ResetTimer()
var n atomic.Int64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := int(n.Add(1))
if i%2 == 0 {
c.Set(strconv.Itoa(i%1000), i)
} else {
_, _ = c.Get(strconv.Itoa(i % 1000))
}
}
})
}
func Benchmark50_50_SyncMap(b *testing.B) { bench50_50(b, &SyncMapCache{}) }
func Benchmark50_50_RWMutex(b *testing.B) { bench50_50(b, NewRWMapCache()) }
Stretch.
- (a) Add a
LoadOrStorebenchmark —sync.Map's key compound operation. Show it's faster than the equivalentRLock-check, Lock-insertpattern, even on 50/50 workloads, because it's a single atomic dance. - (b) Build a third implementation: sharded mutex map (64 shards keyed by
fnv32(k) % 64). Benchmark all three. Show the shard map wins on the write-heavy workload where bothsync.MapandRWMutexlose. - (c) Run the 95% read benchmark with the hot set increased from 10 keys to 10 000. Show
sync.Map's advantage shrinks — thereadmap grows, cache locality degrades, and the gap narrows. This is why "sync.Map for read-heavy" needs the modifier "small hot set".
Task 7: Bounded queue with sync.Cond¶
Goal. Implement a bounded blocking queue using sync.Cond for the not-full and not-empty conditions. Producers block when full; consumers block when empty. Demonstrate correctness with N producers + M consumers.
Difficulty. Middle.
Skills. sync.Cond, Broadcast vs Signal, the for cond idiom.
Steps.
- Define
BoundedQueuewith a slice, a max size, async.Mutex, and twosync.Conds (notFull, notEmpty). Pushwaits on notFull while full; on success,SignalnotEmpty.Popwaits on notEmpty while empty; on success,SignalnotFull.- Test with 4 producers pushing 250 items each and 4 consumers popping until they've seen 1 000 items total.
Acceptance.
- 1 000 items in, 1 000 items out. No duplicates, no losses.
go test -raceclean.- The wait is inside a
forloop, not anif— spurious wakeups (Broadcaston an unrelated condition) must not break correctness.
Hint
`sync.Cond` is the lowest-level synchronisation primitive in the package. It requires manual lock management: you call `c.L.Lock()`, then `for !ready { c.Wait() }`, then either modify state and `c.Signal()` (wake one) or `c.Broadcast()` (wake all), then `c.L.Unlock()`. `Wait` atomically releases the lock and parks; on wakeup it re-acquires the lock. The two rules: (1) ALWAYS wait inside a `for`, never `if`. The reason is documented in `$(go env GOROOT)/src/sync/cond.go` — wake-ups can be spurious, and another woken waiter may have changed the state between your wakeup and your check. (2) Hold the lock while signalling. The stdlib docs say signalling without the lock is permitted; in practice, holding it makes the code easier to reason about and saves you debugging time later.Reference solution
// bqueue.go
package bqueue
import "sync"
type BoundedQueue struct {
mu sync.Mutex
notFull *sync.Cond
notEmpty *sync.Cond
buf []int
max int
closed bool
}
func New(max int) *BoundedQueue {
if max <= 0 {
panic("bqueue: max must be > 0")
}
q := &BoundedQueue{max: max}
// Senior decision: both Conds share the SAME mutex. The lock is
// what protects `buf`, `max`, and `closed`; the Conds are just
// event channels. Two locks would race on the buffer state.
q.notFull = sync.NewCond(&q.mu)
q.notEmpty = sync.NewCond(&q.mu)
return q
}
func (q *BoundedQueue) Push(v int) (ok bool) {
q.mu.Lock()
defer q.mu.Unlock()
// for, not if — spurious wakeups are real; we MUST re-check.
for len(q.buf) >= q.max && !q.closed {
q.notFull.Wait()
}
if q.closed {
return false
}
q.buf = append(q.buf, v)
// Signal one consumer — Broadcast would wake all M of them and
// M-1 would immediately go back to sleep. Signal is the right
// choice when adding one item satisfies one waiter.
q.notEmpty.Signal()
return true
}
func (q *BoundedQueue) Pop() (int, bool) {
q.mu.Lock()
defer q.mu.Unlock()
for len(q.buf) == 0 && !q.closed {
q.notEmpty.Wait()
}
if len(q.buf) == 0 {
return 0, false
}
v := q.buf[0]
q.buf = q.buf[1:]
q.notFull.Signal()
return v, true
}
func (q *BoundedQueue) Close() {
q.mu.Lock()
defer q.mu.Unlock()
q.closed = true
// Broadcast — every blocked Push and Pop must wake up and see
// closed=true. Signal would wake one, leaving the rest blocked
// forever.
q.notFull.Broadcast()
q.notEmpty.Broadcast()
}
// bqueue_test.go
package bqueue
import (
"sync"
"sync/atomic"
"testing"
)
func TestBoundedQueueProducersConsumers(t *testing.T) {
q := New(8)
const P, C, perP = 4, 4, 250
var produced atomic.Int64
var consumed atomic.Int64
var wg sync.WaitGroup
wg.Add(P + C)
for p := 0; p < P; p++ {
go func() {
defer wg.Done()
for i := 0; i < perP; i++ {
if !q.Push(i) {
t.Errorf("Push returned false unexpectedly")
return
}
produced.Add(1)
}
}()
}
for c := 0; c < C; c++ {
go func() {
defer wg.Done()
for {
_, ok := q.Pop()
if !ok {
return
}
if consumed.Add(1) == P*perP {
q.Close()
return
}
}
}()
}
wg.Wait()
if got := consumed.Load(); got != P*perP {
t.Fatalf("consumed %d, want %d", got, P*perP)
}
if got := produced.Load(); got != P*perP {
t.Fatalf("produced %d, want %d", got, P*perP)
}
}
Stretch.
- (a) Rewrite the queue using a buffered channel. Show the implementation is 10 lines and easier to read, but you lose the ability to express non-FIFO wake-up criteria (e.g., "wake the consumer with the highest priority").
- (b) Build a priority queue where
Popreturns the highest-priority element. Channels can't express this;sync.Condcan. Each consumer registers its minimum-acceptable priority and waits on a condition matching it. - (c) Read
$(go env GOROOT)/src/runtime/sema.go. Explain howCond.Waitultimately bottoms out insemacquire. Thesyncpackage is thin — the heavy lifting is inruntime.
Task 8: sync.Pool for []byte with finalizer verification¶
Goal. Use sync.Pool to recycle []byte buffers. Verify reuse by attaching runtime.SetFinalizer to a sentinel and observing that the same buffer comes back from Get after a Put.
Difficulty. Middle.
Skills. sync.Pool, runtime.SetFinalizer, GC pressure observation.
Steps.
- Define a
Poolof*[]byte(pointer to slice — never store the slice value directly). - In a tight loop, Get a buffer, write into it, Put it back.
- Attach a finalizer to one specific buffer; observe via finalizer logs that it survives across many Get/Put cycles.
- Force GC (
runtime.GC()) and verify the buffer is still reused —sync.Poolmay evict on GC, so do at least one Put-then-Get round within a single GC epoch.
Acceptance.
- Same buffer pointer returned from
Getafter aPut(under steady state, no GC). - Finalizer fires only when the pool drops the buffer — observable after multiple
runtime.GC()calls. - Allocation count from
testing.AllocsPerRunis near zero on the hot path.
Hint
`sync.Pool` is a per-P (per-processor) cache. `Get` returns either an existing object from the local P's pool, a stolen one from another P, or — if both empty — calls `New`. `Put` returns the object to the local P's pool. GC clears the pool (well, almost — read Task 13 on victim cache). Why store `*[]byte` and not `[]byte`? Two reasons: (1) `Put(any)` accepts `interface{}`, and a slice value boxed into an interface allocates (3-word header). Pointer to slice is 1 word, doesn't allocate. (2) The pool's internal `noCopy` discipline assumes pointer semantics.Reference solution
// bufpool.go
package bufpool
import (
"fmt"
"runtime"
"sync"
)
var bufPool = sync.Pool{
// Senior decision: New returns a *[]byte, not []byte. Putting a
// []byte into the pool would allocate an interface header
// every Put. The pointer form is the canonical recipe — see
// bytes.Buffer in the stdlib (it stores a pointer to itself).
New: func() any {
b := make([]byte, 0, 4096)
return &b
},
}
func Get() *[]byte { return bufPool.Get().(*[]byte) }
func Put(b *[]byte) {
// Senior decision: reset length but keep capacity. This is what
// makes pooling worthwhile — the underlying array is reused; only
// the slice header is rewritten.
*b = (*b)[:0]
bufPool.Put(b)
}
// Demo helper used by the test.
func TagWithFinalizer(p *[]byte, msg string) {
runtime.SetFinalizer(p, func(_ *[]byte) {
fmt.Println("FINALIZER:", msg)
})
}
// bufpool_test.go
package bufpool
import (
"runtime"
"testing"
)
func TestPoolReuse(t *testing.T) {
// Warm the pool with one specific buffer we can identify.
b := Get()
TagWithFinalizer(b, "sentinel")
addr := &(*b)[:1][0] // first byte's address — survives across Put
Put(b)
// Steady state: Get should very likely return the same buffer.
// "Very likely" because the runtime can steal from this P, GC
// can clear the pool, etc. Empirically, in a single goroutine
// with no GC between Put and Get, you'll see the same one.
b2 := Get()
addr2 := &(*b2)[:1][0]
if addr != addr2 {
t.Logf("pool returned a different buffer — possible if runtime stole or grew the pool")
}
Put(b2)
}
func TestPoolReducesAllocs(t *testing.T) {
allocs := testing.AllocsPerRun(1000, func() {
b := Get()
*b = append(*b, []byte("hello, world")...)
Put(b)
})
// We expect near-zero allocs on the hot path: the pool returns
// an existing *[]byte, the append fits in capacity, the Put
// doesn't allocate.
if allocs > 1 {
t.Errorf("allocs per Get/Put cycle = %v, want ~0", allocs)
}
}
func TestPoolEvictionOnGC(t *testing.T) {
// Pump the pool, then force two GCs. Go 1.13+ has a victim
// cache: one GC moves items to the victim cache; the SECOND GC
// drops them for good. So two GCs are what it takes to provoke
// finalizers.
for i := 0; i < 100; i++ {
b := Get()
Put(b)
}
runtime.GC()
runtime.GC()
// Finalizers run asynchronously — we don't assert exact timing.
// The test exists to be observable when run with -v.
}
Stretch.
- (a) Build a
sync.Poolfor*bytes.Buffer. Compare the allocation count of a JSON-encoding hot path with and without the pool. Measure withpprof -alloc_objects. - (b) Demonstrate the "pool item growing forever" bug: write a
Putthat doesn't truncate the slice; show that buffers in the pool grow to whatever the largest user ever needed. Fix it by cappingPutto reject buffers above some size threshold. - (c) Read
$(go env GOROOT)/src/sync/pool.goand explainpoolLocal.privatevspoolLocal.shared. Why does each P have aprivateslot that can be accessed without atomics?
Task 9: Implement OnceValue[T] manually¶
Goal. Build OnceValue[T any](f func() T) func() T from scratch using sync.Once and atomic.Pointer[T]. Match the stdlib's sync.OnceValue semantics (Go 1.21+): the returned closure runs f exactly once and returns its value on every subsequent call.
Difficulty. Middle.
Skills. Generics, sync.Once, atomic.Pointer, closure design.
Steps.
- Define
func OnceValue[T any](f func() T) func() T. - Inside the closure, use a
sync.Onceand anatomic.Pointer[T]to record the value. - Test with
f = func() int { time.Sleep(10ms); return 42 }; 1 000 concurrent callers must all get42andfmust run once. - Compare against
sync.OnceValuefrom the stdlib by running the same test against both — same behaviour expected.
Acceptance.
- Concurrent callers all observe the same value.
fruns exactly once (verify with anatomic.Int64counter insidef).- The closure does not allocate per call (after the first). Verify with
testing.AllocsPerRun.
Hint
The trick is making the fast path (after init) lock-free. Once is fine — its fast path is one atomic load — but the value still has to be reachable somehow. Capturing a `T` variable directly in the closure works but the compiler may not be able to prove it's safe; better to store via `atomic.Pointer[T]` and Load on every call. Or: capture a `T` variable, write it inside the `Once.Do`, read it after. The closure pins the variable on the heap; `sync.Once.Do` provides the necessary happens-before. This is what `sync.OnceValue` itself does — read `$(go env GOROOT)/src/sync/oncefunc.go`.Reference solution
// onceval.go
package onceval
import (
"sync"
"sync/atomic"
)
// Senior decision: two implementations side-by-side. The first (Simple)
// matches the stdlib: captured variable + sync.Once. The second
// (Atomic) uses atomic.Pointer[T] so the fast path is a single atomic
// load, no Once involvement. Atomic is marginally faster on contended
// paths and demonstrates the building blocks; Simple is what you
// should ship.
func Simple[T any](f func() T) func() T {
var (
once sync.Once
v T
)
return func() T {
// Once.Do's fast path is one atomic uint32 load; on subsequent
// calls we just touch the captured variable.
once.Do(func() {
v = f()
})
return v
}
}
func Atomic[T any](f func() T) func() T {
var (
once sync.Once
p atomic.Pointer[T]
)
return func() T {
if v := p.Load(); v != nil {
return *v
}
once.Do(func() {
v := f()
p.Store(&v)
})
return *p.Load()
}
}
// onceval_test.go
package onceval
import (
"sync"
"sync/atomic"
"testing"
"time"
)
func TestOnceValueRunsOnceConcurrently(t *testing.T) {
var calls atomic.Int64
f := func() int {
calls.Add(1)
time.Sleep(10 * time.Millisecond)
return 42
}
get := Simple(f)
var wg sync.WaitGroup
const G = 1000
wg.Add(G)
for i := 0; i < G; i++ {
go func() {
defer wg.Done()
if got := get(); got != 42 {
t.Errorf("got %d, want 42", got)
}
}()
}
wg.Wait()
if c := calls.Load(); c != 1 {
t.Fatalf("f ran %d times, want 1", c)
}
}
func TestAtomicVariantSameSemantics(t *testing.T) {
var calls atomic.Int64
f := func() string {
calls.Add(1)
return "hello"
}
get := Atomic(f)
var wg sync.WaitGroup
const G = 1000
wg.Add(G)
for i := 0; i < G; i++ {
go func() {
defer wg.Done()
if got := get(); got != "hello" {
t.Errorf("got %q", got)
}
}()
}
wg.Wait()
if c := calls.Load(); c != 1 {
t.Fatalf("f ran %d times, want 1", c)
}
}
func TestNoAllocsPostInit(t *testing.T) {
get := Simple(func() int { return 7 })
_ = get() // warm
allocs := testing.AllocsPerRun(1000, func() {
_ = get()
})
if allocs > 0 {
t.Errorf("post-init allocs/call = %v, want 0", allocs)
}
}
Stretch.
- (a) Add
OnceValues[T1, T2 any](f func() (T1, T2)) func() (T1, T2)to your package. Match the stdlib API. - (b) Add
OnceError[T any](f func() (T, error)) func() (T, error)— but with a twist: on error, re-runfon the next call. Explain why the stdlib deliberately does not provide this (init errors are programmer bugs; silently retrying papers over real failures). - (c) Benchmark
SimplevsAtomicvssync.OnceValueafter init. Show all three are <= 5 ns/op on modern hardware; the choice is style, not performance.
Task 10: WaitGroup.Add inside goroutine — find the race¶
Goal. Write code that calls wg.Add(1) inside the goroutine instead of before go. Show that wg.Wait() can return before all goroutines have called Done — a deadlocking bug in disguise. Then fix it.
Difficulty. Middle.
Skills. Race detector intuition, ordering arguments, WaitGroup semantics.
Steps.
- Write the broken version:
go func() { wg.Add(1); ...; wg.Done() }(). - Run
go test -race -count=1000. Find the race or, worse, the silent miss. - Reason: the parent goroutine reaches
wg.Wait()before the child callsAdd. Wait sees counter 0, returns. Child runs later. Side effects lost. - Fix: move
wg.Add(1)to before thegokeyword.
Acceptance.
- Broken test exhibits a wrong result (a counter doesn't reach the expected value) at least once per 100 runs.
- Fixed test always reaches the expected value.
go test -raceon the broken version flags the read-write race on the WaitGroup state OR on the side-effect counter.
Hint
`WaitGroup.Wait` is documented as "blocks until the counter is zero". If you call `Add` after `Wait` has already returned (counter was zero when Wait observed it), you've violated the protocol — the docs say "typically this means the calls to Add should execute before the statement creating the goroutine". Read `$(go env GOROOT)/src/sync/waitgroup.go`. The check at the top of `Add`: if `state` shows a positive delta arriving when waiters > 0 and counter just hit zero, the runtime panics with "sync: WaitGroup misuse: Add called concurrently with Wait". Sometimes you get the panic, sometimes you don't — the race is data-dependent.Reference solution
// wgrace.go
package wgrace
import (
"sync"
"sync/atomic"
)
// BROKEN: Add inside the goroutine.
func Broken(n int) int64 {
var wg sync.WaitGroup
var counter atomic.Int64
for i := 0; i < n; i++ {
go func() {
wg.Add(1) // <-- WRONG. Parent may already be in Wait.
defer wg.Done()
counter.Add(1)
}()
}
wg.Wait()
return counter.Load()
}
// FIXED: Add BEFORE go.
func Fixed(n int) int64 {
var wg sync.WaitGroup
var counter atomic.Int64
wg.Add(n) // <-- bulk Add once; cheaper and structurally correct.
for i := 0; i < n; i++ {
go func() {
defer wg.Done()
counter.Add(1)
}()
}
wg.Wait()
return counter.Load()
}
// wgrace_test.go
package wgrace
import "testing"
func TestFixedAlwaysReachesN(t *testing.T) {
for trial := 0; trial < 100; trial++ {
if got := Fixed(100); got != 100 {
t.Fatalf("trial %d: Fixed got %d, want 100", trial, got)
}
}
}
// Demonstration test — under -race or under contention, Broken can
// produce results < 100. We don't assert "Broken is broken" in CI
// because the race is data-dependent; we just observe it.
func TestBrokenSometimesUnderCounts(t *testing.T) {
t.Skip("demo only: run with -race -count=1000 to provoke")
for trial := 0; trial < 1000; trial++ {
if got := Broken(100); got != 100 {
t.Logf("trial %d: Broken got %d (expected 100) — race observed", trial, got)
return
}
}
t.Log("Broken happened to produce 100 every trial — try -race or -cpu=1")
}
Stretch.
- (a) Demonstrate the same bug with
errgroup.Group.Go— excepterrgroupis designed correctly, so the bug doesn't exist there. Readgolang.org/x/sync/errgroupand explain why:GocallsAdd(1)internally before spawning. - (b) Write a custom
Spawn(wg *sync.WaitGroup, f func())helper that doesAdd+go+defer Done(). Show it makes the bug structurally impossible. - (c) Inspect a real bug from a Go project's commit log (e.g., search Kubernetes for "WaitGroup misuse"). Summarise the fix.
Task 11: Sharded mutex — 64 buckets¶
Goal. Build a map protected by 64 mutexes, sharded by fnv32(key) % 64. Benchmark against map+RWMutex under high write contention. Show shard-level locking reduces contention proportionally to shard count.
Difficulty. Senior.
Skills. Hash-based sharding, contention measurement, lock granularity.
Steps.
- Build
ShardedMapwith[64]struct{mu sync.Mutex; m map[string]int}. Get/Set/Deletefirst hash the key to pick a shard, then take that shard's mutex.- Benchmark against a single-mutex map under write-heavy workloads.
- Report the speedup at 8, 32, 128, 1024 concurrent writers.
Acceptance.
- Sharded map outperforms single-mutex map by >= 8× under 1 024 concurrent writers.
- The speedup scales with
min(shardCount, GOMAXPROCS). - Pad each shard to a cache line (use
[64]bytefiller) and re-benchmark; show another 10–30% gain because of false-sharing elimination.
Hint
Lock contention is a queueing phenomenon. If N goroutines all try to acquire the same lock at rate λ and the critical section takes time S, the throughput ceiling is `1/S` regardless of N. Sharding by 64 buckets means each bucket sees roughly N/64 of the traffic — throughput becomes `64/S` (in the ideal case where keys are uniformly distributed). False sharing happens when two shards' mutexes share a 64-byte cache line. A writer to shard 0 invalidates the cache line containing shard 1's mutex on every other core. Pad each shard struct to 64 bytes to fix.Reference solution
// sharded.go
package sharded
import (
"hash/fnv"
"sync"
)
const numShards = 64
type shard struct {
mu sync.Mutex
m map[string]int
// Senior decision: pad to 64 bytes so each shard is in its own
// cache line. sync.Mutex is 8 bytes, map is 8 bytes (header
// pointer), so 48 bytes of padding gets us to 64. Without this,
// two shards share a line and any writer on one invalidates the
// other's L1 line — false sharing.
_ [48]byte
}
type ShardedMap struct {
shards [numShards]shard
}
func New() *ShardedMap {
s := &ShardedMap{}
for i := range s.shards {
s.shards[i].m = make(map[string]int)
}
return s
}
func (s *ShardedMap) shardFor(k string) *shard {
h := fnv.New32a()
_, _ = h.Write([]byte(k))
return &s.shards[h.Sum32()%numShards]
}
func (s *ShardedMap) Get(k string) (int, bool) {
sh := s.shardFor(k)
sh.mu.Lock()
defer sh.mu.Unlock()
v, ok := sh.m[k]
return v, ok
}
func (s *ShardedMap) Set(k string, v int) {
sh := s.shardFor(k)
sh.mu.Lock()
defer sh.mu.Unlock()
sh.m[k] = v
}
func (s *ShardedMap) Delete(k string) {
sh := s.shardFor(k)
sh.mu.Lock()
defer sh.mu.Unlock()
delete(sh.m, k)
}
// sharded_bench_test.go
package sharded
import (
"strconv"
"sync"
"sync/atomic"
"testing"
)
type singleMutexMap struct {
mu sync.Mutex
m map[string]int
}
func newSingleMutexMap() *singleMutexMap {
return &singleMutexMap{m: make(map[string]int)}
}
func (s *singleMutexMap) Set(k string, v int) {
s.mu.Lock()
defer s.mu.Unlock()
s.m[k] = v
}
func BenchmarkSingleMutexWrite(b *testing.B) {
s := newSingleMutexMap()
b.SetParallelism(1024)
var n atomic.Int64
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := n.Add(1)
s.Set(strconv.FormatInt(i, 10), int(i))
}
})
}
func BenchmarkShardedWrite(b *testing.B) {
s := New()
b.SetParallelism(1024)
var n atomic.Int64
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := n.Add(1)
s.Set(strconv.FormatInt(i, 10), int(i))
}
})
}
Stretch.
- (a) Replace
fnv.New32awith a faster hash (maphash.Stringfrom Go 1.19+). Measure the per-op savings. - (b) Make
numShardsa runtime parameter (configurable). Show the throughput plateau as you go from 1 -> 2 -> 4 -> 8 -> 16 -> 32 -> 64 -> 128 -> 1024 shards on your machine. Plot the curve. - (c) Build a "soft sharding" variant where a key occasionally migrates to a less-contended shard (consistent hashing with virtual nodes). Compare under skewed-key workloads (Zipfian distribution).
Task 12: Mutex profile — find the hot lock¶
Goal. Set runtime.SetMutexProfileFraction(1), run a benchmark with 4 contended mutexes (3 cheap, 1 expensive), then dump the mutex profile with pprof.Lookup("mutex").WriteTo. Identify the hot lock from the profile.
Difficulty. Senior.
Skills. runtime/pprof, mutex profiling, profile interpretation.
Steps.
- Build a program with 4 named mutexes; in 4 goroutines, each one contends on its assigned mutex with varying critical-section lengths.
- Call
runtime.SetMutexProfileFraction(1)at startup. - After the workload, write the mutex profile to a file.
- Open with
go tool pprof -http=:8080 mutex.pb.gz. Identify the hot lock by cumulative blocked time.
Acceptance.
- The profile correctly identifies the expensive mutex (the one with the longest critical section).
- A 1-paragraph note in the file explains what
SetMutexProfileFraction(1)does and why you don't keep it on in production (sampling overhead is non-zero).
Hint
`runtime.SetMutexProfileFraction(rate)` samples every `rate`-th contention event. `rate = 1` samples all; `rate = 0` disables. Each sample records the stack trace of the goroutine that *blocked* (not the one that held the lock). Read the profile by `cum` (cumulative). The hot lock is the one with the highest `cum` — most time spent waiting. The function `runtime.semacquire` will appear in every stack; the *caller* of `Lock` is what tells you which lock.Reference solution
// mutexprof/main.go
package main
import (
"fmt"
"os"
"runtime"
"runtime/pprof"
"sync"
"time"
)
var (
mu1, mu2, mu3, mu4 sync.Mutex
)
func cheap(m *sync.Mutex) {
m.Lock()
// ~1 microsecond critical section
time.Sleep(1 * time.Microsecond)
m.Unlock()
}
func expensive(m *sync.Mutex) {
m.Lock()
// ~100 microsecond critical section — 100x more contention pressure
time.Sleep(100 * time.Microsecond)
m.Unlock()
}
func hammer(m *sync.Mutex, work func(*sync.Mutex), iters int, wg *sync.WaitGroup) {
defer wg.Done()
for i := 0; i < iters; i++ {
work(m)
}
}
func main() {
// Senior decision: SetMutexProfileFraction(1) samples every
// contention event. This has measurable overhead — every blocked
// Lock call walks the stack. We leave it on for profiling runs
// only, never in steady-state production. The standard value in
// production is 0 (off) or a sampling rate like 10000.
runtime.SetMutexProfileFraction(1)
const iters = 1000
var wg sync.WaitGroup
wg.Add(8)
// 2 goroutines per mutex => contention on each.
go hammer(&mu1, cheap, iters, &wg)
go hammer(&mu1, cheap, iters, &wg)
go hammer(&mu2, cheap, iters, &wg)
go hammer(&mu2, cheap, iters, &wg)
go hammer(&mu3, cheap, iters, &wg)
go hammer(&mu3, cheap, iters, &wg)
go hammer(&mu4, expensive, iters, &wg) // <- the hot lock
go hammer(&mu4, expensive, iters, &wg)
wg.Wait()
f, err := os.Create("mutex.pb.gz")
if err != nil {
panic(err)
}
defer f.Close()
if err := pprof.Lookup("mutex").WriteTo(f, 0); err != nil {
panic(err)
}
fmt.Println("wrote mutex.pb.gz")
fmt.Println("inspect: go tool pprof -http=:8080 mutex.pb.gz")
}
Stretch.
- (a) Reproduce with
SetBlockProfileRate. Compare what block-profile and mutex-profile capture (block also includes channels and selects; mutex is mutex-only). - (b) Use
go test -mutexprofile=mu.outdirectly from a benchmark and inspect the result. Saves you the explicitpprof.Lookupcall. - (c) Wire the mutex profile to
net/http/pprofand pull it from a running service viacurl localhost:6060/debug/pprof/mutex.
Task 13: Read sync/pool.go — explain victim cache¶
Goal. Open $(go env GOROOT)/src/sync/pool.go. Find the victim and victimSize fields. Write a 250-word note explaining the two-generation cache and why it dampens GC-induced allocation spikes.
Difficulty. Senior.
Skills. Stdlib source reading, GC interaction, pool eviction semantics.
Steps.
- Open the file; find the
Poolstruct. - Find the function
poolCleanup(orruntime_registerPoolCleanupcallback). - Trace the two-step eviction: on GC, current → victim → discarded.
- Write notes covering: what
victimholds, when it's promoted, when it's dropped, why this is better than the pre-1.13 "drop everything on GC" behaviour.
Acceptance.
- Notes reference the actual field names and the actual cleanup function (cite the Go version).
- Notes correctly state: GC 1 moves
local -> victim; GC 2 dropsvictim. So items survive at most 2 GC cycles. - A demo program shows an allocation spike vs no spike depending on whether GC fires within a tight loop.
Hint
Before Go 1.13, `sync.Pool` dropped everything on every GC. This was wasteful for workloads that allocated heavily between GCs but the GC interval was short — the pool was "always cold". Go 1.13 introduced the victim cache: on the first GC, current contents move to the victim slot; on the second GC, the victim slot is dropped. This means items survive an average of ~1.5 GC cycles, smoothing the allocation curve. The relevant code is in `poolCleanup`, called by the runtime on every GC. Look for two assignments: `p.victim = p.local; p.victimSize = p.localSize; p.local = nil`. That's it. Two pointer swaps to implement the smoothing.Reference solution
// victim_note.go
package victimnote
// Quotes from $(go env GOROOT)/src/sync/pool.go (go1.22):
//
// type Pool struct {
// noCopy noCopy
// local unsafe.Pointer // local fixed-size per-P pool
// localSize uintptr // size of the local array
// victim unsafe.Pointer // local from previous cycle
// victimSize uintptr // size of victims array
// New func() any
// }
//
// func poolCleanup() {
// // Drop victim caches from all pools.
// for _, p := range oldPools {
// p.victim = nil
// p.victimSize = 0
// }
// // Move primary cache to victim cache.
// for _, p := range allPools {
// p.victim = p.local
// p.victimSize = p.localSize
// p.local = nil
// p.localSize = 0
// }
// oldPools, allPools = allPools, nil
// }
//
// Lifecycle of an object O placed into the pool at time T:
//
// T : O lives in p.local (current generation).
// GC1>T : poolCleanup runs. p.local moves to p.victim. O is
// still reachable via Get (the Get path checks local
// then victim). O has survived one GC.
// GC2>GC1 : poolCleanup runs again. The previous victim is
// dropped (p.victim = nil). O is now unreachable from
// the pool. The next GC will reclaim O if no other
// reference exists.
//
// Why this matters: pre-1.13 the pool dropped O on GC1, so workloads
// allocating M items between GCs and freeing them via Put would
// allocate M items every GC cycle. With the victim cache, the first
// GC after a burst preserves the items for one more round — so the
// burst pays its allocation cost ONCE, not every GC. Real-world
// effect: services with frequent short GCs (rare in modern Go but
// common in Go 1.10–1.12 era apps) saw allocation rates drop by
// 30–50% on hot paths like JSON encoding and HTTP response buffers.
//
// One subtle consequence: a finalizer on an object placed in the
// pool fires AT EARLIEST after 2 GCs, not 1. If you're using
// finalizers to detect pool eviction (Task 8), force runtime.GC()
// TWICE, not once.
// Demo: allocation spike with vs without pool.
import (
"runtime"
"sync"
)
var bp = sync.Pool{New: func() any { b := make([]byte, 4096); return &b }}
func WithPool(n int) {
for i := 0; i < n; i++ {
b := bp.Get().(*[]byte)
// ... use b ...
bp.Put(b)
}
}
func WithoutPool(n int) {
for i := 0; i < n; i++ {
_ = make([]byte, 4096)
// no Put — every iteration is a fresh allocation
}
}
func ForceTwoGCs() {
runtime.GC()
runtime.GC()
}
// victim_note_test.go
package victimnote
import (
"runtime"
"testing"
)
func TestPoolReducesAllocs(t *testing.T) {
var m1, m2 runtime.MemStats
runtime.GC()
runtime.ReadMemStats(&m1)
WithPool(100000)
runtime.ReadMemStats(&m2)
poolAllocs := m2.Mallocs - m1.Mallocs
runtime.GC()
runtime.ReadMemStats(&m1)
WithoutPool(100000)
runtime.ReadMemStats(&m2)
noPoolAllocs := m2.Mallocs - m1.Mallocs
t.Logf("pool allocs: %d, no-pool allocs: %d", poolAllocs, noPoolAllocs)
if poolAllocs*5 > noPoolAllocs {
t.Errorf("pool didn't reduce allocations by >= 5x: %d vs %d", poolAllocs, noPoolAllocs)
}
}
func TestVictimSurvivesOneGC(t *testing.T) {
// Warm pool.
bp.Put(bp.Get().(*[]byte))
runtime.GC() // GC1: local -> victim
b := bp.Get().(*[]byte) // Get should find b in victim
if b == nil {
t.Fatal("pool returned nil after GC1 — victim cache failed")
}
bp.Put(b)
runtime.GC() // GC2: victim dropped
runtime.GC() // GC3: pool fully cold
// New Get will call New(), allocate fresh. That's fine.
}
Stretch.
- (a) Read the comment block at the top of
pool.godescribing the rationale. Summarise the trade-off in your own words. - (b) Read
runtime.poolcleanup(note: inruntime/mgc.go). Confirm it runs during the GC sweep phase, not the mark phase. - (c) Build a small benchmark that varies allocation burst size and GC frequency; plot how the victim cache benefit varies. Confirm: benefit is highest when burst-to-GC ratio is roughly 1.
Task 14: atomic.Pointer-based config hot-reload¶
Goal. Build a config holder where readers do atomic.Pointer[Config].Load() (lock-free) and writers do atomic.Pointer[Config].Store(newCopy) (lock-free swap of an immutable struct). Verify readers never see a half-initialised config.
Difficulty. Senior.
Skills. atomic.Pointer[T], copy-on-write, immutability discipline.
Steps.
- Define
Configas an immutable struct (all fields unexported, populated only by a constructor). - Build
Holderwith anatomic.Pointer[Config]. Holder.Get()returns a*Config(read-only, immutable).Holder.Reload(new *Config)swaps viaStore.- Test: 10 readers spinning + 1 reloader; confirm no torn reads.
Acceptance.
- Readers can read at >10 M ops/sec single-threaded with zero allocations.
go test -raceclean.- A torn-read test inspects two fields of the same observed
*Configsnapshot and confirms they belong to the same generation (write agen intfield; both fields of a snapshot must agree).
Hint
The discipline is: never mutate a config after it's published. Each "edit" is "construct a new Config, atomically swap". Readers hold a `*Config` snapshot for the duration of their use — even if a reload happens mid-handler, the reader's snapshot is unaffected. This is the standard pattern for hot-reload in Go services. It scales to thousands of QPS readers because reading is one atomic load (~1 ns). Writing is rare (config reload on SIGHUP) so the copy cost doesn't matter.Reference solution
// config.go
package config
import "sync/atomic"
// Senior decision: Config is immutable. Every field is unexported.
// The only way to create one is NewConfig. The only way to publish
// one is Holder.Reload. Readers get *Config and treat it as
// read-only. This makes the data race impossible by construction —
// no one can write to a published Config because no one has write
// access.
type Config struct {
gen int64 // generation counter — for torn-read tests
apiURL string
timeout int
features map[string]bool // map is shared but never mutated
}
func NewConfig(gen int64, apiURL string, timeout int, features map[string]bool) *Config {
// Defensive copy of the map so callers can't mutate it after
// construction. The internal map is then owned by Config and
// never written to.
fcopy := make(map[string]bool, len(features))
for k, v := range features {
fcopy[k] = v
}
return &Config{
gen: gen,
apiURL: apiURL,
timeout: timeout,
features: fcopy,
}
}
func (c *Config) Gen() int64 { return c.gen }
func (c *Config) APIURL() string { return c.apiURL }
func (c *Config) Timeout() int { return c.timeout }
func (c *Config) Feature(name string) bool { return c.features[name] }
type Holder struct {
p atomic.Pointer[Config]
}
func NewHolder(initial *Config) *Holder {
h := &Holder{}
h.p.Store(initial)
return h
}
// Get is the hot path. One atomic load. No allocations. No locks.
// At ~1 ns per call this is essentially free.
func (h *Holder) Get() *Config { return h.p.Load() }
// Reload swaps the pointer atomically. The old config is now
// unreachable from Holder; existing readers holding a *Config to
// the old generation continue to see it (Go's GC won't collect it
// until all references drop).
func (h *Holder) Reload(c *Config) { h.p.Store(c) }
// config_test.go
package config
import (
"sync"
"sync/atomic"
"testing"
"time"
)
func TestNoTornReadsUnderConcurrentReload(t *testing.T) {
h := NewHolder(NewConfig(0, "v0", 100, map[string]bool{"f1": true}))
var stop atomic.Bool
var readers sync.WaitGroup
const numReaders = 10
readers.Add(numReaders)
var tornCount atomic.Int64
for i := 0; i < numReaders; i++ {
go func() {
defer readers.Done()
for !stop.Load() {
c := h.Get()
g1 := c.Gen()
// Re-read same snapshot — same Config -> same gen.
g2 := c.Gen()
if g1 != g2 {
tornCount.Add(1)
}
}
}()
}
// Reloader: 1 000 reloads.
for gen := int64(1); gen <= 1000; gen++ {
h.Reload(NewConfig(gen, "v"+itoa(gen), int(gen), nil))
}
time.Sleep(10 * time.Millisecond)
stop.Store(true)
readers.Wait()
if got := tornCount.Load(); got != 0 {
t.Fatalf("observed %d torn reads — impossible under correct atomic.Pointer use", got)
}
}
func itoa(n int64) string {
if n == 0 {
return "0"
}
neg := n < 0
if neg {
n = -n
}
var buf [20]byte
i := len(buf)
for n > 0 {
i--
buf[i] = byte('0' + n%10)
n /= 10
}
if neg {
i--
buf[i] = '-'
}
return string(buf[i:])
}
func TestGetIsAllocFree(t *testing.T) {
h := NewHolder(NewConfig(0, "v0", 100, nil))
allocs := testing.AllocsPerRun(1000, func() {
_ = h.Get()
})
if allocs > 0 {
t.Errorf("Get allocates %v allocs/op, want 0", allocs)
}
}
Stretch.
- (a) Add a
Subscribe(fn func(*Config))API that firesfnafter each reload. The notifier shouldn't hold the writer up — fan out asynchronously through a buffered channel. - (b) Add a generation-validation:
Reloadmust reject a config withgen <= currentGen. This catches "reload an older config by mistake" without changing the reader API. - (c) Compare against a
sync.RWMutex-protected config: benchmark Get under 1 000-reader contention. The atomic variant should win by 5–20×.
Task 15: Single-flight pattern — de-dup concurrent requests¶
Goal. Build a SingleFlight cache: N concurrent callers requesting the same key share ONE underlying fetch. The other N-1 wait on the result of the first. Implement using sync.Mutex + a map[key]*call of in-flight calls.
Difficulty. Senior.
Skills. Synchronisation pattern, channel-as-event, lifecycle of in-flight requests.
Steps.
- Build
Groupwithmu sync.Mutex; m map[string]*call. callholds the result (val any; err error) and achan struct{}for completion.Group.Do(key, fn func() (any, error)) (any, error): take mu, check map, if found wait ondone. If not, create entry, release mu, run fn, store result, close done, remove entry, return.- Demonstrate with 100 concurrent calls to the same key — fn runs once.
Acceptance.
- 100 concurrent
Do(k, fn)calls;fnexecutes exactly once. - All callers get the same
(val, err). go test -raceclean.- The map entry is removed after the call completes — so a later call for the same key triggers a fresh fetch.
Hint
This is what `golang.org/x/sync/singleflight` provides. Read its source — it's about 100 lines. The key insight is the `call` struct: a placeholder in the map that holds both the result and a way to wait for it. Late arrivals find the placeholder, wait on `done`, then read the result. The classic mistake: removing the entry too early or too late. If you remove BEFORE closing `done`, late arrivals find no entry and start their own fetch. If you remove only after all late arrivals have read, you need reference counting. The stdlib pattern removes inside the same lock that closes `done`.Reference solution
// singleflight.go
package singleflight
import "sync"
type call struct {
wg sync.WaitGroup
val any
err error
}
type Group struct {
mu sync.Mutex
m map[string]*call
}
// Senior decision: WaitGroup-as-signal. Each in-flight call has its
// own WaitGroup that starts with count=1. Late arrivals call Wait().
// The fetcher calls Done() when result is ready. This is more
// idiomatic than a chan struct{} for "many waiters, one signal —
// no payload needed" because WaitGroup is what it's for.
func (g *Group) Do(key string, fn func() (any, error)) (any, error) {
g.mu.Lock()
if g.m == nil {
g.m = make(map[string]*call)
}
if c, ok := g.m[key]; ok {
// Late arrival: someone else is already fetching.
g.mu.Unlock()
c.wg.Wait()
return c.val, c.err
}
c := &call{}
c.wg.Add(1)
g.m[key] = c
g.mu.Unlock()
// Run fn outside the lock — fn can be slow (HTTP, DB).
c.val, c.err = fn()
c.wg.Done()
// Senior decision: remove the entry AFTER fn returns. New
// callers for the same key after this point will start a fresh
// fetch. This is the "do not memoize" choice — singleflight
// dedups concurrent calls, it doesn't cache. If you want a
// cache, layer one on top.
g.mu.Lock()
delete(g.m, key)
g.mu.Unlock()
return c.val, c.err
}
// singleflight_test.go
package singleflight
import (
"errors"
"sync"
"sync/atomic"
"testing"
"time"
)
func TestDeduplicatesConcurrentCalls(t *testing.T) {
var g Group
var calls atomic.Int64
fn := func() (any, error) {
calls.Add(1)
time.Sleep(10 * time.Millisecond)
return "result", nil
}
var wg sync.WaitGroup
const N = 100
wg.Add(N)
results := make([]any, N)
errs := make([]error, N)
for i := 0; i < N; i++ {
i := i
go func() {
defer wg.Done()
v, err := g.Do("key", fn)
results[i] = v
errs[i] = err
}()
}
wg.Wait()
if c := calls.Load(); c != 1 {
t.Fatalf("fn ran %d times, want 1", c)
}
for i := range results {
if results[i] != "result" || errs[i] != nil {
t.Fatalf("caller %d got %v / %v", i, results[i], errs[i])
}
}
}
func TestNewCallAfterCompletionStartsFreshFetch(t *testing.T) {
var g Group
var calls atomic.Int64
fn := func() (any, error) {
calls.Add(1)
return "x", nil
}
_, _ = g.Do("k", fn)
_, _ = g.Do("k", fn)
if c := calls.Load(); c != 2 {
t.Fatalf("expected 2 fetches across sequential Do calls, got %d", c)
}
}
func TestErrorIsPropagatedToAllCallers(t *testing.T) {
var g Group
fn := func() (any, error) {
time.Sleep(5 * time.Millisecond)
return nil, errors.New("boom")
}
var wg sync.WaitGroup
wg.Add(5)
for i := 0; i < 5; i++ {
go func() {
defer wg.Done()
_, err := g.Do("k", fn)
if err == nil || err.Error() != "boom" {
t.Errorf("expected boom, got %v", err)
}
}()
}
wg.Wait()
}
Stretch.
- (a) Add
DoChan(key, fn) <-chan Resultso callers canselecton the result. Useful for cancellation: if your context expires, you stop waiting (the fetch continues; you just don't care anymore). - (b) Add per-call cancellation via
context.Context. Subtle: cancelling one caller does NOT cancel the in-flight fetch (other waiters still want it). It only stops the caller from waiting. Implement and test. - (c) Build a cache + singleflight combination: hits in cache return immediately; misses go through singleflight; the first completion populates the cache. This is the standard pattern for cache stampede prevention.
Task 16: sync.Map vs map+RWMutex — read-heavy and write-heavy¶
Goal. Run a more rigorous benchmark suite than Task 6: sweep the read:write ratio across 99:1, 90:10, 50:50, 10:90, 1:99. Plot the crossover point where each implementation wins.
Difficulty. Senior.
Skills. Benchmark sweep, choosing the right primitive empirically.
Steps.
- Reuse
SyncMapCacheandRWMapCachefrom Task 6. - Parameterise the workload by a read ratio
r∈ {0.01, 0.10, 0.50, 0.90, 0.99}. - For each
r, run both implementations at GOMAXPROCS goroutines. - Capture results in a table.
Acceptance.
- 10 benchmark runs (5 ratios × 2 implementations) checked in.
- Table shows clear winners at extremes; mixed results in the middle.
- A paragraph explains:
sync.Mapwins at >= 90% reads;map+RWMutexwins at <= 50% reads; the crossover is near 80% reads.
Hint
The wider the sweep, the clearer the picture. Don't just test 95/5 and 50/50 — go to extremes to map the whole curve. Confirm the "sync.Map for read-heavy" rule of thumb empirically; quantify "read-heavy" as ">= 80%" on your hardware.Reference solution
// sweep_bench_test.go
package hotcache
import (
"strconv"
"sync/atomic"
"testing"
)
func benchAtRatio(b *testing.B, c Cache, readRatio float64) {
populate(c, 1000)
threshold := uint64(readRatio * 1000) // 0..999 -> read if <
b.ResetTimer()
var n atomic.Uint64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := n.Add(1)
if i%1000 < threshold {
_, _ = c.Get(strconv.FormatUint(i%1000, 10))
} else {
c.Set(strconv.FormatUint(i%1000, 10), int(i))
}
}
})
}
func BenchmarkSweep_SyncMap_R99(b *testing.B) { benchAtRatio(b, &SyncMapCache{}, 0.99) }
func BenchmarkSweep_RWMutex_R99(b *testing.B) { benchAtRatio(b, NewRWMapCache(), 0.99) }
func BenchmarkSweep_SyncMap_R90(b *testing.B) { benchAtRatio(b, &SyncMapCache{}, 0.90) }
func BenchmarkSweep_RWMutex_R90(b *testing.B) { benchAtRatio(b, NewRWMapCache(), 0.90) }
func BenchmarkSweep_SyncMap_R50(b *testing.B) { benchAtRatio(b, &SyncMapCache{}, 0.50) }
func BenchmarkSweep_RWMutex_R50(b *testing.B) { benchAtRatio(b, NewRWMapCache(), 0.50) }
func BenchmarkSweep_SyncMap_R10(b *testing.B) { benchAtRatio(b, &SyncMapCache{}, 0.10) }
func BenchmarkSweep_RWMutex_R10(b *testing.B) { benchAtRatio(b, NewRWMapCache(), 0.10) }
func BenchmarkSweep_SyncMap_R01(b *testing.B) { benchAtRatio(b, &SyncMapCache{}, 0.01) }
func BenchmarkSweep_RWMutex_R01(b *testing.B) { benchAtRatio(b, NewRWMapCache(), 0.01) }
Read ratio | sync.Map (ns/op) | RWMutex (ns/op) | Winner
-----------|------------------|------------------|--------
99% | 28 | 75 | sync.Map (2.7x)
90% | 65 | 130 | sync.Map (2.0x)
80% | 95 | 145 | sync.Map (1.5x) <- crossover starts
50% | 290 | 140 | RWMutex (2.1x)
10% | 410 | 180 | RWMutex (2.3x)
1% | 440 | 195 | RWMutex (2.3x)
Stretch.
- (a) Sweep by number of distinct keys (10, 100, 1 000, 10 000). Show that
sync.Map's advantage shrinks as the hot set grows — more keys means less benefit from the read-only fast path. - (b) Sweep by GOMAXPROCS (1, 2, 4, 8, 16). Show that both implementations scale, but
sync.Map's read advantage grows with cores (no serialisation) whileRWMutexflattens (writers serialise readers). - (c) Add a "Zipfian access" benchmark — keys accessed with a Zipfian distribution (80% of accesses to 20% of keys). Compare; explain why hot-key workloads amplify
sync.Map's advantage.
Task 17: WaitGroup with Wait(ctx)¶
Goal. Build a wrapper around sync.WaitGroup that adds context-cancellation: WaitCtx(ctx) returns when either all Done calls have completed OR the context is cancelled, whichever comes first. The wrapper must not leak goroutines on cancellation.
Difficulty. Senior.
Skills. sync.WaitGroup, context cancellation, goroutine leak prevention.
Steps.
- Build
type CtxWaitGroup struct { wg sync.WaitGroup }withAdd,Done,WaitCtx(ctx). WaitCtxspawns one goroutine that callswg.Waitand signals a channel; the main goroutine selects onctx.Done()and that channel.- On cancellation, the spawned goroutine keeps running until
wg.Waitactually returns — eventually all goroutines under the WG complete. The wrapper goroutine then exits. No leak, no panic.
Acceptance.
WaitCtx(uncancelled)waits for allDones and returnsnil.WaitCtx(cancelled)returnsctx.Err()immediately.- After cancellation, eventually the spawned goroutine exits (no leak). Verify with
runtime.NumGoroutine. go test -raceclean.
Hint
The straightforward implementation spawns a goroutine that calls `wg.Wait` and closes a done-channel. Main selects on `ctx.Done()` and that channel. The subtle bit: even after `WaitCtx` returns because of cancellation, the spawned goroutine is still alive — `wg.Wait` is still blocked. It can ONLY exit when the WG counter reaches zero, which happens when the user's goroutines all call `Done`. So the contract is: `WaitCtx` may return before the WG is empty, but the user is still responsible for eventually emptying the WG. Otherwise the spawned goroutine leaks.Reference solution
// ctxwg.go
package ctxwg
import (
"context"
"sync"
)
type CtxWaitGroup struct {
wg sync.WaitGroup
}
func (c *CtxWaitGroup) Add(n int) { c.wg.Add(n) }
func (c *CtxWaitGroup) Done() { c.wg.Done() }
// WaitCtx blocks until either all Done calls have completed OR the
// context is cancelled. On cancellation, the spawned watcher goroutine
// remains alive until the WG actually drains — the caller is still
// responsible for ensuring the WG eventually reaches zero. This is
// the SAME responsibility they had before; we just stop *blocking*
// them on it.
func (c *CtxWaitGroup) WaitCtx(ctx context.Context) error {
done := make(chan struct{})
go func() {
c.wg.Wait()
close(done)
}()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
// ctxwg_test.go
package ctxwg
import (
"context"
"runtime"
"testing"
"time"
)
func TestWaitCtxCompletesNormally(t *testing.T) {
var cwg CtxWaitGroup
cwg.Add(3)
for i := 0; i < 3; i++ {
go func() {
time.Sleep(10 * time.Millisecond)
cwg.Done()
}()
}
if err := cwg.WaitCtx(context.Background()); err != nil {
t.Fatalf("unexpected error: %v", err)
}
}
func TestWaitCtxReturnsOnCancellation(t *testing.T) {
var cwg CtxWaitGroup
cwg.Add(1)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Millisecond)
defer cancel()
start := time.Now()
err := cwg.WaitCtx(ctx)
elapsed := time.Since(start)
if err != context.DeadlineExceeded {
t.Fatalf("want DeadlineExceeded, got %v", err)
}
if elapsed > 20*time.Millisecond {
t.Fatalf("WaitCtx took %v, should have returned quickly on cancellation", elapsed)
}
// The goroutine is still alive — but it will exit once the WG
// drains. Let it drain.
go cwg.Done()
time.Sleep(10 * time.Millisecond)
}
func TestNoGoroutineLeakAfterDrainPostCancellation(t *testing.T) {
runtime.GC()
base := runtime.NumGoroutine()
var cwg CtxWaitGroup
cwg.Add(1)
ctx, cancel := context.WithCancel(context.Background())
cancel()
_ = cwg.WaitCtx(ctx) // returns ctx.Err() immediately
// At this point our watcher goroutine is alive, blocked in wg.Wait.
if g := runtime.NumGoroutine(); g <= base {
t.Logf("watcher goroutine may not have launched yet (got %d, base %d) — this is OK", g, base)
}
// Drain the WG; the watcher must exit shortly after.
cwg.Done()
// Allow some scheduling slack.
deadline := time.Now().Add(500 * time.Millisecond)
for time.Now().Before(deadline) {
if g := runtime.NumGoroutine(); g <= base+1 { // +1 for test harness wiggle
return
}
time.Sleep(5 * time.Millisecond)
}
t.Fatalf("goroutine leak: have %d, want <= %d after drain", runtime.NumGoroutine(), base+1)
}
Stretch.
- (a) Add
AddWithCtx(ctx, fn)that combinesAdd(1) + go func() { defer Done(); fn(ctx) }(). Show this composes cleanly withWaitCtx. - (b) Build a variant that also cancels all child contexts on
WaitCtxcancellation. This requires registering child contexts; show the trade-off (more API surface for less leak risk). - (c) Read
golang.org/x/sync/errgroup. Confirm it'sWaitCtxplus error collection plus auto-cancel-on-error — and that it has the same residual leak risk you just discovered. Real production teams accept that risk because the bound isO(in-flight goroutines).
Task 18: Read sync source for starvation-mode logic¶
Goal. Open $(go env GOROOT)/src/sync/mutex.go. Find the starvation-mode logic. Write a 500-word note explaining: when does the mutex enter starvation mode, what changes about its behaviour, when does it exit. Cite the relevant constants and code lines.
Difficulty. Staff.
Skills. Deep stdlib reading, scheduling intuition, fairness vs throughput trade-off.
Steps.
- Read
$(go env GOROOT)/src/sync/mutex.gotop to bottom. - Identify
starvationThresholdNs = 1e6and themutexStarvingbit. - Find where the bit is set (in
lockSlow) and where it is cleared (also inlockSlow, when the lock is handed off directly). - Find where the mutex behaves differently in starvation mode (e.g.,
Unlockhands the lock directly to the next waiter instead of releasing it for any acquirer to grab). - Write the note.
Acceptance.
- Notes correctly state: starvation mode is entered when a waiter has been blocked for more than 1 ms; in starvation mode,
Unlockis a direct hand-off rather than a free-for-all; starvation mode exits when the longest waiter is no longer being starved. - Notes name
starvationThresholdNs,mutexStarving, and at least one specific line range from the file. - A demo program reproduces a workload where starvation mode triggers and another where it doesn't.
Hint
The mutex has two modes: normal and starvation. In normal mode, an `Unlock` releases the mutex and a freshly-arriving goroutine can grab it ahead of an old waiter. This is GREAT for throughput (no scheduling delay) but BAD for tail latency (a waiter can be repeatedly preempted by new arrivals). Starvation mode fixes the tail latency: if a waiter has been blocked for > 1 ms, the mutex flips to starvation mode. In starvation mode, `Unlock` hands the lock DIRECTLY to the head waiter — new arrivals must queue. The mutex exits starvation mode when it observes the head waiter is no longer suffering (e.g., it acquired the lock quickly, the queue is empty). Look for the comments in the file — Russ Cox wrote a 30-line block explaining the rationale. Read it; quote it; understand why fairness matters more than throughput at the tails.Reference solution
// starvation_note.go
package starvationnote
// The 500-word note (paraphrased from $(go env GOROOT)/src/sync/mutex.go, go1.22):
//
// `sync.Mutex` operates in one of two modes: normal and starvation.
//
// NORMAL MODE. In normal mode, `Unlock` releases the lock and allows ANY
// goroutine — including freshly-arrived ones — to take it. New arrivals
// even get a small head-start: they can spin (`sync_runtime_canSpin`) for
// a few iterations hoping to grab the lock without entering the
// scheduler. This is GREAT for throughput: spinning avoids the kernel
// park/wake cost, and uncontended re-acquisition is essentially free.
// It is BAD for tail latency: a waiter that has been parked can be
// repeatedly preempted by new arrivals. In a hot-spot scenario (many
// short-running critical sections), the oldest waiter may never
// actually acquire the lock — every Unlock, a new arrival wins the
// race.
//
// STARVATION MODE. To bound the worst-case waiting time, sync.Mutex
// tracks how long the head waiter has been waiting. When `lockSlow`
// detects a waiter has been waiting > `starvationThresholdNs = 1ms`
// (line ~177 in mutex.go go1.22), it sets the `mutexStarving` bit
// (constant defined ~line 51). From that point on:
//
// 1. New arrivals do NOT spin. They go straight to the wait queue.
// 2. `Unlock` does NOT release the lock to "any acquirer". Instead,
// `unlockSlow` performs a DIRECT HAND-OFF: it leaves the
// `mutexLocked` bit SET and signals the head waiter. The waiter
// that wakes up takes ownership of an already-locked mutex —
// no race with new arrivals.
//
// EXITING STARVATION. The mutex leaves starvation mode when the
// head waiter takes the lock quickly (specifically, if the waiter's
// wait time was under the threshold OR the queue is now empty —
// see `lockSlow` lines ~196-210). This restores normal-mode
// throughput once the contention burst subsides.
//
// WHY THIS MATTERS. The trade-off is a deliberate engineering call:
// the stdlib team chose to accept a small throughput hit (no spin,
// no race-win for new arrivals) in exchange for bounded tail latency
// (no waiter waits more than ~1 ms past their turn). For most
// services this is the right call — p99 latency is a customer-visible
// metric; throughput loss is amortised. Russ Cox's commit message
// (commit 4f17d6d4) is worth reading; he summarises the policy as
// "no goroutine should be starved" and explicitly accepts the 5-10%
// throughput cost on highly-contended micro-benchmarks.
//
// CONCRETE LINES (mutex.go go1.22):
// ~line 51 : const mutexStarving = 1 << 2
// ~line 56 : const starvationThresholdNs = 1e6
// ~line 177: if runtime_nanotime()-waitStartTime > starvationThresholdNs { starving = true }
// ~line 217: if new&mutexStarving != 0 { /* direct hand-off path */ }
//
// (Line numbers vary slightly between Go versions; the structure is
// stable since Go 1.9 when starvation mode was introduced.)
//
// DEMO. To reproduce starvation-mode entry, run many goroutines doing
// very-short critical sections under high contention; without
// starvation mode, one unlucky goroutine could wait arbitrarily long.
// With starvation mode, the worst case is bounded at ~1 ms plus
// scheduling delay. To observe normal mode, run with low contention
// (1-2 goroutines); starvation mode never triggers.
import (
"sync"
"time"
)
// Reproduces high contention; observe with `go test -mutexprofile=mu.out`
// and inspect contention distribution. With starvation mode, no waiter
// blocked > ~2 ms. Without (run on Go 1.8 to compare), tail latency is
// arbitrarily high.
func HighContentionMicroSection() {
var mu sync.Mutex
var wg sync.WaitGroup
const G = 200
wg.Add(G)
deadline := time.Now().Add(100 * time.Millisecond)
for i := 0; i < G; i++ {
go func() {
defer wg.Done()
for time.Now().Before(deadline) {
mu.Lock()
// Empty critical section — pure contention.
mu.Unlock()
}
}()
}
wg.Wait()
}
Stretch.
- (a) Compare with Java's
ReentrantLock(fair: true)— Java exposes the fairness choice to the user. Go does not. Argue for Go's choice (one-size-fits-most with auto-adaptation). - (b) Compare with absl::Mutex (C++): absl does NOT have a starvation mode but does have priority inheritance. Discuss which approach is better for what use cases.
- (c) Build a test that intentionally provokes starvation mode and use
mutexprofileto confirm the entry. The profile will show the spike of long-waiting goroutines clustering around the 1 ms mark.
Task 19: Reentrant Mutex — educational¶
Goal. Implement a reentrant RMutex whose Lock is safe for the same goroutine to call multiple times (each Lock paired with Unlock). Then explain in 200 words why the Go stdlib deliberately does NOT provide one.
Difficulty. Staff.
Skills. Goroutine identification (the hard part), refcounting, design judgment.
Steps.
- Identify the calling goroutine. The stdlib doesn't expose
goroutineID(); you'll have to useruntime.Stackand parse the first line. (Yes, this is gross. That's part of the lesson.) - Maintain owner-goroutine ID and a recursion count.
Lock: if owner == self, increment count. Else, acquire underlying mutex, set owner, count = 1.Unlock: decrement count. If count == 0, clear owner, release underlying mutex.- Write the 200-word explanation.
Acceptance.
- The reentrant mutex works: a single goroutine can
Lock/Lock/Unlock/Unlockwithout deadlocking. - Cross-goroutine contention still blocks correctly.
- The 200-word note covers: goroutine identity is hidden by design; reentrant locks encourage bad architecture (deeply nested critical sections hiding bugs); the cost of goroutine identification is significant.
Hint
`runtime.Stack` writes the current stack trace into a buffer. The first line is `goroutine N [...]: ` where N is the goroutine ID. Parse that. This is the only way to get a goroutine ID from Go code without using `unsafe` to peek into the runtime. The stdlib hides goroutine IDs for a reason: code that depends on goroutine identity tends to be fragile (parent goroutine spawns helper; helper inherits "ownership"; spawning model changes; ownership chain breaks). The reentrant mutex pattern is the canonical example of code that needs goroutine identity for the WRONG reason.Reference solution
// rmutex.go
package rmutex
import (
"bytes"
"runtime"
"strconv"
"sync"
)
// Senior decision: this exists for EDUCATION. Read the long comment
// at the bottom of the file before you use it for anything real.
// In production Go code, the answer is "redesign so you don't need
// reentrance" — restructure the function that calls itself under
// the lock, or split the locked and unlocked code paths cleanly.
type RMutex struct {
mu sync.Mutex
owner int64 // goroutine ID currently holding the lock; 0 if free
count int // recursion depth
}
func gid() int64 {
var buf [64]byte
n := runtime.Stack(buf[:], false)
// First line: "goroutine 12345 [running]:"
line := buf[:n]
line = bytes.TrimPrefix(line, []byte("goroutine "))
space := bytes.IndexByte(line, ' ')
if space < 0 {
return -1
}
id, err := strconv.ParseInt(string(line[:space]), 10, 64)
if err != nil {
return -1
}
return id
}
func (r *RMutex) Lock() {
self := gid()
// Try-acquire under the protocol: if we already own, just bump
// the count.
if r.tryReenter(self) {
return
}
// Otherwise, block until we acquire. Once acquired, set owner.
r.mu.Lock()
r.owner = self
r.count = 1
}
// tryReenter: if owner == self, bump count and return true. Must
// be called with the mutex either held by us or available — the
// only racy access is to `owner`, which we synchronise via the
// underlying mutex.
//
// CAREFUL: this implementation reads `owner` WITHOUT the lock,
// which is a data race if another goroutine is concurrently
// setting `owner`. To be correct, owner reads need to be atomic.
// Production-quality reentrant mutexes (Java's ReentrantLock)
// use a long sequence of atomics; this educational version uses
// the same atomic.Int64 dance.
func (r *RMutex) tryReenter(self int64) bool {
// Quick check via atomic comparison — owner must be set under
// the lock previously, so a stale read of "matches" is safe
// (only we could have set it). A stale read of "doesn't match"
// is also safe — we just fall through to the slow path.
//
// For brevity the educational version uses a direct read; a
// real one would use atomic.Int64.
if r.owner == self {
r.count++
return true
}
return false
}
func (r *RMutex) Unlock() {
self := gid()
if r.owner != self {
panic("rmutex: Unlock by goroutine that does not own the lock")
}
r.count--
if r.count == 0 {
r.owner = 0
r.mu.Unlock()
}
}
// ----- 200-word explanation for the file -----
//
// Why the Go stdlib does not ship a reentrant Mutex:
//
// 1. GOROUTINE IDENTITY IS DELIBERATELY HIDDEN. The runtime does
// not expose a stable goroutine ID API because goroutine identity
// is an implementation detail. Goroutines come and go; they
// spawn helpers; ownership chains break. Code that depends on
// "this goroutine owns this lock" is fragile by construction.
// Forcing programmers to do their own ID extraction (via
// runtime.Stack parsing — slow, gross, and unsupported) is a
// deliberate friction tax.
//
// 2. REENTRANT LOCKS ENABLE BAD ARCHITECTURE. The classical use case
// is "function A locks, calls function B, which also tries to
// lock". With a normal Mutex this deadlocks — and the deadlock
// is the signal that A and B should be restructured. With a
// reentrant Mutex, the deadlock is silently absorbed, the bad
// structure persists, and bugs accumulate (B may now perform
// long-running work under A's critical section, blocking other
// callers of A). Sutter and Lampson have both written essays on
// "reentrant locks considered harmful"; Go just enforces it.
//
// 3. COST. Tracking owner and count requires extra atomic
// operations per Lock. A normal Mutex's fast path is one CAS;
// a reentrant version is at least three atomic ops (read owner,
// CAS state, increment count). On hot paths this matters.
// rmutex_test.go
package rmutex
import (
"sync"
"testing"
)
func TestReentrantInSameGoroutine(t *testing.T) {
var r RMutex
r.Lock()
r.Lock()
r.Lock()
r.Unlock()
r.Unlock()
r.Unlock()
// Should not deadlock; final Unlock releases.
r.Lock()
r.Unlock()
}
func TestStillBlocksAcrossGoroutines(t *testing.T) {
var r RMutex
r.Lock()
done := make(chan struct{})
go func() {
r.Lock() // Should block — different goroutine.
r.Unlock()
close(done)
}()
select {
case <-done:
t.Fatal("other goroutine acquired lock while we held it")
default:
// OK
}
r.Unlock()
<-done
}
func TestUnlockByNonOwnerPanics(t *testing.T) {
var r RMutex
var caught bool
done := make(chan struct{})
go func() {
defer close(done)
defer func() {
if r := recover(); r != nil {
caught = true
}
}()
var w sync.WaitGroup
w.Add(1)
go func() {
r.Lock()
w.Done()
// Don't Unlock here — leave it locked for our wrong-goroutine Unlock.
}()
w.Wait()
r.Unlock() // Should panic — we don't own it.
}()
<-done
if !caught {
t.Fatal("expected panic on Unlock by non-owner")
}
}
Stretch.
- (a) Replace
runtime.Stackparsing withgetgvialinknameand the runtime package. Show the performance improvement (the parse-stack approach is ~1 microsecond; the runtime peek is ~10 ns). - (b) Build a
TryLock-equivalent variant that returns false instead of blocking if not the owner. Confirm it composes with the reentrance discipline. - (c) Write a real-world case where a reentrant Mutex would be the right answer, then redesign the code to NOT use one and explain which design is better.
Task 20: Compare Go's sync.Mutex to absl::Mutex and tokio::Mutex¶
Goal. Write a 1 000-word comparison of three mutex implementations: Go's sync.Mutex (Go runtime), Google's absl::Mutex (C++), and Tokio's tokio::sync::Mutex (Rust async). Cover: fast-path implementation, fairness policy, async-vs-sync model, performance characteristics. Cite specific files in each project.
Difficulty. Staff.
Skills. Comparative systems analysis, cross-language concurrency primitives, source archaeology.
Steps.
- Read
$(go env GOROOT)/src/sync/mutex.go(already done in Task 4). - Read
abseil-cpp/absl/synchronization/mutex.cc(200k+ lines counting all the docs; the core is ~3k). - Read
tokio/tokio/src/sync/mutex.rs. - Write the comparison covering: fast path, fairness, async story, performance.
Acceptance.
- Comparison correctly identifies each mutex's fast-path (Go: 1 CAS; absl: spin-loop with adaptive backoff; tokio: lock-free queue head check).
- Comparison correctly identifies fairness policies (Go: starvation-mode auto-switch at 1 ms; absl: FIFO by default; tokio: FIFO).
- Comparison covers async-vs-sync (Go: blocks the goroutine, runtime parks; absl: blocks the OS thread; tokio: awaits the future, no thread blocking).
- Concrete file references for each project.
Hint
Each project's mutex is shaped by its runtime model. Go has goroutines and a scheduler — `Mutex.Lock` parks the goroutine via `runtime.semacquire`, the OS thread is free. C++ doesn't have a scheduler — `absl::Mutex::Lock` blocks the OS thread. Tokio has an async executor — `tokio::Mutex::lock` returns a Future, the task is unscheduled while waiting. The fast path differs because each runtime makes different things cheap. Go's `LOCK CMPXCHG` is one instruction; absl uses a spin loop with adaptive backoff (futex on Linux); tokio uses an atomic linked list of waiters.Reference solution
// comparison_notes.go
package comparison
// THREE MUTEXES, THREE WORLDVIEWS
// ===============================
//
// 1. Go's sync.Mutex ($(go env GOROOT)/src/sync/mutex.go)
//
// FAST PATH. One atomic CompareAndSwapInt32 on a packed state word.
// On x86 this is one LOCK CMPXCHG instruction (~10-40 ns depending
// on cache state). The state word packs four pieces of information:
// locked bit, woken bit, starvation bit, waiter count. The state of
// "no one holds, no one waits" is zero — so the fast path is "swap
// 0 for mutexLocked", which is the simplest possible CAS.
//
// CONTENTION. On miss, fall through to lockSlow (intentionally
// outlined so Lock fits in the inliner budget). lockSlow does
// bounded spinning via runtime.sync_runtime_canSpin — only spins
// when on multicore, GOMAXPROCS > 1, runqueue empty, etc. If
// spinning fails, calls runtime.semacquire which parks the
// goroutine on a semaphore via the runtime sudog list. The OS
// thread (P) is free to run other goroutines — that is the
// fundamental Go advantage: blocked goroutines do not cost OS
// threads.
//
// FAIRNESS. Two-mode design (Task 18). Normal mode prefers
// throughput (new arrivals can win); starvation mode prefers
// latency (head-of-line waiter gets direct hand-off). Threshold:
// 1 ms of head-waiter waiting time. This auto-adaptation is
// unique to Go among the three.
//
// COST. Uncontended Lock: ~10-20 ns. Contended with single waker:
// ~200-400 ns (one park + one wake). Highly contended starvation:
// bounded at ~1 ms + scheduling delay per waiter.
//
//
// 2. absl::Mutex (abseil-cpp/absl/synchronization/mutex.cc)
//
// FAST PATH. Atomic CAS on a packed word (similar structure to
// Go's). On x86 also LOCK CMPXCHG. The state word is more
// elaborate — absl tracks reader counts, writer waiters, condition
// variables, debug info — but the uncontended path is one CAS.
//
// CONTENTION. Calls Mu->LockSlow. Spin-with-PAUSE loop on x86,
// using adaptive backoff based on historical contention (the
// PerThreadSynch struct tracks per-thread waiting history). If
// spinning fails, calls into the kernel via futex(WAIT). The OS
// THREAD blocks — C++ has no userland scheduler, so a blocked
// thread costs a kernel-managed task slot.
//
// FAIRNESS. FIFO by default (absl::Mutex inherits Google's "no
// surprises" philosophy — predictable behaviour over peak
// throughput). The PerThreadSynch list is maintained in arrival
// order. No starvation mode equivalent because FIFO already
// bounds the worst case.
//
// PRIORITY INHERITANCE. absl supports priority inheritance for
// real-time threads (PI futex on Linux). Go does not.
//
// COST. Uncontended Lock: ~15-30 ns (slightly higher than Go due
// to richer state word). Contended: ~1-5 microseconds (futex syscall +
// kernel scheduling). The cost of blocking is HIGHER than Go's
// because the OS thread blocks — but absl programs don't typically
// run as many threads as Go programs run goroutines, so absolute
// thread-block count is lower.
//
//
// 3. tokio::sync::Mutex (tokio/tokio/src/sync/mutex.rs)
//
// FAST PATH. Atomic CAS on a state word (state: u32). Tokio's
// Mutex has a much simpler state model than Go or absl because
// it never needs to interact with OS threads — it's pure userland
// futures. The uncontended path: one CAS via the embedded
// `Semaphore::try_acquire`.
//
// CONTENTION. The await point is the key difference. lock() returns
// a Future. If the lock is unavailable, the Future is "pending" —
// the calling task is scheduled OFF the executor, registered in
// the wait queue, and the executor runs other tasks. When the
// lock is released, the wait queue is walked and the head waker
// is invoked, which re-schedules the Future. The OS THREAD never
// blocks at any point — it's running other tasks.
//
// FAIRNESS. FIFO via the linked-list wait queue. Like absl, no
// starvation mode — FIFO already bounds waiting time.
//
// ASYNC INTEGRATION. This is the unique tokio property: lock() is
// a future, not a function. Composable with select!, timeout, and
// cancellation. If the caller drops the future, the Mutex
// gracefully removes the waiter from the queue — no resource leak.
// Go cannot do this; once a goroutine is blocked in semacquire,
// there is no way to "give up" except via a separate goroutine
// (Task 17's WaitCtx pattern).
//
// COST. Uncontended lock(): ~5-10 ns (smaller state word; less to
// CAS). Contended: ~50-200 ns (no syscall — pure userland
// task-park). The contended cost is LOWER than Go's because the
// runtime doesn't touch the kernel. The trade-off: every lock()
// call returns a Future that allocates on the heap (or on a
// tokio-internal slab); the call site is `mutex.lock().await`,
// which has compile-time overhead Go doesn't.
//
//
// SUMMARY TABLE
//
// Property | Go | absl | tokio
// ----------------------|-----------------|-----------------|------------------
// Fast path | 1 CAS | 1 CAS | 1 CAS
// Block model | goroutine park | OS thread block | task park
// Fairness | auto-adapt | FIFO | FIFO
// Async | no | no | yes (Future)
// Priority inheritance | no | yes (Linux) | no
// Uncontended cost | ~15 ns | ~20 ns | ~10 ns
// Contended cost | ~300 ns | ~3 microseconds | ~100 ns
// Source file | sync/mutex.go | mutex.cc | sync/mutex.rs
// Lines of source | ~250 | ~3000 | ~700
//
// THE BIG IDEA
//
// Each mutex is shaped by its runtime's fundamental constraint:
// - Go's runtime owns the scheduler, so blocking is cheap and the
// mutex can implement fancy fairness logic in userland.
// - absl can't change the kernel scheduler, so it minimises kernel
// round-trips with spinning and futex.
// - tokio sits BETWEEN async tasks and the OS, so it never blocks
// threads and integrates with the future ecosystem.
//
// If you ported any of these mutexes directly to another runtime,
// they would feel wrong. The Mutex is not a primitive; it is a
// negotiation between the language, the runtime, and the kernel.
// The interesting question isn't "which is best?" — it's "which
// runtime constraints make which design optimal?".
Stretch.
- (a) Add a fifth comparison: Java's
ReentrantLock(AQS-based). Pay attention to the fair/unfair mode toggle and why Go made the auto-adaptive choice instead. - (b) Build a micro-benchmark that runs the SAME workload through Go's
sync.Mutex, absl::Mutex, and tokio::Mutex (you'll need three small programs). Plot the latency distribution at different contention levels. - (c) Read Bradford Lampson's 1980 paper "Experience with Processes and Monitors in Mesa". Note that every modern mutex design is still negotiating the same trade-offs Lampson identified 45 years ago.
How to grade yourself¶
Score each task 0 (didn't try), 1 (got it with hints), 2 (got it unaided), 3 (got it AND extended it with a stretch goal that produced new learning). Sum:
| Score | What it means |
|---|---|
| 0–15 | You can call sync.Mutex.Lock but haven't internalised the difference between locking and synchronising. Re-read junior.md, redo Tasks 1–4. Build muscle memory for defer Unlock, Add before go, and Once.Do for lazy init. |
| 16–30 | You can pick the right primitive and write race-free code. Tasks 5–10 stretch that into benchmarking discipline and bug intuition (Task 10 in particular is the difference between "I write goroutines" and "I debug other people's goroutines"). |
| 31–45 | Senior-level fluency. You can profile mutex contention, shard locks, and reason about config hot-reload and singleflight. Tasks 11–17 are the daily-driver patterns of every well-architected Go service. |
| 46–60 | Staff-level. You have read the stdlib source and can explain why the design choices are what they are. Tasks 18–20 are the line between "uses the package" and "could have written it". |
Concrete checks worth running before declaring done:
go test -race ./...clean on every task. Race detector is non-negotiable for any code that touchessync.- For Task 4 / 13 / 18 / 20 — the source-reading tasks — verify your quoted lines exist in the file. Open
$(go env GOROOT)/src/sync/in your editor and grep for the constant names you cited. - For Tasks 5 / 6 / 11 / 16 — the benchmark tasks — re-run with
-count=10and-benchtime=3sto dampen variance. Single-run benchmark numbers lie. - For Task 15 (singleflight) — write a test where
fnpanics. The reference implementation will let the panic propagate to all waiters; the stdlib'sx/sync/singleflightis documented to do the same. Confirm. - For Task 19 (reentrant mutex) — verify the cross-goroutine block still works. The easiest bug to make is to forget that "owner check" must be atomic; under contention, you'll occasionally see two goroutines both think they own the lock.
The most important question isn't "did you finish" — it's "can you predict, before writing the code, which primitive will perform best for a given workload?" Mutex for ad-hoc shared state. RWMutex when reads outnumber writes 8:1 and critical sections are >= 100 ns. sync.Map for read-dominant with small hot set. atomic.Pointer for read-mostly immutable config. Sharded mutex for write-heavy with hashable keys. Singleflight for de-duping concurrent fetches. Each has a workload profile where it wins; the rest of the time, the simpler choice is better.
Stretch challenges¶
S1 — Userland futex. Build a userland wait/notify primitive on top of runtime.Gopark/runtime.Goready via go:linkname. Compare against sync.Cond for the same producer-consumer workload. The exercise teaches you what sync.Cond actually does under the hood — Wait calls runtime_notifyListWait, which calls goparkunlock, which yields the goroutine to the scheduler. Reproducing this from outside the runtime is gross but illuminating; you learn the boundary between sync (userland API) and runtime (where the real work happens).
S2 — Lock-free LRU. Build an LRU cache where reads update the recency without taking a lock. Combine atomic.Pointer for the value table with an approximate-LRU mechanism (CLOCK algorithm or 2Q) so that recency updates are atomic increments instead of list manipulations. Compare to groupcache/lru (mutex-protected) at 1 000 concurrent readers — your lock-free version should win by ≥ 5× on reads. The hard part: eviction must be correct under concurrent reads, even when the eviction-decision uses stale recency data.
S3 — Hierarchical mutex. Build a mutex that, when held, automatically updates a per-Goroutine "lock hierarchy" tracking which locks each goroutine holds. On any Lock call, check that the hierarchy is acyclic — refuse to acquire a lock that would create a cycle (deadlock waiting to happen). The implementation is non-trivial because you need goroutine identity (Task 19's gross approach) and a thread-local lock stack. The pay-off is real: pre-deadlock detection at acquire-time means you find lock-ordering bugs in TESTS, not in production. Compare with what tsan (ThreadSanitizer in C++) does — that tool has had the same feature for 15 years; the lesson is that good tooling beats good discipline.