Race Detection — Professional Level¶
Table of Contents¶
- Introduction
- Production Case Study: Hidden Race in a Cache
- Production Case Study: Atomic Misuse in Config Reload
- Production Case Study: Sharded Counter Migration
- Race-Free Design Patterns
sync.Mapvsmap+Mutex- Lock-Free Queues in Practice
- CI Discipline at Scale
- Stress Testing
- Migration: Mutex → Atomic
- Diagnosing Production Races
- Cheat Sheet
- Summary
Introduction¶
Professional race work happens at the system level: designing data structures so races are impossible, integrating -race into a multi-stage CI, sharding hot counters, and diagnosing production-only race symptoms. This file is case-study driven.
Production Case Study: Hidden Race in a Cache¶
A team had a cache with a "lazy invalidate" optimisation:
type Cache struct {
mu sync.RWMutex
data map[string]Entry
invalid bool
}
func (c *Cache) Get(k string) (Entry, bool) {
if c.invalid { // intentionally racy "fast check"
c.refresh()
}
c.mu.RLock()
defer c.mu.RUnlock()
e, ok := c.data[k]
return e, ok
}
The author intentionally read invalid without the lock to avoid contention, planning that a stale read would just trigger an unnecessary refresh. Reasonable, except:
- The Go memory model gives no visibility guarantee on
invalidwithout synchronisation. - The
c.invalidread could be reordered with respect toc.refresh, breaking the assumption. - CPUs can cache
invalidindefinitely; the read may never see writes.
The race detector flagged it on the first CI run. The fix: atomic.Bool for invalid, with Load in the fast path and Store(true) from invalidators.
Lesson: "I know this is racy but it's fine" is almost always wrong. The detector exists because human intuition about concurrency is unreliable.
Production Case Study: Atomic Misuse in Config Reload¶
A service had:
var cfg *Config
var cfgVer int64
func Reload(c *Config) {
cfg = c
atomic.AddInt64(&cfgVer, 1)
}
func Get() *Config { return cfg }
Intent: increment version atomically, swap config. Race: cfg = c is a non-atomic write to a pointer; concurrent readers may see a torn value or stale pointer.
Worse: even if cfg = c were atomic, the version increment is after the assignment. Without a memory barrier, a reader could see the old cfg and the new cfgVer, breaking any version-based cache invalidation.
Fix: atomic.Pointer[Config]:
var cfg atomic.Pointer[Config]
func Reload(c *Config) { cfg.Store(c) }
func Get() *Config { return cfg.Load() }
The atomic Store is sequentially consistent. Readers see a complete pointer. No version counter needed; the pointer itself is the version.
Production Case Study: Sharded Counter Migration¶
A metrics library exported per-request counters. Profiling showed atomic.AddInt64 on a single hot counter was ~30% of CPU during peak. Cache-line contention.
Solution: a 16-shard counter.
type Counter struct {
shards [16]atomic.Int64
_ [48]byte // padding to next cache line for first shard
}
func (c *Counter) Add(delta int64) {
idx := goroutineID() & 15
c.shards[idx].Add(delta)
}
func (c *Counter) Value() int64 {
var sum int64
for i := range c.shards {
sum += c.shards[i].Load()
}
return sum
}
Before: 6M ops/sec, 30% CPU spent in atomic.Add. After: 80M ops/sec, ~5% CPU spent in atomic.Add.
Value() is now O(16) instead of O(1), but it is called rarely (every 10s for Prometheus scrape).
Critical detail: the padding. Without _ [48]byte, two shards may share a cache line, and the contention returns. Use the runtime/internal cache-line size constant (64 on x86) or test on your target hardware.
Race-Free Design Patterns¶
Confinement¶
Don't share. A goroutine owns its data; values flow through channels. The channel send creates a happens-before edge, transferring ownership.
Immutable snapshots¶
A struct, once published via atomic.Pointer, is never modified. Updaters create a new struct and swap the pointer. Readers are lock-free.
Per-goroutine state¶
Each goroutine has its own scratch space (e.g., a per-worker buffer). Aggregate at the end via channels.
Single-writer principle¶
At most one goroutine writes to a given location. Multiple readers are fine.
Sharding¶
Distribute hot writes across N independent locations.
Copy-on-write maps¶
Treat the map as immutable. Updaters create a new map with the change and atomic.Pointer swap.
These patterns avoid races by construction. The detector's job becomes finding the rare case where the design was violated.
sync.Map vs map+Mutex¶
sync.Map is optimised for two specific access patterns:
- Each key is written once and read many times.
- Multiple goroutines read, write, and overwrite disjoint sets of keys.
For these patterns it is faster than map+RWMutex because it avoids contention on a single mutex.
For other patterns (uniform read/write mix, frequent overwrite of the same keys), it is slower because of internal copying.
Benchmarks tell the truth. The library docs are honest about its niche; do not use sync.Map as a default.
A simple decision tree:
| Pattern | Choice |
|---|---|
| Read-mostly, occasional write | sync.Map or atomic.Pointer[map] snapshot |
| Balanced read/write | map + sync.RWMutex |
| Write-heavy, single-writer | map + sync.Mutex |
| Append-only by key | shard then map + Mutex per shard |
Lock-Free Queues in Practice¶
True lock-free queues (e.g., Michael-Scott) are rarely necessary in Go because channels handle most cases at high throughput. When you do need them:
- Use a battle-tested library:
github.com/Workiva/go-datastructures/queueorgithub.com/yireyun/go-queue. - Verify under
-raceand stress tests. - Profile under realistic load — lock-free isn't always faster than mutex-based for moderate contention.
A reasonable rule: prefer channels until profiling proves them inadequate. Then prefer a vetted library over hand-rolled CAS code.
CI Discipline at Scale¶
Large teams often see CI race tests as a chore. Treat them as a first-class deliverable:
- Required PR check: cannot merge if
go test -race ./...fails. - Per-package timeouts: don't let a stuck test hide a race detector report.
- Matrix CPU:
-cpu=1,4,8. Catches scheduling-dependent races. - Iteration count:
-count=3per PR;-count=200nightly. - Quarantine queue: known-flaky tests are quarantined and triaged within a week.
- Race-budget: a team owns each package; race reports are bugs assigned automatically.
Integrate with staticcheck, goleak, errcheck for a wider net.
Stress Testing¶
A stress test:
- Runs the suspect code path under
-race. - Iterates many times (
-count=1000). - Varies CPU count (
-cpu=1,4,8). - Uses random schedulers if possible.
- Logs goroutine state on failure.
Example:
func TestCacheStress(t *testing.T) {
if testing.Short() {
t.Skip()
}
c := NewCache()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
for j := 0; j < 10000; j++ {
key := fmt.Sprintf("k%d", j%100)
if j%2 == 0 { c.Set(key, j) } else { c.Get(key) }
}
}(i)
}
wg.Wait()
}
Run: go test -race -count=20 -cpu=1,4,8 -run TestCacheStress.
Stress tests are slower; tag them with -short so quick CI skips them but nightly CI runs them.
Migration: Mutex → Atomic¶
A common refactor: replace a sync.Mutex around an int with an atomic.Int64.
Before:
type Stats struct {
mu sync.Mutex
count int
}
func (s *Stats) Inc() { s.mu.Lock(); s.count++; s.mu.Unlock() }
func (s *Stats) Get() int { s.mu.Lock(); defer s.mu.Unlock(); return s.count }
After:
type Stats struct {
count atomic.Int64
}
func (s *Stats) Inc() { s.count.Add(1) }
func (s *Stats) Get() int64 { return s.count.Load() }
Faster, simpler, less contention. But:
- Only works for single-field updates.
- If
Statslater gains a second field that must be consistent withcount, you cannot atomically update both with atomics; revert to mutex.
A common pitfall: starting with atomic, growing the struct, forgetting to revert. The detector catches the inconsistency once the second field is added; trust the detector.
Diagnosing Production Races¶
A race that does not appear under -race but causes intermittent production bugs is the worst case. Steps:
- Reproduce in test. Collect the suspect call paths.
- Add
-raceto a staging run. Some races appear under load. - Add
-raceto a stress test specifically targeting the call path. - Inspect with
go tool traceto see goroutine interactions. - Use
pprofto see lock contention on suspect mutexes. - Review code with the formal memory model in mind. Identify every shared variable and the happens-before edge protecting it.
Production races usually come from implicit sharing: a closure captures a variable, a global reuses memory, a logger formats a struct field while another goroutine writes it.
Audit for these systematically. A senior engineer's checklist:
- Loop variables captured by goroutines.
- Globals written after init.
- Map fields in shared structs.
- Slices passed by value (header is copied; backing array is shared).
- Loggers, metrics, and tracers reading struct fields.
Cheat Sheet¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Counter wrong | Non-atomic increment | atomic.Add |
| Map panic | Concurrent map read/write | RWMutex or sync.Map |
| Stale config | Plain pointer assignment | atomic.Pointer[T] |
| Hot atomic dominates CPU | Cache-line contention | Sharded atomic |
| Loop-var bug | Captured i in closure | i := i shadow |
CI matrix: -race -count=3 -cpu=1,4,8 per PR; -count=200 nightly.
Production audits: run -race in staging, go tool trace for hard cases.
Summary¶
Professional race work is system design plus operational discipline. Build data structures that are race-free by construction (atomic snapshots, per-goroutine state, sharded counters). Integrate -race into a CI matrix that varies CPU and iteration count. Migrate from mutex to atomic deliberately, with awareness that one-field atomics are not a general substitute. Diagnose production-only races by combining -race in staging, pprof, go tool trace, and structured audits of every shared variable. The detector is necessary; the design is sufficient.