Race Detector Deep Dive — Middle Level¶
Table of Contents¶
- Introduction
- Running
-raceEffectively - The
GORACEEnvironment Variable - Race-Only Build Tags
- Race Detector and the Test Cache
- CI Integration Patterns
- Stress Testing Under
-race - Combining
-racewith-timeoutfor Deadlocks - Reading Race Reports at Scale
- Mapping Reports to Synchronisation Bugs
- Race Detector and cgo
- Race Detector and
unsafe - Common Mistakes at Middle Level
- Self-Assessment
- Summary
Introduction¶
At junior level you learned to enable -race, read a report, and fix the obvious cases. At middle level you treat -race as a permanent member of your build and test infrastructure. The shift is from "I ran -race once" to "the entire team's PR pipeline gates on -race every push, with stress tests, halt-on-error, and short feedback loops."
After this file you will:
- Configure
-racecorrectly acrossgo test,go build, andgo runfor every workflow. - Tune
GORACEfor CI vs local development. - Combine
-racewith-timeoutand-count=Nfor stress and stability runs. - Add race-only build tags for assertion code that runs only when
-raceis on. - Read a race report and immediately classify the synchronisation gap (missing mutex, missing channel edge, missing atomic).
- Set up a CI job in GitHub Actions, GitLab CI, and CircleCI with race detection plus log archival.
- Stress-test code paths under
-raceto surface races that single test runs hide. - Know when
-raceproduces an apparent false positive and what causes it.
This file does not cover TSan internals (professional level) or scheduler-aware tricks for pinning down rare races (senior level). It is the practical toolbox.
Running -race Effectively¶
The four canonical invocations¶
# Run the whole test suite with race detection
go test -race -count=1 -timeout 120s ./...
# Run a specific package
go test -race -count=1 ./internal/queue/
# Run a specific test
go test -race -count=1 -run TestQueue_Enqueue ./internal/queue/
# Run a program under the detector
go run -race ./cmd/server
Always pair with -count=1¶
go test caches results by package. If the cached run passed, the test is not re-executed. -count=1 forces a fresh run. This matters under -race because:
- A flaky race that passed once will not re-run from cache.
- Schedule-sensitive bugs need fresh entropy each invocation.
- CI logs that show "PASS (cached)" are misleading.
The standard team command is:
Internalise it. Type it without thinking.
-race on go install¶
Produces a binary in $GOBIN with detection on. Useful for end-to-end manual testing of a local server. The resulting binary is slower and bigger; do not ship.
-race on go build¶
Identical effect. Useful when you want to run a long-lived process under -race in a development environment.
The GORACE Environment Variable¶
GORACE is a space-separated list of key=value settings that tune the runtime detector. Set it before running the binary or test:
| Key | Default | Meaning |
|---|---|---|
halt_on_error | 0 | Exit the process the moment the first race fires, instead of continuing to report more. |
history_size | 1 | Size of the per-goroutine history buffer, 0..7. Each step doubles. Bigger = more memory but better stack traces for old accesses. |
log_path | "" | If set, append reports to <path>.<pid>. If empty, print to stderr. |
exitcode | 66 | Exit code on race detection. Override only for tooling reasons. |
strip_path_prefix | "" | Trim a path prefix from all source paths in reports. Useful for vendor or container paths. |
atexit_sleep_ms | 1000 | How long to wait at process exit for in-flight reports to flush. |
CI configuration¶
In CI, the standard combination is:
halt_on_error=1makes the log easy to find: the report is the last thing printed.history_size=2is a small bump from the default with negligible cost.
Local debugging¶
When you have a specific race and want richer traces:
Logs to ./race.log.PID and runs the test 10 times to increase the chance of hitting the race.
Race-Only Build Tags¶
Some helper code should only compile when the race detector is on. Use the race build tag:
//go:build race
package mypkg
// raceAssert is compiled only under -race.
func raceAssert(cond bool, msg string) {
if !cond {
panic("race assertion: " + msg)
}
}
Provide a stub for non-race builds:
Call from production code:
func (q *Queue) Enqueue(v int) {
raceAssert(q.mu.TryLock(), "Enqueue should always succeed taking the lock")
q.buf = append(q.buf, v)
q.mu.Unlock()
}
Under -race, you get cheap runtime checks for invariants. Without -race, the call compiles to nothing.
A common race-only pattern: load-bearing assertion¶
//go:build race
func init() {
runtime.SetBlockProfileRate(1) // collect blocking events for diagnosis
}
Or for a package that needs to detect single-thread-access invariants:
Use these sparingly. Race-only logic must never change observable behaviour.
Race Detector and the Test Cache¶
go test caches results keyed by:
- Source hash.
- Build tags.
- Environment variables.
- Test arguments.
The -race flag affects the build tag set (race is in the tag list under -race), so a non-race run and a race run cache separately. Good. But within either mode, a re-run is cached.
How -count=1 defeats the cache¶
The flag -count=1 is documented as "run each test N times." When N is 1, the cache is bypassed because the test runner explicitly re-runs. Any value of -count other than the default behaviour bypasses the cache; -count=1 is the convention.
# This may print PASS (cached) on the second run:
go test -race ./...
# This always runs fresh:
go test -race -count=1 ./...
When you want repetition¶
For genuine repeated runs to expose flakiness:
Runs TestFlaky ten times. If the race fires on average once per five runs, ten increases your odds. Combine with -failfast to stop on the first failure:
CI Integration Patterns¶
GitHub Actions¶
name: race
on: [pull_request, push]
jobs:
race:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.22'
- name: Run tests with race detector
env:
GORACE: "halt_on_error=1 history_size=2"
run: go test -race -count=1 -timeout 5m ./...
GitLab CI¶
race:
image: golang:1.22
variables:
GORACE: "halt_on_error=1 history_size=2"
script:
- go test -race -count=1 -timeout 5m ./...
CircleCI¶
version: 2.1
jobs:
race:
docker:
- image: cimg/go:1.22
environment:
GORACE: "halt_on_error=1 history_size=2"
steps:
- checkout
- run: go test -race -count=1 -timeout 5m ./...
Two-job pattern¶
Keep a fast test job (no -race) and a slower race job in parallel:
testruns in 30 seconds, gives quick feedback for typos.raceruns in 3 minutes, gives a thorough check.
Both must pass to merge.
Archiving race logs¶
When a race fires in CI, you usually want the report stored as an artifact. Configure GORACE="log_path=race-report" and upload race-report.* files at the end of the job:
- name: Run race tests
env:
GORACE: "halt_on_error=1 log_path=race-report"
run: go test -race -count=1 ./...
- name: Upload race reports
if: failure()
uses: actions/upload-artifact@v4
with:
name: race-reports
path: race-report.*
Race in nightly stress¶
A second CI job runs every night with high repetition:
nightly-race:
schedule:
- cron: '0 3 * * *'
steps:
- run: go test -race -count=50 -run TestConcurrent ./...
Catches races that single-run jobs miss.
Stress Testing Under -race¶
A race that fires once in a hundred runs is undetectable in normal CI. Stress tests fix this. The simplest pattern:
//go:build !short
package queue_test
import (
"sync"
"testing"
)
func TestQueue_Stress(t *testing.T) {
if testing.Short() {
t.Skip("skipping stress test in short mode")
}
q := NewQueue()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(2)
go func() { defer wg.Done(); for j := 0; j < 1000; j++ { q.Enqueue(j) } }()
go func() { defer wg.Done(); for j := 0; j < 1000; j++ { q.Dequeue() } }()
}
wg.Wait()
}
Run with -race:
200 goroutines, 100k operations, fully race-instrumented. Any data race shows up. Pair with -count=10 for extra coverage.
Stress matrix¶
| Variable | Suggested range |
|---|---|
| Number of producers | 1, 4, 16, 64 |
| Number of consumers | 1, 4, 16, 64 |
| Operations per goroutine | 100, 1000, 10000 |
GOMAXPROCS | 1, 2, num_cpu |
Vary one dimension at a time. Some races appear only with GOMAXPROCS=1 (cooperative scheduling), others only with GOMAXPROCS=N (true parallelism).
Combining -race with -timeout for Deadlocks¶
-race does not catch deadlocks. Use -timeout instead. The default is 10 minutes; for unit tests, set it shorter:
If a goroutine deadlocks, the test runner kills the process after 30 seconds and prints a goroutine dump showing exactly where every goroutine is parked. That dump is your debug output.
A typical mixed test:
func TestDoesNotDeadlock(t *testing.T) {
done := make(chan struct{})
go func() {
ProcessRequest()
close(done)
}()
select {
case <-done:
// success
case <-time.After(2 * time.Second):
t.Fatal("ProcessRequest hung")
}
}
A timeout inside the test itself, plus the outer -timeout, gives two layers of deadlock protection.
Reading Race Reports at Scale¶
A single report is easy. When ten reports fire in one CI run, you need a strategy.
Step 1: Halt on first¶
GORACE=halt_on_error=1 reduces the log to one race. Often that one race is the root cause of the others.
Step 2: Group by site¶
Multiple reports for the same memory address from the same call sites are duplicates. Look for unique (Read at, Write at) pairs.
Step 3: Check the goroutine creator¶
If both racing goroutines come from the same creator, the race is between two instances of the same code. If they come from different creators, the race spans two subsystems.
Step 4: Look at the address¶
The hex address is a heap location. If multiple reports share the same address, they are racing on the same object. Different addresses with the same call sites mean the object is being created repeatedly and each instance has the race (i.e., the race is in the type, not in one instance).
Mapping Reports to Synchronisation Bugs¶
The most common report patterns and their fixes.
Pattern 1: One reader, one writer, no sync¶
Fix: add a sync.Mutex or use a channel. Both accesses must use the same primitive.
Pattern 2: Two writers, no sync¶
Fix: lock both writers. If the writes are independent counters, consider sync/atomic.
Pattern 3: Read with mutex, write without¶
Read at ... by goroutine 8:
q.mu.Lock(); q.buf[i]; q.mu.Unlock()
Previous write at ... by goroutine 7:
q.buf = append(q.buf, x) // <-- no lock
Fix: lock the writer too. Both sides must respect the invariant.
Pattern 4: Channel close vs send¶
Fix: never send on a channel that may be closed. The convention is "only the sender closes." Restructure so close is signaled by a separate mechanism (e.g., context.Done) and the send checks before sending.
Pattern 5: Map without lock¶
Fix: lock all map operations or use sync.Map. Go's runtime also panics with "concurrent map writes" sometimes — same root cause.
Pattern 6: Captured loop variable¶
Fix: pass i as a parameter to the goroutine, or upgrade to Go 1.22+ where loop variables are per-iteration.
Race Detector and cgo¶
-race works with cgo on supported platforms, with caveats:
- Memory accessed only from C code is invisible to TSan.
- Memory passed from Go to C is tracked as a single access at the boundary; TSan does not see what C does to it.
- If C code mutates Go-allocated memory concurrently with Go code, the race is real but the report may be confusing.
Practical advice¶
- For libraries that wrap a C dependency, write a Go-side mutex around all calls into C that touch shared state.
- Avoid passing Go slices or maps into C and modifying them concurrently.
- Test C-heavy libraries with
-raceand inspect the reports carefully; expect occasional confused output.
Race Detector and unsafe¶
unsafe.Pointer casts and uintptr arithmetic can confuse TSan. Common pitfalls:
- Reinterpreting a byte slice as an
int64slice: TSan sees the access at the byte level but the size mismatch can produce odd reports. - Holding
uintptrreferences: not visible to TSan; you lose tracking for those accesses. - Using
unsafe.Sliceorunsafe.String(Go 1.20+): TSan understands these.
If a race report looks impossible, check whether the offending memory is touched via unsafe. The fix is usually to remove the unsafe code, not the report.
Common Mistakes at Middle Level¶
Mistake 1: Skipping -count=1 in CI¶
Cached PASS hides flakes. Always force a fresh run.
Mistake 2: Configuring halt_on_error=0 and drowning in noise¶
If twenty races all stem from one root cause, twenty pages of report help no one. halt_on_error=1 is the right default for CI.
Mistake 3: Running -race and -bench together¶
Benchmarks become 5–15x slower, results are useless. Run benchmarks separately, without -race.
Mistake 4: Forgetting to run -race on integration tests¶
go test ./... without -race may miss races in code only exercised by integration suites. Apply -race to all go test invocations.
Mistake 5: Treating a one-off race report as flaky¶
A race that fires once and not again is not flaky — the schedule changed. The bug is still there. Investigate, do not retry.
Mistake 6: Adding t.Skip("races on slow machine")¶
If a test hits a race on a slow machine and passes on a fast one, the race is real on both. Fix the code.
Mistake 7: Believing the binary is identical¶
A -race build differs from a non--race build: instrumented calls, larger binary, different symbol table. Performance benchmarks and binary-size checks must use non--race builds.
Self-Assessment¶
- I can write a
Makefiletarget that runs all tests with-race -count=1and a sane timeout. - I can configure GitHub Actions, GitLab CI, or CircleCI to run a race job.
- I know what
GORACE=halt_on_error=1does and when to use it. - I can write a stress test that hammers a concurrent function from many goroutines.
- I can read a race report and classify the bug in under one minute.
- I know how
-raceinteracts with cgo andunsafe. - I can write a race-only assertion guarded by
//go:build race. - I understand why benchmarks must run without
-race.
Summary¶
At middle level the race detector is no longer something you remember to run; it is the default for tests and stress runs across every workflow. You combine -race with -count=1, -timeout, and GORACE knobs to surface bugs reliably. You wire CI to run a dedicated race job per PR and a nightly stress job for rare races. You read reports quickly by recognising the patterns: missing mutex, missing channel edge, captured loop variable, concurrent map. You also know what -race cannot see — cgo internals, unsafe games, deadlocks, logical races — and you pair it with timeouts, code review, and integration tests to cover those gaps.