Deterministic Testing — Find the Bug¶
A gallery of real flaky-test patterns. Each entry shows the test code, what is wrong, why it fails sometimes, and how to fix it deterministically. Read each before peeking at the diagnosis.
Case 1 — The Disappearing Increment¶
func TestCounter(t *testing.T) {
var c int
go func() { c++ }()
if c != 1 {
t.Fatal("expected 1")
}
}
Symptom¶
Fails almost always with expected 1.
Diagnosis¶
The goroutine has not run yet by the time the assertion executes. Even on a fast machine, the scheduler does not promise to switch to the new goroutine before the next line of the parent runs.
There is also a data race: c is written by the goroutine and read by the main goroutine without synchronisation. go test -race flags this.
Fix¶
func TestCounter(t *testing.T) {
var c int
done := make(chan struct{})
go func() {
c++
close(done)
}()
<-done
if c != 1 {
t.Fatal("expected 1")
}
}
Channel close provides a happens-before edge; the assertion runs strictly after the increment.
Case 2 — The Pretend Sleep¶
func TestStart(t *testing.T) {
s := NewServer()
go s.Run()
time.Sleep(100 * time.Millisecond)
if err := s.Ping(); err != nil {
t.Fatal(err)
}
}
Symptom¶
Passes on the developer's M3 laptop. Fails 5% of the time in CI.
Diagnosis¶
time.Sleep(100ms) is a guess. On the M3 the server starts in 2ms. On a contended CI runner under heavy load, it might take 150ms or more. The sleep is too short on the slow path.
Fix¶
Have Run signal readiness:
func TestStart(t *testing.T) {
s := NewServer()
ready := make(chan struct{})
go s.RunWithReady(ready)
<-ready
if err := s.Ping(); err != nil {
t.Fatal(err)
}
}
Run writes to ready after it has bound the port and is ready to accept calls. The test waits on ready, not on a guessed duration.
Case 3 — The Off-by-One WaitGroup¶
func TestSum(t *testing.T) {
var wg sync.WaitGroup
var sum int64
for i := 0; i < 10; i++ {
go func(v int) {
wg.Add(1)
defer wg.Done()
atomic.AddInt64(&sum, int64(v))
}(i)
}
wg.Wait()
if sum != 45 {
t.Fatalf("sum=%d", sum)
}
}
Symptom¶
Sometimes Wait returns immediately and the sum is 0. Sometimes it works.
Diagnosis¶
wg.Add(1) is inside the goroutine. The parent might reach wg.Wait() before any goroutine has called Add. With counter 0, Wait returns instantly. This is also a documented race in sync.WaitGroup and go vet should catch it.
Fix¶
for i := 0; i < 10; i++ {
wg.Add(1)
go func(v int) {
defer wg.Done()
atomic.AddInt64(&sum, int64(v))
}(i)
}
wg.Wait()
Add happens-before Wait. Always call Add in the parent.
Case 4 — The Captured Loop Variable¶
func TestParallel(t *testing.T) {
var results [5]int
var wg sync.WaitGroup
for i := 0; i < 5; i++ {
wg.Add(1)
go func() {
defer wg.Done()
results[i] = i * i
}()
}
wg.Wait()
want := [5]int{0, 1, 4, 9, 16}
if results != want {
t.Fatalf("got %v want %v", results, want)
}
}
Symptom¶
results = [25 25 25 25 25]. (Or any other unexpected pattern.) Pre-Go 1.22.
Diagnosis¶
In Go 1.21 and earlier, all five goroutines capture the same i variable. By the time they run, i == 5 (or 4 if the loop has not finished iterating yet). They all write to results[5] (which panics) or results[i] with the final value.
Go 1.22+ changed loop variable scoping; this trap is fixed there.
Fix¶
Pass i as an argument so each goroutine captures its own copy.
Case 5 — The Late Send¶
func TestSink(t *testing.T) {
results := make(chan int)
go func() {
results <- compute()
}()
select {
case r := <-results:
if r != expected { t.Fatal("wrong") }
case <-time.After(50 * time.Millisecond):
t.Fatal("timeout")
}
}
Symptom¶
Fails intermittently with "timeout" on slow CI.
Diagnosis¶
compute() sometimes takes longer than 50ms in CI. The timeout is too aggressive. The test treats a slow computation as a bug, but slowness is not the property being tested.
Fix¶
Use a generous timeout (5–10 seconds) intended as a "something is wrong" fence, or use t.Deadline() to align with -timeout:
deadline := time.NewTimer(5 * time.Second)
defer deadline.Stop()
select {
case r := <-results:
if r != expected { t.Fatal("wrong") }
case <-deadline.C:
t.Fatal("computation took longer than 5s")
}
Better: use synctest.Run and replace the timer with synctest.Wait. No real timeout needed.
Case 6 — The Phantom Subscription¶
func TestTimer(t *testing.T) {
clk := clockwork.NewFakeClock()
triggered := false
go func() {
<-clk.After(time.Second)
triggered = true
}()
clk.Advance(2 * time.Second)
time.Sleep(10 * time.Millisecond) // wait for goroutine
if !triggered { t.Fatal("not triggered") }
}
Symptom¶
Fails sometimes — not triggered.
Diagnosis¶
The test calls clk.Advance before the goroutine has subscribed to clk.After. If the goroutine subscribes after the advance, the new timer's fire time is 2s in virtual time, but virtual time is no longer advancing — the timer never fires.
There is also a data race on triggered.
Fix¶
Use BlockUntilContext (or BlockUntil in older clockwork versions) before advancing:
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go func() {
<-clk.After(time.Second)
close(done)
}()
clk.BlockUntilContext(ctx, 1)
clk.Advance(2 * time.Second)
<-done
Or use synctest:
synctest.Run(func() {
done := make(chan struct{})
go func() {
time.Sleep(time.Second)
close(done)
}()
synctest.Wait()
select {
case <-done:
default:
t.Fatal("not triggered")
}
})
synctest.Wait waits until the goroutine is blocked on the sleep, then advances virtual time, then runs it.
Case 7 — The Leaked Worker¶
func TestWorker(t *testing.T) {
in := make(chan int, 5)
go worker(in)
in <- 1
in <- 2
in <- 3
// test ends
}
Symptom¶
Test passes, but goleak.VerifyTestMain reports a leaked goroutine.
Diagnosis¶
worker(in) is for v := range in {...}. The test never closes in. The worker goroutine blocks on <-in forever.
Fix¶
func TestWorker(t *testing.T) {
in := make(chan int, 5)
done := make(chan struct{})
go func() {
defer close(done)
worker(in)
}()
in <- 1
in <- 2
in <- 3
close(in)
<-done
}
Close in to terminate the worker, then wait for its exit. No leak.
Case 8 — The Order Assumption¶
func TestLog(t *testing.T) {
var buf bytes.Buffer
go log.New(&buf, "", 0).Println("first")
go log.New(&buf, "", 0).Println("second")
time.Sleep(50 * time.Millisecond)
want := "first\nsecond\n"
if got := buf.String(); got != want {
t.Fatalf("got %q want %q", got, want)
}
}
Symptom¶
Sometimes "second\nfirst\n". Sometimes interleaved bytes ("firssecondt\n\n").
Diagnosis¶
Two goroutines write to the same bytes.Buffer (not goroutine-safe) and assume a specific order. Both assumptions are wrong.
Fix¶
Either serialise writes (mutex, channel), or do not assert on order. To assert on the set of lines:
var mu sync.Mutex
var lines []string
var wg sync.WaitGroup
for _, s := range []string{"first", "second"} {
wg.Add(1)
go func(s string) {
defer wg.Done()
mu.Lock()
lines = append(lines, s)
mu.Unlock()
}(s)
}
wg.Wait()
sort.Strings(lines)
want := []string{"first", "second"}
if !reflect.DeepEqual(lines, want) {
t.Fatalf("got %v", lines)
}
Case 9 — The Race-Free Flake¶
func TestRingBuffer(t *testing.T) {
rb := NewRingBuffer(4)
var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func(v int) {
defer wg.Done()
rb.Push(v)
}(i)
}
wg.Wait()
if got := rb.Len(); got != 4 {
t.Fatalf("got len %d want 4", got)
}
}
Symptom¶
Passes -race. Fails 30% of the time with got len 3.
Diagnosis¶
The ring buffer is size 4. Ten goroutines push concurrently. The buffer may evict items asynchronously (depending on implementation), or the test's assumption that exactly 4 items remain is wrong. Race detector finds no data race because the buffer uses proper synchronisation.
The bug is in the test's logic: the assertion "buffer has 4 items" is not robust to the buffer's behaviour under concurrent push. The buffer might temporarily report 3 if the implementation has a brief gap between increment and bookkeeping.
Fix¶
Use synctest.Wait to drive the test to quiescence:
synctest.Run(func() {
rb := NewRingBuffer(4)
var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func(v int) {
defer wg.Done()
rb.Push(v)
}(i)
}
wg.Wait()
synctest.Wait()
if got := rb.Len(); got != 4 {
t.Fatalf("got len %d want 4", got)
}
})
Or document the property differently: "ring buffer eventually has 4 items" — use assert.Eventually with a generous timeout.
Case 10 — The TestMain Trap¶
var server *Server
func TestMain(m *testing.M) {
server = NewServer()
go server.Run()
os.Exit(m.Run())
}
func TestPing(t *testing.T) {
if err := server.Ping(); err != nil {
t.Fatal(err)
}
}
Symptom¶
TestPing fails sometimes with connection refused.
Diagnosis¶
server.Run starts asynchronously. TestMain calls m.Run() before the server has finished starting. The first test races against server startup.
Fix¶
func TestMain(m *testing.M) {
server = NewServer()
ready := make(chan struct{})
go server.RunWithReady(ready)
<-ready
os.Exit(m.Run())
}
Wait for the server to signal readiness before running tests.
Case 11 — The Context Without Cancel¶
func TestProcessor(t *testing.T) {
p := NewProcessor()
ctx := context.Background()
go p.Run(ctx)
p.Submit(Task{})
// ...
}
Symptom¶
Tests after this one are slow / occasionally hang.
Diagnosis¶
ctx never cancels. p.Run never exits. The goroutine is leaked across tests; goleak catches it eventually but the symptom is mysterious slowness.
Fix¶
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
done := make(chan struct{})
go func() {
defer close(done)
p.Run(ctx)
}()
// ... test body ...
cancel()
<-done
Always cancel and wait.
Case 12 — The Subtest State Leak¶
func TestThings(t *testing.T) {
counter := 0
t.Run("one", func(t *testing.T) {
counter++
if counter != 1 { t.Fatal() }
})
t.Run("two", func(t *testing.T) {
counter++
if counter != 2 { t.Fatal() }
})
}
Symptom¶
Passes when run sequentially. Fails when subtests are parallel.
Diagnosis¶
Subtests run in deterministic order unless they call t.Parallel(). If they do, the order is not guaranteed. The test asserts on counter order — implicit dependency.
Fix¶
Reset state per subtest, or do not rely on order across subtests. If the test really must observe a sequence, do not use subtests.
Case 13 — The Misused time.After¶
func TestRetry(t *testing.T) {
for i := 0; i < 3; i++ {
select {
case <-time.After(100 * time.Millisecond):
attempt(i)
case <-time.After(50 * time.Millisecond):
t.Fatal("timeout")
}
}
}
Symptom¶
The "timeout" case fires every time. The 100ms case never runs.
Diagnosis¶
The Go select is non-deterministic when multiple cases become ready simultaneously, but here only one case can fire: whichever timer fires first. The 50ms timer always fires before the 100ms timer. The author wrote the assertion order backwards.
Also, time.After leaks: until its timer fires, the timer goroutine is alive. In a tight loop this can accumulate.
Fix¶
Whatever the intent, replace with explicit timers and Stop:
t := time.NewTimer(100 * time.Millisecond)
defer t.Stop()
select {
case <-t.C:
attempt(i)
case <-ctx.Done():
return
}
Or, in synctest.Run, no time.After is needed at all.
Case 14 — The Slow-CI Mystery¶
func TestUpload(t *testing.T) {
file := makeTestFile(1 << 20)
start := time.Now()
err := upload(file)
elapsed := time.Since(start)
if err != nil { t.Fatal(err) }
if elapsed > 500 * time.Millisecond {
t.Fatalf("too slow: %v", elapsed)
}
}
Symptom¶
Passes locally. Fails in CI with "too slow: 612ms".
Diagnosis¶
Wall-clock duration assertion. CI runners are slower. The test is asserting on performance, but performance is not the contract being tested; correctness is.
Fix¶
Either: - Remove the duration assertion entirely if the test is functional. - Move it to a separate benchmark. - Use a much higher threshold (5–10× expected) with a clear comment that the value is a "something is broken" fence, not a performance target.
Case 15 — The Closed-Channel Read¶
func TestStream(t *testing.T) {
out := stream()
if v, ok := <-out; !ok {
t.Fatal("expected a value")
} else if v != "hello" {
t.Fatalf("got %q", v)
}
}
Symptom¶
Sometimes fails with "expected a value".
Diagnosis¶
stream() returns a channel. If stream closes the channel before sending, the receive returns zero-value with ok = false. The test reads this as "no value." But the underlying bug might be: stream racily closes before sending its first value, depending on internal goroutine scheduling.
Fix¶
Inspect stream. Likely fix: send first, close after. The test then either passes (sees the value) or fails clearly (sees nothing because stream failed to send). No flake.
Patterns across all 15¶
- Replace
time.Sleepwith a barrier (cases 1, 2, 6, 14). - Add proper synchronisation between goroutine and assertion (cases 1, 3, 8).
- Inject the clock or use
synctest(cases 2, 5, 6, 14). - Order goroutine setup correctly (cases 3, 4).
- Drain or close to terminate (cases 7, 10, 11).
- Do not assert on order or wall-clock duration (cases 8, 12, 14).
- Treat every flake as a code bug — fix at the source.
End of bug gallery.