WaitGroup in Tests — Junior Level¶
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Real-World Analogies
- Mental Models
- Pros & Cons
- Use Cases
- Code Examples
- Coding Patterns
- Clean Code
- Product Use / Feature
- Error Handling
- Security Considerations
- Performance Tips
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Common Misconceptions
- Tricky Points
- Test
- Tricky Questions
- Cheat Sheet
- Self-Assessment Checklist
- Summary
- What You Can Build
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
Focus: "I started a goroutine in my test. My assertion ran before it finished. What now?"
A *testing.T does not magically know about your goroutines. The Test... function is just a normal Go function — when it returns, go test moves on to the next one. If the test body spawned a goroutine and did not wait for it, three bad things can happen, in increasing order of pain:
- The goroutine finishes after the assertion. The assertion fails when it should have passed.
- The goroutine finishes after the test function returns. The assertion passed, but a stray goroutine writes to a buffer that the test framework has already reused. The next test fails for an unrelated reason.
- The goroutine panics after the test function returns. The whole
go testprocess dies with--- FAILagainst a random test, and the stack trace points nowhere useful.
The cure is to wait inside the test. The classical tool is sync.WaitGroup. The pattern looks like this:
func TestParallelWork(t *testing.T) {
var wg sync.WaitGroup
results := make([]int, 4)
wg.Add(4)
for i := 0; i < 4; i++ {
i := i
go func() {
defer wg.Done()
results[i] = compute(i)
}()
}
wg.Wait()
for i, r := range results {
if r != expected(i) {
t.Errorf("results[%d] = %d, want %d", i, r, expected(i))
}
}
}
Three lines do the real work: wg.Add(4) registers the four pending goroutines, defer wg.Done() decrements once per goroutine, and wg.Wait() blocks the test until every goroutine has finished. Only then do the assertions run. The test is now deterministic — at least about finishing. (Determinism about order of finishing is a different problem; we get to it in the senior file.)
After reading this file you will:
- Know why a test that spawns goroutines almost always needs a barrier
- Be able to write the canonical
wg.Add / wg.Done / wg.Waitblock from memory - Understand
t.Cleanupand when to putwg.Wait()inside it - Recognise — and refuse to write — the
time.Sleep(100 * time.Millisecond)antipattern - Know what the race detector reports when
AddandWaitrace - Have a first encounter with
goleak.VerifyNone(t)and what it protects against
You do not need to know about testing/synctest, fake clocks, or virtual time yet. Those are middle-level tools. You do not need to write your own helper library. We use the standard library only.
Prerequisites¶
- Required: Go 1.21+ (any modern version will do; nothing here is version-gated).
- Required: Comfort with
go test,t.Errorf,t.Fatal. If you have written one normal table-driven test, you are ready. - Required: A basic feel for
sync.WaitGroup— see 03-sync-package/02-waitgroups/junior. You should know thatAddincrements a counter,Donedecrements it, andWaitblocks until it hits zero. - Required: Familiarity with
go func() { ... }()anddefer. The patterndefer wg.Done()is the first line inside almost every test goroutine. - Helpful: Awareness of the race detector —
go test -race. We will run every example with it. - Helpful: Some intuition about flaky tests (tests that pass locally and fail in CI).
If go test -race ./... works on your laptop and you have ever debugged a test that fails 1 in 50 runs, you have the right context.
Glossary¶
| Term | Definition |
|---|---|
| Synchronisation barrier | A point in the test code that blocks until a defined event has happened. wg.Wait() is a barrier. So is <-done. |
sync.WaitGroup | A counter with Add, Done, and Wait. The most common test barrier. |
Add(n) | Increment the WaitGroup counter by n. Must be called before the corresponding Wait. |
Done() | Decrement the WaitGroup counter by 1. Equivalent to Add(-1). Almost always written as defer wg.Done(). |
Wait() | Block until the WaitGroup counter reaches zero. |
t.Cleanup(f) | A function registered on the testing framework to run after the test ends (including on t.Fatal). Cleanups run in LIFO order. |
t.Fatal / t.FailNow | Mark the test as failed and stop the current goroutine. Critically: it stops only the goroutine that called it. Calling t.Fatal from a spawned goroutine does not stop the test. |
| Goroutine leak | A goroutine still alive after the test has ended. Survives into the next test, the next package, or the end of the go test process. |
goleak | A library (go.uber.org/goleak) that fails the test if any goroutine other than the test runner is still alive at teardown. |
| Flaky test | A test that sometimes passes and sometimes fails on identical input, due to timing or scheduling. |
| Start barrier | A pattern where many goroutines wait on a common signal before starting work, used to maximise contention for race detection. |
| Fan-out | One goroutine spawns N workers and waits for all of them. |
time.Sleep antipattern | Using time.Sleep to "wait for" a goroutine to finish. Always wrong. |
Core Concepts¶
A test goroutine is not "owned" by the test¶
When you write go work() inside a test function, the spawned goroutine is just another goroutine. It is not registered with *testing.T. The test function can return, and that goroutine keeps running. The go test binary does not exit until all Test... functions have finished, but it does not wait for stray goroutines launched inside them.
func TestLeaky(t *testing.T) {
go func() {
time.Sleep(2 * time.Second)
t.Log("hello from the future") // dangerous
}()
// function returns immediately
}
This test "passes" because the body has no assertion. But two seconds later, a goroutine calls t.Log on a *testing.T whose test has long ended. Modern Go (1.16+) catches this exact case and reports panic: Log in goroutine after TestLeaky has completed. The cure is to wait inside the test.
wg.Wait is a memory-ordering boundary, not just a counter¶
wg.Wait does more than count. It establishes a happens-before relationship between every wg.Done() and the code following wg.Wait(). Anything a spawned goroutine wrote to memory before its wg.Done() is visible to the test after wg.Wait() returns — without further locking. This is exactly the property a test assertion needs.
In other words: after wg.Wait(), you may read the slice that goroutines wrote to, with no race, no atomic, no mutex.
Add(n) must happen before Wait¶
If Wait reads the counter and sees zero, it returns immediately. If Add had not been called yet, the counter was zero — even though you intended to spawn N goroutines. The result: Wait returns, the assertion runs, the goroutines run later, the test passes for the wrong reason. Always call Add(N) in the test goroutine before launching any worker.
// CORRECT
wg.Add(len(items))
for _, it := range items {
go func(it Item) {
defer wg.Done()
process(it)
}(it)
}
wg.Wait()
// WRONG — Add races with Wait
for _, it := range items {
go func(it Item) {
wg.Add(1) // (A)
defer wg.Done()
process(it)
}(it)
}
wg.Wait() // (B) may run before (A)
In the wrong version, Wait may run before any goroutine has called Add. The counter is zero, Wait returns, the test asserts on uninitialised state.
A spawned goroutine cannot call t.Fatal¶
The testing package's t.Fatal, t.FailNow, and t.SkipNow use runtime.Goexit to stop the current goroutine. If you call t.Fatal from a spawned goroutine, you stop the spawned goroutine — the test continues running, possibly to a successful end. The Go documentation says:
A test ends when its Test function returns or calls any of the methods FailNow, Fatal, Fatalf, SkipNow, Skip, or Skipf. Those methods, as well as the Parallel method, must be called only from the goroutine running the Test function.
So inside a goroutine you must use t.Errorf, which only sets a failure flag, and then let the goroutine return and wg.Wait finish before the test function decides whether to fail. The full pattern:
func TestParallel(t *testing.T) {
var wg sync.WaitGroup
wg.Add(4)
for i := 0; i < 4; i++ {
i := i
go func() {
defer wg.Done()
if got := compute(i); got != expected(i) {
t.Errorf("compute(%d) = %d, want %d", i, got, expected(i))
}
}()
}
wg.Wait()
}
t.Errorf is safe from any goroutine. t.Fatal is not.
time.Sleep is never the right barrier¶
The most common bad-pattern in newcomer tests:
The sleep is "long enough" today on this laptop. Tomorrow it is too short under CI load. The day after, a flaky failure ships to production. Every sleep is either too short (sometimes) or too long (always). Use a barrier instead.
If you cannot avoid waiting on a real-world condition (e.g., a network round-trip in an integration test), use a polling deadline with a hard cap:
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if conditionTrue() {
return
}
time.Sleep(5 * time.Millisecond)
}
t.Fatal("condition never true within 2s")
This is still inferior to a barrier, but it has a definite failure mode — bounded latency and a clear error message — instead of a silent flake.
t.Cleanup is the test's defer¶
Anything you write as defer cleanup() inside the test body could equally well be t.Cleanup(cleanup). The difference: t.Cleanup runs even if t.Fatal is called, in LIFO order, and runs after subtests have finished. For tests that spawn long-running goroutines, the cleanup is the natural place to call cancel() and wg.Wait():
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
runUntilCancelled(ctx)
}()
t.Cleanup(wg.Wait)
The first cleanup cancels the context (signalling the goroutine to stop). The second cleanup waits for it. Because cleanups run LIFO, wg.Wait runs after cancel — exactly what we want.
Real-World Analogies¶
Waiting for everyone at the trailhead¶
You and three friends are hiking. The trailhead has a sign: "Group photo here." You agree: everyone parks the car, drops off gear, then meets at the sign. The sign is wg.Wait. Each friend dropping their gear is wg.Done. Setting "four friends" up front is wg.Add(4). If you take the photo before the slowest friend arrives, the photo is wrong. If you forgot to count Sarah in Add, you might take the photo while she is still in the car.
The dinner party host¶
A host (the test) invites guests (goroutines). The host serves dessert (the assertion) after every guest has finished their main course (wg.Done). If the host serves dessert too early, half the guests are still eating. If the host did not count Sarah when setting the table (Add), Sarah never gets dessert and nobody notices.
The arrivals board at an airport¶
The board waits until every flight (goroutine) has landed (Done). Only then does it display "all clear" (Wait returns). The board does not display "all clear" based on a timer — that would be time.Sleep and would be wrong every busy day.
A construction crew with a foreman¶
The foreman (test goroutine) calls Add(10), dispatches 10 workers, and stands in the office (Wait) reading the paper. When the last worker hangs up their hard hat (Done), the foreman closes up shop (assertion runs). The foreman does not walk around guessing whether everyone is done — that would be time.Sleep.
Mental Models¶
Mental model 1: WaitGroup is a countdown latch¶
Think of wg.Add(n) as setting a kitchen timer to n rings. Each Done is one ring. Wait blocks until all rings have happened. The timer cannot un-ring, so once Wait returns, the test moves on permanently.
Mental model 2: Wait is a memory barrier¶
The Go memory model promises that anything written by a goroutine before its Done is visible to the goroutine doing Wait, after Wait returns. So the slice your goroutines wrote to is now safe to read. This is the secret that makes test assertions clean — no atomics, no mutexes, just Wait and then read.
Mental model 3: Goroutines are draft children¶
The test is the parent. The spawned goroutines are the children. The parent must wait for every child before going home. Add is registering the children. Done is each child saying "I'm done." Wait is the parent standing at the door.
If the parent leaves without waiting, the children are orphaned — they run on with no supervision and may set fire to the kitchen (the next test) by writing to memory that has been reused.
Mental model 4: Tests are batch jobs, not streams¶
A test is a one-shot batch: set up, run, assert, tear down. Streaming patterns (publish-subscribe, hot loops) do not fit a test directly. To test a streaming system, you typically run a finite slice of the stream and synchronise on its end with a WaitGroup or a done channel.
Pros & Cons¶
WaitGroup as a test barrier¶
Pros:
- Standard library, no dependencies.
- Zero allocation in the steady-state pattern.
- Establishes happens-before, so no manual locking around shared test state.
- Mirrors production code one-to-one — your test code looks like the code under test.
Cons:
- No timeout. If a goroutine hangs, the test hangs forever. (Solved with
WaitTimeout, see middle.md.) - No error propagation. If a goroutine wants to fail the test, it must use
t.Errorffrom inside the goroutine, which then triggers a failure flag — not as direct aserrgroup.Group.Wait()returning an error. - The
Add/Waitordering rule is subtle. Easy to get wrong in dynamic-fan-out tests.
Channels as test barriers¶
Pros:
- Naturally combine with
selectandtime.Afterfor a timeout. - Carry data — the goroutine can report its result through the channel itself.
- Visible in goroutine dumps if you hang:
chan receivevssync.WaitGroup.Wait.
Cons:
- More verbose for plain fan-out.
- Easy to leak: forgetting to close, or sending to a buffered channel nobody drains.
- The pattern grows quickly when fan-in counts are dynamic.
time.Sleep as a test barrier¶
Pros:
- Trivial to write.
Cons:
- Wrong. Will flake. Will be slow. Use anything else.
Use Cases¶
Fan-out test: process a slice in parallel¶
You have a function that runs N tasks in parallel. The test feeds it a fixed input and checks every result.
func TestFanOut(t *testing.T) {
items := []int{1, 2, 3, 4, 5}
results := make([]int, len(items))
var wg sync.WaitGroup
wg.Add(len(items))
for i, it := range items {
i, it := i, it
go func() {
defer wg.Done()
results[i] = it * it
}()
}
wg.Wait()
for i, r := range results {
if want := items[i] * items[i]; r != want {
t.Errorf("results[%d] = %d, want %d", i, r, want)
}
}
}
Background worker test: start, do something, stop¶
You have a server-like worker. The test starts it, sends one event, asserts on the side effect, and shuts it down cleanly.
func TestWorker(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
w := NewWorker()
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
w.Run(ctx)
}()
t.Cleanup(wg.Wait)
w.Submit(Event{ID: 1})
// wait for the event to be processed
select {
case <-w.Done(1):
case <-time.After(2 * time.Second):
t.Fatal("event 1 not processed within 2s")
}
}
Stress test: hammer a structure from many goroutines¶
The whole point is to maximise concurrent access. Use a start barrier so all goroutines begin at the same instant, increasing contention for the race detector.
func TestConcurrentSet(t *testing.T) {
s := NewSet()
start := make(chan struct{})
var wg sync.WaitGroup
const n = 100
wg.Add(n)
for i := 0; i < n; i++ {
i := i
go func() {
defer wg.Done()
<-start // wait for the gun
s.Add(i)
}()
}
close(start) // fire the gun
wg.Wait()
if got := s.Len(); got != n {
t.Errorf("Len = %d, want %d", got, n)
}
}
(We expand on the start-barrier pattern in senior.md.)
Leak-detection test: catch stray goroutines¶
You suspect a function leaks a goroutine on the error path. Add goleak:
import "go.uber.org/goleak"
func TestParserNoLeak(t *testing.T) {
defer goleak.VerifyNone(t)
_, err := Parse(brokenInput)
if err == nil {
t.Fatal("expected error")
}
}
If Parse leaks, the deferred verifier fails the test with the stack trace of the leaked goroutine.
Code Examples¶
Example 1: the canonical fan-out¶
package wgtest
import (
"sync"
"testing"
)
func square(x int) int { return x * x }
func TestSquareAll(t *testing.T) {
in := []int{1, 2, 3, 4}
out := make([]int, len(in))
var wg sync.WaitGroup
wg.Add(len(in))
for i, x := range in {
i, x := i, x
go func() {
defer wg.Done()
out[i] = square(x)
}()
}
wg.Wait()
want := []int{1, 4, 9, 16}
for i := range want {
if out[i] != want[i] {
t.Errorf("out[%d] = %d, want %d", i, out[i], want[i])
}
}
}
Run with go test -race. The race detector verifies that the writes out[i] = ... do not race with the reads in the assertion — they don't, because wg.Wait() synchronises them.
Example 2: assertion inside the goroutine¶
When the goroutine itself can decide whether the result is correct, you can let it call t.Errorf directly. Do not call t.Fatal.
func TestEachWorker(t *testing.T) {
var wg sync.WaitGroup
wg.Add(4)
for i := 0; i < 4; i++ {
i := i
go func() {
defer wg.Done()
got := square(i)
if got != i*i {
t.Errorf("square(%d) = %d, want %d", i, got, i*i)
}
}()
}
wg.Wait()
}
The t.Errorf calls accumulate. If any goroutine fails, the test fails at the end. The test never hangs because every goroutine eventually returns.
Example 3: t.Cleanup for graceful teardown¶
func TestServer(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel) // (1) signal stop
s := NewServer()
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
s.Serve(ctx)
}()
t.Cleanup(wg.Wait) // (2) wait for stop
// body of the test...
}
t.Cleanup callbacks run LIFO. We registered cancel first and wg.Wait second, so the order at teardown is: wg.Wait, cancel. That is wrong. Fix by swapping:
t.Cleanup(wg.Wait) // registered first → runs last
t.Cleanup(cancel) // registered second → runs first
Or, more readable, do it both at once:
I prefer the second form. It puts the order in plain sight.
Example 4: WaitGroup race when Add is inside the goroutine¶
This compiles, runs, and is wrong:
func TestRacyAdd(t *testing.T) {
var wg sync.WaitGroup
for i := 0; i < 4; i++ {
go func() {
wg.Add(1) // (A)
defer wg.Done()
work()
}()
}
wg.Wait() // (B)
}
Under go test -race, this reports a data race between (A) and (B). The fix: move Add to the outer goroutine.
func TestRacyAddFixed(t *testing.T) {
var wg sync.WaitGroup
wg.Add(4) // before any goroutine starts
for i := 0; i < 4; i++ {
go func() {
defer wg.Done()
work()
}()
}
wg.Wait()
}
Example 5: counting on len(items)¶
A common safety habit:
wg.Add(len(items))
for _, it := range items {
it := it
go func() {
defer wg.Done()
process(it)
}(...)
}
Add(len(items)) is faster than calling Add(1) inside the loop (one atomic instead of len(items)), but the main reason to write it this way is clarity: the count is right next to the loop that produces the work. If you change the loop, you change the Add.
Example 6: the antipattern (do not ship)¶
func TestBad(t *testing.T) {
state := 0
go func() {
time.Sleep(10 * time.Millisecond)
state = 42
}()
time.Sleep(50 * time.Millisecond) // "wait for it"
if state != 42 {
t.Errorf("state = %d, want 42", state)
}
}
Three bugs at once:
time.Sleepis the wait.stateis read from one goroutine and written from another with no synchronisation — a data race.- The test will pass on a fast machine and flake on a slow one.
The fix: a barrier and either a mutex or pass-the-result-via-channel.
func TestGood(t *testing.T) {
done := make(chan int, 1)
go func() {
done <- 42
}()
select {
case state := <-done:
if state != 42 {
t.Errorf("state = %d, want 42", state)
}
case <-time.After(time.Second):
t.Fatal("timeout waiting for state")
}
}
Coding Patterns¶
Pattern A: classic fan-out¶
Use when you have a fixed-size slice of work and want all results before asserting.
var wg sync.WaitGroup
wg.Add(N)
for i := 0; i < N; i++ {
i := i
go func() {
defer wg.Done()
results[i] = compute(i)
}()
}
wg.Wait()
Pattern B: result channel with timeout¶
Use when one goroutine produces one result and you want a timeout in the test.
done := make(chan T, 1)
go func() {
done <- compute()
}()
select {
case v := <-done:
assertEqual(t, v, want)
case <-time.After(2 * time.Second):
t.Fatal("timeout")
}
Pattern C: t.Cleanup for cancel + wait¶
Use when a goroutine runs for the whole test duration.
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
runUntilCancelled(ctx)
}()
t.Cleanup(func() {
cancel()
wg.Wait()
})
Pattern D: start barrier¶
Use to maximise concurrency for race testing.
start := make(chan struct{})
var wg sync.WaitGroup
wg.Add(N)
for i := 0; i < N; i++ {
i := i
go func() {
defer wg.Done()
<-start
contended(i)
}()
}
close(start)
wg.Wait()
Pattern E: goleak defer¶
Use whenever you suspect a function may leak a goroutine.
Or at package level:
Clean Code¶
A clean concurrent test reads almost like a sequential one. The barrier sits at the seam, not scattered through the body.
Smelly:
func TestThing(t *testing.T) {
wg := sync.WaitGroup{}
for _, it := range items {
wg.Add(1)
go func(it Item) {
defer wg.Done()
r := process(it)
mu.Lock()
results = append(results, r)
mu.Unlock()
t.Logf("processed %v", it)
}(it)
}
time.Sleep(50 * time.Millisecond)
wg.Wait()
if len(results) != len(items) {
t.Errorf(...)
}
}
Problems: Add inside the loop, sleep before Wait, locking that does nothing because Wait already synchronises, t.Logf from a goroutine is fine but mixed in distracts the reader.
Clean:
func TestThing(t *testing.T) {
results := make([]Result, len(items))
var wg sync.WaitGroup
wg.Add(len(items))
for i, it := range items {
i, it := i, it
go func() {
defer wg.Done()
results[i] = process(it)
}()
}
wg.Wait()
for i, r := range results {
if !valid(items[i], r) {
t.Errorf("results[%d] for %v is invalid: %v", i, items[i], r)
}
}
}
Pre-allocated slice — no lock. Pre-counted Add. One barrier. Assertions afterwards.
Test helpers belong in helper files¶
If a WaitTimeout or goleak-style helper appears in three different _test.go files, lift it into testhelpers_test.go or its own internal package. Tests should describe what is tested, not the plumbing.
Mark helpers with t.Helper()¶
Inside any function that takes a *testing.T and asserts on its behalf, call t.Helper() as the first line. Test failures will then point to the calling test, not the helper.
func WaitTimeout(t *testing.T, wg *sync.WaitGroup, d time.Duration) {
t.Helper()
done := make(chan struct{})
go func() {
wg.Wait()
close(done)
}()
select {
case <-done:
case <-time.After(d):
t.Fatalf("WaitGroup did not finish within %v", d)
}
}
Product Use / Feature¶
Why we test concurrent code at all¶
Production concurrent code earns its testing only when:
- Multiple goroutines touch the same state.
- Ordering matters (an event must happen before another).
- Lifecycle matters (something starts, runs, and must stop cleanly).
For these systems a test must do three things: drive concurrency, observe its end, and assert. A WaitGroup barrier is the cheapest "observe its end."
CI flake budgets¶
A common policy: a test that fails more than once in 1000 runs is broken. To check, run:
If you see two or more failures, the test is flaky. Most flakes turn out to be a missing barrier (sleep instead of wg.Wait) or a leaked goroutine that mutates state in the next test. The fixes are in this file.
Server boot tests¶
For services, the first concurrent test is usually "boot up, accept one request, shut down." It looks like this:
func TestServerBoot(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
Serve(ctx)
}()
t.Cleanup(func() {
cancel()
wg.Wait()
})
waitForListen(t, ":8080")
resp, err := http.Get("http://localhost:8080/health")
// ... assertions
}
The waitForListen helper is itself a barrier — typically a poll with a 2-second timeout. Without it, the http.Get races the listener.
Error Handling¶
When a goroutine returns an error¶
A common ask: the spawned goroutine produces an error and the test should fail with that error. Three choices:
Choice 1: t.Errorf inside the goroutine.
Simplest, works for most tests. Cannot use t.Fatal here.
Choice 2: error channel.
errs := make(chan error, N)
for i := 0; i < N; i++ {
go func() {
errs <- work()
}()
}
for i := 0; i < N; i++ {
if err := <-errs; err != nil {
t.Errorf("worker %d: %v", i, err)
}
}
Convenient for collecting and reporting.
Choice 3: errgroup.Group.
import "golang.org/x/sync/errgroup"
var g errgroup.Group
for i := 0; i < N; i++ {
i := i
g.Go(func() error { return work(i) })
}
if err := g.Wait(); err != nil {
t.Fatal(err)
}
The cleanest for "all workers must succeed." Combines WaitGroup with first-error semantics.
Goroutine panics in a test¶
If a spawned goroutine panics, the whole go test process dies. The test framework cannot trap a panic in another goroutine. Recover inside the goroutine if the panic is expected behaviour you are testing:
go func() {
defer wg.Done()
defer func() {
if r := recover(); r != nil {
t.Errorf("worker panicked: %v", r)
}
}()
work()
}()
If the panic is not expected, do not recover. Let it crash the test — the stack trace is exactly what you need.
Timeouts on Wait¶
wg.Wait has no built-in timeout. A goroutine that hangs hangs the test forever. In CI, the test framework will eventually kill the process (Go's go test -timeout 10m defaults to 10 minutes), but ten minutes is too long for a fast feedback loop. The WaitTimeout helper (see middle.md) wraps wg.Wait in a select with time.After and fails the test cleanly.
Security Considerations¶
Secrets in goroutines that outlive the test¶
If a test goroutine holds a secret in a string and the test goroutine leaks past the test, the secret remains in memory for the rest of the go test binary's life. Cleanup with t.Cleanup and a cancel() ensures the goroutine returns, freeing references.
Test-only listeners on real network ports¶
A test that binds to :8080 and forgets to close the listener leaks not only a goroutine but a port. The next test fails with "address already in use." t.Cleanup(listener.Close) and wg.Wait solve it.
Race detector in CI¶
Run go test -race in CI for every package. It is the closest thing to a security audit for concurrent code. The race detector is heavy at runtime (5–10x slowdown) but inexpensive in CI.
Performance Tips¶
Pre-allocate before fan-out¶
is better than
var results []int
mu := sync.Mutex{}
// later, in each goroutine: mu.Lock(); results = append(results, r); mu.Unlock()
The pre-allocated form avoids the mutex entirely — each goroutine writes to its own index. The wg.Wait synchronises everything.
Avoid Add(1) in tight loops¶
Add is one atomic instruction; one Add(len(items)) is faster than len(items) separate Add(1)s. For tests this rarely matters, but the readability argument alone is enough — keep the count next to the loop.
Re-use a WaitGroup only on a clear barrier¶
Once Wait has returned, the WaitGroup is reusable, but only if there is a happens-before edge from the previous Wait to the next Add. In tests, that edge is the test body itself — sequential code. So this is fine:
wg.Add(N)
spawn(N)
wg.Wait()
// happens-before edge: continued execution in the test goroutine
wg.Add(M)
spawn(M)
wg.Wait()
What is not fine: Add from one goroutine while another is still inside Wait for the same group. Don't do that in tests; it has no use case.
Parallel subtests¶
t.Parallel lets independent subtests run concurrently. A WaitGroup inside each subtest works fine. The subtests' goroutines do not race because each t.Parallel subtest has its own *testing.T. Just make sure each subtest waits for its own goroutines before its body returns.
func TestParallelGroup(t *testing.T) {
for _, tc := range cases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
var wg sync.WaitGroup
wg.Add(len(tc.items))
for _, it := range tc.items {
it := it
go func() {
defer wg.Done()
_ = process(it)
}()
}
wg.Wait()
})
}
}
Best Practices¶
- Call
Add(N)in the same goroutine that will callWait, before spawning workers. - Use
defer wg.Done()as the first line inside the goroutine. - Capture the loop variable by argument or by a
i := irebinding before Go 1.22; from 1.22 onwards, a per-iteration variable is the default, but the rebinding still documents intent. - Never use
time.Sleepas a barrier. Replace withwg.Wait, adonechannel, or a polling deadline with a hard cap. - Use
t.Errorf, nott.Fatal, from inside a spawned goroutine. - Register
t.Cleanupfor cancel-and-wait when a goroutine outlives the assertion logic. - Run
go test -race -count=10locally before pushing. - Add
goleak.VerifyTestMain(m)(or per-testgoleak.VerifyNone) when working on code that creates background goroutines. - Add a
WaitTimeouthelper to your test package. A test that hangs is worse than a test that fails. - Keep barriers visible. One
wg.Waitat the end of the test body is easier to audit than a dozendonechannels.
Edge Cases & Pitfalls¶
Add(0) is legal, Add(negative) is allowed only as Done¶
wg.Add(0) is a no-op. wg.Add(-1) is wg.Done(). Going below zero panics: sync: negative WaitGroup counter. Tests sometimes hit this by calling wg.Done() twice on the same goroutine.
Wait does not reset¶
After Wait returns, the counter is zero. You can Add again and Wait again. There is no Reset. Re-using is fine if you don't interleave the two phases.
Empty WaitGroup waits zero¶
wg.Wait() on a WaitGroup that has never had Add returns immediately. Useful for "wait if I spawned anything" patterns:
var wg sync.WaitGroup
if shouldSpawn {
wg.Add(1)
go func() { defer wg.Done(); work() }()
}
wg.Wait() // no-op if no goroutine was spawned
A buffered channel as a poor man's WaitGroup¶
done := make(chan struct{}, N)
for i := 0; i < N; i++ {
go func() {
defer func() { done <- struct{}{} }()
work()
}()
}
for i := 0; i < N; i++ {
<-done
}
Works, but uses N times more memory and obscures intent. Use a WaitGroup unless you also need to carry data through the channel.
Don't pass WaitGroup by value¶
func spawn(wg sync.WaitGroup) { // BAD: passes a copy
wg.Add(1)
go func() {
defer wg.Done()
...
}()
}
sync.WaitGroup contains a counter. Copying it makes two unrelated counters. Always pass *sync.WaitGroup.
A defer wg.Wait() at function top¶
func TestStuff(t *testing.T) {
var wg sync.WaitGroup
defer wg.Wait() // runs at the end of the test function
wg.Add(1)
go func() { defer wg.Done(); work() }()
// assertions here
}
The assertions run before wg.Wait. The goroutine may not have finished. This pattern is the wrong way to use defer for tests — almost always you want wg.Wait between spawn and assertion. Use t.Cleanup only when the goroutine is supposed to run for the whole test.
Common Mistakes¶
Mistake 1: forgetting Add¶
No Add. Wait returns immediately. Done later panics with "negative WaitGroup counter". A failed test if you're lucky; an undetected race if you aren't.
Mistake 2: Wait before Add¶
Wait may run before the goroutine has called Add. Counter is zero, Wait returns, assertion runs on uninitialised data. The race detector reports this exact pattern under go test -race.
Mistake 3: double Done¶
The defer plus the early Done means two decrements for one increment. The counter goes negative — panic.
Fix: choose one. If using defer, never call Done again.
Mistake 4: passing wg by value¶
See "Don't pass WaitGroup by value" above.
Mistake 5: t.Fatal from a goroutine¶
t.Fatal does runtime.Goexit for the current goroutine. The goroutine returns. defer wg.Done() runs. The test continues. The failure flag is set, so the test does fail at the end, but it does not stop other goroutines from running. If multiple goroutines call t.Fatal, you may get scrambled output. Use t.Errorf.
Mistake 6: time.Sleep instead of Wait¶
Already covered. Worth repeating: every time.Sleep in a test is a bug or a hack.
Mistake 7: forgetting to cancel a context before Wait¶
ctx, cancel := context.WithCancel(context.Background())
defer wg.Wait()
defer cancel() // BAD ORDER: defer is LIFO, Wait runs first, then cancel
defer runs LIFO. So Wait runs first — but the goroutine is still running, waiting for context cancel. Deadlock. Fix: register Wait first (so it runs last), or wrap both in a single Cleanup.
Common Misconceptions¶
"Once I call go, the goroutine has started."¶
The go keyword schedules the goroutine. It does not promise that the goroutine has executed even one line by the time go returns. On a fast machine the new goroutine may run almost immediately; on a busy machine it may not start for milliseconds. Tests cannot rely on it having started — use a barrier.
"If I close the channel, all goroutines stop."¶
Closing a channel stops receivers on that channel (they see the zero value and ok=false). Goroutines that are not receiving on the channel keep running. Closing a channel is not a kill switch.
"A WaitGroup is a mutex."¶
A WaitGroup is a counter. It does not protect data. The protection comes from the happens-before edge between Done and Wait. If two goroutines both write to the same memory before their Done, they still race with each other.
"I need a mutex to share state with the test goroutine."¶
You only need a mutex if goroutines are running concurrently. After wg.Wait(), they are not. The test goroutine can read any state the workers wrote, without locking.
"I'll just add time.Sleep(10ms) — that's plenty."¶
It is plenty until it isn't. Under CPU load, under the race detector (5x slowdown), in CI on a shared runner, 10ms can become 200ms. Tests that pass 99 in 100 runs are not stable. Use a barrier.
"Wait returns immediately if no goroutines started."¶
True — but only because the counter is zero. If you intended to spawn goroutines and forgot to call Add, Wait lies to you.
Tricky Points¶
Why Add(positive) must happen before Wait¶
Inside WaitGroup.Add, the implementation does an atomic increment. Inside Wait, it does an atomic load. If the load sees a stale zero, Wait returns. The Go memory model does not allow Wait to wait for an Add that has not happened yet — it has no clairvoyance. Therefore you must ensure the order.
The rule: do not start the first Add(positive) from inside a goroutine launched without a prior Add. In practice this means Add(N) in the caller, before go f().
Why t.Cleanup LIFO matters¶
If your test registers two cleanups:
cancel runs first (registered second, popped first), then wg.Wait. That is what you want — signal stop, then wait for stop. Reverse the registration and the cleanup deadlocks.
Why goleak runs after t.Cleanup¶
goleak.VerifyNone(t) registered as a defer at the start of the test runs after all t.Cleanup callbacks — defers in the test function execute before the test framework returns, which is after cleanups. So:
func TestX(t *testing.T) {
defer goleak.VerifyNone(t) // (1)
t.Cleanup(cancel) // (2)
t.Cleanup(wg.Wait) // (3)
...
}
Order at the end: body returns → defer (1) is scheduled but not run yet, cleanups (3) then (2) run, then defer (1) runs. Wait — actually, in Go, deferred calls run after cleanups for the test, because t.Cleanup callbacks are invoked by the testing package before the test goroutine returns. Let me restate: the defer at line (1) is in the test goroutine; deferred functions run when the test function returns. The cleanups run when the test function returns too, before the test framework records the result. The exact ordering, per the testing documentation, is: deferred functions of the test run first, then cleanups in LIFO order. So goleak.VerifyNone(t) actually runs before t.Cleanup(wg.Wait).
This is a real gotcha. If you want goleak to run last, use it as a cleanup:
t.Cleanup(func() { goleak.VerifyNone(t) }) // registered first → runs last
t.Cleanup(cancel)
t.Cleanup(wg.Wait)
Or use goleak.VerifyTestMain(m) at the package level, which verifies after the whole test binary's tests are done.
t.Parallel and shared WaitGroup¶
Two subtests with t.Parallel share nothing but the parent test's variables. A shared var wg sync.WaitGroup in the parent that both subtests call Add/Wait on creates a mess. Each subtest should own its own WaitGroup.
Test¶
A short quiz. Answers in the next section.
-
What is wrong with this test?
-
Why is this WaitGroup pattern racy?
-
What happens if a goroutine in a test calls
t.Fatal? -
Why does
time.Sleepnot solve the synchronisation problem? -
What is the correct order of these
t.Cleanupregistrations to ensurecancelruns beforewg.Wait? t.Cleanup(cancel)-
t.Cleanup(wg.Wait) -
What does
goleak.VerifyNone(t)check? -
Is the following safe? Why?
Tricky Questions¶
-
Why does
wg.Wait()establish a happens-before edge? Because the implementation contains an atomic store/load pair that the Go memory model treats as a synchronisation event. EveryDonesynchronises-before anyWaitthat returns after the counter reached zero. -
Can
WaitGroupdeadlock? Yes. IfAdd(N)is called but onlyM < Ngoroutines callDone,Waitblocks forever. -
Why doesn't the test framework just wait for all goroutines for me? Because Go has no notion of "test-owned goroutines." A goroutine is anonymous. The runtime cannot know which were spawned for which test. You are the only one who knows; you must wait.
-
If I use
t.Cleanup(cancel), do I still needwg.Wait? Yes, unless you are certain the goroutine returns synchronously when the context is cancelled. For most production-like code,cancel()only signals the goroutine; the goroutine still needs time to finish whatever it was doing, andwg.Waitis how you observe that finish. -
Why does
go test -raceslow my test down? The race detector instruments every memory access, adding 5–10x runtime overhead. Tests that pass without-racemay flake under-racebecause the slowdown changes timing. That is the point:-raceexposes timing-sensitive bugs. -
What is the difference between
WaitGroupanderrgroup.Group?errgroup.Groupwraps aWaitGroupand adds: (a)g.Go(func() error)for spawning, (b) first-error collection, (c) optional context cancellation on first error. For tests that need the first error,errgroupis cleaner. -
Can I copy a
WaitGroup? No — its internal counter is one piece of state shared by all participants. Copying creates two independent counters. -
What happens if I call
Waitfrom multiple goroutines simultaneously? All of them block until the counter hits zero, then all return. Useful for fan-in barriers.
Cheat Sheet¶
// Classic fan-out
var wg sync.WaitGroup
wg.Add(N)
for i := 0; i < N; i++ {
i := i
go func() {
defer wg.Done()
results[i] = compute(i)
}()
}
wg.Wait()
// Cleanup pattern for long-running goroutines
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
wg.Add(1)
go func() { defer wg.Done(); run(ctx) }()
t.Cleanup(func() { cancel(); wg.Wait() })
// WaitTimeout helper
func WaitTimeout(t *testing.T, wg *sync.WaitGroup, d time.Duration) {
t.Helper()
done := make(chan struct{})
go func() { wg.Wait(); close(done) }()
select {
case <-done:
case <-time.After(d):
t.Fatalf("WaitGroup did not finish within %v", d)
}
}
// goleak — at start of test
defer goleak.VerifyNone(t)
// goleak — at package level
func TestMain(m *testing.M) { goleak.VerifyTestMain(m) }
Rules:
Add(N)in the test goroutine, beforego.defer wg.Done()first line in the goroutine.- No
time.Sleepas a barrier. - No
t.Fatalfrom a spawned goroutine. t.Cleanup(cancel)registered after (so it runs before)t.Cleanup(wg.Wait).
Self-Assessment Checklist¶
- I can write the
Add / go / Done / Waitblock from memory without looking it up. - I know why
Addmust be called before the correspondingWait. - I never use
time.Sleepas a barrier in a test. - I use
t.Errorf(nott.Fatal) inside spawned goroutines. - I register
t.Cleanup(cancel)aftert.Cleanup(wg.Wait)for long-running goroutines. - I run
go test -racebefore pushing. - I know what
goleak.VerifyNone(t)reports. - I can rewrite a sleep-based test as a barrier-based one.
Summary¶
Goroutines do not finish for free, and the test framework does not wait for them. The first concurrent-test skill is putting a barrier between spawn and assertion. sync.WaitGroup is the standard barrier: Add(N) in the test goroutine, defer wg.Done() first line in each worker, wg.Wait() before the assertion. This produces a deterministic finish and a clean memory-ordering edge — assertions can read worker state directly. For tests that own a long-running goroutine, t.Cleanup(func() { cancel(); wg.Wait() }) is the canonical teardown. For tests that worry about leaks, defer goleak.VerifyNone(t) is the canonical guard. Above all, never reach for time.Sleep — every sleep is either too short, too slow, or both.
What You Can Build¶
- A
WaitTimeout(t, wg, 2*time.Second)helper that fails the test if Wait blocks. (Middle level expands on this.) - A
goleak-integratedTestMainthat fails the whole package if any test leaks a goroutine. - A flaky-test scanner: a CI step that runs
go test -race -count=50and flags any test that fails more than once. - A "fan-out test harness" that spawns N workers against a target function and validates the results in a single deterministic block.
- A
withCancelAndWait(t, run)helper that hides the cancel/wg/cleanup dance behind a single call.
Further Reading¶
- The Go Blog — Go Memory Model — the
sync.WaitGrouphappens-before rule. - pkg.go.dev — sync.WaitGroup — the official API doc.
- pkg.go.dev — testing —
t.Cleanup,t.Parallel,t.Helper. - go.uber.org/goleak — leak detection.
- pkg.go.dev — golang.org/x/sync/errgroup — first-error-aware WaitGroup.
- The Go testing book — "Concurrent tests" chapter — historical reference; the patterns still apply.
Related Topics¶
- 02-deterministic-testing —
synctestand broader determinism. - 01-race-detector-deep — using
-raceto validate barriers. - 04-mocking-time — replacing real timers, the bigger sibling of "no
time.Sleep." - ../../03-sync-package/02-waitgroups/ — the primitive in production code.
- ../../05-context-package/ —
context.WithTimeoutas an alternative deadline source.
Diagrams & Visual Aids¶
The barrier pattern (text diagram)¶
Test goroutine Worker goroutines
-------------- -----------------
wg.Add(3)
go worker(0) ----------> [worker 0 running]
go worker(1) ----------> [worker 1 running]
go worker(2) ----------> [worker 2 running]
|
wg.Wait() <---- Done() <---+ (worker 0)
| <---- Done() <---+ (worker 1)
| <---- Done() <---+ (worker 2)
v
(counter == 0, Wait returns)
|
assert(results)
What time.Sleep looks like in the failure case¶
Test goroutine Worker goroutine
-------------- ----------------
go work() ---> [scheduled, not yet running]
time.Sleep(10ms) ...waiting for CPU...
...CPU busy with other tests...
...sleep ends...
assert(state == 42) ...still scheduled, has not run...
FAIL: state == 0
[finally runs] state = 42
t.Cleanup LIFO order¶
Register: t.Cleanup(A) Test body runs
Register: t.Cleanup(B) Test body returns
Register: t.Cleanup(C)
Cleanups run:
C (registered last)
B
A (registered first)
Memory ordering at wg.Wait¶
Worker: ... slice[i] = v wg.Done()
(write) (release fence)
|
happens-before
|
v
Test: wg.Wait() ... read slice[i]
(acquire fence) (sees v)
Anything the worker wrote before Done is visible to the test after Wait.