WaitGroup in Tests — Optimize¶
Tests pay a real cost in CI minutes, developer wait, and feedback latency. Concurrent tests pay extra because the race detector slows execution 5–10x and because barrier-based waits are bounded by real time. This page covers the levers that reduce test latency without reducing coverage: tighter timeouts, fewer goroutines, parallel subtests, virtual time, inlined helpers, and harness reuse.
1. Tight timeouts beat loose ones¶
A 30-second WaitTimeout on a test that normally finishes in 10 ms is "safe" but wasteful. When the goroutine hangs, you wait 30 seconds to learn about it. Multiply by 100 flaky CI runs and you have lost hours.
Pick a timeout that is one order of magnitude above the normal completion time. If the test usually finishes in 10 ms, use 100 ms. If it usually takes 1 s, use 10 s.
WaitTimeout(t, &wg, 100*time.Millisecond) // fast test
WaitTimeout(t, &wg, 5*time.Second) // server test
WaitTimeout(t, &wg, 30*time.Second) // integration test
The trade-off: a CI runner under exceptional load may hit a tight timeout falsely. Mitigate by:
- Using
assert.Eventually-style polling (the test takes only as long as it needs). - Setting
t.Parallelonly for tests that are CPU-isolated (a busy CI runner can starve a single-threaded test). - Re-running flaky tests with
t.Skip("flaky in CI; retry")only as a temporary measure while you fix the timeout.
2. Choose N goroutines for the workload, not for "more is better"¶
Stress tests scale roughly:
- N goroutines × M iterations = work units.
- Race detector slowdown ~ 5–10x.
- Schedule overhead ~ constant + ε × N for moderate N.
Doubling N from 100 to 200 in a race test usually does not find more bugs but doubles the test's wall time. Stick to:
- Sanity test: 4–8 goroutines, 100 iterations.
- Race test: 50–100 goroutines, 1000 iterations, with start barrier.
- Soak test: 100 goroutines, 100,000 iterations — for the nightly suite, not the per-PR suite.
The race detector finds races based on memory access patterns, not goroutine count. The start barrier matters more than raw N.
3. Use t.Parallel aggressively, but correctly¶
t.Parallel runs subtests concurrently. The wall time becomes max(subtest time), not sum(subtest time). On a project with 100 fast tests, this drops CI time from 100 seconds to 1 second.
Two rules:
- Capture loop variables (or use Go 1.22+).
- Don't share state across parallel subtests.
for _, tc := range cases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
// ... use tc
})
}
For concurrent tests within a parallel subtest, the WaitGroup pattern works as usual. The parent's TestX function returns before the subtests' bodies run — that's the testing framework's responsibility.
When NOT to parallelise¶
- Tests that bind to a fixed port (parallel runs race on the port).
- Tests that touch a shared file system path.
- Tests that depend on a singleton process.
- Tests that share a database fixture.
For these, serialise via build tags, separate *_test.go files, or -p 1.
4. Replace real time with virtual time¶
testing/synctest (Go 1.24+) replaces real clock advancement with virtual time inside a "bubble." A test that uses time.Sleep or time.After for legitimate reasons (testing a timeout, a retry interval) now runs in microseconds instead of seconds.
import "testing/synctest"
func TestRetryBackoff(t *testing.T) {
synctest.Run(func() {
client := New(retryInterval = 100 * time.Millisecond)
client.Send(req)
synctest.Wait() // settles all goroutines
if client.Attempts() != 3 { t.Errorf(...) }
})
}
Without synctest, the test sleeps 300 ms. With synctest, near-instant.
For pre-1.24 codebases, the alternative is dependency injection of clock interfaces (see 04-mocking-time).
5. Inline simple helpers¶
A short helper is a function call away — fast, but slightly more allocation than inlined code. For a WaitTimeout called once per test, this doesn't matter. For a tight stress-test inner loop, it might:
// inline form
done := make(chan struct{})
go func() { wg.Wait(); close(done) }()
select {
case <-done:
case <-time.After(d):
t.Fatal(...)
}
Versus:
The helper costs: one closure allocation, one channel allocation, one timer allocation. The inline form costs the same. There is no real difference. Readability wins — use the helper.
Where inlining does help: inside a goroutine body that you spawn 10,000 times. A closure that captures t and wg is bigger and slower to schedule than a direct call.
6. Reuse expensive setup with TestMain or sync.Once¶
A test suite that boots a database per test is slow. Boot once per package:
var (
db *sql.DB
setupOnce sync.Once
)
func setup() {
setupOnce.Do(func() {
db = openTestDB()
})
}
func TestX(t *testing.T) {
setup()
// ... use db
}
Or in TestMain:
With t.Parallel, tests share the DB. They must be careful to use independent rows / schemas. The setup cost is paid once instead of per test.
7. Cut barrier overhead in microbenchmarks¶
For benchmarks, b.RunParallel handles the WaitGroup setup internally. Outside of benchmarks, hand-rolled fan-out has measurable startup cost:
BenchmarkFanOut10 50000 32000 ns/op
BenchmarkFanOut100 5000 280000 ns/op
BenchmarkFanOut1000 500 2900000 ns/op
(Approximate numbers from a typical Go 1.22 machine.)
The cost is goroutine spawn + scheduler overhead, not the WaitGroup itself. To reduce:
- Pre-spawn a worker pool that the benchmark feeds.
- Use
RunParallelso the framework's internal pool is reused. - Increase per-goroutine work so the spawn cost amortises.
For tests, this never matters. For benchmarks, choose the level that reveals what you want to measure.
8. Bound retries in goleak¶
goleak's default retry budget is 20 attempts × 100 ms = 2 seconds. For a clean test, this is wasted time:
goleak.VerifyNone(t,
goleak.WithRetryAttempts(5),
goleak.WithRetryInterval(50 * time.Millisecond),
)
5 × 50 ms = 250 ms in the worst case. For a package with 100 tests, that saves 175 seconds per CI run.
The risk: tests with truly slow shutdown (TCP listeners with SO_LINGER) need the longer retry. Profile your shutdown path before tightening.
9. Avoid recreating WaitGroups in loops¶
A loop that creates a fresh WaitGroup per iteration is fine — WaitGroup is cheap. But re-using one across waves avoids the allocation entirely:
var wg sync.WaitGroup
for wave := 0; wave < 100; wave++ {
wg.Add(N)
spawnWave(wave, &wg)
wg.Wait()
}
The savings are nanoseconds per wave; not meaningful for tests. Don't optimise here unless a profiler shows it.
10. Skip race detector for fast iteration¶
Locally, run without -race for tight inner loops:
Use -race for the final pre-PR check:
The 5–10x slowdown of -race is fine in CI; locally, it slows the feedback loop unnecessarily for tests that aren't focused on concurrency.
11. Reduce per-test setup with subtests¶
A common pattern: each TestX builds a fresh service. With subtests sharing the parent's setup, you build once and run many cases:
func TestService(t *testing.T) {
svc := buildService(t)
t.Run("case1", func(t *testing.T) { ... })
t.Run("case2", func(t *testing.T) { ... })
t.Run("case3", func(t *testing.T) { ... })
}
If cases are independent and idempotent, add t.Parallel. Setup cost amortises across all cases.
12. Use b.ReportAllocs and -race together for diagnostic¶
When optimising a test or helper, enable both:
-benchmem reports allocations per operation. Allocations correlate with garbage collector load, which affects concurrent test timing. A helper that allocates a closure per call may slow stress tests perceptibly.
For pure tests (not benchmarks), profile with:
The profile usually shows the WaitGroup is the cheapest thing in the test. The bottleneck is the work the goroutines do.
13. Watch out for the start barrier's hidden cost¶
Closing a channel with N receivers does not wake all of them simultaneously — the runtime wakes them in sequence, with a small per-wakeup cost. For N = 100, the wakeup phase takes maybe 50 microseconds. For N = 10,000, it takes ~5 ms.
If your race test relies on simultaneous start across thousands of goroutines, the start barrier's serialisation cost can mask the race. Mitigations:
- Reduce N to the smallest value that consistently finds races (usually 50–200).
- Run the test many times with
-count=Nrather than scaling N up.
14. Use runtime.LockOSThread carefully¶
Pinning a goroutine to an OS thread can be useful for testing thread-local state. The cost: that goroutine can no longer be scheduled on other threads, increasing contention.
For most tests, do not use LockOSThread. It is for cgo-heavy code and signal handlers.
15. Cache test-specific resources¶
A test that, in each subtest, parses the same large file and then forks goroutines wastes the parse time. Hoist:
var parsedOnce sync.Once
var parsed *Doc
func loadDoc(t *testing.T) *Doc {
parsedOnce.Do(func() {
b, err := os.ReadFile("testdata/big.json")
if err != nil { t.Fatal(err) }
parsed, err = Parse(b)
if err != nil { t.Fatal(err) }
})
return parsed
}
Subtests call loadDoc(t) and get the cached document. Parse cost is paid once.
16. Profile your test suite with -test.timeout¶
Set a global per-test timeout:
Any test that takes longer than 30s gets killed and reports a goroutine dump. Use this to find slow tests:
Anything that fails this is too slow for "small unit test" classification. Either optimise it or move it to an integration suite.
17. Parallel CI sharding¶
For repos with thousands of tests:
Combined with -shard k/N:
Splits the test load across N CI machines. Each runs a subset; the suite finishes in (total time) / N.
The shard helper isn't standard — projects roll their own or use a CI feature. The principle: tests are embarrassingly parallel at the package level, fully parallel at the subtest level. Use both axes.
18. Trade test count for test depth¶
A test that runs 1000 iterations of one scenario is one CI run. A test that runs 100 iterations of one scenario, plus runs -count=10 in CI, is the same total work but distributed across separate runs. The second form gives you statistical evidence that the test is stable across runs, not just that it survives one long run.
A useful CI pattern:
- Per-PR:
go test -race -count=3. - Nightly:
go test -race -count=100. - Weekly:
go test -race -count=1000.
The nightly run catches the once-in-100 flakes that the per-PR run misses. The weekly run catches the once-in-10000 flakes that betray deeper bugs.
19. Optimising means leaving tests alone first¶
The biggest "optimisation" of concurrent tests is not breaking them in pursuit of speed. A test that completes in 1 second but is flaky is worse than a test that takes 5 seconds and is rock-solid. Order of operations:
- Eliminate
time.Sleep. - Use
synctestor polling deadlines. - Use parallel subtests where safe.
- Tighten timeouts.
- Cache fixtures.
- Shard the suite.
Steps 1–2 also speed things up by removing dead waits. Steps 3–6 are pure throughput improvements. Never start with step 6 — fix the flakes first.
20. Summary¶
The optimisation toolkit, ordered by typical impact:
| Lever | Typical impact |
|---|---|
Remove time.Sleep | Tests run as fast as the underlying work |
t.Parallel everywhere safe | Wall-clock time falls by factor of nproc |
synctest for time-based logic | Time-driven tests run in microseconds |
Cache fixtures in TestMain | Eliminates per-test setup cost |
Tighten WaitTimeout and goleak retries | Saves seconds per test on the failure path |
| Shard the suite across CI workers | Linear speedup in CI wall time |
Trade count=1000 per-test for count=10 × 100 runs | Same coverage, parallelised |
| Pre-spawn worker pools (benchmarks only) | Eliminates spawn overhead |
Apply these in order. The first three give 90% of the speedup. The rest is fine-tuning for repos that already run their tests in 60 seconds and want them under 10.