Sleep for Synchronization — Senior Level¶

Table of Contents¶

Introduction
Observable Quiescence As An API Contract
Structured Concurrency Versus Sleep-Coupled Spaghetti
Deterministic Time Control: A Deep Dive Into testing/synctest
Clock Abstractions At Scale
Race-Free Coordination Patterns
Retry And Backoff Done Right
Jitter: The Mathematics Of Avoiding Thundering Herds
Eradication: Removing Sleep From A Legacy Test Suite
Designing APIs That Resist Reintroduction
Architectural Consequences Of Forbidding Sleep
Probabilistic Reasoning About Test Flakiness
Integration Tests: Where The Rules Bend
Containerised Tests And Wall-Clock Reality
Race Detector Limitations Around Time
Goroutine Leak Detection Under Test
Testing Long-Running Background Jobs
Distributed Systems And The Sleep Smell
Reviewing PRs At Scale
Edge Cases And Advanced Pitfalls
Common Senior-Level Mistakes
Test
Tricky Questions
Cheat Sheet
Self-Assessment Checklist
Summary
Further Reading

Introduction¶

Focus: "I have led a team. How do I eradicate time.Sleep-as-synchronisation from a 200K-line codebase without breaking everyone, and how do I design code so the discipline survives turnover?"

At the middle level we catalogued the replacements: WaitGroup, channels, context, Cond, errgroup, fake clocks, synctest. At the senior level the question shifts: not "what do I write instead of time.Sleep?" but "how do I get an entire codebase, team, and CI pipeline to stop reaching for time.Sleep in the first place, and how do I keep them from reaching back?"

This requires:

Treating observable quiescence — the ability to externally tell that a system is at rest — as a first-class API design property.
Understanding the runtime semantics of testing/synctest precisely enough to debug it when it misbehaves.
Designing retry, backoff, and rate-limiting libraries so that callers cannot accidentally re-introduce sleep.
Building a migration plan: how to remove sleeps from existing tests in priority order, how to measure progress, how to prevent regressions.
Reasoning about flakiness probabilistically: why a "1% flake" is much worse than it sounds, why retrying does not save you, and how to set policies that align incentives.

After reading this file you will:

Articulate "observable quiescence" as a property of an API.
Read the source of testing/synctest and explain bubble semantics, durable blocking, and clock advancement.
Design retry libraries that are race-free, jittered, deterministic to test, and impossible to misuse.
Plan and execute a multi-quarter migration to eradicate sleep-based tests from a large codebase.
Lead code review across a team so the anti-pattern stops accumulating.
Recognise where the rules genuinely bend (integration tests with real wall-clock dependencies) and where they do not.

Observable Quiescence As An API Contract¶

The single most important senior-level idea about sleep-for-sync is this: the absence of time.Sleep in your tests is downstream of a property of your APIs, not a property of your tests. If the APIs do not let you tell when work is done, the tests will sleep no matter how disciplined the team is.

We call the property observable quiescence: from outside the system, with no inside knowledge of goroutines or internal state, you can definitively answer "is this system at rest right now, or is work still in flight?"

A system has observable quiescence if there exists a single externally-callable function Wait such that after Wait() returns:

All goroutines spawned by the system have either finished or are parked on an external input.
All pending callbacks have been invoked.
All buffered work has been drained or cancelled.
No timer is about to fire and change state.

If you cannot write such a Wait for your own code, the code lacks observable quiescence and tests against it will be flaky.

Concrete example: a debounce helper¶

type Debouncer struct {
    d   time.Duration
    fn  func()
    mu  sync.Mutex
    tmr *time.Timer
}

func (d *Debouncer) Trigger() {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.tmr != nil {
        d.tmr.Stop()
    }
    d.tmr = time.AfterFunc(d.d, d.fn)
}

How do you test this? A naive author writes:

func TestDebouncer(t *testing.T) {
    var calls int32
    d := NewDebouncer(50*time.Millisecond, func() { atomic.AddInt32(&calls, 1) })
    d.Trigger()
    d.Trigger()
    d.Trigger()
    time.Sleep(100 * time.Millisecond)
    if atomic.LoadInt32(&calls) != 1 {
        t.Errorf("calls = %d, want 1", calls)
    }
}

This test sleeps because Debouncer has no quiescence API. There is no way to ask "is the timer about to fire, and if so, has it fired yet?"

The fix is to add quiescence to the API:

type Debouncer struct {
    // ...
    fired chan struct{}
}

func (d *Debouncer) Trigger() {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.tmr != nil {
        d.tmr.Stop()
    }
    d.fired = make(chan struct{})
    fired := d.fired
    d.tmr = time.AfterFunc(d.d, func() {
        d.fn()
        close(fired)
    })
}

func (d *Debouncer) WaitFired() <-chan struct{} {
    d.mu.Lock()
    defer d.mu.Unlock()
    return d.fired
}

Now the test:

func TestDebouncer(t *testing.T) {
    var calls int32
    d := NewDebouncer(50*time.Millisecond, func() { atomic.AddInt32(&calls, 1) })
    d.Trigger()
    d.Trigger()
    d.Trigger()
    select {
    case <-d.WaitFired():
    case <-time.After(2 * time.Second):
        t.Fatal("debouncer never fired")
    }
    if atomic.LoadInt32(&calls) != 1 {
        t.Errorf("calls = %d, want 1", calls)
    }
}

There is still a time.After(2 * time.Second) — but it is a safety timeout, not a synchronisation duration. The test runs as fast as the debouncer actually fires, which inside synctest is zero wall-clock time.

Quiescence as a public contract¶

You have to commit to quiescence in your API. If Debouncer.WaitFired exists but is documented as "private, for tests only", a future maintainer will remove it during a "cleanup" and your tests will regress. Promote quiescence to public API:

// WaitFired returns a channel that is closed when the most recently-scheduled
// debounced call has completed. Useful for graceful shutdown and testing.
func (d *Debouncer) WaitFired() <-chan struct{} { ... }

Many production callers will find legitimate uses for the API too: graceful shutdown, progress reporting, metrics.

When quiescence is hard¶

Some systems have no natural quiescence point. A stream processor that reads from Kafka and writes to Postgres is never "done" until you stop it. For these systems, bounded quiescence is the next best thing: "after the last input was submitted, processing has finished within N milliseconds." Expose a Flush(ctx) or Drain(ctx) method that returns when the in-flight work has cleared, and write tests against that.

The metric: number of tests that block on quiescence vs number that sleep¶

A senior engineer's hand-off to a team includes a measurable goal. The natural metric is:

(tests that wait on a typed quiescence signal) / (tests that wait on a time.Sleep)

Track it over time. A healthy codebase trends to infinity (zero sleeps).

Structured Concurrency Versus Sleep-Coupled Spaghetti¶

A second senior-level idea: sleep-as-synchronisation is the natural failure mode of unstructured concurrency. Once you adopt structured concurrency — every goroutine has an explicit parent that knows how to wait for it — the sleeps disappear without active effort.

What is structured concurrency?¶

In structured concurrency, goroutines are spawned within a scope, and the scope is responsible for waiting on them before it returns. errgroup.Group is structured concurrency for Go: spawn N goroutines, the g.Wait() is the closing brace of their lifetime.

func processBatch(ctx context.Context, items []Item) error {
    g, ctx := errgroup.WithContext(ctx)
    for _, it := range items {
        it := it
        g.Go(func() error { return process(ctx, it) })
    }
    return g.Wait()
}

This function is statically responsible for all the goroutines it spawned. After it returns, there are no in-flight goroutines associated with items. There is nothing for a test to sleep waiting for; if you have a result, the result is final.

Unstructured concurrency, where sleeps live¶

The opposite is "fire and forget":

func processBatch(items []Item) {
    for _, it := range items {
        it := it
        go process(it)
    }
}

The function returns instantly. Any caller who wants to know "are we done?" has to poll, sleep, or invent a side-channel. Tests against this function will sleep because there is no other option.

The cure is not "stop using sleep in the test"; the cure is "stop writing unstructured fan-out in production code".

Promoting unstructured to structured¶

Most fan-outs can be promoted to errgroup. The remainder (long-running daemons that should outlive the call) should be encapsulated in a Service type with a Run(ctx) method and a Done() channel:

type Service struct {
    done chan struct{}
}

func (s *Service) Run(ctx context.Context) error {
    defer close(s.done)
    // ... long-running work ...
}

func (s *Service) Done() <-chan struct{} { return s.done }

Now the caller — whether a main or a test — has a structured handle on the service's lifetime.

Nesting structured scopes¶

Structured concurrency composes. A service that spawns child goroutines should manage them with an internal errgroup:

func (s *Service) Run(ctx context.Context) error {
    defer close(s.done)
    g, ctx := errgroup.WithContext(ctx)
    g.Go(func() error { return s.consumer(ctx) })
    g.Go(func() error { return s.producer(ctx) })
    g.Go(func() error { return s.reaper(ctx) })
    return g.Wait()
}

When Run returns, all three sub-goroutines have exited. The outer caller sees a single, atomic "done" signal via <-s.Done(). No layer needs to sleep.

Anti-pattern: orphaned goroutines¶

func (s *Service) Start() {
    go s.consumer()
    go s.producer()
}

This API gives the caller no way to wait for consumer and producer to exit. Tests will sleep. Promote Start to Run(ctx) error and the design issue goes away.

Deterministic Time Control: A Deep Dive Into `testing/synctest`¶

testing/synctest (Go 1.24+) is the cornerstone of senior-level deterministic time testing. To use it effectively at scale you must understand its internals, not just its surface API.

The bubble¶

A bubble is a goroutine tree spawned by synctest.Test(t, func(t *testing.T) { ... }). Inside the bubble:

time.Now() returns the bubble's virtual clock, which starts at midnight UTC on a fixed date.
time.Sleep(d) parks the goroutine until the bubble's virtual clock advances by d.
time.NewTimer(d), time.After(d), time.AfterFunc(d, f) all schedule against virtual time.
time.NewTicker(d) ticks on virtual time.

Goroutines spawned inside the bubble are members of the bubble. Goroutines spawned by the test that started the bubble are not members (unless explicitly included). The boundary matters because the runtime needs to know which goroutines must be durably blocked before virtual time can advance.

Durable blocking¶

A goroutine is durably blocked if it is parked on an operation that only goroutines inside the bubble can unblock. Concretely:

<-ch where ch is created inside the bubble: durably blocked.
<-ch where ch is created outside (or by os.Open): not durably blocked.
mu.Lock() on a mutex used inside the bubble: durably blocked.
time.Sleep(d) inside the bubble: durably blocked (virtual time can unblock it).
Reading from a file descriptor: not durably blocked.

If any bubble goroutine is not durably blocked, virtual time does not advance. This is why synctest works for pure-Go logic but does not work for code that crosses into OS-level I/O.

Advancement rule¶

When every goroutine in the bubble is durably blocked, the runtime:

Finds the next-to-fire timer (the smallest time.Until(fire)).
Advances virtual time to that point.
Fires the timer.
Re-runs the scheduler. If the firing goroutine made progress, it may unblock others, and the cycle repeats.

If there are no pending timers and every goroutine is durably blocked, the bubble has reached a deadlock and synctest fails the test with a clear message.

`synctest.Wait()`¶

synctest.Wait() blocks the current goroutine until every other bubble goroutine is durably blocked. It is the deterministic equivalent of "wait for the system to be at rest" — a programmatic quiescence barrier.

Use synctest.Wait():

Before observing state. Without it, you might race with a still-running producer.
Before a negative assertion. After Wait, if the channel is still empty, you know no event is pending.
To break a deadlock diagnostic. If Wait itself blocks forever (or fails with "deadlock"), you have a missing producer.

Example: pub-sub with synctest¶

func TestBus_FanOut(t *testing.T) {
    synctest.Test(t, func(t *testing.T) {
        b := NewBus()
        var got1, got2 []string
        b.Subscribe("topic", func(m string) { got1 = append(got1, m) })
        b.Subscribe("topic", func(m string) { got2 = append(got2, m) })

        b.Publish("topic", "a")
        b.Publish("topic", "b")
        synctest.Wait()

        if !reflect.DeepEqual(got1, []string{"a", "b"}) {
            t.Errorf("got1 = %v, want [a b]", got1)
        }
        if !reflect.DeepEqual(got2, []string{"a", "b"}) {
            t.Errorf("got2 = %v, want [a b]", got2)
        }
    })
}

synctest.Wait() ensures that both subscriber goroutines have processed "a" and "b" before the assertion. Without it, the test would race the subscribers.

Example: retry with synctest¶

func TestRetry_VirtualClock(t *testing.T) {
    synctest.Test(t, func(t *testing.T) {
        var attempts []time.Time
        op := func() error {
            attempts = append(attempts, time.Now())
            return errors.New("fail")
        }
        start := time.Now()
        _ = Retry(op, 3, time.Second)
        synctest.Wait()

        if len(attempts) != 3 {
            t.Fatalf("attempts = %d, want 3", len(attempts))
        }
        if d := attempts[1].Sub(attempts[0]); d != time.Second {
            t.Errorf("gap 1 = %s, want 1s", d)
        }
        if d := attempts[2].Sub(attempts[1]); d != 2*time.Second {
            t.Errorf("gap 2 = %s, want 2s", d)
        }
        if elapsed := time.Since(start); elapsed != 3*time.Second {
            t.Errorf("total = %s, want 3s", elapsed)
        }
    })
}

The test asserts on virtual durations: exactly 1s, exactly 2s, exactly 3s total. There is no fuzziness because there is no real time involved.

Limitations and gotchas¶

time.AfterFunc callbacks run on a bubble goroutine. They are scheduled by the runtime, so the callback must be cooperative; if it blocks on something outside the bubble, the bubble loses quiescence.
Channels created outside the bubble are not durably blocking inside it. A common bug is passing a chan from the test fixture into the bubble and expecting synctest.Wait() to consider receivers on it as durably blocked.
runtime.GC() and runtime.GoSched() are no-ops for synctest's quiescence model. They yield but don't durably block.
http.DefaultTransport uses net.Dial, which crosses out of the bubble. Real network calls cannot be inside a bubble. Use httptest.Server cautiously; even it spawns OS-level goroutines.
The bubble's virtual clock starts at a fixed date. If your code asserts on absolute calendar dates, you may have to adjust expectations.

Debugging "stuck" bubbles¶

Symptom: synctest.Test hangs, then fails with "deadlock" or "all goroutines durably blocked but no timer pending".

Diagnostic steps:

Print runtime.NumGoroutine() before and after entering the bubble.
Use runtime.Stack to dump all goroutines and inspect what each is blocked on.
Check for goroutines parked on external channels (e.g. from net.Conn.Read).
Check for tickers that haven't been stopped.
Check for select { default: ... } busy loops that prevent durable blocking.

A well-instrumented test fixture prints stack traces on deadlock. runtime.SetBlockProfileRate(1) plus pprof.Lookup("block") is your friend.

`Clock` Abstractions At Scale¶

For codebases that cannot use synctest (older Go, integrations with non-bubbled code, or simulation harnesses), a Clock interface is the alternative. At scale, the design of this interface matters.

Minimal interface¶

type Clock interface {
    Now() time.Time
    Sleep(time.Duration)
    After(time.Duration) <-chan time.Time
    NewTimer(time.Duration) Timer
    NewTicker(time.Duration) Ticker
    AfterFunc(time.Duration, func()) Timer
    Until(time.Time) time.Duration
    Since(time.Time) time.Duration
}

type Timer interface {
    Chan() <-chan time.Time
    Stop() bool
    Reset(time.Duration) bool
}

type Ticker interface {
    Chan() <-chan time.Time
    Stop()
    Reset(time.Duration)
}

Note Chan() <-chan time.Time rather than a struct field: the standard library's time.Timer exposes C, but we want an interface, and you cannot put a field on an interface. The wrappers do the trivial translation.

Why every time method goes through the clock¶

If even one path bypasses the clock, the abstraction leaks. The most common leak is time.Now() deep inside business logic. Audit ruthlessly:

git grep -nE 'time\.(Now|Sleep|After|NewTimer|NewTicker|AfterFunc|Until|Since|Tick)' -- '*.go' ':!*_test.go'

Every hit should either be on a clock.X(...) call (allowed) or in a type that explicitly accepts a Clock (allowed) or be flagged for refactor.

Wiring at the edge¶

Inject the Clock at the type's constructor:

func NewService(clk clockwork.Clock, ...) *Service {
    return &Service{clk: clk, ...}
}

var defaultClock = clockwork.NewRealClock()

func NewServiceDefault(...) *Service {
    return NewService(defaultClock, ...)
}

Production code uses NewServiceDefault; tests use NewService with a fake.

Lock-free clock reads¶

For hot paths, clk.Now() has a per-call cost (interface dispatch). If profiling shows it as a hotspot, cache the clock's value at the start of a batch:

func (s *Service) processBatch(items []Item) {
    now := s.clk.Now()
    for _, it := range items {
        s.processItem(it, now)
    }
}

This is a real performance win in throughput-critical paths and does not break the abstraction (the batch's logical "now" is a defensible single value).

Composition: clock + scheduler¶

Some code needs more than time.Sleep: it needs to schedule work for a future time. Compose clocks with a scheduler interface:

type Scheduler interface {
    Schedule(time.Time, func()) (cancel func())
}

type ClockScheduler struct {
    clk Clock
}

func (s *ClockScheduler) Schedule(when time.Time, fn func()) func() {
    t := s.clk.AfterFunc(s.clk.Until(when), fn)
    return func() { t.Stop() }
}

Tests pass a ManualScheduler that records pending callbacks and lets the test trigger them explicitly.

Race-Free Coordination Patterns¶

A senior engineer should be able to articulate and implement a small library of race-free coordination patterns from memory. Below are the canonical ones, with the failure modes they replace.

Pattern 1: One-shot ready signal¶

Use when: a long-running service needs to announce "I am ready" to its caller.

Failure mode replaced: go server.Run(); time.Sleep(d); client.Do(...).

type Service struct {
    ready chan struct{}
}

func (s *Service) Run() error {
    ln, err := net.Listen("tcp", s.addr)
    if err != nil {
        close(s.ready)
        return err
    }
    close(s.ready)
    return s.serve(ln)
}

func (s *Service) WaitReady(ctx context.Context) error {
    select {
    case <-s.ready:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

Pattern 2: Latch (close once, observe many)¶

Use when: many readers need to know about a single event ("shutdown initiated", "config loaded").

type Latch struct {
    once sync.Once
    ch   chan struct{}
}

func NewLatch() *Latch { return &Latch{ch: make(chan struct{})} }

func (l *Latch) Trigger() { l.once.Do(func() { close(l.ch) }) }
func (l *Latch) Done() <-chan struct{} { return l.ch }
func (l *Latch) Fired() bool {
    select {
    case <-l.ch:
        return true
    default:
        return false
    }
}

Pattern 3: Generation counter for cancellable work¶

Use when: a service receives many "do this" requests and only the most recent matters.

type Generation struct {
    mu  sync.Mutex
    gen uint64
}

func (g *Generation) New() (uint64, func() bool) {
    g.mu.Lock()
    g.gen++
    cur := g.gen
    g.mu.Unlock()
    return cur, func() bool {
        g.mu.Lock()
        defer g.mu.Unlock()
        return g.gen == cur
    }
}

A worker captures its generation at start; before publishing the result, it checks isStill(). If the generation has changed, the result is discarded. No sleeps, no leaks.

Pattern 4: Done-channel with optional error¶

Use when: a goroutine has a single async result that may be an error.

type result struct {
    val any
    err error
}

func (s *Service) ProcessAsync(input In) <-chan result {
    out := make(chan result, 1)
    go func() {
        v, err := s.process(input)
        out <- result{val: v, err: err}
    }()
    return out
}

The buffer of 1 ensures the goroutine never blocks if the caller loses interest.

Pattern 5: Quiescence flush¶

Use when: a service does background work and you need to know when the queue is empty.

type Worker struct {
    queue    chan job
    inflight sync.WaitGroup
}

func (w *Worker) Submit(j job) {
    w.inflight.Add(1)
    w.queue <- j
}

func (w *Worker) process() {
    for j := range w.queue {
        w.do(j)
        w.inflight.Done()
    }
}

func (w *Worker) Flush() { w.inflight.Wait() }

Flush is the quiescence API. Tests call it after submitting; they never sleep.

Pattern 6: Multi-stage barrier¶

Use when: N goroutines must all reach point X before any proceeds.

type Barrier struct {
    n      int
    arrive chan struct{}
    release chan struct{}
    cur    int
    mu     sync.Mutex
}

func NewBarrier(n int) *Barrier {
    return &Barrier{
        n:       n,
        arrive:  make(chan struct{}, n),
        release: make(chan struct{}),
    }
}

func (b *Barrier) Wait() {
    b.mu.Lock()
    b.cur++
    if b.cur == b.n {
        close(b.release)
    }
    b.mu.Unlock()
    <-b.release
}

(For real multi-cycle barriers, use a more elaborate scheme; this is the one-shot version.)

Pattern 7: Selectable timeout¶

Use when: you want to do something for up to a duration but bail on cancel.

func WaitUntil(ctx context.Context, d time.Duration, ch <-chan struct{}) error {
    timer := time.NewTimer(d)
    defer timer.Stop()
    select {
    case <-ch:
        return nil
    case <-timer.C:
        return errors.New("timeout")
    case <-ctx.Done():
        return ctx.Err()
    }
}

This is not time.Sleep. The duration is a bound, not a target. The fast path returns instantly when ch closes.

Retry And Backoff Done Right¶

Retry libraries are sleep-rich production code. A senior engineer designs them so the sleeps are testable, jittered, cancellable, and observable.

Anatomy of a correct retry library¶

type Retrier struct {
    clk        Clock
    maxAttempts int
    base, cap  time.Duration
    jitter     float64
    classify   func(error) bool
    onAttempt  func(attempt int, err error, next time.Duration)
}

func (r *Retrier) Do(ctx context.Context, op func(context.Context) error) error {
    var lastErr error
    for i := 0; i < r.maxAttempts; i++ {
        if err := ctx.Err(); err != nil {
            return errors.Join(lastErr, err)
        }
        err := op(ctx)
        if err == nil {
            return nil
        }
        if !r.classify(err) {
            return err
        }
        lastErr = err
        if i+1 == r.maxAttempts {
            break
        }
        next := r.backoff(i)
        if r.onAttempt != nil {
            r.onAttempt(i, err, next)
        }
        timer := r.clk.NewTimer(next)
        select {
        case <-timer.Chan():
        case <-ctx.Done():
            timer.Stop()
            return errors.Join(lastErr, ctx.Err())
        }
    }
    return lastErr
}

func (r *Retrier) backoff(attempt int) time.Duration {
    d := r.base << uint(attempt)
    if d > r.cap {
        d = r.cap
    }
    if r.jitter > 0 {
        f := 1 + (rand.Float64()*2-1)*r.jitter
        d = time.Duration(float64(d) * f)
    }
    return d
}

Notes:

op accepts the context so each attempt is cancellable.
The wait between attempts is via clk.NewTimer + select, not time.Sleep. This makes it both testable (with a fake clock) and cancellable (via ctx.Done).
classify(err) decides whether the error is retryable. Default is "retry everything"; tests can opt into stricter classifiers.
onAttempt is a hook for observability. Tests use it to assert exact retry counts and delays.

Retry as a higher-order function vs. a method¶

The example above is method-style. Many libraries (github.com/cenkalti/backoff/v4, etc.) expose backoff.Retry(op, policy) as a free function. Either works; the API style is taste. The senior-level question is whether the library forces correctness:

Does it accept a context?
Does it accept a clock?
Does it provide a hook for observability?
Does it support classification of errors?

If any answer is no, the library is fighting you and you will see time.Sleep re-emerge in tests.

Bounded vs unbounded retry¶

Unbounded retry (for { op(); sleep }) is a near-universal mistake. If the underlying problem is permanent — bad credentials, missing table, code bug — unbounded retry hot-loops forever and emits an error per attempt. Production logs flood, oncall paged.

Always cap retries by a count or a total deadline:

ctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
defer cancel()
err := retrier.Do(ctx, op)

Per-attempt timeout¶

Each attempt should have its own timeout, separate from the total retry budget:

err := r.Do(ctx, func(ctx context.Context) error {
    attemptCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()
    return doRequest(attemptCtx)
})

Otherwise a single slow request consumes the entire retry budget.

Idempotency¶

Retry only works for idempotent operations. If your op mutates state and may have partially succeeded, retrying creates duplicates. The library cannot enforce idempotency for you; document the requirement loudly and consider adding a precondition check.

Observability hooks¶

r.onAttempt = func(attempt int, err error, next time.Duration) {
    metrics.Inc("retry.attempt", "attempt", strconv.Itoa(attempt))
    log.Warn("retrying", "attempt", attempt, "err", err, "next", next)
}

In production this gives you metrics for free; in tests, the hook is the assertion target.

Jitter: The Mathematics Of Avoiding Thundering Herds¶

Backoff without jitter is broken even though it seems to work. Senior engineers know why.

The thundering herd¶

Imagine 10 000 clients all retry against a service that crashed at time t. Without jitter, every client retries at t + 1s. The service comes back at t + 0.5s and is immediately hit with 10 000 requests, knocking it over again. The "fix" propagates the outage.

With jitter, retries are spread over a window, and the service catches up gradually.

Equal jitter¶

delay = base * 2^attempt
half = delay / 2
return half + rand.Int63n(half)

Each retry waits in [delay/2, delay). Spreads to 50% of the nominal window.

Full jitter (AWS Architecture Blog recommendation)¶

delay = min(cap, base * 2^attempt)
return rand.Int63n(delay)

Each retry waits uniformly in [0, delay). Spreads across the entire interval.

Full jitter is generally better: lower expected delay, better convergence under contention.

Decorrelated jitter¶

prev = max(base, prev)
return min(cap, rand.Int63n(prev * 3))

Each delay depends on the previous (capped). Smoother behavior under sustained contention.

Testing jittered retries¶

Jitter makes the delay non-deterministic, which sounds incompatible with deterministic tests. The solution: inject the random source.

type Retrier struct {
    rng *rand.Rand
    // ...
}

func NewRetrier(seed int64, ...) *Retrier {
    return &Retrier{rng: rand.New(rand.NewSource(seed)), ...}
}

Tests pass a fixed seed; the random sequence is reproducible. Combined with a fake clock or synctest, the entire retry behavior is deterministic.

Picking a jitter ratio¶

Standard practice:

For most clients: full jitter.
For clients that must complete soon (e.g. user-facing latency): equal jitter with a tight cap.
For very-large fleets (>100k clients): decorrelated jitter to avoid harmonic bursts.

Eradication: Removing Sleep From A Legacy Test Suite¶

When you join a team with 5000 tests and 800 time.Sleep calls in _test.go, the senior question is "how do I reduce that to zero without burning out and without missing tests".

Step 1: Measure¶

git grep -nE 'time\.Sleep\(' -- '*_test.go' | wc -l
git grep -nE 'time\.Sleep\(' -- '*_test.go' | awk -F: '{print $1}' | sort -u | wc -l

You now know how many sleeps and how many files. This is the debt baseline. Track it weekly in a dashboard.

Step 2: Classify¶

Read every time.Sleep and tag it with a replacement bucket:

WG — WaitGroup join.
CH — channel notification.
CTX — context cancellation.
CLK — fake clock or synctest.
POLL — Eventually.
KEEP — legitimate (negative assertion, etc.).

A spreadsheet of file:line → tag is the migration plan.

Step 3: Prioritise¶

Not all sleeps are equal. Prioritise by:

Flakiness: which sleeps cause the most failed builds? CI metrics tell you. Tackle the noisiest first.
Length: a 5-second sleep wastes more CI time than a 50ms one. Sort descending.
Ownership: if one team's package contains 60% of the sleeps, work with that team; their refactor unblocks the most.
Ease: WG and CH migrations are easier than CLK migrations. Tackle easy ones first to build momentum.

Step 4: Migrate in batches¶

Open one PR per related cluster of sleeps:

"Replace WaitGroup-shaped sleeps in pkg/worker (45 sleeps)".
"Introduce Clock interface in pkg/cache and migrate 12 expiration tests".
"Adopt testing/synctest in pkg/retry (8 tests)".

Each PR should be reviewable in 30 minutes. Bundling more makes review hard and merges riskier.

Step 5: Lint and forbid¶

After migration, add a lint rule (custom go-ruleguard rule, golangci-lint config, or a CI script) that fails the build if a new time.Sleep appears in _test.go outside an allowlisted set.

# pre-merge check
new_sleeps=$(git diff origin/main -- '*_test.go' | grep -E '^\+' | grep -cE 'time\.Sleep\(')
if [ "$new_sleeps" -gt 0 ]; then
    echo "new time.Sleep in test files; see docs/no-sleep.md"
    exit 1
fi

Pair the check with a documented escape hatch (e.g. a //nosleep:negative-assertion comment that the lint respects).

Step 6: Educate¶

Post a short writeup in your team's docs explaining the rule and the replacements. Include before/after examples from your own codebase. Make the rule discoverable by new hires.

Step 7: Maintain¶

Schedule a recurring 1-day audit (quarterly is plenty) to:

Re-measure the sleep count.
Investigate any allowlist additions since the last audit.
Pick up any new patterns.

A team that does this religiously keeps the codebase clean indefinitely.

Designing APIs That Resist Reintroduction¶

Some APIs invite sleep-based tests. Identify and refactor them.

Invitation 1: APIs that return void¶

func (s *Service) DoSomethingAsync(input In)

Returns nothing. Any test must guess when the side effect completes. Either:

Return a <-chan Result.
Return a future-like object (type Op interface { Done() <-chan struct{}; Err() error; Result() Out }).
Accept a callback.

Invitation 2: APIs that fire-and-forget callbacks¶

func (b *Bus) Publish(topic string, m string) // returns immediately, subscribers run in goroutines

A test wants to know when subscribers have run. Either:

Make Publish synchronous (run subscribers before return). Often the right call.
Expose a WaitDeliveries method.
Return a <-chan struct{} from Publish that closes when all subscribers complete.

Invitation 3: APIs that "start" without a "stop"¶

func (s *Service) Start()

No way to wait for the goroutines it spawned. Promote to:

func (s *Service) Run(ctx context.Context) error

with structured shutdown.

Invitation 4: APIs that use `time.Now()` internally¶

func (c *Cache) IsExpired(key string) bool {
    return c.entry(key).expires.Before(time.Now())
}

A test cannot fake "now" without injecting a clock. Refactor to accept a Clock. The change is mechanical and pays back forever.

Invitation 5: Long-running loops with no quiescence point¶

func (s *Service) Run(ctx context.Context) error {
    for {
        s.processOnce()
        time.Sleep(s.interval)
    }
}

A test wants to know "did one iteration complete?" but there is no API. Add a hook:

func (s *Service) afterIteration() {} // override-able for tests

func (s *Service) Run(ctx context.Context) error {
    for {
        s.processOnce()
        s.afterIteration()
        select {
        case <-time.After(s.interval):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
}

Or expose a metric counter and assert on it.

Architectural Consequences Of Forbidding Sleep¶

If your team adopts "no sleep in tests" as a hard rule, downstream effects appear in the code architecture, mostly for the better.

Effect 1: more interfaces, more injection¶

Every clock-touching component grows a Clock parameter. Every async API grows an "I am done" channel. This pushes the codebase toward dependency injection and away from globals. Some teams resist this as "too much ceremony"; the senior counter-argument is "the ceremony is the design".

Effect 2: explicit goroutine ownership¶

Every goroutine acquires an owner that can wait for it. Untracked goroutines disappear because tests cannot pass against them.

Effect 3: structured shutdown¶

Services accept a context and return when it cancels. There are no "kill it with -9" services, because tests demand graceful shutdown.

Effect 4: smaller, more focused tests¶

A test that wakes up at every event runs in microseconds. Engineers write more tests because each is cheap.

Effect 5: better production observability¶

The same channels and hooks that make tests deterministic make production code easier to monitor. Metrics, traces, and dashboards reuse the test surfaces.

Effect 6: forced honesty about time¶

A code path that needs "real" time (a TLS cert that expires at a fixed instant, an HTTP retry-after header) must declare so explicitly. You can no longer hide a time.Now() call deep inside business logic.

Probabilistic Reasoning About Test Flakiness¶

Senior engineers reason about flakiness as a probability, not as a binary "flaky / not flaky" state.

Single-test flake probability¶

Let p be the per-run failure probability of one test. A run that includes n such tests fails with probability 1 - (1-p)^n. For p = 0.001 (a "0.1%" flake) and n = 5000 tests, the run-failure probability is 1 - 0.999^5000 ≈ 0.99. A 0.1% flake rate produces a CI failure 99% of the time.

This is why even "rare" sleep-based flakes are catastrophic at scale. A team that retries on failure simply hides the cost.

Retry policies and stability¶

If CI retries each test once on failure, the per-test failure probability becomes p^2. For p = 0.001, the post-retry rate is 0.000001, and 1 - 0.999999^5000 ≈ 0.005. Retries make builds pass at a hidden cost: they double the runtime of failures and disguise the underlying bug.

Retry policies are an anaesthetic, not a cure. They should be paired with a "if a test failed twice in a row this week, file a bug" rule.

Detecting flakes statistically¶

Track per-test pass/fail history. Compute the rolling 30-day failure rate. Any test with p > 0.01% should be flagged for investigation.

# pseudo
for each (test, day) in last 30 days:
    record pass/fail
flake_rate(test) = fails / total

A team with this dashboard catches a re-introduced sleep within days.

The Bayesian view¶

Even when a sleep-based test passes, the expected flakiness imposes a tax. If you observe "passed 999 times, failed once" you cannot rule out a 1% true flake rate; you cannot rule it in either. The point estimate is 0.1%, the 95% confidence interval includes 0.005% to 5%. You are betting on a parameter you cannot measure precisely.

Sleep-free tests have an observable zero flake probability when designed correctly. Probability theory has no useful contribution to make because the failure mode is gone.

Integration Tests: Where The Rules Bend¶

Pure unit tests have no excuse for time.Sleep. Integration tests that hit real Postgres, real Kafka, real S3 have a slightly different story: the external system has timing semantics you do not control, and there is sometimes no notification API.

Acceptable patterns¶

Eventual consistency: assert.Eventually(t, func() bool { return queryRow(...) != nil }, 10*time.Second, 100*time.Millisecond). Bounded poll with a generous total budget.
Kafka catch-up: produce N messages, then consume until you have seen all N or you hit a 30-second timeout.
Real HTTP: dial the server with a 5-second timeout; the connection should succeed in milliseconds in the happy path.

Even here, prefer:

A dedicated test-fixture API on the integration (e.g. mock.WaitForN(n)).
A poll with a tight interval.

Reach for fixed-duration sleep only when the external system literally exposes no other observable, and document why.

Test categorisation¶

Tag integration tests separately (//go:build integration). Run them less often (e.g. nightly). Keep their flakiness budget separate from unit tests.

Per-test timeouts¶

Every integration test should have a per-test deadline (t.Deadline() or a context.WithTimeout). A hung external system should not block CI for hours.

Containerised Tests And Wall-Clock Reality¶

Tests that run inside containers (e.g. with Testcontainers-Go) interact with real services with real timing. The senior practice:

Use container-readiness probes¶

Testcontainers-Go has WaitFor strategies (waitForLog, waitForPort, waitForHTTP). These are quiescence APIs for containers. Use them instead of time.Sleep after container.Start().

req := testcontainers.ContainerRequest{
    Image:        "postgres:16",
    ExposedPorts: []string{"5432/tcp"},
    WaitingFor:   wait.ForLog("database system is ready to accept connections"),
}

WaitingFor blocks Start() until the readiness condition holds. No time.Sleep needed.

Avoid sleeping for "warm-up"¶

A common mistake: start a container, then time.Sleep(5 * time.Second) "to let it warm up". The 5 seconds is a guess and is either too short (on slow CI) or wasteful (on fast CI). Use readiness probes.

Acceptable container sleeps¶

Health-check intervals — the polling interval in a WaitFor strategy is a sleep, but it is bounded and inside a polling helper.
Throttle for honest rate limits — if the container is genuinely rate-limited at 10 RPS, your test sleeps to respect the limit.

Race Detector Limitations Around Time¶

The Go race detector (-race) is excellent at finding data races, but it has blind spots around time-related anti-patterns.

What `-race` catches¶

Concurrent unsynchronised reads/writes of the same memory location.
Races involving time.Now() if multiple goroutines write a shared time.Time without synchronisation.

What `-race` does not catch¶

Flakiness due to insufficient time.Sleep. A test that sleeps too short and reads stale state is not a data race; it just reads valid memory that is "stale" in domain terms.
Goroutines that leak across tests. The race detector does not warn about a goroutine that outlives its test, only about races involving it.
Time-ordering bugs where order matters but no shared memory is involved.

Implication: clean `-race` runs are not proof of correctness¶

A team that says "we run -race so our concurrent tests are fine" is mistaken about what -race guarantees. The corollary is that sleep-based tests can pass -race and still be flaky on the next run. Removing sleeps is complementary to running -race, not redundant.

Combining tools¶

For maximum coverage:

Run unit tests with -race -count=10 regularly. The count=10 runs each test 10 times, surfacing intermittent failures.
Run with -cpu=1,2,4,8 to vary scheduling.
Run with GOMAXPROCS=1 periodically; some races only show with serialised scheduling.
Run with GODEBUG=gctrace=1 if you suspect GC pauses are uncovering races.

A test suite that passes all of the above and contains no time.Sleep is the gold standard.

Goroutine Leak Detection Under Test¶

Closely related to the sleep-for-sync problem: the goroutine leak. A test that sleeps may pass while leaving goroutines running, which the next test inherits.

`uber-go/goleak`¶

import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

After every test, goleak snapshots the goroutine list and fails if extra goroutines exist. This catches both the "forgot to cancel" and "sleeping test left worker behind" cases.

Test-local goleak¶

For finer control:

func TestThing(t *testing.T) {
    defer goleak.VerifyNone(t)
    // ... test body ...
}

What goleak does not catch¶

Goroutines that exit but slowly (the test ends before they do, goleak fails).
Background goroutines started by the test framework itself.

Configure goleak with goleak.IgnoreTopFunction(...) to skip known framework noise.

Pairing with no-sleep¶

Once you remove sleeps, tests become deterministic; goleak then reliably catches leaks. Conversely, a sleep-based test makes goleak unreliable: sometimes the leaked goroutine exits before goleak runs, sometimes not.

Testing Long-Running Background Jobs¶

Cron-like jobs (every 5 minutes flush a buffer; every hour rotate logs) are sleep-heavy by nature. Test them with deterministic time.

Pattern: parameterise the interval¶

Production: 5 minutes. Test: 5 milliseconds with a fake clock. The interval is exposed as a configuration, but the test does not rely on the small interval to pass — the test rolls a fake clock forward.

Pattern: separate "tick" from "process"¶

type Job struct {
    clk  Clock
    tick <-chan time.Time
}

func NewJob(clk Clock, interval time.Duration) *Job {
    return &Job{clk: clk, tick: clk.NewTicker(interval).Chan()}
}

func (j *Job) Run(ctx context.Context) error {
    for {
        select {
        case <-j.tick:
            if err := j.processOnce(ctx); err != nil {
                return err
            }
        case <-ctx.Done():
            return ctx.Err()
        }
    }
}

func (j *Job) ProcessNow(ctx context.Context) error {
    return j.processOnce(ctx)
}

The ProcessNow exposed for tests means the test does not need to advance the clock at all to exercise processOnce; it calls it directly. The ticker is only relevant for the scheduling test, which uses a fake clock.

Pattern: assert "scheduled but not yet run"¶

func TestJob_RespectsInterval(t *testing.T) {
    clk := clockwork.NewFakeClock()
    j := NewJob(clk, time.Minute)
    ran := make(chan struct{}, 10)
    j.OnProcess = func() { ran <- struct{}{} }
    go j.Run(context.Background())

    clk.BlockUntil(1) // wait until the goroutine is parked on the timer
    select {
    case <-ran:
        t.Fatal("ran before tick")
    default:
    }

    clk.Advance(time.Minute)
    select {
    case <-ran:
    case <-time.After(2 * time.Second):
        t.Fatal("did not run after tick")
    }
}

clk.BlockUntil(n) (a clockwork feature) waits until at least n goroutines are blocked on the fake clock. This is the quiescence primitive for the clockwork model.

Distributed Systems And The Sleep Smell¶

In distributed systems, time.Sleep smells different but follows the same anti-pattern shape.

"Sleep until the cluster heals"¶

Tests that bring up a 3-node Raft cluster and then time.Sleep(5 * time.Second) to "wait for leader election" are doing the same thing as a single-machine sleep-for-sync. The fix: query the cluster's election state via an admin API or wait for a LeaderChanged event.

"Sleep until eventual consistency"¶

A test that writes to one replica and reads from another, with a time.Sleep in between, is racy. Use the replica's "synced to LSN X" API or wait for a "replication caught up" event.

Wall-clock dependencies across nodes¶

NTP clock skew, leap seconds, and time-zone offsets are all real concerns. A senior engineer designs distributed systems to be tolerant of clock skew (logical clocks, vector clocks, hybrid logical clocks) precisely so that tests can avoid wall-clock-dependent assertions.

Chaos and time¶

Chaos-engineering tools (Chaos Mesh, Litmus) inject delays, network partitions, etc. Tests that observe behavior under chaos should not themselves rely on time.Sleep to "let the chaos take effect" — the chaos tool exposes its own state.

Reviewing PRs At Scale¶

A senior engineer reviews many PRs per week. A scalable approach to keeping sleep out:

Use a checklist¶

Pre-merge checklist that includes:

No new time.Sleep in _test.go. Justified exceptions documented with a comment.
Every spawned goroutine has a clear owner that waits on it.
Any new clock-dependent code accepts a Clock or runs inside synctest.Test.
Retry/backoff has jitter, a cap, and a context-cancelable wait.

Use automation¶

Lint rules catch most of the items mechanically; CI fails the PR. Reserve human review for the cases the lint flags as ambiguous (e.g. a negative assertion with time.Sleep).

Comment templates¶

Pre-written review comments save time:

"time.Sleep in tests is a sync anti-pattern. Replace with <channel/WaitGroup/synctest>. See docs/no-sleep.md."

"This goroutine has no observable exit. Please return a <-chan struct{} from Run or accept a context.Context so callers can wait."

"time.Now() here makes the function untestable. Please accept a Clock parameter or use time.Now() only at the API boundary."

Roll-back loud¶

When you find sleeps reintroduced, push back firmly and roll back the change rather than tolerating "we'll fix it later". The expediency of "merging and fixing later" almost always reverts to permanence.

Edge Cases And Advanced Pitfalls¶

Edge case: `time.Sleep` with `runtime.Gosched`¶

for {
    if cond() { return }
    runtime.Gosched()
}

Pure busy wait. CPU pegged. Worse than time.Sleep. Use a channel.

Edge case: nanosleep precision¶

On Linux, time.Sleep is implemented via nanosleep(2) or clock_nanosleep(2). Granularity is typically 1ms on most kernels, 100µs on tickless kernels. Sub-microsecond sleeps are impossible. If you need fine-grained timing for a benchmark, use time.Now() polling, not sleep.

Edge case: sleep across timezone change¶

time.Sleep(d) uses the monotonic clock; it is immune to DST and timezone changes. But time.Until(targetWallClockTime) is not; if the system clock jumps backward by an hour during DST end, your "sleep until 3am" call will sleep an extra hour.

Edge case: sleep with `LockOSThread`¶

In a CGo program with runtime.LockOSThread, the thread is dedicated to the goroutine. time.Sleep parks the goroutine, leaving the thread idle but unable to be reused for other goroutines. In a thread-pool-limited program this is starvation.

Pitfall: sleeping inside `init`¶

func init() {
    time.Sleep(time.Second) // why?
}

Always wrong. init runs at program startup; sleeping there blocks main from running. Whatever the author was waiting for, restructure.

Pitfall: sleeping in benchmark setup¶

func BenchmarkX(b *testing.B) {
    setup()
    time.Sleep(time.Second) // let setup "stabilise"
    for i := 0; i < b.N; i++ {
        // ...
    }
}

The benchmark timer is on by default, and the sleep is included. Use b.ResetTimer() after the sleep, or remove the sleep entirely by using a real quiescence API.

Pitfall: assuming `time.Sleep` returns early on signal¶

In some languages (POSIX C), a signal interrupts nanosleep. Go masks signals at the runtime level; time.Sleep always sleeps the full duration. There is no "early return on signal" behavior.

Pitfall: `time.Sleep` in a hot-loop closure passed to a library¶

Library APIs that accept a callback may call it many times. A time.Sleep in the callback compounds: 1000 callbacks × 1ms sleep = 1 second of total sleep, often invisible to the author.

Common Senior-Level Mistakes¶

Accepting "we'll add a test later" as a justification for an untestable API. The API will not get a test later; it will get a sleep-based test later, or no test.
Tolerating a few sleeps "for pragmatism". Pragmatism is the slippery slope to a thousand sleeps. Hold the line.
Letting one team set the standard for the others. If one team is allowed sleeps, others will follow. Apply the rule uniformly.
Designing Clock as an opaque struct rather than an interface. You need to fake it; design for that.
Using clockwork and synctest in the same test. They conflict. Pick one per test.
Asserting on exact virtual durations without recognizing scheduler effects. Even inside synctest, ordering of independent goroutines is nondeterministic; assert on what must be true, not on incidental orderings.
Adding the lint rule before migrating. The rule fails every PR until migration is complete. Migrate first, then enforce.
Forgetting to track flake metrics. A team that removes sleeps but doesn't measure flakiness cannot prove the discipline is working.
Implementing custom retry instead of using golang.org/x/time/rate or a known library. Rolling your own usually means re-implementing the bugs.
Confusing "deterministic" with "fast". Both are good but they are independent properties. Synctest is both; clockwork is deterministic but not necessarily faster than real time.

Test¶

Build the following deliverables. They cumulatively prove senior-level mastery.

A Clock interface and clockwork-style fake. Implement Now, Sleep, After, NewTimer, NewTicker, AfterFunc, with Advance and BlockUntil helpers on the fake. Pass a property-based test suite that exercises edge cases (negative durations, concurrent timers, etc.).
A retry library. Accepts a clock, supports exponential and decorrelated jitter, accepts a context, exposes an OnAttempt hook. Cover with deterministic tests using both synctest and clockwork. Cover ~10 scenarios: success first try, success after N retries, permanent failure, classifier rejection, context cancellation during wait, context cancellation during attempt, zero retries, retries exhausted, jitter distribution, observability hook called correctly.
A debouncer with observable quiescence. Expose Trigger, Cancel, WaitFired. Test with synctest to assert that triggers within the window collapse to one fire and triggers across windows fire separately.
A migration plan document. Pretend a 5000-test codebase has 800 sleeps. Write a 2-page plan including measurement, classification, prioritisation, batching, enforcement, and timeline.
A lint rule. Custom go-ruleguard or golangci-lint rule that flags time.Sleep in _test.go outside an allowlist. Test it on a synthetic codebase with deliberate sleeps and verify it fires only where expected.

Tricky Questions¶

"My test must run in CI inside Docker. The container's clock jumps when the host hibernates. How do I make it deterministic?" Use synctest so the virtual clock is independent of the container's wall clock. If you cannot, isolate the clock-dependent code behind a Clock interface and inject a controlled fake.
"How do I test code that depends on time.AfterFunc with a callback that itself spawns goroutines?" Inside synctest, the callback runs on a bubble goroutine; its spawned goroutines are also in the bubble. synctest.Wait() waits for all of them to become durably blocked. The test then observes the state.
"My retry library passes a context to op. If the context is canceled mid-attempt, should I count that as a retry or as the final error?" As the final error. Cancellation is a hard stop; retrying after cancel violates the cancellation contract.
"How do I test that a rate limiter allows exactly 10 requests per second?" Inject a fake clock or use synctest. In a 1-second virtual window, call Allow() 11 times and assert that the 11th returns false. Advance virtual time and assert that the next call succeeds.
"My code uses time.Tick (not NewTicker). Why is that wrong?" time.Tick returns a channel from a leaked *Ticker. There is no Stop. In long-lived programs the leak is unbounded. Always use time.NewTicker and call Stop().
"What is the right way to test 'this function does NOT make a network call'?" Inject an HTTP client (or RoundTripper) interface. The fake records calls. Assert that no call was made. No sleep needed.
"My team says synctest is too new and risky. What do I say?" Synctest is stable since Go 1.24, is in the standard library, and is the deterministic primitive for time-based testing. If "too new" is a blocker, fall back to clockwork, but make the case that within a release cycle synctest should be adopted.
"How do I prove my no-sleep rule is paying off?" Track three metrics: (1) count of time.Sleep in _test.go (should trend to zero); (2) test-suite wall time (should trend down); (3) per-test flake rate (should trend to zero). Plot weekly.
"My production code has a 30-second timeout that I want to test. How do I do that without waiting 30 seconds?" Either inject a Clock and advance it 30 seconds in 1µs, or use synctest so the 30-second wait is virtual.
"Is there ever a case where time.Sleep is faster than the alternative?" Rarely. A runtime.Gosched() is faster but pegs CPU. A channel send/receive is comparable in cost to time.Sleep in microbenchmarks. For coarse-grained scheduling (millisecond+), the cost differences are negligible.

Cheat Sheet¶

Eradication SOP¶

Measure: git grep 'time.Sleep' '*_test.go' | wc -l.
Classify into WG | CH | CTX | CLK | POLL | KEEP.
Prioritise by flakiness × duration × ease.
Migrate in small PRs.
Lint with go-ruleguard.
Track metric weekly.

`synctest` quick reference¶

Run: synctest.Test(t, func(t *testing.T) { ... }).
Wait: synctest.Wait() inside the bubble.
Time: time.Now(), time.Sleep, etc. all use virtual clock automatically.
Pitfall: any OS-level blocking inside the bubble breaks quiescence.

Retry library checklist¶

Accepts context.Context.
Accepts Clock.
Caps attempts and total budget.
Uses full or decorrelated jitter.
Exposes OnAttempt hook.
Classifies errors.

Code-review red flags¶

time.Sleep( in a _test.go.
go service.Run() with no later <-service.Done().
time.Now() deep in business logic.
Retry loops without context, jitter, or cap.

Self-Assessment Checklist¶

You are senior-level on this topic when you can:

Summary¶

At the senior level the question is not "what do I write instead of time.Sleep?" but "how do I make sleep unnecessary, undesirable, and impossible across an entire codebase and team?"

The answer is a stack of design and process disciplines:

Observable quiescence as an API contract.
Structured concurrency as the default goroutine spawning style.
Deterministic time via testing/synctest or a Clock interface.
Race-free coordination patterns from a small known library.
Correct retry and backoff with jitter, context, and observability.
A migration plan to eradicate existing sleeps.
Linting and code review to prevent re-introduction.
Probabilistic reasoning about flakiness to justify the investment.

A senior engineer who internalises this stack can join any Go team and, within a quarter, halve their flaky-test count and double their unit-test speed. Within a year, they can produce a test suite where the word Sleep does not appear outside time.Sleep in production rate-limiting code.

The professional file picks this up at the scheduler and runtime level: how the Go scheduler implements time.Sleep, what synctest's implementation looks like under gdb, and how sleep behaves in hot paths under contention.