Skip to content

Sleep for Synchronization — Interview Questions

Format

Questions are graded [J] (junior), [M] (middle), [S] (senior), [P] (professional). Each question includes a model answer and discussion of common wrong answers.


Q1 [J]. What is wrong with this test?

func TestWorker(t *testing.T) {
    w := NewWorker()
    go w.Run()
    time.Sleep(100 * time.Millisecond)
    if w.Status() != "running" {
        t.Fatal("not running")
    }
}

Model answer. Four issues:

  1. The 100ms is a guess; on a slow CI runner the worker may not have started yet.
  2. Even when the test passes, it costs 100ms of wall-clock time it did not need.
  3. The test cannot be reasoned about — it relies on timing rather than on a deterministic event.
  4. There is no synchronization between the goroutine's writes (to w.status) and the main goroutine's read, which is also a data race.

Wrong answer to avoid. "The duration is too short, bump it to 1s." Tuning the sleep does not fix any of the four issues; it papers over them.


Q2 [J]. How would you refactor the previous test to be deterministic?

Model answer. Use a notification channel that the worker closes when it has entered the "running" state:

func TestWorker(t *testing.T) {
    w := NewWorker()
    ready := make(chan struct{})
    w.OnReady(func() { close(ready) })
    go w.Run()
    select {
    case <-ready:
    case <-time.After(2 * time.Second):
        t.Fatal("worker did not become ready in 2s")
    }
    if w.Status() != "running" {
        t.Fatal("not running")
    }
}

The time.After(2 * time.Second) is a safety timeout (a backstop for hung implementations), not a synchronisation duration.


Q3 [J]. What is sync.WaitGroup used for and how does it differ from time.Sleep?

Model answer. sync.WaitGroup is a counter of in-flight tasks. Add(n) increments, Done decrements, Wait blocks until the counter reaches zero. It is the precise primitive for "wait until N goroutines are done".

It differs from time.Sleep in every way:

  • Deterministic: returns exactly when the counter is zero, not after an arbitrary duration.
  • Memory-safe: establishes happens-before from Done to Wait return.
  • Fast: returns immediately when the work is done; no extra wait.

Wrong answer to avoid. "WaitGroup is for big workloads; small ones can use sleep." There is no size threshold; sleep is wrong at every scale.


Q4 [J]. Identify the bug:

var wg sync.WaitGroup
go func() {
    wg.Add(1)
    defer wg.Done()
    work()
}()
wg.Wait()

Model answer. wg.Add(1) is called inside the goroutine. There is a race between wg.Wait() in the main goroutine and the Add(1) in the spawned goroutine. If Wait() runs first (before Add), it sees a counter of 0 and returns immediately, possibly before work() even starts.

The fix: call Add(1) before go:

var wg sync.WaitGroup
wg.Add(1)
go func() {
    defer wg.Done()
    work()
}()
wg.Wait()

Q5 [J]. When is time.Sleep acceptable in production code?

Model answer. A small list:

  • Backoff between retries (where the wait is the point).
  • Throttle/rate-limit (when not using a proper rate limiter).
  • Polling intervals for external systems that have no notification API.
  • Demo / one-off scripts.

It is never acceptable as a synchronisation primitive between goroutines.


Q6 [J]. What is the difference between time.Sleep and time.After?

Model answer. time.Sleep(d) blocks the current goroutine for at least d. time.After(d) returns a channel that receives the current time after at least d; the caller does not block until they receive from the channel. time.After is useful in select statements where you want a timeout alongside other cases.

select {
case x := <-ch:
    use(x)
case <-time.After(d):
    timeout()
}

Q7 [M]. Why is time.Sleep bad in a _test.go file but sometimes OK in production code?

Model answer. In tests, sleep is used as a synchronisation primitive — "wait until the goroutine is done" — for which there is always a deterministic alternative (channel, WaitGroup, synctest). The test version is racy and slow.

In production, time.Sleep is sometimes used as a duration primitive — "wait this long before the next attempt" — which is a legitimate primitive when the duration is the actual semantic (backoff, throttle). The problem in tests is the conceptual misuse, not the function itself.


Q8 [M]. Refactor this test to use synctest:

func TestCacheExpires(t *testing.T) {
    c := NewCache(100 * time.Millisecond)
    c.Set("k", "v")
    time.Sleep(150 * time.Millisecond)
    if _, ok := c.Get("k"); ok {
        t.Fatal("entry should have expired")
    }
}

Model answer.

func TestCacheExpires(t *testing.T) {
    synctest.Test(t, func(t *testing.T) {
        c := NewCache(100 * time.Millisecond)
        c.Set("k", "v")
        time.Sleep(150 * time.Millisecond) // virtual time
        if _, ok := c.Get("k"); ok {
            t.Fatal("entry should have expired")
        }
    })
}

The body is nearly identical; inside the bubble, the 150ms sleep is virtual and the test runs in microseconds.


Q9 [M]. What is errgroup and when do you use it?

Model answer. errgroup.Group is WaitGroup + first error + context cancellation. Use it when running N independent tasks and the whole batch should fail (with the first error) if any task fails.

g, ctx := errgroup.WithContext(ctx)
for _, item := range items {
    item := item
    g.Go(func() error { return process(ctx, item) })
}
if err := g.Wait(); err != nil {
    return err
}

g.Wait() returns the first non-nil error; the context is cancelled when any goroutine returns an error, signalling the others to bail.


Q10 [M]. Why is "tuning the sleep duration" never a real fix?

Model answer. Four reasons:

  1. The duration must be longer than the slowest possible run, but the slowest run is unbounded.
  2. Tuning hides the underlying race rather than fixing it; a future slowdown re-introduces the failure.
  3. Every passing run wastes the tuned duration even though the actual work was faster.
  4. When the test does fail, the failure mode is "sleep too short", which leads engineers to bump the sleep further rather than investigate the real cause.

Q11 [M]. Implement a polling helper similar to testify/assert.Eventually.

Model answer.

func Eventually(t *testing.T, total, step time.Duration, what string, cond func() bool) {
    t.Helper()
    deadline := time.Now().Add(total)
    for time.Now().Before(deadline) {
        if cond() {
            return
        }
        time.Sleep(step)
    }
    t.Fatalf("timed out after %s waiting for %s", total, what)
}

The time.Sleep(step) here is bounded and small (~10ms typical), and the function returns as soon as cond() is true. This is the correct use of polling.


Q12 [M]. What happens if you call Wait on a sync.WaitGroup whose counter is 0?

Model answer. Wait returns immediately. This is by design; it is not an error to wait on an empty group.

A common pitfall: if Add is called after Wait starts, the behaviour is undefined (race). Always call Add before the corresponding Wait.


Q13 [M]. Refactor a sleep-based rate limiter test:

func TestLimiter(t *testing.T) {
    l := NewLimiter(2, time.Second)
    if !l.Allow() || !l.Allow() { t.Fatal() }
    if l.Allow() { t.Fatal() }
    time.Sleep(1100 * time.Millisecond)
    if !l.Allow() { t.Fatal() }
}

Model answer. Inject a clock interface:

func TestLimiter(t *testing.T) {
    clk := clockwork.NewFakeClock()
    l := NewLimiter(clk, 2, time.Second)
    if !l.Allow() || !l.Allow() { t.Fatal() }
    if l.Allow() { t.Fatal() }
    clk.Advance(time.Second + time.Millisecond)
    if !l.Allow() { t.Fatal() }
}

Or use synctest:

func TestLimiter(t *testing.T) {
    synctest.Test(t, func(t *testing.T) {
        l := NewLimiter(realClock{}, 2, time.Second)
        if !l.Allow() || !l.Allow() { t.Fatal() }
        if l.Allow() { t.Fatal() }
        time.Sleep(1100 * time.Millisecond)
        if !l.Allow() { t.Fatal() }
    })
}

Either way, the test runs in microseconds.


Q14 [M]. What is the difference between time.Tick and time.NewTicker?

Model answer. time.NewTicker(d) returns a *Ticker with C, Stop, and Reset. time.Tick(d) returns only the channel and discards the *Ticker reference, so the ticker cannot be stopped and leaks for the lifetime of the program. Always use NewTicker (with defer t.Stop()) in any non-trivial code.


Q15 [M]. Explain the bug:

go func() {
    time.Sleep(time.Second)
    cancel()
}()
worker(ctx)

Model answer. The "cancel after one second" goroutine is using time.Sleep as a timer, which is fine, but the surrounding pattern is wrong. The cancellation happens on a fixed schedule independent of the worker's progress. Use context.WithTimeout:

ctx, cancel := context.WithTimeout(parent, time.Second)
defer cancel()
worker(ctx)

This is cleaner, has proper cancellation semantics, and is correctly testable.


Q16 [S]. Define "observable quiescence" and explain why it matters.

Model answer. Observable quiescence is the property that an external observer (a test, a caller, a monitoring tool) can determine that a system is at rest — no pending callbacks, no in-flight goroutines doing useful work, no timers about to fire. A system has observable quiescence if it exposes a function Wait whose return guarantees the rest state.

It matters because tests that cannot observe quiescence must guess (with time.Sleep) when work is done. Designing APIs with quiescence in mind eliminates the need for sleeps entirely.


Q17 [S]. How does testing/synctest advance virtual time?

Model answer. When every goroutine in a synctest bubble is "durably blocked" — parked on an operation that can only be unblocked by another bubble goroutine or by the virtual clock — the runtime:

  1. Finds the next-to-fire pending timer in the bubble.
  2. Sets the bubble's virtual clock to that timer's fire time.
  3. Fires the timer, which unblocks the parked goroutine.
  4. Re-runs the scheduler.

If all goroutines are durably blocked and there are no pending timers, the bubble is deadlocked and the test fails.


Q18 [S]. What is "durably blocked" and what is not?

Model answer. A goroutine is durably blocked when its unpark depends only on other bubble goroutines or on the virtual clock. Examples:

  • <-ch on a bubble-created channel: durably blocked.
  • mu.Lock() on a bubble mutex: durably blocked.
  • time.Sleep(d): durably blocked.

Not durably blocked:

  • Reading from a file descriptor.
  • net.Conn.Read.
  • A cgo call.
  • A receive on a channel created outside the bubble.

If any goroutine in the bubble is not durably blocked, the bubble cannot advance virtual time.


Q19 [S]. Design a retry library.

Model answer. The library should:

type Retrier struct {
    Clock       Clock
    Attempts    int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
    Jitter      float64 // fraction
    Classify    func(error) bool
    OnAttempt   func(attempt int, err error, next time.Duration)
}

func (r *Retrier) Do(ctx context.Context, op func(context.Context) error) error {
    var lastErr error
    for i := 0; i < r.Attempts; i++ {
        if err := ctx.Err(); err != nil {
            return errors.Join(lastErr, err)
        }
        err := op(ctx)
        if err == nil { return nil }
        if !r.Classify(err) { return err }
        lastErr = err
        if i+1 == r.Attempts { break }
        next := r.backoff(i)
        if r.OnAttempt != nil { r.OnAttempt(i, err, next) }
        timer := r.Clock.NewTimer(next)
        select {
        case <-timer.Chan():
        case <-ctx.Done():
            timer.Stop()
            return errors.Join(lastErr, ctx.Err())
        }
    }
    return lastErr
}

Critical properties:

  • Accepts a clock (testable).
  • Accepts a context (cancellable).
  • Exponential with jitter.
  • Capped delay.
  • Classifier for non-retryable errors.
  • Observability hook.

Q20 [S]. Explain "full jitter" vs "equal jitter" vs "decorrelated jitter" for backoff.

Model answer.

  • Equal jitter: delay = half + rand(0, half) where half = base * 2^attempt / 2. Spread is half the nominal interval.
  • Full jitter: delay = rand(0, cap) where cap = min(limit, base * 2^attempt). Spread is the entire interval.
  • Decorrelated jitter: delay = rand(base, prev * 3), capped. Smoother under sustained contention.

Full jitter is usually the recommended default (AWS Architecture Blog). It produces the best spread for retries across many clients and the lowest expected delay.


Q21 [S]. Why is "1% test flake rate" disastrous at scale?

Model answer. For a build with N independent tests each with flake probability p, the build failure probability is 1 - (1-p)^N. For p = 0.01 and N = 500 tests, the build failure rate is 1 - 0.99^500 ≈ 99.3%. Even a 1% per-test flake produces a build failure on virtually every run.

CI retry policies hide the cost by retrying failures; this doubles failed-build runtime and masks the real flake rate. The fix is to eliminate sleeps, which makes per-test flake go to zero (deterministic).


Q22 [S]. How would you migrate a 5000-test codebase with 1200 time.Sleep calls in tests?

Model answer. Six phases:

  1. Discover: enumerate all sleeps with git grep; classify by replacement bucket (WG, CH, CTX, CLK, POLL).
  2. Standardise: publish coding standard + tutorial.
  3. Tool: lint rule, dashboard, pre-merge check.
  4. Migrate: split into per-team batches; allocate 10% engineer-time over 8 weeks.
  5. Enforce: ratchet-style lint; new sleeps blocked; existing have deadline.
  6. Postmortem: measure suite runtime, flake rate, developer sentiment.

Track KPIs: count of sleeps, p99 build time, flake rate.


Q23 [S]. Why does synctest sometimes hang?

Model answer. If any goroutine in the bubble is parked on a non-durable operation (file I/O, network, cgo), the runtime cannot determine that the bubble is quiescent. Virtual time cannot advance. The bubble hangs until the test deadline.

Common causes:

  • The test calls a function that uses net.Dial or os.Open.
  • The test uses a channel created outside the bubble.
  • A finalizer or signal-handler goroutine touches the bubble.

Diagnosis: dump all goroutine stacks; identify which one is parked on non-bubble I/O.


Q24 [S]. Critique this code:

func (s *Service) Start() {
    go s.consumer()
    go s.producer()
    go s.reaper()
}

Model answer. The API gives the caller no way to wait for the three goroutines to exit. Any test calling Start() followed by an assertion will need to time.Sleep. Promote to Run(ctx) error with structured concurrency:

func (s *Service) Run(ctx context.Context) error {
    g, ctx := errgroup.WithContext(ctx)
    g.Go(func() error { return s.consumer(ctx) })
    g.Go(func() error { return s.producer(ctx) })
    g.Go(func() error { return s.reaper(ctx) })
    return g.Wait()
}

Now Run returns when all three exit, and the caller (production or test) has a deterministic handle.


Q25 [S]. What is hedging and why does it help?

Model answer. Hedging issues a duplicate request after a fixed delay, cancelling whichever comes back later. It reduces p99 latency in tail-sensitive fan-outs because a slow backend on one path is bypassed by the duplicate on another.

func hedged(ctx context.Context, hedge time.Duration, do func(context.Context) (Out, error)) (Out, error) {
    ...
}

Hedging is acceptable use of time.After(hedge) because the duration is the SLO boundary (semantically meaningful), not a synchronisation guess.


Q26 [S]. What is a "thundering herd" and how does jitter prevent it?

Model answer. When many clients fail at once (e.g. service crash) and all retry at exactly the same time, the recovered service is hit with a massive simultaneous load and may crash again. This is a thundering herd.

Jitter spreads the retries over an interval, so the recovered service sees a more gradual ramp-up. Full jitter (uniform over [0, base * 2^attempt]) is the standard mitigation.


Q27 [P]. Walk through time.Sleep's implementation in the Go runtime.

Model answer.

  1. User calls time.Sleep(d).
  2. If d <= 0, return.
  3. The runtime reads nanotime() for the start time.
  4. Allocates (or reuses) a *runtime.timer struct with when = nanotime() + d.
  5. Inserts the timer into the current P's local timer heap (4-ary min-heap by when).
  6. Calls gopark with a callback that registers the timer.
  7. The goroutine state becomes Gwaiting with waitReasonSleep.
  8. The scheduler picks another runnable goroutine for this P.
  9. When the runtime's scheduler reaches the timer (via findrunnable checking the timer heap), it fires the timer.
  10. The fire callback calls goready(g), placing the goroutine on the local run queue.
  11. The goroutine is eventually scheduled and resumes after time.Sleep.

Model answer. Three metrics:

  • Per-test failure rate: for each test, percentage of runs that fail (before retries). Compute over a 30-day window.
  • Build retry rate: percentage of builds that needed at least one retry.
  • CI total runtime: p50, p99 of build duration.

Store per-test pass/fail in a database. Compute the metrics nightly. Plot on a dashboard. Alert when any test crosses 1% rolling flake.


Q29 [P]. Design a flake-detector tool.

Model answer. A tool that:

  1. Runs the test suite N times (e.g. 100) with varying GOMAXPROCS and -cpu flags.
  2. Captures per-test pass/fail.
  3. Identifies any test with non-zero failure rate.
  4. Outputs a ranked list of suspect tests.
results := map[string]int{}
for i := 0; i < 100; i++ {
    out := exec.Command("go", "test", "-count=1", "...").Output()
    for _, fail := range parseFailures(out) {
        results[fail]++
    }
}
sortByCount(results)
print(results)

Q30 [P]. How does context propagation interact with sleep in a deep call stack?

Model answer. A context.Context flows down a call stack. At the leaf, code that sleeps should sleep cancellably:

select {
case <-time.After(d):
case <-ctx.Done():
    return ctx.Err()
}

If a deep callee uses time.Sleep(d) directly (no select), the context cancellation does not interrupt the sleep. The whole stack waits up to d before noticing the cancel. This is a frequent latency bug.

The fix: every sleep in a context-aware codebase must be wrapped in a select.


Q31 [P]. What is the security implication of timing-dependent code?

Model answer. Timing side channels. If validateLogin takes 5ms when the user does not exist and 50ms when the user exists with wrong password, an attacker can enumerate valid usernames by measuring response times.

Mitigation:

  • Use constant-time comparison (crypto/subtle.ConstantTimeCompare) for secrets.
  • Apply a uniform time.Sleep (or better, a constant-time delay) regardless of outcome.
  • Combine with rate limiting on IP and account.

Q32 [P]. How do you keep the no-sleep rule alive after the engineers who introduced it leave?

Model answer.

  • Automation: lint rules, pre-merge checks, dashboards. The rule must enforce itself, not depend on memory.
  • Documentation: a clear, short guide in the repo. New hires read it.
  • Onboarding: include the rule in the onboarding checklist.
  • Periodic audits: schedule quarterly reviews of the dashboard.
  • Tribal storytelling: keep one or two "memorable" production incidents in shared memory. Postmortems should reference the rule.

Q33 [P]. What is the worst case wakeup latency for time.Sleep on Linux?

Model answer. Unbounded in principle. Practically:

  • Under no load: ~1µs.
  • Under moderate load (GOMAXPROCS busy): ~100µs to 1ms.
  • Under heavy load with GC pauses: 10-100ms.
  • Under extreme load (swap thrashing, scheduler saturation): seconds.

This is the second reason time.Sleep is not a sync primitive: even if the duration were right, the wake time is unbounded.


Q34 [P]. Implement a deterministic test for time.AfterFunc(d, f).

Model answer.

func TestAfterFunc(t *testing.T) {
    synctest.Test(t, func(t *testing.T) {
        var ran atomic.Int32
        time.AfterFunc(time.Second, func() { ran.Add(1) })
        time.Sleep(2 * time.Second) // virtual
        synctest.Wait()
        if ran.Load() != 1 {
            t.Errorf("ran = %d, want 1", ran.Load())
        }
    })
}

Inside the bubble, time.AfterFunc schedules against virtual time; time.Sleep(2s) advances virtual time past the fire point; synctest.Wait() ensures the callback has completed before the assertion.


Q35 [P]. Critique this rate limiter:

type Limiter struct {
    mu sync.Mutex
    last time.Time
    interval time.Duration
}

func (l *Limiter) Wait() {
    l.mu.Lock()
    elapsed := time.Since(l.last)
    sleep := l.interval - elapsed
    l.mu.Unlock()
    if sleep > 0 {
        time.Sleep(sleep)
    }
    l.mu.Lock()
    l.last = time.Now()
    l.mu.Unlock()
}

Model answer. Bugs:

  1. Not testable: uses time.Now and time.Sleep directly. Inject a clock.
  2. No context: cannot be cancelled.
  3. Race: between releasing the first lock and acquiring the second, multiple goroutines can each observe "enough elapsed" and all proceed simultaneously.
  4. No jitter: thundering herd risk.
  5. Reinventing the wheel: golang.org/x/time/rate.Limiter does this correctly.

Suggested replacement:

limiter := rate.NewLimiter(rate.Every(interval), 1)
limiter.Wait(ctx)

Q36 [P]. Explain how time.Sleep interacts with runtime.LockOSThread.

Model answer. A goroutine that has called runtime.LockOSThread is pinned to its OS thread. When that goroutine sleeps:

  • The goroutine is parked normally.
  • The OS thread it is locked to is idle (cannot run other goroutines).
  • Other goroutines run on other Ps' threads.

In CGo-heavy programs (e.g. OpenGL or GUI toolkits where threads must own GL contexts), this wastes a thread per sleeping locked goroutine. The fix: avoid sleeping in locked goroutines; use channel-based coordination from a non-locked goroutine.


Q37 [P]. Why doesn't running go test -race catch sleep-based flakiness?

Model answer. The race detector catches data races: concurrent unsynchronised access to memory where at least one access is a write. A sleep-based test that fails because the producer has not run yet is not a data race — the read happens before the write, full stop. There is no concurrent unsynchronised access in the technical sense.

The race detector is necessary but not sufficient. Removing sleeps is complementary, not redundant.


Q38 [P]. Postmortem: a deploy fails because a 30s sleep was bumped to 45s but pods took 60s to be ready. What is the root cause?

Model answer. Two compounding causes:

  1. Primary: deploy script uses sleep to "wait for pods ready" instead of a real readiness probe (kubectl rollout status or watch).
  2. Secondary: the chosen duration is a guess that proves wrong under unusual cluster load.

The fix is structural: replace the sleep with a wait-for-ready API. Bumping the duration is a coverup; the next bad day will exceed any chosen number.


Q39 [P]. Compare clockwork and synctest.

Model answer.

  • clockwork: third-party library. A fake Clock interface implementation. Production code must accept a Clock parameter. Works on any Go version. Requires explicit Advance and BlockUntil calls in tests.
  • synctest: standard library (Go 1.24+). Runs the test in a bubble with virtual time. Production code can use time.Now etc. directly; the bubble intercepts. Time advances automatically when all bubble goroutines are durably blocked.

Use clockwork for backwards compatibility or when you cannot bubble (e.g. tests that cross OS boundaries). Use synctest for clean, pure-Go tests on modern Go.


Q40 [P]. Final question — leadership scenario. Your team has been told to "just turn off the lint rule" because a critical deploy is blocked by it. What do you do?

Model answer.

  1. Triage: examine the specific failing diff. Is it a single legitimate sleep (negative assertion, etc.)? If yes, add a //nolint:nosleep // reason annotation and proceed. The lint rule has an explicit escape hatch.
  2. Refuse a global disable: turning off the rule resets the entire programme. Push back firmly.
  3. Document the decision: if a sleep is added under pressure, file a follow-up ticket with deadline.
  4. Postmortem: was the rule too aggressive (false positives causing friction)? Was the team unprepared (training gap)? Update the rule or the training.
  5. Reaffirm: communicate that the rule stands, with explicit allowlist for justified cases.

The point is: the rule is a guard rail, not a wall. When it blocks something legitimate, narrow the rule; when it blocks something illegitimate, hold the line.


Summary

These 40 questions cover the full range from "what is wrong with this test" (junior) through "design a retry library" (senior) to "how do you lead an org-wide programme" (professional). A candidate who can answer the J and M tier comfortably is hireable for IC roles. A candidate who can answer S is a senior engineer. A candidate who can answer P is a staff/principal-level engineer or engineering manager.

Use these as conversation starters in interviews, not as memorisation drills. The right answer often depends on context, and the discussion of trade-offs is more informative than the final conclusion.