Work Stealing — Interview Questions¶

Table of Contents¶

Junior Questions
Middle Questions
Senior Questions
Staff / Principal Questions
Whiteboard Tasks
Behavioural / Design

Junior Questions¶

Q1: What is work stealing in the Go scheduler?¶

Expected: The runtime keeps a per-P local runqueue. When a P runs out of work, its M tries to steal half of another (randomly chosen) P's runnable goroutines instead of going to sleep. This keeps all CPUs busy without a central coordinator.

Q2: Why per-P queues instead of one global queue?¶

Expected: One global queue would require a global lock on every push and pop. With per-P queues, the owner pushes/pops without any lock. The synchronisation cost is paid only by thieves.

Q3: What does "half steal" mean?¶

Expected: When stealing, the thief takes ceil(N/2) of the victim's runqueue size. Not less (the victim still has too much), not more (the victim then runs dry too soon). Half is the Cilk-proven sweet spot.

Q4: What is the LRQ and what is its size?¶

Expected: Local Run Queue, per P, 256 slots, a circular array with head/tail indices. Owner pushes onto tail, thieves pull from head.

Q5: Where does the scheduler look for work before stealing?¶

Expected: First local LRQ, then global runqueue (GRQ), then netpoll. Stealing is the last resort before parking the M.

Q6: What is GOMAXPROCS?¶

Expected: The number of Ps (logical processors). Default is the number of CPU cores. It bounds the number of Gs running user code simultaneously.

Q7: Will my goroutine stay on the P that spawned it?¶

Expected: Usually starts there, but may be stolen by another P. Code should not assume P-affinity.

Q8: What is `runtime.LockOSThread`?¶

Expected: Pins a goroutine to its OS thread. Used for OS-thread-specific state (e.g., cgo libraries, OpenGL contexts). A locked G is not stealable; the M is reserved for it.

Middle Questions¶

Q9: Walk me through `findRunnable`.¶

Expected: 1. Check local LRQ via runqget. 2. Periodically (1 in 61) check GRQ first. 3. Check GRQ via globrunqget. 4. Check netpoll non-blocking. 5. Become spinning if budget allows; attempt up to 4 rounds of stealing across all Ps. 6. Final pre-park checks (timers, GRQ, blocking netpoll). 7. Park the M.

Q10: What is a spinning M?¶

Expected: An M actively looking for work without a current G. Spinning Ms keep latency low: when new work arrives, a spinning M picks it up within ~1 μs. Without spinners, waking a parked M costs ~1 ms (futex syscall).

Q11: What is `nmspinning`?¶

Expected: An atomic counter of how many Ms are currently spinning. The scheduler enforces nmspinning <= (gomaxprocs - npidle) / 2 to cap CPU burn.

Q12: When is `wakep` called?¶

Expected: Whenever new work is created (goroutine spawned, parked G woken, netpoll Gs ready). It ensures at least one spinning M is alive while work exists. If a spinner is already running, wakep is a fast no-op (atomic load + CAS-fail).

Q13: What is `runnext`?¶

Expected: A single-slot LIFO cache on each P. A newly spawned G goes into runnext and runs before the LRQ tail. This optimises parent-child cache locality. Thieves only steal runnext on the fourth (last) attempt.

Q14: What is the 1-in-61 rule?¶

Expected: Every 61st schedule() call on a P, the scheduler checks the GRQ before the LRQ. This prevents GRQ starvation when the LRQ always has work.

Q15: What happens when an LRQ fills to 256?¶

Expected: runqputslow is called. It moves half the LRQ (128 Gs) plus the new G to the GRQ in one batch under sched.lock. Slow path; rare in practice.

Q16: How is the LRQ memory-safe for thieves and the owner?¶

Expected: Acquire-release atomics on runqhead (modified by thieves) and runqtail (modified by owner only). The release-store on runqtail synchronises with the acquire-load by thieves; visibility of G pointers is guaranteed.

Q17: How does the runtime pick the victim P?¶

Expected: A cheaprand() seed determines a random starting position; from there, the thief walks all Ps with a coprime stride. Each pass touches every P once before retrying.

Q18: How many steal attempts before parking?¶

Expected: 4 rounds, each walking all GOMAXPROCS - 1 other Ps. If none succeed, the M parks.

Senior Questions¶

Q19: Compare Go's work stealing with Cilk's.¶

Expected: Both use random victim selection and half-steal. Cilk uses LIFO from the owner and FIFO from thieves (depth-first locally, breadth-first globally). Go uses FIFO from both sides, with runnext providing a LIFO override. Cilk has unbounded deques; Go has bounded LRQs (256) with GRQ overflow.

Q20: Why the `usleep(3)` before stealing `runnext`?¶

Expected: To give the owner a chance to consume runnext itself. If the thief CAS'd immediately, it would frequently win the race against the owner who was about to execute the G. The 3-μs pause biases toward "owner runs it."

Q21: How does work stealing interact with timers?¶

Expected: In Go 1.14+, each P has its own timer heap. When stealing, the thief also checks the victim's timers; if any have expired, the thief fires them (which may push Gs onto the thief's LRQ). This prevents timer-driven Gs from being starved on busy Ps.

Q22: What is `injectglist` for?¶

Expected: Distributes a list of newly-runnable Gs back into the system. Used when netpoll returns multiple ready Gs at once, or when a batch of timers fires. It spreads Gs across LRQ and GRQ, and wakes spinners proportional to the batch size.

Q23: How does async preemption (Go 1.14+) help work stealing?¶

Expected: Without preemption, a long-running G could hog its M forever; stealers could not access the Gs on its LRQ (they could be stolen, but the running G itself was not). With async preemption, a long-running G is interrupted at safepoints every ~10 ms and its place in the LRQ is restored, making the M available for the next G — possibly a stolen one.

Q24: What if all spinners try to steal from the same P?¶

Expected: They contend on the CAS of runqhead. One succeeds; the rest retry. The retries re-randomise their victim selection, so they spread to other Ps. Sustained CAS contention is rare.

Q25: Is the steal lock-free or wait-free?¶

Expected: Lock-free (LRQ side). The owner's runqput is wait-free (only writes runqtail). The thief's runqgrab may retry on CAS failure but each retry is bounded by the contention. No mutex is held during LRQ stealing.

Q26: How does work stealing scale on NUMA?¶

Expected: Less well than within a NUMA node. Stealing across nodes hits cross-socket memory traffic. The Go runtime is not NUMA-aware; it does not bias stealing toward local nodes. For large NUMA systems, this can be a bottleneck. Mitigation: pin processes per NUMA node, run multiple processes.

Q27: What is the cost (in nanoseconds) of one steal?¶

Expected: For a small batch (~4 Gs), ~25 ns including atomics, copies, and index updates. For larger batches, +0.5 ns per additional G. Compare: park-and-wake an M is ~1 μs.

Q28: When does the runtime skip a P during stealing?¶

Expected: When the P is in idlepMask (known to be idle, LRQ definitely empty). Also when the thief's own P (pp == p2). Also when the LRQ is empty (peek returned head == tail).

Staff / Principal Questions¶

Q29: Design a NUMA-aware version of work stealing.¶

Expected discussion: - Bias victim selection toward Ps on the same NUMA node. - Only fall back to cross-node stealing after local-node attempts fail. - Maintain per-NUMA spinning M pools. - Cost: complex topology detection at startup. - Benefit: 2-3× lower cross-socket memory traffic for highly parallel workloads. - Trade-off: load imbalance if one node is busier; need a smart fallback strategy.

Q30: How would you debug a goroutine that "should be runnable but isn't running"?¶

Expected: 1. GODEBUG=schedtrace=1000 to see queue state. 2. SIGQUIT (kill -3) to dump all goroutine stacks. 3. Look for LockOSThread (G is pinned), cgocall (G is in cgo), gopark (G is parked, not runnable). 4. Check if the G is actually in _Grunnable state (via runtime.GoroutineProfile). 5. Check GOMAXPROCS — maybe set too low. 6. Check sysmon for stuck conditions.

Q31: Should the runtime have priorities?¶

Expected discussion: - Go's design philosophy: simple, uniform. Priorities add complexity. - Real-time scheduling needs are typically met by separating critical work into a dedicated OS thread via LockOSThread. - Some users would benefit from "soft" priority (e.g., GC mark workers vs user Gs). Already implemented for GC. - The runtime does have priority-like mechanisms (runnext is higher priority than LRQ tail). - Adding general priorities would require redesigning LRQ as a priority queue — O(log n) instead of O(1).

Q32: How does Tokio's work stealing differ from Go's, and why?¶

Expected discussion: - Tokio is Rust async, so tasks are explicit .await points, not preemptible. - Tokio uses local queue + injection queue, similar to LRQ + GRQ. - Tokio's reactor (epoll equivalent) is per-worker, not centralised. Better for I/O scaling. - Tokio has explicit "LIFO slot" similar to runnext. - Tokio fairness rule: 31 polls (vs Go's 61); more aggressive. - Differences driven by Rust's lack of GC, async/await, and Tokio's I/O-heavy target workload.

Q33: Could the runtime use a lock-free GRQ?¶

Expected: Yes, in theory. Several lock-free queue algorithms exist (Michael-Scott, LCRQ, etc.). The cost would be in memory ordering complexity and ABA hazards. The current sched.lock approach is simple and the GRQ is rarely the hot path (LRQs absorb most traffic). A redesign for lock-free GRQ has been discussed but not implemented.

Q34: What is the theoretical optimality of half-steal?¶

Expected: Cilk's theorem: for a computation with T_1 total work and T_∞ critical path, the parallel time on P processors is T_P ≤ T_1/P + O(T_∞). Half-steal is asymptotically optimal — any constant fraction would suffice; half is empirically the fastest. The proof rests on a potential-function argument: each successful steal halves the potential, so log(potential) steals suffice.

Q35: Design a work-stealing scheduler from scratch for a language without GC.¶

Expected discussion: - LRQ allocation: stack or arena (no malloc per task). - Task representation: a struct with a function pointer and arg ptr, no GC roots. - Stealing: same algorithm. - Memory reclamation: hazard pointers or epoch-based reclamation for stolen tasks (no GC to clean up). - Difference from Go: explicit lifecycle management; harder API. - Examples: Cilk++, Intel TBB, Rust's Rayon.

Q36: When would you not use the default Go scheduler?¶

Expected discussion: - Soft real-time: bounded scheduling latency required. Use LockOSThread + dedicated thread + Goroutine-free critical path. - Very large NUMA: process-per-node may scale better. - Single-thread requirement (legacy OpenGL, etc.): LockOSThread + GOMAXPROCS=1. - CPU-bound with strict isolation: kernel cgroups + CPU affinity for the whole process. - For 99% of code, the default is right.

Whiteboard Tasks¶

T1: Implement a simple work-stealing scheduler in Go¶

Show the candidate can build the data structures:

type Worker struct {
    lrq    [256]Task
    head   atomic.Uint32
    tail   atomic.Uint32
}

func (w *Worker) Push(t Task) bool {
    t := w.tail.Load()
    h := w.head.Load()
    if t-h >= 256 {
        return false
    }
    w.lrq[t%256] = t
    w.tail.Store(t + 1)
    return true
}

func (w *Worker) Pop() (Task, bool) {
    t := w.tail.Load()
    h := w.head.Load()
    if h >= t {
        return Task{}, false
    }
    t.Store(t - 1)
    return w.lrq[(t-1)%256], true
}

func (w *Worker) Steal(into *Worker) int {
    for {
        h := w.head.Load()
        t := w.tail.Load()
        n := t - h
        n = n - n/2
        if n == 0 {
            return 0
        }
        // copy half
        for i := uint32(0); i < n; i++ {
            into.lrq[(into.tail.Load()+i)%256] = w.lrq[(h+i)%256]
        }
        if w.head.CompareAndSwap(h, h+n) {
            into.tail.Add(n)
            return int(n)
        }
    }
}

Watch for: acquire-release order, ABA, queue full, queue empty edge cases.

T2: Write a program to demonstrate work stealing¶

func main() {
    runtime.GOMAXPROCS(4)
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            // record which P ran this G (hard from user code)
            time.Sleep(10 * time.Microsecond)
        }(i)
    }
    wg.Wait()
}

Run with GODEBUG=schedtrace=100 to observe LRQ distribution.

T3: Read a `go tool trace` and identify a steal¶

Given a trace, look at: - A "GoCreate" event on P0 at time t. - A "GoStart" event on P2 at time t+1μs. - That's a steal: G created on P0, ran on P2.

T4: Calculate expected number of steals¶

For a program with N runnable Gs all initially on P0 and P=8 Ps, how many steals occur?

Answer: Each steal halves the largest queue. From 1024 on P0, first steal: 512 to P1. Then P0=512, P1=512. Next steal: P2 takes 256 from P0 or P1. Etc. After ~log₂(N) steal events, the load is balanced. For N=1024, ~10 steal events. Cilk's analysis confirms O(P log N) steals expected.

Behavioural / Design¶

B1: A coworker says "let's implement our own work stealing on top of channels for our worker pool." How do you respond?¶

Expected: The scheduler already does work stealing for you. Implementing it on top of channels is reinventing the wheel and almost certainly slower. Show benchmarks: a simple for range chan pool is faster than any user-space stealing.

B2: A microservice has 10× CPU usage in production vs staging. `GODEBUG=schedtrace` shows high `spinningthreads`. What do you investigate?¶

Expected: 1. Is the workload busier (more goroutines)? Stealers spin only when there is work. 2. Are spinners failing to find work? Check idleprocs over time. Mismatch indicates a runtime bug. 3. Profile CPU. If runtime.findRunnable dominates, something is wrong with stealing. 4. Check for very short-lived Gs (high churn). Batch them. 5. Check for LockOSThread overuse — locked Gs are not stealable but their Ms hold Ps.

B3: You're designing an API that needs predictable scheduling. What guarantees can you make?¶

Expected: Go does not give scheduling guarantees beyond "eventually runnable Gs run." For predictable latency: - Use LockOSThread + dedicated goroutine + bounded work. - Set GOMAXPROCS to leave headroom. - Measure tail latency, not just averages. - Consider Rust or a real-time OS if hard guarantees are needed.

B4: How would you teach work stealing to a junior?¶

Expected: Start with the analogy (chefs in a kitchen, library returns). Show the LRQ as a bucket with two openings. Walk through findRunnable decision tree. Avoid memory ordering and CAS until the picture is clear. Run a go tool trace together.

B5: Argue for or against making the LRQ size configurable.¶

Expected: - For: workloads with extreme bursts could benefit from larger queues (less overflow). - Against: configurability is footgun. Tuning is hard. 256 is empirically good. Configurable cap would invite over-tuning and obscure shared-experience knowledge. - The runtime team has explicitly rejected this. Position is consistent with Go's "few knobs" philosophy.

End of interview prep. For practical exercises, see tasks.md.