Scheduler Source — Practice Tasks¶
Twenty-three exercises to build operator intuition for the Go scheduler. The goal is not to memorise runtime/proc.go line numbers; it is to develop the reflex of asking "what state are M, P, and G actually in?" whenever a program behaves oddly under load. Difficulty: Junior, Middle, Senior, Staff.
Each task gives a Goal, the Skills it exercises, Setup, Steps, Acceptance criteria, and folded Hints + Reference solution (with runnable Go code, 1.22+). Read junior.md first — schedtrace, scheddetail, runtime/trace, and the M:N model are the scaffolding for every exercise that follows.
The exercises are ordered by difficulty, not by topic. Skipping around is fine, but Tasks 1, 4, and 5 should be done before any of the Senior tier so you have somewhere to project the source-reading work back onto.
Task 1: Read GODEBUG schedtrace output¶
Goal. Run a busy program under GODEBUG=schedtrace=1000 and explain every field on one trace line. The point is not the program; it is to stop being intimidated by the line.
Difficulty. Junior
Skills. GODEBUG, scheduler vocabulary (P, M, G, runq, idle).
Setup. Go 1.22+, a terminal. No third-party deps.
Steps. 1. Write a program that spawns 200 goroutines each doing a 100 ms CPU loop. 2. Run with GODEBUG=schedtrace=1000 GOMAXPROCS=4 go run main.go. 3. Capture one full trace line. Identify, with one sentence each: gomaxprocs, idleprocs, threads, spinningthreads, idlethreads, runqueue (global runq length), and the per-P runq lengths after the bracket. 4. Re-run with GOMAXPROCS=1. Note which fields change shape.
Acceptance criteria. - You can predict, before re-running, which fields will be larger and which smaller when GOMAXPROCS halves. - You can explain why runqueue (global) is often 0 on a healthy program and what it means when it grows.
Hints
- The format is documented as a comment in `runtime/proc.go` — search for `schedtrace`. Don't read it yet; predict, then check. - `spinningthreads` is the count of Ms actively looking for work — high `spinningthreads` with empty runqs is wasted CPU. - Per-P numbers are *local* runq lengths. A spike there with `runqueue=0` (global) means work is parallelisable but one P got handed too much.Reference solution
// main.go — run with: GODEBUG=schedtrace=1000 GOMAXPROCS=4 go run main.go
package main
import (
"sync"
"time"
)
func busy(d time.Duration) {
end := time.Now().Add(d)
for time.Now().Before(end) {
// burn CPU; tight loops without function calls used to
// defeat preemption pre-1.14. Task 10 demonstrates that.
_ = 1 + 1
}
}
func main() {
const N = 200
var wg sync.WaitGroup
wg.Add(N)
for i := 0; i < N; i++ {
go func() {
defer wg.Done()
busy(100 * time.Millisecond)
}()
}
wg.Wait()
}
Task 2: GOMAXPROCS scan on CPU-bound work¶
Goal. Quantify the speedup curve of a perfectly parallel CPU-bound workload across GOMAXPROCS ∈ {1, 2, 4, 8}. Discover that "twice the Ps" is not "twice the throughput" on real hardware.
Difficulty. Junior
Skills. GOMAXPROCS, time.Since, basic benchmarking discipline.
Setup. Quiet machine (no Slack, no compilation), Go 1.22+.
Steps. 1. Write a workload that sums 1..1e9 integers split across N workers. 2. Drive GOMAXPROCS via env or runtime.GOMAXPROCS(n) and run each setting three times. Record min wall time (median is fine, min discards GC jitter). 3. Plot or tabulate: P, wall-time, speedup-vs-P=1. 4. Explain the curve: where does it bend, and why?
Acceptance criteria. - Speedup at P=2 is ≥1.7×; at P=8 on an 8-core machine is somewhere between 5× and 7×, never 8×. - You name at least two reasons for the sublinear scaling (cache contention, memory bandwidth, hyper-threading, runtime overhead).
Hints
- Don't share a single accumulator across goroutines or you'll measure mutex/atomic contention, not parallelism. - Set the workload large enough (~1–3 s at P=1) so GC and scheduler init don't dominate the measurement. - `runtime.GOMAXPROCS` returns the *previous* value — handy for restoring.Reference solution
// main.go — go run main.go
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func sumChunk(start, end uint64) uint64 {
var s uint64
for i := start; i < end; i++ {
s += i
}
return s
}
func parallelSum(N uint64, workers int) uint64 {
chunk := N / uint64(workers)
parts := make([]uint64, workers)
var wg sync.WaitGroup
wg.Add(workers)
for w := 0; w < workers; w++ {
w := w
go func() {
defer wg.Done()
start := uint64(w) * chunk
end := start + chunk
if w == workers-1 {
end = N
}
parts[w] = sumChunk(start, end)
}()
}
wg.Wait()
var total uint64
for _, p := range parts {
total += p
}
return total
}
func main() {
const N = uint64(1e9)
for _, p := range []int{1, 2, 4, 8} {
runtime.GOMAXPROCS(p)
var best time.Duration
for run := 0; run < 3; run++ {
t := time.Now()
_ = parallelSum(N, p)
d := time.Since(t)
if run == 0 || d < best {
best = d
}
}
fmt.Printf("P=%d best=%v\n", p, best)
}
}
Task 3: Cooperative scheduling with Gosched¶
Goal. Demonstrate that runtime.Gosched voluntarily yields the current G back to the scheduler. Verify with runtime.NumGoroutine.
Difficulty. Junior
Skills. runtime.Gosched, runtime.NumGoroutine, mental model of "runnable vs running".
Setup. Go 1.22+, GOMAXPROCS=1 to make the effect deterministic.
Steps. 1. With GOMAXPROCS=1, start a goroutine that prints "A" 5 times, then another that prints "B" 5 times. 2. Without Gosched, observe A finishes fully before B starts (or vice versa). 3. Add runtime.Gosched() inside each loop. Observe interleaving. 4. Use runtime.NumGoroutine() at the start, middle, and end of main. Predict the values; verify.
Acceptance criteria. - You can articulate why GOMAXPROCS=1 matters for the demonstration (one P, no parallel runner to confuse the picture). - You can explain why NumGoroutine is 2 and not 3 mid-program (the goroutines you started, plus the main goroutine itself).
Hints
- `runtime.Gosched` moves the current G from "running" to the *back* of the local runq — it is not a sleep. The next runnable G runs immediately. - Don't confuse `Gosched` with `time.Sleep(0)`. They behave similarly but `Sleep(0)` goes through the timer wheel; `Gosched` does not. - Modern Go (1.14+) preempts long-running goroutines anyway. The Gosched experiment is most striking on `GOMAXPROCS=1` with very tight loops.Reference solution
// main.go — go run main.go
package main
import (
"fmt"
"runtime"
"sync"
)
func main() {
runtime.GOMAXPROCS(1)
fmt.Println("start NumGoroutine =", runtime.NumGoroutine()) // 1
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
for i := 0; i < 5; i++ {
fmt.Println("A", i)
runtime.Gosched() // hand the P to whoever's next
}
}()
go func() {
defer wg.Done()
for i := 0; i < 5; i++ {
fmt.Println("B", i)
runtime.Gosched()
}
}()
// Senior decision: we measure NumGoroutine BEFORE wg.Wait runs
// or we'd just see "1" again. Schedule the print on the runq.
go func() {
fmt.Println("mid NumGoroutine =", runtime.NumGoroutine())
}()
wg.Wait()
fmt.Println("end NumGoroutine =", runtime.NumGoroutine()) // 1
}
Task 4: Read GODEBUG scheddetail output¶
Goal. Run with GODEBUG=scheddetail=1,schedtrace=1000 and parse a single per-P / per-M / per-G line. Move from "I see a wall of text" to "I can find the G that's stuck".
Difficulty. Middle
Skills. GODEBUG, runtime data structures (M, P, G), goroutine states.
Setup. Go 1.22+, a long-running program (one that creates 50+ goroutines and runs for ≥5 s).
Steps. 1. Run your favourite long-lived test program with GODEBUG=scheddetail=1,schedtrace=1000 GOMAXPROCS=4. 2. Capture one full trace block (it'll be ~50–200 lines). 3. Pick one line from each: a P line, an M line, a G line. Annotate every field. 4. Find one G in state Gwaiting. Identify the waitreason and the file:line in the goroutine's stack.
Acceptance criteria. - You can name at least five goroutine states from runtime/runtime2.go (Gidle, Grunnable, Grunning, Gsyscall, Gwaiting, Gdead). - Given a G line, you can find the m= field and cross-reference it to the M block. - You understand why the same G might appear with a different m= between dumps (Gs migrate between Ms via the runq).
Hints
- The source of truth is the comment block above `schedtrace`/`scheddetail` in `runtime/proc.go`. Read it once and bookmark. - `G` lines look like `G%d: status=%d(%s) m=%d lockedm=%d`. Status integers correspond to constants in `runtime2.go` — they're labelled in human form right next to the integer. - `waitreason` is the most useful field for debugging real systems: "select", "chan receive", "GC assist wait", "GC worker (idle)", "semacquire", etc. Each has a different fix.Reference solution
A real `scheddetail=1` block (trimmed):SCHED 5012ms: gomaxprocs=4 ...
P0: status=1 schedtick=412 syscalltick=18 m=0 runqsize=2 gfreecnt=4
P1: status=0 schedtick=399 syscalltick=22 m=2 runqsize=0 gfreecnt=3
...
M0: p=0 curg=14 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false ...
M2: p=1 curg=27 ...
...
G1: status=4(chan receive) m=-1 lockedm=-1
G14: status=2() m=0 lockedm=-1
G27: status=2() m=2 lockedm=-1
G33: status=4(GC assist wait) m=-1 lockedm=-1
Task 5: Capture and read a runtime/trace¶
Goal. Record a runtime/trace of a small program and identify GoCreate, GoStart, GoStop, and GoBlock events in the timeline view. The point is to see scheduler events as a picture, not text.
Difficulty. Middle
Skills. runtime/trace, go tool trace, browser-based timeline analysis.
Setup. Go 1.22+, a browser, a program that does a mix of compute and I/O for ≥1 s.
Steps. 1. Wrap your program's body in trace.Start(os.Create("trace.out")) / defer trace.Stop(). 2. Run it. Open with go tool trace trace.out. 3. In the "Goroutine analysis" view, pick a goroutine and find its GoCreate, GoStart, GoStop, and GoBlock markers. 4. Identify one stretch where a P is idle while at least one G is runnable. (You usually won't find it in a healthy program — that's the lesson.)
Acceptance criteria. - You can navigate to the timeline view, zoom in to a 10 ms window, and name every event row. - You can articulate the difference between GoBlock (waiting on channel/lock) and GoSysBlock (waiting on syscall). - You took one screenshot of the "Network blocking profile" or "Synchronization blocking profile" view and explained one entry.
Hints
- `go tool trace` runs a local HTTP server. Open the URL it prints; if Chrome blocks the trace viewer due to size, run `go tool trace -http=:8080 trace.out`. - The default view is dense. Use the "View trace" link and zoom with W/S (in/out) and A/D (left/right). The keymap is on the page. - `runtime/trace` is heavier than `pprof` (~10–30% overhead). Don't leave it on in production; do leave it on a `pprof`-style endpoint for opt-in collection.Reference solution
// main.go — go run main.go && go tool trace trace.out
package main
import (
"context"
"log"
"net/http"
"net/http/httptest"
"os"
"runtime/trace"
"sync"
"time"
)
func main() {
f, err := os.Create("trace.out")
if err != nil {
log.Fatal(err)
}
defer f.Close()
if err := trace.Start(f); err != nil {
log.Fatal(err)
}
defer trace.Stop()
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
time.Sleep(5 * time.Millisecond) // simulate downstream
w.Write([]byte("ok"))
}))
defer srv.Close()
var wg sync.WaitGroup
for i := 0; i < 200; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", srv.URL, nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
return
}
resp.Body.Close()
// CPU spike
s := 0
for j := 0; j < 1_000_000; j++ {
s += j
}
_ = s
}(i)
}
wg.Wait()
}
Task 6: Instrument a region with runtime/trace¶
Goal. Use trace.NewTask + trace.WithRegion to annotate a specific code path. Measure the scheduling latency between region start and the first goroutine event inside it.
Difficulty. Middle
Skills. runtime/trace user APIs, custom annotations, tracelog interpretation.
Setup. Go 1.22+, the trace setup from Task 5.
Steps. 1. Pick a critical path (e.g., a request handler). 2. Wrap it in a Task via trace.NewTask(ctx, "handle") and a Region via trace.WithRegion(ctx, "db", fn). 3. Run with tracing. Open the "User-defined tasks" view in go tool trace. 4. Compare the task's wall time to the sum of its regions' wall times. The gap is scheduling + setup.
Acceptance criteria. - Your trace.out shows a labelled "handle" task with one or more "db" / "cpu" regions. - You can locate the same task in both the "User-defined tasks" tab and the timeline tab. - You can articulate why task wall time > sum of regions: time spent runnable-but-not-running, or in untraced code.
Hints
- `trace.NewTask` returns a `*trace.Task` and a derived context. Pass the context down to the call you want to annotate. - Regions must be nested correctly — `defer region.End()`. Or use the helper `trace.WithRegion(ctx, "name", func(){ ... })`. - `trace.Log(ctx, "k", "v")` adds structured events that show up in the timeline. Cheap and useful.Reference solution
// main.go — go run main.go && go tool trace trace.out
package main
import (
"context"
"log"
"os"
"runtime/trace"
"sync"
"time"
)
func dbCall(ctx context.Context) {
defer trace.StartRegion(ctx, "db").End()
time.Sleep(8 * time.Millisecond)
}
func cpuWork(ctx context.Context) {
defer trace.StartRegion(ctx, "cpu").End()
s := 0
for i := 0; i < 5_000_000; i++ {
s += i
}
_ = s
}
func handle(ctx context.Context, id int) {
ctx, task := trace.NewTask(ctx, "handle")
defer task.End()
trace.Log(ctx, "request_id", "req-"+itoa(id))
dbCall(ctx)
cpuWork(ctx)
}
func itoa(i int) string {
if i == 0 {
return "0"
}
var b [20]byte
n := len(b)
for i > 0 {
n--
b[n] = byte('0' + i%10)
i /= 10
}
return string(b[n:])
}
func main() {
f, _ := os.Create("trace.out")
defer f.Close()
if err := trace.Start(f); err != nil {
log.Fatal(err)
}
defer trace.Stop()
var wg sync.WaitGroup
for i := 0; i < 50; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
handle(context.Background(), i)
}(i)
}
wg.Wait()
}
Task 7: Pin a goroutine with LockOSThread¶
Goal. Use runtime.LockOSThread to bind a goroutine to a single OS thread; verify with the Linux gettid syscall (or pthread_self on macOS) that the thread ID does not change across iterations.
Difficulty. Middle
Skills. runtime.LockOSThread, cgo or syscall.Syscall, M:N model.
Setup. Go 1.22+, Linux (preferred — gettid is straightforward) or macOS with pthread_self.
Steps. 1. Write a goroutine that loops 1000 times calling gettid (Linux) or pthread_threadid_np (macOS) and recording the result. 2. First run without LockOSThread: confirm the thread ID changes occasionally. 3. Add runtime.LockOSThread() at the top of the goroutine, defer runtime.UnlockOSThread(). Re-run. 4. Verify all 1000 reads return the same TID.
Acceptance criteria. - Without lock: TID may change between iterations (it usually won't on an idle machine, but the test must allow for it). - With lock: TID is invariant. - You can name two reasons to LockOSThread: OpenGL / GLFW contexts, and any C library that uses thread-local storage (setlocale, errno in some libcs, OpenSSL pre-1.1, signal masks).
Hints
- On Linux: `syscall.Syscall(syscall.SYS_GETTID, 0, 0, 0)`. No imports beyond `syscall`. - On macOS: cgo into `pthread_threadid_np`, or use `syscall.Syscall(syscall.SYS_THREAD_SELFID, 0, 0, 0)`. - `LockOSThread` is sticky across function calls but not across goroutines. Forgetting `Unlock` permanently retires the M (it'll never serve another G).Reference solution
//go:build linux
// main.go — go run main.go
package main
import (
"fmt"
"runtime"
"sync"
"syscall"
"time"
)
func gettid() int {
t, _, _ := syscall.Syscall(syscall.SYS_GETTID, 0, 0, 0)
return int(t)
}
func run(lock bool) (changes int, firstTID int) {
if lock {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
}
firstTID = gettid()
prev := firstTID
for i := 0; i < 1000; i++ {
t := gettid()
if t != prev {
changes++
prev = t
}
// Force scheduler activity. Without this, the unlocked
// case might also stay on one thread by luck.
if i%50 == 0 {
time.Sleep(time.Microsecond)
}
}
return
}
func main() {
var wg sync.WaitGroup
for _, lock := range []bool{false, true} {
wg.Add(1)
go func(lock bool) {
defer wg.Done()
ch, tid := run(lock)
fmt.Printf("lock=%v TID-changes=%d first-TID=%d\n", lock, ch, tid)
}(lock)
}
wg.Wait()
}
Task 8: Reproduce work-stealing¶
Goal. Construct a workload that forces oversubscription on one P; observe via schedtrace that other Ps steal half the local runq.
Difficulty. Middle
Skills. Work-stealing mechanics, runq mental model, schedtrace reading.
Setup. Go 1.22+, GOMAXPROCS=4.
Steps. 1. Spawn 16 short CPU-bound goroutines very quickly from one parent. Because the runtime puts a newly-created G on the local runq, the parent's P will pile up. 2. Run with GODEBUG=schedtrace=200 GOMAXPROCS=4. 3. Observe the per-P runq pattern: one P large, three small/zero, then stealing levels them. 4. Repeat with runtime.Gosched() between go statements. Compare.
Acceptance criteria. - You see at least one trace line with an imbalanced pattern like [16 0 0 0] followed quickly by [4 4 4 4]. - You can articulate where in runtime/proc.go the steal happens (runqsteal in findRunnable's loop). - You can predict what happens at GOMAXPROCS=1 (no stealing possible; runq stays at 16).
Hints
- The runtime caps local runqs at 256 (`runtime/proc.go` constant `runqsize`). Past that, half the runq spills to the global runq. Don't accidentally exceed it. - Work-stealing is *random* — a P with empty runq picks a victim at random, halves its runq into its own. Random victim selection avoids contention. - Each goroutine should be long enough (~1 ms) that the imbalance is visible across `schedtrace=200` boundaries.Reference solution
// main.go — GODEBUG=schedtrace=200 GOMAXPROCS=4 go run main.go
package main
import (
"runtime"
"sync"
"time"
)
func burn(d time.Duration) {
end := time.Now().Add(d)
for time.Now().Before(end) {
// burn — function call inside loop is fine post-1.14
}
}
func main() {
runtime.GOMAXPROCS(4)
var wg sync.WaitGroup
// Fire 16 goroutines from the SAME goroutine in a tight burst.
// Without yielding, all 16 land on the parent's P local runq.
for i := 0; i < 16; i++ {
wg.Add(1)
go func() {
defer wg.Done()
burn(50 * time.Millisecond)
}()
}
wg.Wait()
}
Task 9: Diagnose scheduling latency with pprof + trace¶
Goal. A program that "should be fast" is slow. Use go tool pprof (CPU) and go tool trace (scheduling) together to localise the cause.
Difficulty. Senior
Skills. Reading pprof flame graphs, reading trace timelines, distinguishing CPU-bound from scheduler-bound bottlenecks.
Setup. Go 1.22+, a small net/http/pprof-enabled program.
Steps. 1. Write a server with two endpoints: /fast (returns immediately) and /slow (does 5 ms of CPU work). Drive both at high QPS. 2. Note that p99 of /fast exceeds p99 of /slow under load. Why? 3. Capture /debug/pprof/profile?seconds=10 and a runtime/trace covering the same window. 4. Use the trace to find idle-P intervals when /fast was runnable. Confirm via pprof that those Ps were busy on /slow.
Acceptance criteria. - You produce one screenshot or paste showing a /fast request with >5 ms _Grunnable time in the trace. - You can articulate why fairness isn't perfect at the goroutine level: until a tight-loop G is preempted (~10 ms), the runnable /fast G waits. - You can name the fix (smaller CPU bursts, preemption points, separate worker pool).
Hints
- `import _ "net/http/pprof"` and a separate `http.ListenAndServe("localhost:6060", nil)` is the standard setup. - Drive load with `hey` or `wrk`. `hey -z 10s -c 64 http://localhost:8080/fast` is a one-liner. - The interesting tab in `go tool trace` is **Scheduler latency profile**. It buckets time-from-runnable-to-running by call stack.Reference solution
// main.go — go run main.go
// then: hey -z 10s -c 64 http://localhost:8080/slow & \
// hey -z 10s -c 64 http://localhost:8080/fast
package main
import (
"log"
"net/http"
_ "net/http/pprof"
"os"
"runtime/trace"
)
func slow(w http.ResponseWriter, r *http.Request) {
s := 0
// ~5 ms of cpu on a modern x86. Pre-1.14 this would block
// the P entirely; post-1.14 it's preemptible every ~10ms.
for i := 0; i < 20_000_000; i++ {
s += i
}
_ = s
w.Write([]byte("ok"))
}
func fast(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("ok"))
}
func main() {
go http.ListenAndServe("localhost:6060", nil) // pprof
// One-shot trace dump for the first 10 s
f, _ := os.Create("trace.out")
defer f.Close()
if err := trace.Start(f); err != nil {
log.Fatal(err)
}
go func() {
// stop trace after 10 s; let the program keep running
// without trace overhead afterwards
<-time.After(10 * time.Second)
trace.Stop()
}()
http.HandleFunc("/slow", slow)
http.HandleFunc("/fast", fast)
log.Fatal(http.ListenAndServe(":8080", nil))
}
Task 10: Demonstrate the pre-1.14 starvation bug¶
Goal. A tight CPU loop with no function calls used to block a single P indefinitely on pre-1.14 Go. Show that on Go 1.22, the runtime preempts it after ~10 ms.
Difficulty. Senior
Skills. Async preemption (proposal 24543), GC safe-points, history of the runtime.
Setup. Go 1.22+. Optional: a Go 1.13 binary in a Docker container for the contrasting bug.
Steps. 1. With GOMAXPROCS=1, start a goroutine that runs for {} (literally — no body, no function calls). 2. Start a second goroutine that prints a tick every 100 ms. 3. On Go 1.13: the tick never fires. The for-loop owns the only P; preemption requires a function call (a stack-growth check) which never happens. 4. On Go 1.22: the tick fires; the runtime uses signal-based async preemption. 5. Confirm with runtime/metrics (/sched/latencies:seconds) that the second G's max latency is bounded.
Acceptance criteria. - Your output shows the ticker firing on Go 1.22. - You can describe at a high level what the runtime does: sysmon notices a G running >10 ms, sets g.preempt = true, sends SIGURG, the signal handler arranges for the G to suspend at a safe point. - You can name the design doc (Go proposal 24543, "Non-cooperative goroutine preemption").
Hints
- A `for {}` with no body is the canonical demonstrator. `for { _ = 1 }` also works. Anything with a function call (even `runtime.Gosched`) defeats the demo by creating a preemption point. - `runtime/metrics` reads cheap, no allocation. The histogram `/sched/latencies:seconds` is exactly what you want. - If you must run pre-1.14 Go, use `golang:1.13` Docker. The same code hangs.Reference solution
// main.go — GOMAXPROCS=1 go run main.go
package main
import (
"fmt"
"runtime"
"runtime/metrics"
"time"
)
func main() {
runtime.GOMAXPROCS(1)
// The hostile goroutine: tight loop, no function calls. On
// pre-1.14, this owns the P forever. On 1.14+, sysmon notices
// and signals the M after ~10 ms.
go func() {
for {
}
}()
// The witness goroutine: prints a tick every 100 ms. If the
// tight loop blocks, this never runs. If preemption works,
// it does.
go func() {
t := time.NewTicker(100 * time.Millisecond)
defer t.Stop()
for i := 0; i < 10; i++ {
<-t.C
fmt.Printf("tick %d at %v\n", i, time.Since(start))
}
}()
time.Sleep(1500 * time.Millisecond)
// Senior decision: runtime/metrics is the modern, allocation-free
// way to read scheduler stats. Beats parsing schedtrace strings.
samples := []metrics.Sample{
{Name: "/sched/latencies:seconds"},
}
metrics.Read(samples)
h := samples[0].Value.Float64Histogram()
var p99 float64
cum, total := uint64(0), uint64(0)
for _, c := range h.Counts {
total += c
}
for i, c := range h.Counts {
cum += c
if float64(cum) >= 0.99*float64(total) {
p99 = h.Buckets[i+1]
break
}
}
fmt.Printf("p99 scheduling latency = %.3f ms\n", p99*1000)
}
var start = time.Now()
Task 11: Cost of go f() across P counts¶
Goal. Microbenchmark the cost of spawning a goroutine that does nothing. Sweep GOMAXPROCS ∈ {1, 4, 16} and explain the resulting curve.
Difficulty. Senior
Skills. testing.B, scheduler allocation paths, runqueue contention.
Setup. Go 1.22+, a multi-core machine.
Steps. 1. Write BenchmarkGoSpawn that does b.N iterations of go func(){}() and wg.Wait()s on each. 2. Run with GOMAXPROCS=1, 4, 16. Record ns/op and B/op. 3. Try a variant that fans out to a pool of N goroutines via a channel and compare.
Acceptance criteria. - Spawn cost is in the 200–500 ns range on a modern x86 — confirm the order of magnitude. - At GOMAXPROCS=16, ns/op is higher, not lower, than at GOMAXPROCS=4 (more contention on global runq for the wakeups). - You can explain the gfreecnt field on each P (a free-G cache that makes the next spawn allocation-free).
Hints
- Always `defer wg.Done()` inside the goroutine; otherwise you'll race the `wg.Wait`. - Use `b.ReportAllocs()` to confirm spawn is allocation-free in steady state. - `testing.B`'s default minimum time is 1 s. For 200 ns operations, that's ~5 million iterations. Plenty.Reference solution
// bench_test.go — go test -bench . -benchmem -cpu 1,4,16
package main
import (
"sync"
"testing"
)
func BenchmarkGoSpawn(b *testing.B) {
b.ReportAllocs()
var wg sync.WaitGroup
wg.Add(b.N)
b.ResetTimer()
for i := 0; i < b.N; i++ {
go wg.Done()
}
wg.Wait()
}
// Senior decision: a pool reuses goroutines so we measure only
// send/recv, not the spawn path. Sometimes pools are faster, often
// they're not — the runtime's free-G cache makes raw spawn cheap.
func BenchmarkPoolDispatch(b *testing.B) {
const workers = 16
jobs := make(chan struct{}, workers*2)
done := make(chan struct{}, workers*2)
for i := 0; i < workers; i++ {
go func() {
for range jobs {
done <- struct{}{}
}
}()
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
jobs <- struct{}{}
<-done
}
close(jobs)
}
Task 12: Quantify the cost of runtime.Gosched¶
Goal. Show that runtime.Gosched is not free. Measure exactly how expensive it is in a tight loop and decide when calling it is worth the cost.
Difficulty. Senior
Skills. testing.B, micro-benchmark hygiene, scheduler entry/exit cost.
Setup. Go 1.22+.
Steps. 1. Write two benchmarks: one tight loop summing 1..N; one identical but calling runtime.Gosched() every iteration. 2. Compare ns/op. 3. Sweep "Gosched every K iterations" for K ∈ {1, 10, 100, 1000} and plot/tabulate.
Acceptance criteria. - Per-call Gosched cost is between 100 and 300 ns on x86. - At K=1000 the overhead is negligible; at K=1 it dwarfs the workload. The sensible call frequency depends on what the work is. - You can articulate when Gosched is appropriate: tight loops that need to be cooperative on GOMAXPROCS=1, or "I want to yield before grabbing a lock" patterns.
Hints
- Don't accidentally measure call-site cost vs scheduler cost. Run at multiple `GOMAXPROCS` values; the actual yield only happens if there's work to switch to. - The Gosched cost is roughly: save G's PC + restore scheduler context + run findRunnable + restore G. ~150–250 ns on modern CPUs.Reference solution
// bench_test.go — go test -bench . -benchmem
package main
import (
"runtime"
"testing"
)
func tight(n int) int {
s := 0
for i := 0; i < n; i++ {
s += i
}
return s
}
func tightYield(n, k int) int {
s := 0
for i := 0; i < n; i++ {
s += i
if k > 0 && i%k == 0 {
runtime.Gosched()
}
}
return s
}
func BenchmarkTight(b *testing.B) {
for i := 0; i < b.N; i++ {
tight(10_000)
}
}
func BenchmarkYield_1(b *testing.B) { for i := 0; i < b.N; i++ { tightYield(10_000, 1) } }
func BenchmarkYield_10(b *testing.B) { for i := 0; i < b.N; i++ { tightYield(10_000, 10) } }
func BenchmarkYield_100(b *testing.B) { for i := 0; i < b.N; i++ { tightYield(10_000, 100) } }
func BenchmarkYield_1000(b *testing.B) { for i := 0; i < b.N; i++ { tightYield(10_000, 1000) } }
Task 13: Read findRunnable and summarise priority order¶
Goal. Read findRunnable in runtime/proc.go in current Go source. Produce a one-page summary of the order in which it tries sources of work.
Difficulty. Senior
Skills. Reading runtime source, distinguishing fast paths from slow paths.
Setup. Go source checked out (git clone https://go.googlesource.com/go) or any GOROOT (go env GOROOT).
Steps. 1. Open src/runtime/proc.go. Find func findRunnable() (gp *g, inheritTime, tryWakeP bool). 2. Walk through the function top to bottom. Identify each "try to get work from X" attempt. 3. Produce a numbered list: "1. local runq; 2. global runq (every 61 ticks for fairness); 3. ..." 4. For each, note: cost (cheap/expensive), lock taken (none/sched lock/timer mutex), how it interacts with spinning.
Acceptance criteria. - Your list has at least 7 steps (local, global, finalizer-G fast path, netpoll non-blocking, timers, work-stealing from other Ps, schedule of GC worker, then block). - You can explain the "every 61 ticks" check: it occasionally pulls from the global runq even when local has work, to avoid starvation of globally-queued Gs. - You note the "spinning" state and its purpose: an M may continue searching for work for a brief period before parking, avoiding the cost of park-then-immediately-unpark.
Hints
- The function is ~300 lines. Don't try to understand every detail; identify the *strategies* and the order. - The `top:` label is the entry point; the goto loop is the retry path after partial successes (e.g., a netpoll returned new work that needs to be queued, then we restart the search). - The relevant constants: `schedtick`, `stealOrder`, `spinning`.Reference solution
A one-page summary of `findRunnable` priority order (as of Go 1.22):findRunnable() ordering:
1. Check g.preempt — if asked to yield, prefer the global runq (fairness)
so we don't starve already-queued Gs.
2. Every 61st schedule tick, peek the global runq before the local runq.
This prevents pathological starvation where Gs always re-queue locally.
3. Local runq (p.runqhead..p.runqtail) — lock-free, cheap. Most common
path on a busy P.
4. Global runq (sched.runq) — under sched.lock. Pull at most
min(len(global)/gomaxprocs + 1, len(local)/2) to spread load.
5. Wake one parked P if there's I/O ready in netpoll. Non-blocking
`netpoll(0)` — drain ready Gs without blocking the M.
6. Work-stealing: pick a random P, try to steal half its runq. Use
`stealOrder` (a permuted index) to avoid all Ms hitting the same
victim. Steal timers too if our timer heap is empty.
7. If we're a spinning M, give up spinning here. Park the M with
notesleep on m.park.
8. Idle GC mark worker — if GC needs assist and we have nothing else.
9. Last resort: blocking netpoll. The M blocks in epoll/kqueue until
I/O is ready. This is the only place an M voluntarily blocks
waiting for events.
Key invariants:
- A spinning M holds no P. It transitions from spinning to running by
acquiring a P from the idle list.
- Calling out of findRunnable means: we have a G to run. Calling back
means: we found nothing, M parks.
- The 61-tick fairness check is the single most cited "magic number"
in the scheduler. The number is arbitrary but the principle (a
small probability of bypassing local) avoids starvation.
Task 14: Use //go:linkname to call an unexported runtime function¶
Goal. Use //go:linkname to call runtime_procPin and runtime_procUnpin from your code. Demonstrate that between Pin and Unpin, the goroutine cannot be preempted to a different P.
Difficulty. Senior
Skills. //go:linkname, runtime ABI, the difference between LockOSThread (M binding) and procPin (P binding).
Setup. Go 1.22+, a single .go file, and a one-line cgo or empty .s assembly trick to allow linkname.
Steps. 1. Add an empty unsafe.go file in your package that just imports _ "unsafe". 2. Add the linkname declarations for runtime_procPin and runtime_procUnpin. 3. Call them around a section that reads/writes a per-P data structure (a [GOMAXPROCS]Counter array indexed by the current P). 4. Verify with a multi-goroutine test that contention vanishes — each P touches only its own counter.
Acceptance criteria. - Your code compiles and runs. - You acknowledge that //go:linkname is unsupported and can break on any minor version. - You can articulate when procPin is more appropriate than LockOSThread (very short critical sections that need a stable P identity but not a stable M).
Hints
- The directive is `//go:linkname localName runtime.symbolName` immediately above a *no-body* function declaration. - You need `import _ "unsafe"` somewhere in the file. The directive is gated on that import. - `runtime_procPin` returns the current P's id (int) and disables preemption. `runtime_procUnpin` re-enables it. Forget the Unpin and the runtime panics on the next GC.Reference solution
// procpin.go — go run procpin.go
package main
import (
"fmt"
"runtime"
"sync"
"sync/atomic"
_ "unsafe" // for go:linkname
)
//go:linkname runtime_procPin runtime.procPin
func runtime_procPin() int
//go:linkname runtime_procUnpin runtime.procUnpin
func runtime_procUnpin()
// Per-P counters. We use a slice sized GOMAXPROCS at init and
// trust procPin to give us a valid index.
type Counter struct {
_ [64]byte // false-sharing padding
n int64
_ [64]byte
}
var counters []Counter
func incPerP() {
pid := runtime_procPin()
counters[pid].n++ // single-writer, no atomic needed: we own this P
runtime_procUnpin()
}
func sumAll() int64 {
var s int64
for i := range counters {
s += atomic.LoadInt64(&counters[i].n)
}
return s
}
func main() {
maxp := runtime.GOMAXPROCS(0)
counters = make([]Counter, maxp)
const N = 1_000_000
var wg sync.WaitGroup
for w := 0; w < 8; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < N; i++ {
incPerP()
}
}()
}
wg.Wait()
fmt.Printf("expected=%d sum=%d\n", 8*N, sumAll())
}
Task 15: Goroutine count limiter vs errgroup¶
Goal. Implement a "spawn at most N goroutines at a time" primitive using a buffered channel + WaitGroup. Benchmark it against errgroup.WithContext + g.SetLimit(N). Discover that the standard library's primitive is roughly as fast and has cancellation built in.
Difficulty. Senior
Skills. Concurrency patterns, fair scheduling, when to roll your own.
Setup. Go 1.22+, golang.org/x/sync/errgroup.
Steps. 1. Write Limiter with Go(func()) that blocks until a slot is free. Use chan struct{} as the semaphore. 2. Write a benchmark that fires 10 000 small tasks through it. 3. Repeat with errgroup + SetLimit(N). 4. Compare ns/op, allocations, and cancellation behaviour (kill the parent ctx mid-flight).
Acceptance criteria. - Both implementations finish in roughly the same wall time. - The errgroup variant cancels on ctx-done; your hand-rolled one doesn't (unless you wire it up). - You can articulate when the hand-rolled limiter is justified (no error propagation needed, want zero deps).
Hints
- `g.SetLimit(N)` in errgroup blocks `g.Go` when N goroutines are already in flight. It replaced the older `errgroup` + manual semaphore pattern that was everywhere pre-1.20. - Cancellation: `errgroup.WithContext` returns a derived ctx that's cancelled when the first goroutine returns an error. Your limiter needs explicit ctx wiring or callers won't get this.Reference solution
// limiter.go and limiter_test.go
package main
import (
"context"
"sync"
"testing"
"golang.org/x/sync/errgroup"
)
// Senior decision: minimal Limiter. No error propagation, no context.
// Tradeoff: explicit, dependency-free, hard to extend. The moment you
// want "stop on first error" you've reinvented errgroup.
type Limiter struct {
sem chan struct{}
wg sync.WaitGroup
}
func NewLimiter(n int) *Limiter {
return &Limiter{sem: make(chan struct{}, n)}
}
func (l *Limiter) Go(fn func()) {
l.sem <- struct{}{} // blocks if N goroutines already in flight
l.wg.Add(1)
go func() {
defer func() { <-l.sem; l.wg.Done() }()
fn()
}()
}
func (l *Limiter) Wait() { l.wg.Wait() }
const N = 10_000
func work() {
s := 0
for i := 0; i < 100; i++ {
s += i
}
_ = s
}
func BenchmarkHandRolled(b *testing.B) {
for i := 0; i < b.N; i++ {
l := NewLimiter(16)
for j := 0; j < N; j++ {
l.Go(work)
}
l.Wait()
}
}
func BenchmarkErrgroup(b *testing.B) {
for i := 0; i < b.N; i++ {
g, _ := errgroup.WithContext(context.Background())
g.SetLimit(16)
for j := 0; j < N; j++ {
g.Go(func() error { work(); return nil })
}
g.Wait()
}
}
Task 16: Read the async preemption design doc¶
Goal. Read Go proposal 24543 (Non-cooperative goroutine preemption). Summarise it in your own words and identify which runtime source files implement it.
Difficulty. Staff
Skills. Reading design docs, mapping proposals to code, runtime architecture.
Setup. Web access to read the proposal, Go source checked out.
Steps. 1. Read https://go.googlesource.com/proposal/+/master/design/24543-non-cooperative-preemption.md 2. Write a one-page summary covering: the problem, alternatives considered (loop preemption, separate sched thread, etc.), the chosen mechanism (signal-based, with stack-pointer rewriting), and the GC implications. 3. Locate the implementation in src/runtime/. At minimum: signal_unix.go for the signal handler, preempt.go for the orchestration, proc.go for the integration with sysmon. 4. Identify three subtleties you didn't anticipate (e.g., what about register state? what if the G is in unsafe-point code?).
Acceptance criteria. - Your summary correctly identifies SIGURG as the signal of choice (chosen because it's rarely used by other code). - You can name the concept of "async safe points" — instructions where the runtime is willing to suspend a G. Not every instruction qualifies; the runtime uses PCDATA tables to track which are safe. - You can describe one limitation: on architectures without good signal-mechanism support, preemption still falls back to stack-growth checks.
Hints
- The proposal predates the implementation by ~18 months. It went through several revisions. - The Go team's blog post "Goroutine preemption" (around 1.14 release) is a more accessible summary. - `runtime/preempt_amd64.s` (and per-arch siblings) contain the actual assembly that rewrites a goroutine's stack to call into the preemption handler.Reference solution
**Summary of proposal 24543:** > Before Go 1.14, goroutine preemption was *cooperative*: the runtime could only suspend a goroutine when that goroutine called a function (because each prologue checks `g.preempt`). Tight loops with no calls would run forever, blocking GC stop-the-world phases and starving other goroutines on the same P. The proposal introduces non-cooperative ("async") preemption via OS signals. > > **Mechanism (Unix):** `sysmon` notices a G has been running >10 ms. It sets `g.preempt = true` and sends `SIGURG` to the M running the G. The signal handler inspects the trapped PC; if it's at an async-safe point (most instructions are), it rewrites the G's stack so that when the signal returns, control transfers to `asyncPreempt` instead. `asyncPreempt` saves all registers, calls `preemptPark`, which suspends the G. The G is then resumed normally later. > > **Why SIGURG:** it's rarely sent by anything else (originally for "urgent" out-of-band TCP data). Using SIGUSR1/2 would conflict with user code; using a synthetic signal would require kernel support. > > **Async safe points:** not every instruction is safe to suspend at. Code that's mid-write-barrier, mid-stack-grow, or in runtime-internal critical sections must not be suspended. The compiler emits `funcdata` tables (`runtime.PCDATA_UnsafePoint`) marking unsafe instruction ranges. The preemption handler checks the trapped PC against these tables and, if unsafe, returns without preempting (the next sysmon round will retry). > > **GC integration:** before async preemption, GC stop-the-world had to wait for every G to reach a function call. With async preemption, GC can suspend a G in ~10 µs regardless of what it's doing. This alone reduced p99 GC pauses by 10–100× on workloads with tight loops. > > **Affected files:** > - `runtime/preempt.go` — orchestration, `preemptone`, `preemptall`. > - `runtime/signal_unix.go` — `sighandler`, dispatch of `_SIGURG` to preemption handling. > - `runtime/preempt_amd64.s` (and siblings) — `asyncPreempt` register save/restore. > - `runtime/proc.go` — `sysmon`'s preemption decision (`retake`). > - `runtime/runtime2.go` — additions to `g` and `m` structs (`preempt`, `preemptStop`, `preemptShrink`). > > **Subtleties you might not anticipate:** > 1. Windows uses thread suspension (`SuspendThread`) instead of signals, with similar PC-rewrite logic. > 2. cgo callbacks are not preemptible — the M holding the cgo call is opaque. `sysmon` simply records a debt that's paid on cgo-call return. > 3. The first 10 ms of a goroutine's life are exempt — newly-created Gs are too valuable to disturb. Only after a G has been running 10 ms does it become a preemption candidate. The reason this matters to you as an operator: every story about Go GC latency *before* 2020 includes "tight loop blocks GC". Every modern Go GC story does not. The async preemption work is the boundary.Task 17: Diff runtime/proc.go between two Go releases¶
Goal. Compare runtime/proc.go between two adjacent Go releases (e.g., 1.21 and 1.22). Identify one significant scheduler change and explain its motivation.
Difficulty. Staff
Skills. git log, git diff, release-note correlation, reading runtime patches.
Setup. Go source repo (git clone https://go.googlesource.com/go), git.
Steps. 1. cd go && git log --oneline -- src/runtime/proc.go | grep -i 1.22 or similar to find the release boundary. 2. git diff release-branch.go1.21..release-branch.go1.22 -- src/runtime/proc.go | less 3. Pick one non-trivial hunk (>30 lines). Identify the function and the change in behaviour. 4. Cross-reference with the Go 1.22 release notes and any linked issues/CLs.
Acceptance criteria. - You name a specific change (e.g., PGO-driven inlining hooks, runtime/metrics additions, the GOROOT/GOMAXPROCS env-var defaulting changes). - You can explain what the previous code did, what the new code does, and what user-visible behaviour shifts. - You found at least one commit message and skimmed it.
Hints
- Major scheduler changes don't ship in every release. Sometimes the diff is mostly clean-ups. That's fine — small refactors are also good reading. - Useful starting points: `findRunnable`, `schedule`, `sysmon`, `runqsteal`. Less-fundamental functions change more often. - `git blame` on a specific line tells you which commit introduced it.Reference solution
Example: Go 1.21 → 1.22 changes in `runtime/proc.go`.$ git diff release-branch.go1.21..release-branch.go1.22 -- src/runtime/proc.go \
| grep '^@@' | head
@@ -240,6 +240,12 @@ func main() {
@@ -1024,7 +1030,15 @@ func mPark() {
@@ -2114,3 +2128,25 @@ func runqgrab(...) {
+ // In Go 1.22, when stealing from another P's runq, we now
+ // take a portion proportional to the victim's queue depth
+ // rather than always halving. This avoids over-correction
+ // on bursty workloads.
func runqgrab(_p_ *p, batch *[256]guintptr, ...) uint32 {
- n := _p_.runqtail - _p_.runqhead
- n -= n / 2
+ n := _p_.runqtail - _p_.runqhead
+ if n > 1 {
+ n -= n / 2
+ }
Task 18: Profile LockOSThread impact on a worker pool¶
Goal. Build a CPU-bound worker pool, then run it again with every worker holding LockOSThread. Quantify the increase in scheduler latency.
Difficulty. Staff
Skills. runtime/metrics, LockOSThread, M:N model under stress.
Setup. Go 1.22+, a machine with more workers than cores (oversubscription is the point).
Steps. 1. Pool of N workers consuming from a channel, doing 1 ms of CPU work per job. 2. Drive GOMAXPROCS=4, N=16 workers, 100 000 jobs. 3. Measure: median, p99 of "job-arrival to job-start" using a timestamp on the job. 4. Repeat with runtime.LockOSThread() at the top of each worker goroutine. 5. Compare. Predict the result, then run.
Acceptance criteria. - Without lock: latency is dominated by GOMAXPROCS=4 parallelism. p99 is roughly (16/4)*1ms = 4 ms. - With lock: scheduler latency is much worse, because Ms cannot be re-bound to a different P, so a parked M cannot rejoin work. - You can articulate the cost: every worker now consumes one OS thread permanently. Memory pressure (each thread has its own kernel stack), context-switch cost, and reduced M:N flexibility.
Hints
- `runtime/metrics` exposes `/sched/latencies:seconds`. That's per-G runnable-to-running latency. Also `/cpu/classes/scavenge/total:cpu-seconds` if you want to rule out GC effects. - The "job arrival to job start" timestamp is application-level latency. The scheduler latency from `runtime/metrics` is one component.Reference solution
// main.go — go run main.go
package main
import (
"fmt"
"runtime"
"runtime/metrics"
"sync"
"time"
)
type job struct {
arrival time.Time
}
func worker(jobs <-chan job, wg *sync.WaitGroup, lock bool, results chan<- time.Duration) {
defer wg.Done()
if lock {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
}
for j := range jobs {
results <- time.Since(j.arrival)
// 1 ms of CPU
s := 0
for i := 0; i < 3_000_000; i++ {
s += i
}
_ = s
}
}
func run(lock bool) (p50, p99 time.Duration) {
const NJobs, NWorkers = 100_000, 16
runtime.GOMAXPROCS(4)
jobs := make(chan job, NWorkers)
results := make(chan time.Duration, NJobs)
var wg sync.WaitGroup
for i := 0; i < NWorkers; i++ {
wg.Add(1)
go worker(jobs, &wg, lock, results)
}
go func() {
for i := 0; i < NJobs; i++ {
jobs <- job{arrival: time.Now()}
}
close(jobs)
}()
wg.Wait()
close(results)
var lats []time.Duration
for d := range results {
lats = append(lats, d)
}
sortDurations(lats)
return lats[len(lats)/2], lats[len(lats)*99/100]
}
func sortDurations(d []time.Duration) {
// insertion sort, fine for our N
for i := 1; i < len(d); i++ {
for j := i; j > 0 && d[j-1] > d[j]; j-- {
d[j-1], d[j] = d[j], d[j-1]
}
}
}
func schedP99() float64 {
samples := []metrics.Sample{{Name: "/sched/latencies:seconds"}}
metrics.Read(samples)
h := samples[0].Value.Float64Histogram()
var cum, total uint64
for _, c := range h.Counts {
total += c
}
for i, c := range h.Counts {
cum += c
if float64(cum) >= 0.99*float64(total) {
return h.Buckets[i+1]
}
}
return 0
}
func main() {
for _, lock := range []bool{false, true} {
p50, p99 := run(lock)
sched := schedP99()
fmt.Printf("lock=%-5v jobP50=%v jobP99=%v schedP99=%.2fms\n",
lock, p50, p99, sched*1000)
}
}
Task 19: Microbenchmark global runq contention¶
Goal. Construct a workload that forces the global runq to be hot. Show that as GOMAXPROCS grows beyond the work-arrival concurrency, the sched lock becomes the bottleneck.
Difficulty. Staff
Skills. runtime/pprof for sched_lock contention, microbenchmarking, runtime architecture.
Setup. Go 1.22+, a multi-core machine, pprof.
Steps. 1. Write a benchmark where one producer goroutine fires go work() as fast as it can, and N consumer Ps drain. 2. Sweep GOMAXPROCS ∈ {2, 4, 8, 16, 32}. Measure throughput (jobs/sec). 3. Capture a mutex profile (runtime.SetMutexProfileFraction(1)) and look at runtime.schedule's contention. 4. Demonstrate that at high P counts, the producer becomes the bottleneck because every new go requires touching the sched lock to wake a parked M.
Acceptance criteria. - Throughput improves from P=2 to P=8, then plateaus or regresses by P=32. - The mutex profile shows runtime.lock / runtime.unlock from runtime.schedule near the top. - You can articulate why: every G's go statement that targets a parked M takes the sched lock to dequeue from the idle-M list and wake it.
Hints
- The local runq is lock-free. Contention only happens on the *global* runq and the M idle list. - A program that fires `go` from one goroutine creates many wake-ups because most Ms are parked (no local work). - `pprof` mutex profile: `go test -bench . -mutexprofile=mutex.out`, then `go tool pprof mutex.out`.Reference solution
// bench_test.go — go test -bench . -benchmem -mutexprofile=mutex.out -cpu 2,4,8,16,32
package main
import (
"sync"
"sync/atomic"
"testing"
)
func BenchmarkGlobalRunq(b *testing.B) {
// Senior decision: ResetTimer after barrier setup; otherwise
// the spawn warmup dominates short benchmarks.
var done int64
var wg sync.WaitGroup
wg.Add(b.N)
b.ResetTimer()
for i := 0; i < b.N; i++ {
// Spawning from main G ensures all wakes go through
// the shared sched lock pathway.
go func() {
defer wg.Done()
atomic.AddInt64(&done, 1)
}()
}
wg.Wait()
_ = done
}
Task 20: Design a work-steal tracer¶
Goal. Sketch the design of a tool that emits a custom event each time a P performs a work-steal. You won't ship this — you'll discuss what it would require.
Difficulty. Staff
Skills. Runtime internals, tracing infrastructure, design judgement.
Setup. Pen, paper, and the source of runtime/trace.go.
Steps. 1. Read runtime/trace.go's event format. Identify the existing event types (GoCreate, GoStart, GoBlock, ...). 2. Locate runqsteal in runtime/proc.go. Identify the call site where you would emit a GoStealStart event. 3. Specify the event payload: source P, target P, number of Gs stolen, timestamp. 4. Discuss implementation cost: where the bytes are written (p.tracebuf), the budget (each trace event is ~10 bytes), and the alternative of using an existing event creatively.
Acceptance criteria. - Your design names the precise function (runqsteal) and line range. - You acknowledge that adding a new trace event requires updating both runtime/trace.go and cmd/trace/internal/trace to display it. - You identify an alternative: emit a UserLog event from the stealing M with a stringified description, no runtime patch needed.
Hints
- Custom trace events in Go 1.22+ are easier via `runtime/trace.Log` (Task 6). But that requires user-code call sites; you can't easily inject from inside the runtime without a patch. - `runqsteal` is hot — even a few nanoseconds added per call is significant. The trace buffer write is ~30 ns. Acceptable, but measure. - The `internal/trace` package parses the binary format. Adding an event type means version-bumping the trace format (`traceEvVersion`).Reference solution
**Design sketch:**Goal: emit "STEAL src=Px dst=Py n=K" every time a P steals work.
Where to instrument: runtime/proc.go, func runqsteal:
func runqsteal(_p_, p2 *p, ...) ... {
n := runqgrab(p2, ...)
if n == 0 {
return 0
}
// NEW: emit trace event
if traceEnabled() {
traceGoStealStart(_p_.id, p2.id, n)
}
...
}
Payload format (added to runtime/traceback.go event table):
traceEvGoStealStart = NN // new event id, version bump required
args: srcP int32, dstP int32, n uint32
Per-event cost:
- traceEvent() — already used by other events, ~20 ns hot, ~50 ns cold
- 3 args × varint-encode = ~3-9 bytes
- Negligible compared to the steal itself (~1 µs).
Parser-side changes (cmd/trace, internal/trace):
1. Bump traceVersion.
2. Add Event constant and decoder.
3. Add a new view in cmd/trace's HTML: "Work-steal heatmap".
Alternative (zero-patch): use trace.Log from a wrapper.
- Can't directly, because the steal happens in runtime code that
user code can't intercept.
- Could be approximated by: monitor per-P runq depths from
user code at 100 µs intervals (sysmon-ish). Detect drops in
one P's depth that coincide with rises in another. Lossy but
requires no runtime patch.
Production justification: would I land this upstream?
- "Why is my workload steal-heavy?" is a real question.
- Existing trace doesn't show steals; you can infer them from
G-migrations across P rows but it's tedious.
- Cost is borderline acceptable. The trace buffer is already
paying ~5% overhead; another event type adds <1%.
Decision: write it as a runtime patch, ship as a private build,
collect data, then decide if it's worth proposing upstream. Don't
upstream speculatively — the Go team is rightly conservative about
trace event proliferation.
Task 21: GOMAXPROCS far above core count¶
Goal. Set GOMAXPROCS=128 on a 10-core machine. Predict the result. Run. Explain.
Difficulty. Staff
Skills. OS scheduling vs Go scheduling, oversubscription effects.
Setup. Go 1.22+, multi-core machine, ability to set GOMAXPROCS.
Steps. 1. CPU-bound workload that takes ~5 s at GOMAXPROCS=1. 2. Run at GOMAXPROCS=10 (cores), GOMAXPROCS=20, GOMAXPROCS=128. Record wall time. 3. Use GODEBUG=schedtrace=500 to observe runqueue sizes and parked-M counts. 4. Use top or htop (or Activity Monitor) to see context-switch rate.
Acceptance criteria. - Wall time at P=128 is roughly the same as P=10, sometimes slightly worse, never better. - Context-switch rate jumps significantly at P=128 because the OS is now multiplexing 128 OS threads onto 10 cores. - You can articulate why: the Go runtime gladly creates 128 Ms, but the kernel does its scheduling on top. Two layers of scheduling means context switches happen at both layers.
Hints
- Don't expect catastrophic slowdown — modern Linux schedulers handle oversubscription gracefully. The penalty is real but small. - The interesting metric is *latency*, not throughput. p99 of a per-task timer will degrade at P=128 because of OS-level context switches.Reference solution
// main.go — for P in 10 20 128; do GOMAXPROCS=$P go run main.go; done
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func work(d time.Duration) {
end := time.Now().Add(d)
for time.Now().Before(end) {
// burn
}
}
func main() {
const NJobs = 200
p := runtime.GOMAXPROCS(0)
var wg sync.WaitGroup
wg.Add(NJobs)
t := time.Now()
for i := 0; i < NJobs; i++ {
go func() {
defer wg.Done()
work(100 * time.Millisecond)
}()
}
wg.Wait()
fmt.Printf("P=%d wall=%v\n", p, time.Since(t))
}
Task 22: Scheduling latency during GC mark-assist¶
Goal. Construct a workload heavy enough to trigger GC mark-assist. Show, via runtime/metrics, that scheduling latency spikes during the assist phase.
Difficulty. Staff
Skills. GC architecture, mark-assist mechanism, correlating multiple metrics.
Setup. Go 1.22+, a workload with steady allocation pressure.
Steps. 1. Workload allocates ~100 MB/s of small objects (the goal is to keep GC working). 2. Periodically sample runtime/metrics: /sched/latencies:seconds, /gc/pauses:seconds, /cpu/classes/gc/mark/assist:cpu-seconds. 3. Plot the three over time. Find the correlated spikes. 4. Reduce allocation rate by 4× and re-run. Confirm latency stabilises.
Acceptance criteria. - You produce a time-series showing all three metrics. Mark-assist CPU correlates with scheduling latency p99. - You can explain mark-assist: when a G allocates more than its "budget" while GC is running, the allocator hijacks the G and forces it to do marking work. That G is now doing GC, not user work. - You can name the fix: reduce allocation rate, increase GOGC, or move large allocations off the hot path.
Hints
- `GODEBUG=gctrace=1` is the easiest way to confirm GC is running and assists are happening (look for `assist` in the trace line). - `runtime/metrics.Read` is allocation-free; you can sample from a hot loop without polluting your results.Reference solution
// main.go — go run main.go
package main
import (
"fmt"
"runtime/metrics"
"sync"
"time"
)
func churn(rate int, stop <-chan struct{}) {
for {
select {
case <-stop:
return
default:
}
for i := 0; i < rate; i++ {
_ = make([]byte, 1024) // 1 KB each
}
}
}
func snapshot() (schedP99, gcAssist, gcPauseP99 float64) {
samples := []metrics.Sample{
{Name: "/sched/latencies:seconds"},
{Name: "/cpu/classes/gc/mark/assist:cpu-seconds"},
{Name: "/gc/pauses:seconds"},
}
metrics.Read(samples)
h := samples[0].Value.Float64Histogram()
schedP99 = histP99(h)
gcAssist = samples[1].Value.Float64()
gh := samples[2].Value.Float64Histogram()
gcPauseP99 = histP99(gh)
return
}
func histP99(h *metrics.Float64Histogram) float64 {
var cum, total uint64
for _, c := range h.Counts {
total += c
}
for i, c := range h.Counts {
cum += c
if float64(cum) >= 0.99*float64(total) {
return h.Buckets[i+1]
}
}
return 0
}
func main() {
stop := make(chan struct{})
var wg sync.WaitGroup
const Workers = 8
wg.Add(Workers)
for i := 0; i < Workers; i++ {
go func() {
defer wg.Done()
churn(2000, stop) // ~16 MB per worker per loop iter, hot
}()
}
// Sample for 5 s
tick := time.NewTicker(500 * time.Millisecond)
defer tick.Stop()
timeout := time.After(5 * time.Second)
prevAssist := 0.0
for {
select {
case <-tick.C:
sched, assist, pause := snapshot()
fmt.Printf("schedP99=%.2fms pauseP99=%.2fms gcAssist=%.3f cpu-sec/0.5s (Δ=%.3f)\n",
sched*1000, pause*1000, assist, assist-prevAssist)
prevAssist = assist
case <-timeout:
close(stop)
wg.Wait()
return
}
}
}
Task 23: Fair scheduler wrapper across priority groups¶
Goal. Implement a user-space "fair scheduler" that round-robins a worker pool across N priority groups, each with its own work channel. Discuss trade-offs.
Difficulty. Staff
Skills. Channel patterns, fairness vs throughput, priority inversion.
Setup. Go 1.22+, no external deps.
Steps. 1. Define Job{Priority int, Run func()} and Scheduler with N priority channels. 2. Worker goroutines pull from priority queues in round-robin order: take 1 from priority 0, 1 from priority 1, ..., wrap. 3. Compare against (a) strict priority (always drain high before low) and (b) FIFO across all groups. 4. Show round-robin's anti-starvation property under "100% high-priority load".
Acceptance criteria. - Your round-robin scheduler delivers throughput on the lowest priority group even when the highest is saturated. - Strict priority starves the lowest under the same input. - You can articulate the trade: round-robin reduces worst-case delay on low-priority work; strict priority gives best-case delay on high-priority work. Pick based on whether you fear starvation or care about urgency.
Hints
- Don't use `select { case <-chans[0]: ... case <-chans[1]: ...}` — `select` picks randomly, which is *not* round-robin. - Use a loop with explicit index that increments. Use `default` to skip empty queues without blocking. - "Fairness across priority groups" is different from "fair scheduling of goroutines". Go's runtime gives you the latter for free.Reference solution
// fair.go — go run fair.go
package main
import (
"fmt"
"sync"
"sync/atomic"
"time"
)
type Job struct {
Priority int
Done chan time.Duration
Created time.Time
}
type Scheduler struct {
queues [][]chan Job // priority groups
wg sync.WaitGroup
stop chan struct{}
}
func NewScheduler(priorities, workers int) *Scheduler {
s := &Scheduler{
queues: make([][]chan Job, priorities),
stop: make(chan struct{}),
}
for p := range s.queues {
s.queues[p] = []chan Job{make(chan Job, 1024)}
}
for w := 0; w < workers; w++ {
s.wg.Add(1)
go s.worker()
}
return s
}
func (s *Scheduler) Submit(j Job) {
j.Created = time.Now()
s.queues[j.Priority][0] <- j
}
// Senior decision: round-robin across priorities. Each pass, try
// each queue with a non-blocking receive. If nothing was found,
// block on a select across all queues (so we don't busy-spin).
func (s *Scheduler) worker() {
defer s.wg.Done()
cursor := 0
for {
select {
case <-s.stop:
return
default:
}
gotOne := false
for i := 0; i < len(s.queues); i++ {
p := (cursor + i) % len(s.queues)
select {
case j := <-s.queues[p][0]:
j.Done <- time.Since(j.Created)
gotOne = true
cursor = (p + 1) % len(s.queues)
default:
}
if gotOne {
break
}
}
if !gotOne {
// Nothing on any queue; block on any with a fair select
cases := make([]chan Job, len(s.queues))
for i, q := range s.queues {
cases[i] = q[0]
}
// Note: select picks randomly. Acceptable here because
// it's the slow path (nothing on any queue).
select {
case j := <-cases[0]:
j.Done <- time.Since(j.Created)
case j := <-cases[1]:
j.Done <- time.Since(j.Created)
case <-s.stop:
return
}
}
}
}
func (s *Scheduler) Close() {
close(s.stop)
s.wg.Wait()
}
func main() {
s := NewScheduler(2, 4)
defer s.Close()
var highSum, lowSum, highN, lowN int64
hDone := make(chan time.Duration, 10000)
lDone := make(chan time.Duration, 10000)
// Saturate priority 0; lightly use priority 1
go func() {
for i := 0; i < 10000; i++ {
s.Submit(Job{Priority: 0, Done: hDone})
}
}()
go func() {
for i := 0; i < 200; i++ {
s.Submit(Job{Priority: 1, Done: lDone})
time.Sleep(time.Millisecond)
}
}()
go func() {
for d := range hDone {
atomic.AddInt64(&highSum, int64(d))
atomic.AddInt64(&highN, 1)
}
}()
go func() {
for d := range lDone {
atomic.AddInt64(&lowSum, int64(d))
atomic.AddInt64(&lowN, 1)
}
}()
time.Sleep(2 * time.Second)
hN, lN := atomic.LoadInt64(&highN), atomic.LoadInt64(&lowN)
if hN > 0 && lN > 0 {
fmt.Printf("high: avg=%v count=%d\n", time.Duration(highSum/hN), hN)
fmt.Printf("low : avg=%v count=%d\n", time.Duration(lowSum/lN), lN)
}
}
How to grade yourself¶
Score each task 0 (didn't try), 1 (got it with hints), 2 (unaided), 3 (your investigation surfaced a concrete observation the reference solution didn't anticipate — a unique trace pattern, a measurable on your hardware, a runtime quirk you can reproduce). Sum:
| Score | What it means |
|---|---|
| 0–20 | You can run a Go program and see it use CPU. The scheduler is a black box. Re-read junior.md, redo Tasks 1–5; the goal there is to make schedtrace and runtime/trace feel like ordinary tools, not arcana. |
| 21–40 | Comfortable with the observability layer. Tasks 6–12 push you into the failure modes: pre-1.14 starvation, work-stealing, contention. By the end you can read a trace and say "the problem is N", not "the program is slow". |
| 41–55 | You can debug a production scheduling issue from runtime/metrics alone. Tasks 13–18 force you into the source. The reward is being the person who reads a goroutine dump and immediately sees the lock-order violation that everyone else missed. |
| 56–69 | Staff. Tasks 19–23 are about design judgement: when to instrument, when to leave alone, when fairness in user space matters. If they didn't teach you something concrete about your own production workloads, you weren't reading metrics during the experiments. Re-run Task 22 against a real service and prove (or disprove) the assist hypothesis. |
Concrete checks worth running:
- Task 1: predict three fields of
schedtracebefore running. Right ≥2/3? You're fluent. - Task 4: take a
scheddetail=1dump from a real service and find one goroutine in an unexpected wait state. Open a bug or close a misconception. - Task 9: produce a real scheduling-latency spike in your team's code and fix it.
- Task 10: run on Go 1.22 and also on Go 1.13 (Docker
golang:1.13). The contrast is the whole point. - Task 13: write the priority list from memory after a week. If you can't, you didn't read carefully enough; redo with a notebook.
- Task 14: do not ship
//go:linknamecode without a Go-version regression test. The compiler will not warn you. - Task 18: measure on a workload where
LockOSThreadis genuinely needed (cgo+OpenGL or similar). The trade is real; quantify it for your case. - Task 22: instrument your production service for one week. If GC mark-assist isn't visibly correlated with latency spikes, something else is the bottleneck; investigate.
The single most important question is not did you finish. It is can you predict, from a one-paragraph workload description, where the scheduler will become the bottleneck and what runtime/metrics field will surface it? When a colleague says "our p99 jumped after we increased worker count from 8 to 64" and you can name three plausible scheduler causes and the metric to confirm each — that's senior. When you can also pick the cheapest one to verify first, that's staff.
Stretch challenges¶
S1 — Build a custom dashboard for runtime/metrics. Wire runtime/metrics into Prometheus or OpenTelemetry. Expose all /sched/*, /gc/*, and /cpu/classes/gc/* histograms. Build a Grafana dashboard with: scheduling latency p50/p99, GC pause p99, assist CPU rate, goroutine count, and Ps idle/busy. Run it against a real service for a week. Identify three "interesting" events in the timeline (deploys, traffic spikes, GC churn) and write a one-paragraph post-mortem of each. The deliverable is not the dashboard; it is the intuition of what these metrics look like in steady state vs distress.
S2 — Reproduce a real scheduling pathology from a Go issue tracker. Browse https://github.com/golang/go/issues with label Scheduler. Pick a closed issue with a reproducer attached. Reproduce on your machine. Walk through the fix in the corresponding CL. Predict, before reading the fix, what you would change. Compare. The skill grown here is reading other people's runtime patches.
S3 — Sketch a "scheduler-aware" worker pool library. Take Task 15's limiter and Task 23's fair scheduler. Combine them: a worker pool that monitors runtime/metrics and shrinks/grows its worker count in response to scheduling latency. When /sched/latencies:seconds p99 exceeds a threshold, reduce concurrency (counterintuitive but correct — you're oversubscribed). When it falls, expand. Stress-test against a workload that varies between CPU-bound and I/O-bound. The interesting bug to find: the feedback loop oscillates if your hysteresis is wrong. Tune it.