`GOMAXPROCS` — Optimization Exercises¶

Each exercise starts with a baseline measurement, identifies a tuning hypothesis, and asks you to validate or refute it on real hardware. Benchmarks are the spine — never tune without numbers.

Easy¶

Exercise 1 — Confirm `NumCPU` is the Sweet Spot¶

Setup. A CPU-burning HTTP server, 8-core box, no cgroup limits.

package main

import (
    "crypto/sha256"
    "log"
    "net/http"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    h := sha256.New()
    buf := make([]byte, 4096)
    for i := 0; i < 64; i++ {
        h.Write(buf)
    }
    _, _ = w.Write(h.Sum(nil))
}

func main() {
    log.Printf("GOMAXPROCS=%d", runtime.GOMAXPROCS(0))
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}

Target. Run a sweep GOMAXPROCS=1, 2, 4, 8, 16, 32 with wrk -t8 -c64 -d20s. Plot throughput. Confirm peak at GOMAXPROCS=8 (NumCPU). Confirm regression beyond.

Deliverable. Plot + 50-word interpretation.

Exercise 2 — Find the Container Default¶

Setup. A Go binary in a Docker container with --cpus=2 on a 64-core host. Go version 1.20.

docker run --cpus=2 -it my-go-bin

Target. Without modifying source, confirm what GOMAXPROCS resolves to. Compare with a Go 1.15 image and a Go 1.22 image. Document the differences.

Deliverable. A markdown table: Go version vs reported GOMAXPROCS. Recommendation based on the table.

Exercise 3 — Eliminate a Manual Override¶

Setup. Production code with runtime.GOMAXPROCS(64) near main(). The service runs in a pod with cpu: 2.

Target. Remove the manual override. Confirm via the sweep harness that throughput is equal-or-better and p99 latency drops. Report numbers before and after.

Deliverable. A before/after table and the PR diff (a single line deletion).

Exercise 4 — Add `automaxprocs`¶

Setup. A service running Go 1.15 (cannot upgrade — third-party dependency). Pod has cpu: 4. Service reports GOMAXPROCS=64.

Target. Add import _ "go.uber.org/automaxprocs". Confirm log line. Confirm GOMAXPROCS drops to 4. Confirm p99 latency improves.

Deliverable. Diff (one import) and metrics before/after.

Exercise 5 — Right-Size for an I/O Service¶

Setup. An HTTP gateway that forwards requests to a backend. Each request: 10 µs of CPU work, 5 ms of backend wait.

func proxy(w http.ResponseWriter, r *http.Request) {
    resp, _ := http.Get("http://backend/" + r.URL.Path)
    defer resp.Body.Close()
    io.Copy(w, resp.Body)
}

Target. Run the sweep GOMAXPROCS=1, 2, 4, 8 at 5 000 concurrent connections. Confirm GOMAXPROCS=2 is sufficient (per-request CPU is small; the netpoller handles the wait).

Deliverable. A throughput-vs-GOMAXPROCS curve that is flat beyond 2. Explanation: 5 000 connections × 10 µs / 5 ms = ~10 cores of concurrent but parked work. CPU demand is trivial.

Medium¶

Exercise 6 — Tail Latency Headroom¶

Setup. A service with GOMAXPROCS = NumCPU = 8. p99 = 25 ms. GC mark workers occasionally compete with request handlers for cores.

Target. Try GOMAXPROCS = 7. Hypothesis: leaving one core for OS interrupts and GC mark reduces p99 tail. Confirm or refute.

Deliverable. Side-by-side p50, p99, p99.9 measurements. Likely outcome: p50 slightly worse (less parallelism), p99 noticeably better. Total throughput slightly lower.

Exercise 7 — GOGC + GOMAXPROCS Interaction¶

Setup. Service does 200 K allocations per second. GC pauses ~5 ms p99.

Target. Run a 2D sweep: GOMAXPROCS × GOGC:

	GOGC=100	GOGC=200	GOGC=400
GOMAXPROCS=4
GOMAXPROCS=8
GOMAXPROCS=16

For each cell, record throughput, p99, peak heap. Identify the global optimum.

Deliverable. Heatmap + recommended config.

Exercise 8 — Reduce Spinning Overhead¶

Setup. Service with GOMAXPROCS=32 on a 16-core box ("for headroom"). GODEBUG=schedtrace=1000 shows spinningthreads=4-8 continuously.

Target. Reduce GOMAXPROCS to 16. Confirm spinning drops. Measure CPU usage (top, process_cpu_seconds_total). Hypothesis: CPU usage drops by ~5% with no throughput regression.

Deliverable. Before/after CPU usage and throughput.

Exercise 9 — Burst Headroom¶

Setup. Service usually at 30% CPU, but bursts to 95% for 5 seconds every minute. Pod has cpu: 4 limit. CFS throttle counter is non-zero during bursts.

Target. Either raise the limit to 5 (with corresponding GOMAXPROCS) or accept the throttling. Quantify the throttle penalty during bursts; quantify the cost of raising the limit (Kubernetes scheduling).

Deliverable. A short cost-benefit analysis.

Exercise 10 — NUMA Split¶

Setup. Service on a 2-socket box, 16 cores per socket, single process at GOMAXPROCS=32. Throughput is 70% of expected.

Target. Split into two processes, each numactl-pinned to one socket, each GOMAXPROCS=16. Front with HAProxy. Measure total throughput, p99, and numastat cross-socket traffic.

Deliverable. Throughput improvement (expect 15–30%), numastat before/after.

Exercise 11 — Cgo-Aware Sizing¶

Setup. Service makes heavy cgo calls. Each cgo call holds an M for ~5 ms. Service handles 100 RPS. p99 jumps to 50 ms during cgo bursts.

Target. Investigate whether raising GOMAXPROCS helps (cgo holds an M but not a P — so other Gs can run). Hypothesis: no improvement; the bottleneck is cgo concurrency, not parallelism.

Deliverable. A sweep proving the hypothesis. Suggest the real fix: bound cgo concurrency with a semaphore.

Exercise 12 — Disk I/O Parallelism¶

Setup. Service reads many small files. Currently spawns 1000 concurrent goroutines, each doing os.ReadFile. Thread count climbs to 200+.

Target. Each os.ReadFile is a blocking syscall on a regular file (not netpoller-backed). Test whether raising GOMAXPROCS helps. Hypothesis: no, because Ms are blocked in syscalls, not Ps.

Deliverable. Confirm by sweep. The real fix is bounding parallelism with a semaphore — show that.

Hard¶

Exercise 13 — Latency-Critical Service¶

Setup. A trading-system order matcher. p99 latency must be below 1 ms. CPU work per request: 50 µs.

Target. Optimise for tail latency:

Baseline.
GOGC=off (no GC), set GOMEMLIMIT for safety.
Pre-allocate everything in sync.Pool.
GOMAXPROCS = NumCPU - 2 (headroom for system).
Run-to-completion goroutines (no time.Sleep, no blocking).
Pin the matcher goroutine via LockOSThread.

Measure at each step.

Deliverable. Stepwise table; final p99 should be < 500 µs.

Exercise 14 — Multi-Region Replicated Service¶

Setup. Same service runs in 5 regions. Each region has different CPU instance types. Hard-coded GOMAXPROCS=8 works in some, not in others.

Target. Switch to cgroup-aware sizing. Add process_gomaxprocs metric. Generate a per-region report showing GOMAXPROCS resolution for each instance type.

Deliverable. A Grafana dashboard or markdown report showing per-region GOMAXPROCS. Confirm sane values across regions.

Exercise 15 — Sweep Automation¶

Setup. Service's CPU profile has changed three times in the last 6 months. The team manually tunes after each change.

Target. Build a CI job that:

On every PR touching internal/handlers/, runs a sweep.
Compares peak throughput to baseline in repo.
Updates baseline if PR explicitly opts in.
Fails CI on > 10% regression.

Deliverable. GitHub Actions workflow + baseline file + sweep script.

Exercise 16 — Workload-Aware Autoscale¶

Setup. Bursty service: 100 RPS steady, 5 000 RPS bursts. Static GOMAXPROCS=8. Tail latency suffers during bursts (queue depth grows).

Target. Build an adaptive controller that raises GOMAXPROCS from 8 to 16 (within cgroup quota) during sustained high latency and lowers back when latency normalises. Add hysteresis. Measure: tail latency during bursts should improve; overhead during steady state should be < 1%.

Deliverable. Adaptive package + tests + before/after p99 measurements.

Exercise 17 — Multi-Process Manager¶

Setup. Service on a 64-core, 2-socket bare-metal box. Currently single process at GOMAXPROCS=64.

Target. Build a supervisor that:

Forks 2 child processes.
Pins each to one NUMA node via taskset / numactl.
Sets each child's GOMAXPROCS=32.
Restarts crashed children.
Load-balances via an in-process reverse proxy or front HAProxy.

Compare total throughput, p99, and operational complexity to the single-process baseline.

Deliverable. Supervisor + manifest + comparison report.

Exercise 18 — CFS Throttle Hunt¶

Setup. Fleet of 200 Go services. ~10 of them show occasional CFS throttling. Identifying which is manual and slow.

Target. Build an audit tool that:

Queries Prometheus for container_cpu_cfs_throttled_seconds_total > 0 over the last 7 days.
For each affected service, fetches process_gomaxprocs and the pod's cpu limit.
Reports mismatches.

Deliverable. A Go program that emits a markdown report. Run weekly via cron.

Exercise 19 — Memory-Bound Workload¶

Setup. Service does sequential 8 GB memory scans. GOMAXPROCS=16. Throughput is bound by memory bandwidth (~25 GB/s on modern DDR4).

Target. Confirm raising GOMAXPROCS does not help (memory bandwidth is the bottleneck). Measure with perf stat -e cycles,stalled-cycles-frontend,LLC-loads,LLC-load-misses ./bin.

Deliverable. A short report demonstrating memory-bandwidth saturation. Suggest mitigations: better locality, prefetching, smaller dataset.

Exercise 20 — End-to-End Tail Tuning¶

Setup. Take any non-trivial Go service. Establish baseline p50, p99, p99.9 at expected load.

Target. Iteratively tune through every knob covered in this section:

Log GOMAXPROCS at startup.
Audit manifest for per-container CPU limits.
Run a sweep; confirm default value.
Run another sweep with GOMAXPROCS = quota - 1.
Profile allocations; pool the top 3 hotspots.
Sweep GOGC ∈ {100, 200, 400} at fixed GOMAXPROCS.
Set GOMEMLIMIT to bound memory.
Final sweep across all combinations to lock in.

Record measurements at each step.

Deliverable. A final markdown report with the cumulative improvement. Realistic target: 50%+ reduction in p99.9 latency, similar throughput, controlled memory.

Solutions¶

Solution 1¶

Sweep results for an Intel 8-core dev box:

GOMAXPROCS	Throughput (req/s)	p99 (ms)
1	1 100	70
2	2 200	35
4	4 200	20
8	8 100	12
16	8 000	13
32	7 700	15

Peak at 8 (= NumCPU). 16 and 32 cost ~1% throughput. Sweet spot confirmed.

Solution 2¶

Go version	`GOMAXPROCS` reported
1.15	64 (cgroup unaware)
1.20	2 (cgroup v2 detected)
1.22	2 (same)

Recommendation: pin to Go ≥ 1.18 for accurate cgroup detection. Add automaxprocs as belt-and-braces.

Solution 3¶

Before (with hard-coded 64): throughput 4 000 req/s, p99 35 ms, CFS throttle rate 5%. After (default 2): throughput 4 100 req/s, p99 9 ms, throttle rate 0%.

Net: same throughput, p99 reduced 75%. PR is one-line deletion.

Solution 4¶

After adding automaxprocs: log line maxprocs: Updating GOMAXPROCS=4: determined from CPU quota. GOMAXPROCS drops from 64 to 4. p99 improves; CFS throttling drops to 0.

Solution 5¶

Sweep at 5 000 concurrent connections, request = 10 µs CPU + 5 ms wait:

GOMAXPROCS	Throughput (req/s)	p99 (ms)
1	9 500	30
2	19 000	12
4	19 200	10
8	19 100	10

GOMAXPROCS=2 is enough. The netpoller handles 5 000 parked goroutines for free. Higher GOMAXPROCS does not help because the CPU portion is trivial.

Solution 6¶

Config	p50	p99	p99.9
GOMAXPROCS=8	4 ms	25 ms	60 ms
GOMAXPROCS=7	5 ms	18 ms	40 ms

p50 slightly worse, p99 and p99.9 better. Trade-off worth it for latency-sensitive services.

Solution 7¶

Sample heatmap:

	GOGC=100	GOGC=200	GOGC=400
GMP=4	tput 12k, p99 30 ms, heap 1 G	13k, 25 ms, 2 G	13k, 24 ms, 4 G
GMP=8	23k, 18 ms, 1 G	24k, 15 ms, 2 G	24k, 14 ms, 4 G
GMP=16	25k, 22 ms, 1 G	25k, 20 ms, 2 G	25k, 19 ms, 4 G

Optimum: GOMAXPROCS=8, GOGC=200 — best p99 at acceptable memory.

Solution 8¶

Before: 32 Ps, 24% CPU usage (spinning + work), 40 K req/s throughput. After: 16 Ps, 18% CPU usage, 41 K req/s throughput.

CPU saved: ~5 percentage points. Throughput marginally better (less scheduler overhead). spinningthreads drops from 4–8 to 0–2.

Solution 9¶

Throttle penalty during bursts: ~30 ms added p99 latency for ~5 seconds per minute.

Cost of cpu: 5 instead of 4: scheduler may need a different node; node capacity must accommodate the spike.

Recommendation: keep cpu: 4 but provision more replicas for burst capacity. Or use Vertical Pod Autoscaler for adaptive limits.

Solution 10¶

Single process, GOMAXPROCS=32: 60k req/s, numastat shows 40% cross-socket access. Two processes, each GOMAXPROCS=16, NUMA-pinned: 78k req/s (+30%), numastat shows < 5% cross-socket.

Expected pattern. Memory locality matters.

Solution 11¶

Sweep with GOMAXPROCS=4, 8, 16, 32 — all give same throughput (~100 RPS) and same p99 (~50 ms during bursts). Cgo holds M, not P; raising P count does not help.

Real fix:

sem := semaphore.NewWeighted(20)
// in handler:
sem.Acquire(ctx, 1)
defer sem.Release(1)
C.slow_call()

Now cgo concurrency is bounded; p99 stabilises.

Solution 12¶

Sweep confirms: thread count rises proportionally to concurrent file reads, regardless of GOMAXPROCS. Blocking syscalls park Ms; they do not contend for Ps.

Fix:

sem := make(chan struct{}, 8)
for _, p := range paths {
    sem <- struct{}{}
    go func() {
        defer func() { <-sem }()
        data, _ := os.ReadFile(p)
        process(data)
    }()
}

Thread count caps at ~10 (GOMAXPROCS + a few syscall holders).

Solution 13¶

Stepwise tuning for trading matcher:

Step	p50	p99	p99.9
Baseline	80 µs	2 ms	8 ms
`GOGC=off` + GOMEMLIMIT	80 µs	1.2 ms	4 ms
sync.Pool everywhere	75 µs	800 µs	2 ms
GOMAXPROCS=NumCPU-2	75 µs	700 µs	1.5 ms
LockOSThread on matcher	70 µs	400 µs	800 µs

Final p99: 400 µs. Achieved target.

Solution 14¶

Per-region report:

Region	Instance type	NumCPU	Cgroup quota	GOMAXPROCS resolved
us-east-1	c5.4xlarge	16	8	8
eu-west-1	c6i.4xlarge	16	8	8
ap-southeast-1	c5.2xlarge	8	8	8
us-west-2	c5a.4xlarge	16	8	8
ap-northeast-1	c5.8xlarge	32	8	8

All consistent at 8 (cgroup quota). No hard-coded overrides surviving.

Solution 15¶

GitHub Actions:

- name: Sweep
  run: ./scripts/sweep.sh > sweep.json

- name: Compare
  run: |
    python ./scripts/compare.py sweep.json baseline.json --tolerance=0.10

Workflow fails if any GOMAXPROCS setting regresses > 10%. compare.py is ~50 lines.

Solution 16¶

Adaptive controller (see tasks.md Solution 13). With hysteresis (3 consecutive samples), bounds [4, 16], cooldown 30 s.

Result: p99 during bursts drops from 100 ms to 30 ms. Steady-state CPU overhead < 0.5%.

Solution 17¶

Supervisor in Go: forks 2 children with taskset, monitors via Wait4, restarts on exit, exposes single port via in-process reverse proxy.

Single process: 80k req/s, p99 18 ms. Two-process NUMA-split: 105k req/s (+30%), p99 14 ms.

Operational complexity rises: deployments must restart both processes, logs are interleaved, metrics are per-process. Worth it on memory-heavy services.

Solution 18¶

Audit tool queries Prometheus, joins with manifest data:

service-foo: GOMAXPROCS=8, limit=2 (mismatch); throttled 3% of time
service-bar: GOMAXPROCS=4, limit=4 (OK); throttled 1% (real load issue)
service-baz: GOMAXPROCS=64, limit=none (no limit); throttled 0% (noisy neighbour risk)

Run weekly. Top 3 offenders get an issue auto-filed.

Solution 19¶

perf stat output:

LLC-loads: 1.2 G
LLC-load-misses: 1.1 G (91%)
stalled-cycles-frontend: 80% of cycles

Memory bandwidth saturated. Raising GOMAXPROCS adds more demands on the same memory bus; no improvement.

Mitigations: tile the dataset for better cache use; prefetch sparingly; or accept the limit.

Solution 20¶

Stepwise improvement for a sample service:

Step	p50	p99	p99.9	Tput	Mem
Baseline	8 ms	80 ms	250 ms	1000 rps	1 G
+log + audit manifest	(no change in metrics)
+default GOMAXPROCS sweep	8 ms	60 ms	180 ms	1100 rps	1 G
+sync.Pool	7 ms	35 ms	100 ms	1200 rps	1 G
+GOGC=200	7 ms	30 ms	80 ms	1300 rps	2 G
+GOMEMLIMIT=2 G	7 ms	28 ms	75 ms	1300 rps	2 G
+GOMAXPROCS=NumCPU-1	8 ms	25 ms	60 ms	1280 rps	2 G
Final	8 ms	25 ms	60 ms	1280 rps	2 G

Net: p99 reduced from 80 ms to 25 ms (~70% improvement). p99.9 reduced from 250 ms to 60 ms (~75%). Throughput unchanged (+28%). Memory doubled (acceptable trade).

Wrap-Up¶

Themes across these optimisations:

Trust the default first. The Go runtime since 1.18 gets it right in containers. Manual overrides should be rare and justified.
GOMAXPROCS is a small lever. Allocation profile, GC tuning, and per-container limits dominate the latency picture. Tune GOMAXPROCS last.
Sweep before changing. Without before/after numbers, you are guessing.
NUMA splits for big iron. Multi-socket boxes benefit from multiple processes, one per socket.
CFS throttling is loud. A non-zero throttle rate is always a problem.
GOMAXPROCS does not affect I/O concurrency. The netpoller takes care of it.
Adaptive sizing is rare. Most teams set once and trust.

These exercises form the practical curriculum for GOMAXPROCS tuning. Working through them — with real numbers — builds the intuition that no amount of reading can replace.

GOMAXPROCS — Optimization Exercises¶