GOMAXPROCS — Optimization Exercises¶
Each exercise starts with a baseline measurement, identifies a tuning hypothesis, and asks you to validate or refute it on real hardware. Benchmarks are the spine — never tune without numbers.
Easy¶
Exercise 1 — Confirm NumCPU is the Sweet Spot¶
Setup. A CPU-burning HTTP server, 8-core box, no cgroup limits.
package main
import (
"crypto/sha256"
"log"
"net/http"
"runtime"
)
func handler(w http.ResponseWriter, r *http.Request) {
h := sha256.New()
buf := make([]byte, 4096)
for i := 0; i < 64; i++ {
h.Write(buf)
}
_, _ = w.Write(h.Sum(nil))
}
func main() {
log.Printf("GOMAXPROCS=%d", runtime.GOMAXPROCS(0))
http.HandleFunc("/", handler)
http.ListenAndServe(":8080", nil)
}
Target. Run a sweep GOMAXPROCS=1, 2, 4, 8, 16, 32 with wrk -t8 -c64 -d20s. Plot throughput. Confirm peak at GOMAXPROCS=8 (NumCPU). Confirm regression beyond.
Deliverable. Plot + 50-word interpretation.
Exercise 2 — Find the Container Default¶
Setup. A Go binary in a Docker container with --cpus=2 on a 64-core host. Go version 1.20.
Target. Without modifying source, confirm what GOMAXPROCS resolves to. Compare with a Go 1.15 image and a Go 1.22 image. Document the differences.
Deliverable. A markdown table: Go version vs reported GOMAXPROCS. Recommendation based on the table.
Exercise 3 — Eliminate a Manual Override¶
Setup. Production code with runtime.GOMAXPROCS(64) near main(). The service runs in a pod with cpu: 2.
Target. Remove the manual override. Confirm via the sweep harness that throughput is equal-or-better and p99 latency drops. Report numbers before and after.
Deliverable. A before/after table and the PR diff (a single line deletion).
Exercise 4 — Add automaxprocs¶
Setup. A service running Go 1.15 (cannot upgrade — third-party dependency). Pod has cpu: 4. Service reports GOMAXPROCS=64.
Target. Add import _ "go.uber.org/automaxprocs". Confirm log line. Confirm GOMAXPROCS drops to 4. Confirm p99 latency improves.
Deliverable. Diff (one import) and metrics before/after.
Exercise 5 — Right-Size for an I/O Service¶
Setup. An HTTP gateway that forwards requests to a backend. Each request: 10 µs of CPU work, 5 ms of backend wait.
func proxy(w http.ResponseWriter, r *http.Request) {
resp, _ := http.Get("http://backend/" + r.URL.Path)
defer resp.Body.Close()
io.Copy(w, resp.Body)
}
Target. Run the sweep GOMAXPROCS=1, 2, 4, 8 at 5 000 concurrent connections. Confirm GOMAXPROCS=2 is sufficient (per-request CPU is small; the netpoller handles the wait).
Deliverable. A throughput-vs-GOMAXPROCS curve that is flat beyond 2. Explanation: 5 000 connections × 10 µs / 5 ms = ~10 cores of concurrent but parked work. CPU demand is trivial.
Medium¶
Exercise 6 — Tail Latency Headroom¶
Setup. A service with GOMAXPROCS = NumCPU = 8. p99 = 25 ms. GC mark workers occasionally compete with request handlers for cores.
Target. Try GOMAXPROCS = 7. Hypothesis: leaving one core for OS interrupts and GC mark reduces p99 tail. Confirm or refute.
Deliverable. Side-by-side p50, p99, p99.9 measurements. Likely outcome: p50 slightly worse (less parallelism), p99 noticeably better. Total throughput slightly lower.
Exercise 7 — GOGC + GOMAXPROCS Interaction¶
Setup. Service does 200 K allocations per second. GC pauses ~5 ms p99.
Target. Run a 2D sweep: GOMAXPROCS × GOGC:
| GOGC=100 | GOGC=200 | GOGC=400 | |
|---|---|---|---|
| GOMAXPROCS=4 | |||
| GOMAXPROCS=8 | |||
| GOMAXPROCS=16 |
For each cell, record throughput, p99, peak heap. Identify the global optimum.
Deliverable. Heatmap + recommended config.
Exercise 8 — Reduce Spinning Overhead¶
Setup. Service with GOMAXPROCS=32 on a 16-core box ("for headroom"). GODEBUG=schedtrace=1000 shows spinningthreads=4-8 continuously.
Target. Reduce GOMAXPROCS to 16. Confirm spinning drops. Measure CPU usage (top, process_cpu_seconds_total). Hypothesis: CPU usage drops by ~5% with no throughput regression.
Deliverable. Before/after CPU usage and throughput.
Exercise 9 — Burst Headroom¶
Setup. Service usually at 30% CPU, but bursts to 95% for 5 seconds every minute. Pod has cpu: 4 limit. CFS throttle counter is non-zero during bursts.
Target. Either raise the limit to 5 (with corresponding GOMAXPROCS) or accept the throttling. Quantify the throttle penalty during bursts; quantify the cost of raising the limit (Kubernetes scheduling).
Deliverable. A short cost-benefit analysis.
Exercise 10 — NUMA Split¶
Setup. Service on a 2-socket box, 16 cores per socket, single process at GOMAXPROCS=32. Throughput is 70% of expected.
Target. Split into two processes, each numactl-pinned to one socket, each GOMAXPROCS=16. Front with HAProxy. Measure total throughput, p99, and numastat cross-socket traffic.
Deliverable. Throughput improvement (expect 15–30%), numastat before/after.
Exercise 11 — Cgo-Aware Sizing¶
Setup. Service makes heavy cgo calls. Each cgo call holds an M for ~5 ms. Service handles 100 RPS. p99 jumps to 50 ms during cgo bursts.
Target. Investigate whether raising GOMAXPROCS helps (cgo holds an M but not a P — so other Gs can run). Hypothesis: no improvement; the bottleneck is cgo concurrency, not parallelism.
Deliverable. A sweep proving the hypothesis. Suggest the real fix: bound cgo concurrency with a semaphore.
Exercise 12 — Disk I/O Parallelism¶
Setup. Service reads many small files. Currently spawns 1000 concurrent goroutines, each doing os.ReadFile. Thread count climbs to 200+.
Target. Each os.ReadFile is a blocking syscall on a regular file (not netpoller-backed). Test whether raising GOMAXPROCS helps. Hypothesis: no, because Ms are blocked in syscalls, not Ps.
Deliverable. Confirm by sweep. The real fix is bounding parallelism with a semaphore — show that.
Hard¶
Exercise 13 — Latency-Critical Service¶
Setup. A trading-system order matcher. p99 latency must be below 1 ms. CPU work per request: 50 µs.
Target. Optimise for tail latency:
- Baseline.
GOGC=off(no GC), setGOMEMLIMITfor safety.- Pre-allocate everything in
sync.Pool. GOMAXPROCS = NumCPU - 2(headroom for system).- Run-to-completion goroutines (no
time.Sleep, no blocking). - Pin the matcher goroutine via
LockOSThread.
Measure at each step.
Deliverable. Stepwise table; final p99 should be < 500 µs.
Exercise 14 — Multi-Region Replicated Service¶
Setup. Same service runs in 5 regions. Each region has different CPU instance types. Hard-coded GOMAXPROCS=8 works in some, not in others.
Target. Switch to cgroup-aware sizing. Add process_gomaxprocs metric. Generate a per-region report showing GOMAXPROCS resolution for each instance type.
Deliverable. A Grafana dashboard or markdown report showing per-region GOMAXPROCS. Confirm sane values across regions.
Exercise 15 — Sweep Automation¶
Setup. Service's CPU profile has changed three times in the last 6 months. The team manually tunes after each change.
Target. Build a CI job that:
- On every PR touching
internal/handlers/, runs a sweep. - Compares peak throughput to baseline in repo.
- Updates baseline if PR explicitly opts in.
- Fails CI on > 10% regression.
Deliverable. GitHub Actions workflow + baseline file + sweep script.
Exercise 16 — Workload-Aware Autoscale¶
Setup. Bursty service: 100 RPS steady, 5 000 RPS bursts. Static GOMAXPROCS=8. Tail latency suffers during bursts (queue depth grows).
Target. Build an adaptive controller that raises GOMAXPROCS from 8 to 16 (within cgroup quota) during sustained high latency and lowers back when latency normalises. Add hysteresis. Measure: tail latency during bursts should improve; overhead during steady state should be < 1%.
Deliverable. Adaptive package + tests + before/after p99 measurements.
Exercise 17 — Multi-Process Manager¶
Setup. Service on a 64-core, 2-socket bare-metal box. Currently single process at GOMAXPROCS=64.
Target. Build a supervisor that:
- Forks 2 child processes.
- Pins each to one NUMA node via
taskset/numactl. - Sets each child's
GOMAXPROCS=32. - Restarts crashed children.
- Load-balances via an in-process reverse proxy or front HAProxy.
Compare total throughput, p99, and operational complexity to the single-process baseline.
Deliverable. Supervisor + manifest + comparison report.
Exercise 18 — CFS Throttle Hunt¶
Setup. Fleet of 200 Go services. ~10 of them show occasional CFS throttling. Identifying which is manual and slow.
Target. Build an audit tool that:
- Queries Prometheus for
container_cpu_cfs_throttled_seconds_total > 0over the last 7 days. - For each affected service, fetches
process_gomaxprocsand the pod'scpulimit. - Reports mismatches.
Deliverable. A Go program that emits a markdown report. Run weekly via cron.
Exercise 19 — Memory-Bound Workload¶
Setup. Service does sequential 8 GB memory scans. GOMAXPROCS=16. Throughput is bound by memory bandwidth (~25 GB/s on modern DDR4).
Target. Confirm raising GOMAXPROCS does not help (memory bandwidth is the bottleneck). Measure with perf stat -e cycles,stalled-cycles-frontend,LLC-loads,LLC-load-misses ./bin.
Deliverable. A short report demonstrating memory-bandwidth saturation. Suggest mitigations: better locality, prefetching, smaller dataset.
Exercise 20 — End-to-End Tail Tuning¶
Setup. Take any non-trivial Go service. Establish baseline p50, p99, p99.9 at expected load.
Target. Iteratively tune through every knob covered in this section:
- Log
GOMAXPROCSat startup. - Audit manifest for per-container CPU limits.
- Run a sweep; confirm default value.
- Run another sweep with
GOMAXPROCS = quota - 1. - Profile allocations; pool the top 3 hotspots.
- Sweep
GOGC∈ {100, 200, 400} at fixedGOMAXPROCS. - Set
GOMEMLIMITto bound memory. - Final sweep across all combinations to lock in.
Record measurements at each step.
Deliverable. A final markdown report with the cumulative improvement. Realistic target: 50%+ reduction in p99.9 latency, similar throughput, controlled memory.
Solutions¶
Solution 1¶
Sweep results for an Intel 8-core dev box:
| GOMAXPROCS | Throughput (req/s) | p99 (ms) |
|---|---|---|
| 1 | 1 100 | 70 |
| 2 | 2 200 | 35 |
| 4 | 4 200 | 20 |
| 8 | 8 100 | 12 |
| 16 | 8 000 | 13 |
| 32 | 7 700 | 15 |
Peak at 8 (= NumCPU). 16 and 32 cost ~1% throughput. Sweet spot confirmed.
Solution 2¶
| Go version | GOMAXPROCS reported |
|---|---|
| 1.15 | 64 (cgroup unaware) |
| 1.20 | 2 (cgroup v2 detected) |
| 1.22 | 2 (same) |
Recommendation: pin to Go ≥ 1.18 for accurate cgroup detection. Add automaxprocs as belt-and-braces.
Solution 3¶
Before (with hard-coded 64): throughput 4 000 req/s, p99 35 ms, CFS throttle rate 5%. After (default 2): throughput 4 100 req/s, p99 9 ms, throttle rate 0%.
Net: same throughput, p99 reduced 75%. PR is one-line deletion.
Solution 4¶
After adding automaxprocs: log line maxprocs: Updating GOMAXPROCS=4: determined from CPU quota. GOMAXPROCS drops from 64 to 4. p99 improves; CFS throttling drops to 0.
Solution 5¶
Sweep at 5 000 concurrent connections, request = 10 µs CPU + 5 ms wait:
| GOMAXPROCS | Throughput (req/s) | p99 (ms) |
|---|---|---|
| 1 | 9 500 | 30 |
| 2 | 19 000 | 12 |
| 4 | 19 200 | 10 |
| 8 | 19 100 | 10 |
GOMAXPROCS=2 is enough. The netpoller handles 5 000 parked goroutines for free. Higher GOMAXPROCS does not help because the CPU portion is trivial.
Solution 6¶
| Config | p50 | p99 | p99.9 |
|---|---|---|---|
| GOMAXPROCS=8 | 4 ms | 25 ms | 60 ms |
| GOMAXPROCS=7 | 5 ms | 18 ms | 40 ms |
p50 slightly worse, p99 and p99.9 better. Trade-off worth it for latency-sensitive services.
Solution 7¶
Sample heatmap:
| GOGC=100 | GOGC=200 | GOGC=400 | |
|---|---|---|---|
| GMP=4 | tput 12k, p99 30 ms, heap 1 G | 13k, 25 ms, 2 G | 13k, 24 ms, 4 G |
| GMP=8 | 23k, 18 ms, 1 G | 24k, 15 ms, 2 G | 24k, 14 ms, 4 G |
| GMP=16 | 25k, 22 ms, 1 G | 25k, 20 ms, 2 G | 25k, 19 ms, 4 G |
Optimum: GOMAXPROCS=8, GOGC=200 — best p99 at acceptable memory.
Solution 8¶
Before: 32 Ps, 24% CPU usage (spinning + work), 40 K req/s throughput. After: 16 Ps, 18% CPU usage, 41 K req/s throughput.
CPU saved: ~5 percentage points. Throughput marginally better (less scheduler overhead). spinningthreads drops from 4–8 to 0–2.
Solution 9¶
Throttle penalty during bursts: ~30 ms added p99 latency for ~5 seconds per minute.
Cost of cpu: 5 instead of 4: scheduler may need a different node; node capacity must accommodate the spike.
Recommendation: keep cpu: 4 but provision more replicas for burst capacity. Or use Vertical Pod Autoscaler for adaptive limits.
Solution 10¶
Single process, GOMAXPROCS=32: 60k req/s, numastat shows 40% cross-socket access. Two processes, each GOMAXPROCS=16, NUMA-pinned: 78k req/s (+30%), numastat shows < 5% cross-socket.
Expected pattern. Memory locality matters.
Solution 11¶
Sweep with GOMAXPROCS=4, 8, 16, 32 — all give same throughput (~100 RPS) and same p99 (~50 ms during bursts). Cgo holds M, not P; raising P count does not help.
Real fix:
sem := semaphore.NewWeighted(20)
// in handler:
sem.Acquire(ctx, 1)
defer sem.Release(1)
C.slow_call()
Now cgo concurrency is bounded; p99 stabilises.
Solution 12¶
Sweep confirms: thread count rises proportionally to concurrent file reads, regardless of GOMAXPROCS. Blocking syscalls park Ms; they do not contend for Ps.
Fix:
sem := make(chan struct{}, 8)
for _, p := range paths {
sem <- struct{}{}
go func() {
defer func() { <-sem }()
data, _ := os.ReadFile(p)
process(data)
}()
}
Thread count caps at ~10 (GOMAXPROCS + a few syscall holders).
Solution 13¶
Stepwise tuning for trading matcher:
| Step | p50 | p99 | p99.9 |
|---|---|---|---|
| Baseline | 80 µs | 2 ms | 8 ms |
GOGC=off + GOMEMLIMIT | 80 µs | 1.2 ms | 4 ms |
| sync.Pool everywhere | 75 µs | 800 µs | 2 ms |
| GOMAXPROCS=NumCPU-2 | 75 µs | 700 µs | 1.5 ms |
| LockOSThread on matcher | 70 µs | 400 µs | 800 µs |
Final p99: 400 µs. Achieved target.
Solution 14¶
Per-region report:
| Region | Instance type | NumCPU | Cgroup quota | GOMAXPROCS resolved |
|---|---|---|---|---|
| us-east-1 | c5.4xlarge | 16 | 8 | 8 |
| eu-west-1 | c6i.4xlarge | 16 | 8 | 8 |
| ap-southeast-1 | c5.2xlarge | 8 | 8 | 8 |
| us-west-2 | c5a.4xlarge | 16 | 8 | 8 |
| ap-northeast-1 | c5.8xlarge | 32 | 8 | 8 |
All consistent at 8 (cgroup quota). No hard-coded overrides surviving.
Solution 15¶
GitHub Actions:
- name: Sweep
run: ./scripts/sweep.sh > sweep.json
- name: Compare
run: |
python ./scripts/compare.py sweep.json baseline.json --tolerance=0.10
Workflow fails if any GOMAXPROCS setting regresses > 10%. compare.py is ~50 lines.
Solution 16¶
Adaptive controller (see tasks.md Solution 13). With hysteresis (3 consecutive samples), bounds [4, 16], cooldown 30 s.
Result: p99 during bursts drops from 100 ms to 30 ms. Steady-state CPU overhead < 0.5%.
Solution 17¶
Supervisor in Go: forks 2 children with taskset, monitors via Wait4, restarts on exit, exposes single port via in-process reverse proxy.
Single process: 80k req/s, p99 18 ms. Two-process NUMA-split: 105k req/s (+30%), p99 14 ms.
Operational complexity rises: deployments must restart both processes, logs are interleaved, metrics are per-process. Worth it on memory-heavy services.
Solution 18¶
Audit tool queries Prometheus, joins with manifest data:
service-foo: GOMAXPROCS=8, limit=2 (mismatch); throttled 3% of time
service-bar: GOMAXPROCS=4, limit=4 (OK); throttled 1% (real load issue)
service-baz: GOMAXPROCS=64, limit=none (no limit); throttled 0% (noisy neighbour risk)
Run weekly. Top 3 offenders get an issue auto-filed.
Solution 19¶
perf stat output:
Memory bandwidth saturated. Raising GOMAXPROCS adds more demands on the same memory bus; no improvement.
Mitigations: tile the dataset for better cache use; prefetch sparingly; or accept the limit.
Solution 20¶
Stepwise improvement for a sample service:
| Step | p50 | p99 | p99.9 | Tput | Mem |
|---|---|---|---|---|---|
| Baseline | 8 ms | 80 ms | 250 ms | 1000 rps | 1 G |
| +log + audit manifest | (no change in metrics) | ||||
| +default GOMAXPROCS sweep | 8 ms | 60 ms | 180 ms | 1100 rps | 1 G |
| +sync.Pool | 7 ms | 35 ms | 100 ms | 1200 rps | 1 G |
| +GOGC=200 | 7 ms | 30 ms | 80 ms | 1300 rps | 2 G |
| +GOMEMLIMIT=2 G | 7 ms | 28 ms | 75 ms | 1300 rps | 2 G |
| +GOMAXPROCS=NumCPU-1 | 8 ms | 25 ms | 60 ms | 1280 rps | 2 G |
| Final | 8 ms | 25 ms | 60 ms | 1280 rps | 2 G |
Net: p99 reduced from 80 ms to 25 ms (~70% improvement). p99.9 reduced from 250 ms to 60 ms (~75%). Throughput unchanged (+28%). Memory doubled (acceptable trade).
Wrap-Up¶
Themes across these optimisations:
- Trust the default first. The Go runtime since 1.18 gets it right in containers. Manual overrides should be rare and justified.
GOMAXPROCSis a small lever. Allocation profile, GC tuning, and per-container limits dominate the latency picture. TuneGOMAXPROCSlast.- Sweep before changing. Without before/after numbers, you are guessing.
- NUMA splits for big iron. Multi-socket boxes benefit from multiple processes, one per socket.
- CFS throttling is loud. A non-zero throttle rate is always a problem.
GOMAXPROCSdoes not affect I/O concurrency. The netpoller takes care of it.- Adaptive sizing is rare. Most teams set once and trust.
These exercises form the practical curriculum for GOMAXPROCS tuning. Working through them — with real numbers — builds the intuition that no amount of reading can replace.