Worker Pools — Senior Level¶
Table of Contents¶
- Introduction
- Backpressure as a System Property
- Little's Law and Pool Sizing
- Dynamic Resizing
- Work Stealing and Adaptive Scheduling
- Error Propagation Strategies
- Pool Composition: Multi-Stage Pipelines
- sync.Pool Inside Workers
- Long-Running vs Short-Lived Workers
- Memory and Goroutine Leak Forensics
- Observability and SLOs
- Comparison Deep Dive: WaitGroup, errgroup, semaphore
- Real-World Architecture Examples
- Decision Matrix
- Summary
Introduction¶
Focus: "I'm choosing between three worker-pool variants for a system with 10k req/s, P99 SLO of 200 ms, and a 30-instance fleet."
At senior level, you stop writing pools and start choosing them. The question isn't "how do I make a pool?" — that's a 30-line skeleton. The question is: which pool, sized how, sharing what state, instrumented how, and recovering how, given the system I'm in.
This file covers the system-level concerns that separate "knows the pattern" from "owns the pattern." Backpressure as a queueing-theory phenomenon. Little's Law for sizing. Dynamic resizing. Work stealing. Error semantics under partial failure. Memory profiling under load.
Backpressure as a System Property¶
The textbook definition¶
Backpressure is a producer-side slowdown caused by consumer-side saturation. In Go pools: producers block on jobs <- j when the channel is full and all workers are busy. The producer's blocking is the backpressure signal.
Where backpressure breaks down¶
The mechanism only works if every link in the chain respects it:
client → load balancer → web server → pool → DB
^ ^ ^ ^ ^
must must must must must
propagate propagate propagate propagate slow down
slow slow slow slow or
down down down down fail fast
If the DB has a 100-deep queue and the web server times out at 1 second, producers see "fast acceptance" of jobs that will eventually time out. The pool absorbs them but can't make them succeed.
Three backpressure strategies¶
| Strategy | Mechanism | When |
|---|---|---|
| Block | Producer waits on full channel | Internal pipelines, batch jobs |
| Reject | Submit returns "full" error; caller retries | HTTP servers, public APIs |
| Drop | Submit silently discards | Metrics, logs, telemetry |
The choice is a product decision masquerading as a technical one. Reject is correct for HTTP. Drop is correct for non-critical telemetry. Block is correct for in-process pipelines.
Coding the three strategies¶
// Block (default)
pool.jobs <- j
// Reject
select {
case pool.jobs <- j:
default:
return errPoolFull
}
// Drop with metric
select {
case pool.jobs <- j:
case <-ctx.Done():
return ctx.Err()
default:
droppedCounter.Inc()
}
Little's Law and Pool Sizing¶
Little's Law: in a stable system, L = λW, where: - L = average number of items in the system (queue + in-process) - λ = arrival rate (items/sec) - W = average time per item (sec)
Applied to a worker pool:
If you want 1000 req/s and each request takes 200 ms, you need 200 inflight requests at any moment. With workers that can each handle 1 request at a time, that's N = 200 workers.
Three derived rules¶
- Cut latency in half → halve N at the same throughput. Or double throughput at the same N.
- N is independent of arrival pattern. Bursty or smooth, the steady-state count is the same.
- Buffer = burst tolerance. Buffer doesn't change steady-state N; it absorbs bursts so producers don't block during transients.
Worked example¶
Service handles 5,000 req/s. P99 latency is 80 ms (so use that as the planning latency, not P50). Number of workers needed:
Across 4 instances, 100 each. Add 20% headroom: N = 120 per instance.
If each worker costs 8 KiB of stack (large), 120 workers = ~1 MiB. Negligible. The real cost is the 120 downstream connections.
When Little's Law is wrong¶
- Under utilisation > 80%, queue grows non-linearly. Plan for ~70% target utilisation.
- Heavy-tailed latency distributions (rare 10s outliers). P50 sizing collapses; size by P99.
- Coordinated arrivals (cron jobs, retries). Bursts can multiply expected N by 10x.
Dynamic Resizing¶
A static pool is correct most of the time but wasteful or unsafe at the edges. Dynamic resizing scales N with load.
Approach 1: Spawn-on-demand with a max¶
type Pool struct {
jobs chan Job
sem chan struct{} // bounds concurrency
}
func (p *Pool) Submit(j Job) {
select {
case p.sem <- struct{}{}:
go func() {
defer func() { <-p.sem }()
process(j)
}()
case p.jobs <- j: // queue if all workers busy and queue has room
}
}
Goroutines are short-lived. The semaphore caps concurrency. No long-running workers.
Approach 2: Watermark-based scaling¶
const (
minWorkers = 4
maxWorkers = 64
growAt = 0.8 // queue 80% full → add workers
shrinkAt = 0.2
)
func (p *Pool) supervise() {
tick := time.NewTicker(time.Second)
defer tick.Stop()
for range tick.C {
load := float64(len(p.jobs)) / float64(cap(p.jobs))
if load > growAt && p.size() < maxWorkers {
p.addWorker()
} else if load < shrinkAt && p.size() > minWorkers {
p.removeWorker()
}
}
}
The supervisor goroutine watches queue depth and adjusts. addWorker spawns; removeWorker posts a sentinel (or closes a per-worker stop channel).
Approach 3: Per-worker idle timeout¶
Each worker exits after T seconds of idle:
func (p *Pool) work(ctx context.Context) {
defer p.wg.Done()
idle := time.NewTimer(30 * time.Second)
defer idle.Stop()
for {
idle.Reset(30 * time.Second)
select {
case <-ctx.Done():
return
case j, ok := <-p.jobs:
if !ok {
return
}
process(j)
case <-idle.C:
return // no work for 30s, exit
}
}
}
Combined with spawn-on-demand, this gives an auto-scaling pool with bounded max.
Pitfalls of dynamic resizing¶
- Thrashing. Pool grows on a burst, shrinks on idle, grows on next burst. Use hysteresis (different grow/shrink thresholds).
- Per-worker state. If workers hold caches or open connections, churning them costs more than the savings.
- Goroutine leaks. Resizing logic must guarantee
wg.Done()on every exit path. - Metrics drift. Static pools have predictable counter behaviour; dynamic ones have spikes.
The senior judgment call: most pools should not resize dynamically. A correctly sized static pool is simpler and almost always sufficient. Resize only when load varies by 100x within minutes.
Work Stealing and Adaptive Scheduling¶
Vanilla pool: each worker pulls from one shared channel. This is FIFO across workers and fair on average but bad when:
- One job is 100x slower than others (head-of-line blocking on that worker).
- Jobs have locality preferences (e.g., partitioned by user ID for cache reuse).
Work stealing¶
Each worker has its own queue. When a worker's queue is empty, it tries to steal from a busy peer's queue. Go's runtime scheduler does this for goroutines; you rarely need to reimplement it for application-level work.
Library: github.com/panjf2000/ants is a popular goroutine pool with locality features. For most problems, the standard channel pool is fine — re-evaluating only if profiling shows imbalanced worker utilisation.
Sharded pools¶
Sharding by key gives locality without stealing complexity:
type ShardedPool struct {
shards [16]*Pool
}
func (sp *ShardedPool) Submit(key string, j Job) {
h := fnv32(key) % uint32(len(sp.shards))
sp.shards[h].Submit(j)
}
Each shard is an independent pool. Same-key jobs go to the same shard. Useful for stateful workers (e.g., per-user caches).
Error Propagation Strategies¶
Strategy A — Embed in result¶
Pros: no special infrastructure. Consumer decides. Continues on error. Cons: caller must handle every error. Aggregating "did anything fail?" is manual.
Strategy B — errgroup (fail-fast)¶
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(N)
for _, j := range jobs {
j := j
g.Go(func() error { return process(ctx, j) })
}
return g.Wait()
Pros: idiomatic. First error cancels the rest. Cons: you only see one error. Multi-error aggregation needs extra work.
Strategy C — Multi-error aggregation¶
import "errors"
import "go.uber.org/multierr"
var errs error
var mu sync.Mutex
for _, j := range jobs {
j := j
g.Go(func() error {
if err := process(ctx, j); err != nil {
mu.Lock()
errs = multierr.Append(errs, err)
mu.Unlock()
}
return nil // don't fail the group
})
}
g.Wait()
return errs
Or errors.Join (Go 1.20+):
Pros: collect all failures. Cons: doesn't fail-fast. Use when partial completion is OK.
Strategy D — Circuit breaker¶
Workers track failure rate; if it exceeds a threshold, the pool refuses new jobs:
Pros: protects downstream from continuing failure. Cons: complexity. Use only when retry storms are a real risk.
When to choose what¶
| Scenario | Strategy |
|---|---|
| Batch jobs, partial success OK | A or C |
| Atomic-ish operation, all-or-nothing | B |
| Public API with downstream failures | B + circuit breaker |
| Telemetry, fire-and-forget | A (drop or log) |
Pool Composition: Multi-Stage Pipelines¶
A pipeline is N pools chained by channels. Each stage is sized for its own bottleneck.
Three pools, three sizes, three closure responsibilities.
Closure cascade¶
Each stage closes its own output when its input closes:
func parseStage(ctx context.Context, in <-chan Raw) <-chan Parsed {
out := make(chan Parsed)
var wg sync.WaitGroup
for i := 0; i < 8; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for r := range in {
select {
case out <- parse(r):
case <-ctx.Done():
return
}
}
}()
}
go func() { wg.Wait(); close(out) }()
return out
}
The chain shuts down by cascading close() events down the stages.
Pipeline pitfalls¶
- Mismatched sizes. A 100-worker fetch stage feeding an 8-worker parse stage means parse is the bottleneck and fetch sits idle. Size the whole pipeline to the slowest stage.
- Lost cancellation. If stage 2 doesn't propagate ctx, stage 3's cancellation is invisible.
- Buffer cascading. Large buffers between stages hide bottlenecks. Keep buffers small or zero so backpressure is visible in metrics.
sync.Pool Inside Workers¶
sync.Pool reuses allocations across goroutines. Inside a worker that does many short-lived allocations, this can cut GC pressure dramatically.
var bufPool = sync.Pool{
New: func() any { return make([]byte, 0, 64*1024) },
}
func worker(jobs <-chan Job, results chan<- Result) {
for j := range jobs {
buf := bufPool.Get().([]byte)
buf = process(j, buf)
results <- Result{Data: append([]byte(nil), buf...)} // copy out
bufPool.Put(buf[:0])
}
}
Rules¶
- Always copy out before
Put. The pool may give your buffer to another worker. - Reset before
Put.buf[:0]keeps capacity, drops length. - Don't
Puthuge buffers. They consume memory until the next GC cycle drains the pool. sync.Poolis per-P (processor). Cross-P transfers are expensive; keep work local to a P.
When NOT to use¶
- The allocation is cheap (< 100 bytes). Pool overhead dominates.
- The lifetime exceeds the worker. Pool entries can be GC'd at any time.
- The number of items in flight is small. Pool wins on volume.
Long-Running vs Short-Lived Workers¶
| Aspect | Long-running | Short-lived |
|---|---|---|
| Spawn cost | Once | Per task |
| Job dispatch | Channel | go func() |
| Per-worker state | Easy | Awkward |
| Lifecycle | Explicit (Close/Wait) | Implicit (return) |
| Sizing | Static N | Implicit via semaphore |
| Use case | Many small jobs | Few medium jobs with setup |
Long-running wins when you have: - Per-worker setup (open DB conn, build cache). - Hot-loop processing where goroutine creation is a measurable cost. - Need for explicit lifecycle (graceful shutdown, draining).
Short-lived wins when: - Tasks are heterogeneous and rare. - Per-task setup is cheap. - You want simpler code (semaphore + g.Go).
Memory and Goroutine Leak Forensics¶
Common leak: forgot to close jobs¶
Producer panics; jobs is never closed; workers stay in range forever. Heap won't grow noticeably (workers are idle), but runtime.NumGoroutine() plateaus at N+1 forever.
Detect:
// In tests:
before := runtime.NumGoroutine()
runMyPool()
after := runtime.NumGoroutine()
if after > before {
t.Errorf("leaked %d goroutines", after-before)
}
In production: expose runtime.NumGoroutine() as a metric.
Common leak: blocked on send¶
Consumer exits early; workers block on results <- r. Detect via pprof goroutine profile — N workers all blocked on channel send.
Fix: every send guarded by select { case results <- r: case <-ctx.Done(): return }.
Common leak: orphaned context¶
Inner goroutine derives a child context but doesn't cancel() it. The child context's resources leak (timers, listeners). Always defer cancel().
pprof workflow for pool issues¶
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
go tool pprof cpu.prof
# (pprof) top
# (pprof) list workerName
Look for: time spent in channel ops, allocations per job, GC pressure.
Observability and SLOs¶
A pool emitting nothing is invisible until it's broken. Senior pools instrument:
| Metric | Type | Use |
|---|---|---|
pool.inflight | Gauge | Current concurrent jobs |
pool.queue_depth | Gauge | len(jobs) |
pool.workers_active | Gauge | Workers not idle |
pool.jobs_total | Counter | Total jobs processed |
pool.jobs_failed_total | Counter | Total errors |
pool.duration_seconds | Histogram | Per-job latency |
pool.queue_wait_seconds | Histogram | Time job spent in queue |
Implementation sketch¶
import "github.com/prometheus/client_golang/prometheus"
var (
inflight = prometheus.NewGauge(prometheus.GaugeOpts{Name: "pool_inflight"})
duration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "pool_duration_seconds",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 12),
})
)
func work(jobs <-chan Job) {
for j := range jobs {
inflight.Inc()
start := time.Now()
process(j)
duration.Observe(time.Since(start).Seconds())
inflight.Dec()
}
}
SLO mapping¶
If your SLO is "P99 latency < 200 ms," your alerting rule is on the histogram's 0.99 quantile. Pool sizing (Little's Law) directly determines whether the SLO is achievable. Tie metrics to SLOs and you can tell, at a glance, whether the pool is the cause of an alert.
Comparison Deep Dive: WaitGroup, errgroup, semaphore¶
| Feature | sync.WaitGroup | errgroup.Group | x/sync/semaphore |
|---|---|---|---|
| Wait for goroutines | Yes | Yes | No (you wait yourself) |
| Cancellation on first error | No | Yes (with Context) | No |
| Concurrency limit | No | SetLimit | Weighted, native |
| Per-task weight | No | No | Yes |
| Error aggregation | Manual | First only | Manual |
| Best fit | Simple "wait for all" | Fail-fast batch | Variable-cost throttling |
Rule of thumb:
- "I just need to wait for them" →
sync.WaitGroup. - "I want fail-fast on first error" →
errgroup. - "I have variable-cost jobs and need ctx-aware acquire" →
x/sync/semaphore. - "I have many small jobs and need backpressure" → channel-based pool.
You can combine: errgroup plus a channel semaphore plus per-task context.WithTimeout is a very common production stack.
Real-World Architecture Examples¶
Image processor¶
HTTP POST /upload
→ enqueue ImageJob{ID, S3Key}
→ pool of 16 workers (CPU-bound; NumCPU=16)
each worker:
1. Download from S3 (concurrent within worker via separate I/O pool)
2. Resize using libvips (CPU-bound)
3. Upload thumbnail to S3
4. Update DB row
→ return jobID synchronously; client polls /status/{jobID}
Key choices: - Submit returns 202 Accepted; processing is async. - Pool size = NumCPU because resize is CPU-bound. - Per-job timeout = 30 s; if libvips hangs, the worker isn't lost. - Bounded queue depth = 1000; over that, return 429.
Web scraper¶
URL list (1M URLs)
→ producer reads line-by-line, sends to jobs chan
→ pool of 64 workers (I/O-bound; remote service tolerates 64 parallel)
each worker:
1. HTTP GET with 10s timeout
2. Parse HTML
3. Extract structured data
4. Send to results chan
→ consumer batches results, writes 100 at a time to DB
Key choices: - Pool size = 64 because target site rate-limits beyond that. - Per-request timeout = 10 s. - Token bucket limiter at 50 req/s (below the 64 cap, leaves headroom). - Failure mode: log error, continue. Use multierr to surface at the end.
Batch DB writer¶
Stream of insert events
→ buffer up to 1000 rows or 100 ms (whichever first)
→ submit batch to pool of 8 writers
each writer:
1. Acquire DB connection from pool of 16 conns
2. Begin tx
3. Bulk insert
4. Commit
→ emit metrics
Key choices: - Pool size = 8 because DB pool has 16 connections; 8 writers × 1 conn = 8 in-flight. - Batching reduces round trips by 1000x. - Reject (429) if queue depth > 10 batches. - Per-tx timeout = 5 s.
Decision Matrix¶
| Need | Use |
|---|---|
| Bound concurrency, no errors | Worker pool with WaitGroup |
| Fail-fast on first error | errgroup with SetLimit |
| Aggregate all errors | errgroup + multierr / errors.Join |
| Variable per-task cost | x/sync/semaphore |
| Streaming pipeline | Pipeline of pools, ctx-propagated |
| Per-key locality | Sharded pools |
| Auto-scaling load | Spawn-on-demand + semaphore + idle timeout |
| Public API entry | Pool + reject (429) on full queue |
| Background batch | Pool + block on full queue |
| Telemetry | Pool + drop on full queue + counter |
Summary¶
Senior worker-pool work is system design. You apply Little's Law to size the pool. You choose the backpressure strategy (block, reject, drop) based on what kind of system you're in. You pick error semantics (embed, fail-fast, aggregate, circuit break) based on the failure model. You use sync.Pool only when allocation profiling justifies it. You expose metrics that map directly to SLOs. You compose pools into pipelines, sized per stage. You resort to dynamic resizing only when load varies by orders of magnitude.
The small skeleton from junior level is the trunk. Everything in this file is the branches that grow off it for production realities.