Dynamic Worker Scaling — Senior Level¶
Table of Contents¶
- Introduction
- Prerequisites
- Architecture: Control Loops in Production
- AIMD: Additive Increase, Multiplicative Decrease
- PID Controllers for Worker Pools
- Hysteresis as a Stability Mechanism
- Little's Law Driven Sizing
- Integration with Backpressure
- Integration with Circuit Breakers
- Multi-Pool Coordination
- Predictive Scaling
- Workload Modeling
- Cost-Aware Scaling
- Pluggable Autoscaler Architectures
- Coding Patterns
- Clean Code
- Error Handling and Recovery
- Performance Tips
- Best Practices
- Edge Cases and Pitfalls
- Common Mistakes
- Common Misconceptions
- Tricky Points
- Test
- Tricky Questions
- Cheat Sheet
- Self-Assessment Checklist
- Summary
- What You Can Build
- Further Reading
- Related Topics
- Diagrams
Introduction¶
Focus: "How do I design an autoscaler that is stable, predictable, and integrates with backpressure, circuit breakers, and cost controls?"
At middle level you wrote a multi-signal autoscaler with hysteresis. Senior level brings the math and the systems integration. The big themes:
- AIMD — the algorithm behind TCP congestion control, applicable to worker pools
- PID — the control-theory technique for keeping a signal at a set-point
- Little's Law — the analytical baseline for sizing
- Integration — autoscaling does not live alone; it works with backpressure, breakers, rate limits, cost controls
- Stability — when autoscaling causes instability (oscillation, feedback loops, race conditions), how to diagnose and prevent
- Architecture — building autoscalers as composable, testable, swappable components
After this chapter you should be able to:
- Implement AIMD and reason about its convergence properties
- Implement a PID controller for utilization targeting
- Apply Little's Law to estimate baseline pool size and detect saturation
- Wire autoscaling with backpressure, circuit breakers, and rate limiters into a coherent system
- Architect autoscaler components for testability and composability
- Diagnose autoscaler-induced instabilities (oscillation, cascading failures, retry storms)
- Design cost-aware autoscalers that respect budgets
This is the level where you design the autoscaling for a large service, not just write the loop. You make decisions that the team will live with for years.
Prerequisites¶
- You have shipped at least one dynamic pool in production
- You have read and internalized junior and middle chapters
- You are comfortable with basic control theory concepts (feedback, set-point, stability)
- You have used Prometheus, Grafana, distributed tracing
- You have built or maintained systems with circuit breakers and rate limiters
- You can read complex Go code (sync, context, interfaces, generics)
Architecture: Control Loops in Production¶
A worker pool autoscaler is a feedback control loop. Control theory has well-developed vocabulary for this:
- Plant: the system being controlled. Here: the worker pool + downstream system.
- Set-point: the desired value of the controlled variable.
- Process variable: the measured value of the controlled variable.
- Error: difference between set-point and process variable.
- Controller: the function that maps error to control action (resize amount).
- Actuator: the mechanism that applies the control action. Here: Resize / Tune.
The "thermostat" analogy is direct:
| Thermostat | Worker Pool Autoscaler |
|---|---|
| Set-point: 21°C | Set-point: util=70% |
| Process variable: 19°C | Process variable: util=85% |
| Error: -2°C | Error: +15% |
| Controller: thermostat logic | Controller: scaling decision |
| Actuator: heater on/off | Actuator: Resize(n) |
Good control loops:
- Are stable — small disturbances do not cause big oscillations
- Are responsive — react quickly to changes
- Have low overshoot — do not exceed the set-point by much
- Settle — the system reaches steady state
Bad control loops oscillate, overshoot, fail to settle, or chase noise.
Most worker-pool autoscalers are discrete-time bang-bang controllers: at each tick, they take a binary action (grow / shrink / hold). Bang-bang controllers work but can oscillate around the set-point. PID controllers are smoother — we cover them below.
Stability fundamentals¶
A system is stable when small disturbances decay over time. Three properties contribute:
-
Negative feedback dominates positive feedback. Growing the pool reduces queue depth (negative feedback). If growing also somehow increases queue depth (e.g., by overloading downstream — positive feedback), the system is unstable.
-
Loop gain < 1. If a single tick's action overshoots the necessary correction, you oscillate. Loop gain depends on step size, cooldown, and how strongly signals correlate with size.
-
Delay matters. Long feedback delays (signal collection lag + autoscaler tick + resize-takes-effect lag) reduce stability margin. Slower loops have more headroom.
These principles guide all autoscaler design.
AIMD: Additive Increase, Multiplicative Decrease¶
AIMD is the algorithm behind TCP congestion control. It is provably stable under network-like conditions and has nice properties for worker pools too.
Definition¶
- On positive evidence (room to grow): add a small constant N to current size.
- On negative evidence (need to shrink): multiply current size by a factor F < 1.
func AIMD(cur int, sig Signals) int {
switch {
case shouldGrow(sig):
return cur + 1 // additive
case shouldShrink(sig):
return cur - cur/4 // multiplicative: -25%
default:
return cur
}
}
Why does it work?¶
AIMD has two convergence properties:
- Efficiency: the system tends toward maximum utilization (does not under-provision indefinitely).
- Fairness (across multiple AIMD pools sharing a resource): they converge to equal shares.
In a worker pool context, AIMD means:
- Growth is gentle. You add one worker at a time, easing into more load.
- Shrink is aggressive. When you decide capacity is excess, you shrink by 25% (or 50%, or 75%).
The aggressive shrink prevents over-provisioning from persisting. The gentle growth prevents overshoot.
Why not the other way?¶
Multiplicative increase + additive decrease (MIAD): grow fast, shrink slowly. Result: overshooting capacity, slow to react when load drops.
Additive + additive (AIAD): symmetric. Stable but slow in both directions.
Multiplicative + multiplicative (MIMD): violent in both directions. Unstable.
AIMD is the sweet spot for systems with asymmetric cost: cost of over-provisioning > cost of under-provisioning (because we can ramp up quickly when needed).
Wait — in pools, isn't under-provisioning worse?¶
In worker pools, under-provisioning causes latency spikes — sometimes worse than over-provisioning's cost. So you might wonder: should we use MIAD for pools?
In practice, AIMD still works because:
- The autoscaler runs constantly. Additive grow gets you to the right size in seconds.
- Multiplicative shrink only triggers when sustained low signals are observed. Brief dips don't cause shrink.
- The pool re-grows fast on next burst (additive but eager).
If your workload has very sudden, severe bursts (load spikes 10x in milliseconds), then yes — consider MIAD or even MIMD for the rare emergency-grow case. But these are exceptions.
Implementation¶
type AIMDController struct {
GrowStep int // typically 1 or 2
ShrinkFactor float64 // 0.25 = -25%, 0.5 = -50%
GrowAfter Predicate
ShrinkAfter Predicate
}
func (a *AIMDController) Decide(cur int, signals Signals) int {
switch {
case a.GrowAfter(signals):
return cur + a.GrowStep
case a.ShrinkAfter(signals):
shrink := int(float64(cur) * a.ShrinkFactor)
if shrink < 1 { shrink = 1 }
return cur - shrink
default:
return cur
}
}
Predicate is func(Signals) bool. Inject your conditions. Test in isolation.
Variants¶
- AIMD with floor/ceiling. Clamp results.
- AIMD with cooldown. Same as before.
- AIMD with dampened multiplicative. Shrink by 25% if util very low, by 10% if moderately low.
- MIMD-fallback under saturation. Normal AIMD, except when signal indicates emergency (e.g., wait > 10x SLO), multiplicative grow once.
These variations are common in production. The pure AIMD is a useful baseline.
Convergence example¶
Imagine a workload where the "right" pool size is 30. AIMD starts at 8.
- Tick 1: signals say grow. 8 → 9.
- Tick 2: grow. 9 → 10.
- ...
- Tick 23: grow. 30 → 31.
- Tick 24: util drops slightly (we are at right size). Hold at 31.
- Tick 25: low util signal. Shrink: 31 → 31 - 7 = 24.
- Tick 26: util goes up again (we shrank too much). Grow: 24 → 25.
- ...
- Tick 32: grow. 30 → 31.
- ...
Result: pool oscillates around 30, gradually approaching. Stable, slow, predictable.
Compare to pure additive (no multiplicative shrink): each oscillation is +/-1. Tighter oscillation but less responsive to permanent load drops.
Compare to multiplicative both ways: huge swings, hard to predict.
PID Controllers for Worker Pools¶
PID = Proportional + Integral + Derivative. The classic control technique for keeping a process variable at a set-point.
Definition¶
Where: - e(t) is the error (set-point - process variable) - Kp, Ki, Kd are tuning constants - u(t) is the control output (resize amount)
In Go, discrete form:
type PID struct {
Kp, Ki, Kd float64
setpoint float64
integral float64
lastError float64
lastTime time.Time
primed bool
}
func (p *PID) Update(measured float64, now time.Time) float64 {
err := p.setpoint - measured
if !p.primed {
p.primed = true
p.lastError = err
p.lastTime = now
return p.Kp * err
}
dt := now.Sub(p.lastTime).Seconds()
if dt <= 0 {
return p.Kp * err
}
p.integral += err * dt
derivative := (err - p.lastError) / dt
p.lastError = err
p.lastTime = now
return p.Kp*err + p.Ki*p.integral + p.Kd*derivative
}
Application¶
Set-point: target utilization, e.g., 0.70.
Each tick:
util := computeUtilization()
controlSignal := pid.Update(util, time.Now())
// controlSignal is roughly: how many workers to add/remove
target := cur + int(math.Round(controlSignal))
target = clamp(target, floor, ceiling)
pool.Resize(target)
Tuning¶
The hard part. There is no universal Kp/Ki/Kd. Approaches:
- Ziegler-Nichols. Tune Kp until oscillation, then back off. Set Ki and Kd by formula.
- Manual. Start with small Kp (e.g., 5), Ki=0, Kd=0. Increase until responsive. Add small Ki for steady-state accuracy. Add small Kd if oscillation persists.
- Auto-tuning. Various libraries do this online. Rarely used in worker pools.
Typical values for a worker pool:
The exact values depend on your system. Tune empirically.
Why PID is overkill (usually)¶
PID is for systems where the process variable must be at a precise set-point. Worker pools don't need that — being at 65% utilization vs 75% is fine. A bang-bang controller with hysteresis serves most cases.
PID becomes useful when:
- You need very tight latency control
- You have a clear set-point (e.g., maintain p99 < 100 ms)
- The cost of overshoot is high
- You have time to tune carefully
For most pools, AIMD or simple threshold-based is enough. PID is a tool in the toolbox.
Anti-windup¶
A subtle PID issue: if the integral keeps accumulating during prolonged saturation (we're at ceiling, can't grow more, but error is still positive), the integral term grows without bound. When the situation resolves, the controller massively overshoots in the other direction.
Anti-windup: clamp the integral.
Or freeze integration when saturated:
This is the kind of detail that turns a paper-correct PID into a production-correct PID.
Hysteresis as a Stability Mechanism¶
We have used hysteresis (different thresholds for up and down) throughout. Let us look at it more carefully.
The math¶
A system with a single threshold T exhibits chattering when the signal is near T. Crossing T triggers an action; the action reduces the signal back below T; the signal grows again; rinse, repeat.
Hysteresis introduces two thresholds, T_high > T_low. Crossing T_high triggers grow; crossing T_low triggers shrink. The deadband between them is where no action is taken.
The width of the deadband determines how much noise is filtered:
- Narrow (T_high - T_low = 0.1): filters noise within 10% range
- Wide (T_high - T_low = 0.5): filters noise within 50%
The wider the deadband, the more the system can drift before reacting. Trade-off: stability vs responsiveness.
Choosing widths¶
A good rule: deadband width = max(noise amplitude × 3, set-point × 0.3).
If your measured signal has a standard deviation of 0.05 around steady state, deadband of 0.15 (3 sigma) filters 99.7% of noise. If your set-point is 0.7 utilization, deadband of 0.2 (centered on set-point: 0.6-0.8) is reasonable.
Hysteresis vs cooldown¶
Hysteresis filters by value. Cooldown filters by time. They are complementary:
- Hysteresis without cooldown: no flap on the value axis, but you can resize every tick when signal slowly drifts across the threshold.
- Cooldown without hysteresis: no rapid-fire resize, but you flap as soon as cooldown elapses if signal is right at threshold.
- Both: solid stability.
Production systems always use both.
Adaptive hysteresis¶
If your workload pattern changes (e.g., daytime vs nighttime), fixed hysteresis may be too narrow at one time and too wide at another. Adaptive: track noise level, widen deadband when noise is high.
type AdaptiveBand struct {
base float64
noise *EWMA // smoothed noise level
current float64
}
func (a *AdaptiveBand) Update(signal float64) {
a.noise.Add(math.Abs(signal - a.lastSignal))
a.current = a.base + a.noise.Value() * 3
a.lastSignal = signal
}
The deadband widens when noise is high. Useful in production where workload character changes over the day.
Little's Law Driven Sizing¶
We have mentioned Little's Law. Let us use it.
Statement¶
For a stable queueing system:
L: average number of items in the systemλ: throughput (items per second arriving and departing — they are equal in steady state)W: average time in system
Worker pool application¶
In a worker pool:
L_workers: average number of busy workersλ: tasks per secondW_process: average processing time
So:
Example: 200 tasks/s, each taking 150 ms:
You need at least 30 workers busy on average. With some idle capacity for variance and bursts, you provision 30 to 50 — say, 40.
Floor and ceiling from Little's Law¶
- Steady state: workers = λ × W_process
- Floor: enough workers for minimum throughput. If minimum throughput is 50 tasks/s, floor = 50 × 0.15 = 8.
- Ceiling: enough workers for maximum acceptable throughput. If max is 2000 tasks/s, ceiling = 2000 × 0.15 = 300.
Detecting saturation¶
If you observe:
You are saturated. Tasks are queueing. Grow.
If:
You are over-provisioned. Shrink.
Worked example¶
A service processes 200 tasks/s. Each task takes 100 ms. Pool is at 40 workers.
By Little's Law: 200 × 0.1 = 20 workers busy on average. Pool of 40 means 50% utilization. Good headroom for bursts.
A burst arrives: throughput goes to 400 tasks/s briefly. Little's Law: needs 40 workers. Pool is exactly at limit. If burst is sustained: queue grows.
The autoscaler should grow the pool. Say to 60. New headroom: 33%.
If you predicted Little's Law and set up the autoscaler with foreknowledge of W_process, you can choose appropriate thresholds.
When Little's Law breaks down¶
- Non-stationary processes (workload character changes mid-test)
- Highly bursty workloads (steady-state assumptions fail)
- Workloads with feedback (slow downstream causes processing time to grow)
In those cases, Little's Law is a starting point, not a target.
Integration with Backpressure¶
Backpressure and autoscaling are partners. Let us see the full integration.
Layers of pressure¶
In a healthy production system:
- Caller-side: clients implement retries with backoff. If we 503 them, they wait, retry.
- Edge layer (HTTP): rate limiting at the front door. Reject excessive load.
- Pool input:
Submitreturns error if pool is full. - Pool itself: autoscaler grows up to ceiling.
- Downstream: circuit breaker if downstream is unhealthy.
Each layer protects the next. The autoscaler is layer 4; backpressure is layers 1-3 and 5.
Submit with backpressure¶
func (s *Service) Handle(req Request) Response {
ctx, cancel := context.WithTimeout(req.Context(), 5 * time.Second)
defer cancel()
err := s.pool.Submit(ctx, func(ctx context.Context) {
s.process(ctx, req)
})
switch err {
case nil:
return Response{Status: 200}
case ErrPoolFull:
return Response{Status: 503, Retry: true}
case context.DeadlineExceeded:
return Response{Status: 504}
default:
return Response{Status: 500}
}
}
The caller sees 503 immediately when the pool is at capacity. The caller decides retry.
Token bucket fronting the pool¶
Sometimes you don't want to expose backpressure at the pool. Instead, fail fast at the edge:
type ServiceTokens struct {
pool *Pool
limiter *rate.Limiter
}
func (s *ServiceTokens) Handle(req Request) Response {
if !s.limiter.Allow() {
return Response{Status: 429}
}
err := s.pool.Submit(...)
// ...
}
The limiter caps tokens/sec. Excess fails immediately with 429. The pool is protected from sudden bursts; the autoscaler adjusts the limiter rate based on pool capacity.
Autoscaler-aware limiter¶
Couple the limiter rate to pool capacity:
func (s *Service) updateLimiter() {
cap := s.pool.Size()
// assume ~100ms processing
rate := float64(cap) / 0.1 // tokens per second
s.limiter.SetLimit(rate.Limit(rate))
}
Called every tick. The limiter rate tracks the pool's actual capacity. As pool grows, limiter relaxes. As pool shrinks, limiter tightens.
This is integrated control: pool size and limiter rate move together.
Integration with Circuit Breakers¶
A circuit breaker monitors a downstream's health. When unhealthy, it short-circuits — calls fail immediately instead of waiting for timeout.
Why combine with autoscaling¶
We saw the failure mode at middle level: slow downstream → workers tied up → queue grows → autoscaler grows pool → more workers slamming slow downstream → downstream collapses.
The circuit breaker stops this:
func (s *Service) work(ctx context.Context, req Request) {
err := s.breaker.Call(func() error {
return s.callDownstream(ctx, req)
})
if err == ErrCircuitOpen {
// fail fast; don't tie up the worker
return
}
// ...
}
When breaker is open, the worker returns immediately. The pool's load is what matters, not downstream's. The autoscaler's signals (wait time, queue depth) reflect actual pool work, not downstream slowness.
Breaker state as a signal¶
You can also use breaker state directly in the autoscaler:
func (a *Autoscaler) shouldGrow() bool {
if a.breaker.IsOpen() {
return false // downstream is sick; don't grow
}
return a.signal.WaitP99() > a.threshold
}
This is a veto: while downstream is sick, do not grow the pool. Doing so would worsen the situation.
Half-open state¶
A breaker in half-open lets a few requests through to test downstream. The autoscaler should also be conservative — grow slowly, if at all, during half-open.
func (a *Autoscaler) growStep() int {
switch a.breaker.State() {
case Open:
return 0
case HalfOpen:
return 1 // tiny test step
case Closed:
return 4 // normal
}
return 0
}
Multi-Pool Coordination¶
Many services have multiple pools (one per downstream, per priority class, per tenant). Coordinating their autoscaling is its own challenge.
Independent autoscalers¶
Simplest: each pool has its own autoscaler. They don't talk to each other.
Pros: simple, isolated. Cons: they may collectively over-grow (each thinks it should grow; total exceeds host capacity).
Shared resource budget¶
Define a global budget: total workers across all pools < N.
type GlobalBudget struct {
mu sync.Mutex
capacity int
used int
}
func (g *GlobalBudget) RequestGrowth(n int) (granted int) {
g.mu.Lock()
defer g.mu.Unlock()
if g.used + n > g.capacity {
granted = g.capacity - g.used
} else {
granted = n
}
g.used += granted
return granted
}
func (g *GlobalBudget) Release(n int) {
g.mu.Lock()
defer g.mu.Unlock()
g.used -= n
if g.used < 0 { g.used = 0 }
}
Each pool's autoscaler asks the budget for growth permission. Across pools, total never exceeds capacity.
Priority-based allocation¶
Higher-priority pools get growth requests granted first. Lower-priority pools wait.
type PrioritizedBudget struct {
capacity int
used int
// queue of pending requests, sorted by priority
}
Production frameworks like Kubernetes Horizontal Pod Autoscaler use similar logic at cluster scale.
Per-tenant fairness¶
If pools serve different tenants, fairness matters. Several models:
- Equal shares. Each tenant guaranteed N/T workers, can use more if others don't.
- Weighted shares. High-paying tenants get more.
- Per-tenant quotas. Each has a max; no inter-tenant priority.
Implementation depends on policy. Track per-tenant utilization; allocate growth to whoever needs it most.
Predictive Scaling¶
Reactive autoscaling grows after the load arrives. Predictive grows before.
Time-of-day patterns¶
If your traffic is consistently high at 09:00, pre-warm at 08:55.
type SchedulePoint struct {
Hour int
Min int
Target int
}
func (p *PatternedScaler) tick(now time.Time) {
for _, sp := range p.schedule {
if now.Hour() == sp.Hour && now.Minute() == sp.Min {
p.pool.Resize(sp.Target)
}
}
}
Simple but effective. Many services have stable daily patterns.
Predictive models¶
More sophisticated: use a forecast. Linear regression, ARIMA, neural network. The autoscaler computes the predicted load 5-10 minutes ahead and resizes preemptively.
For most services, a simple weighted-average prediction works:
func (p *Forecast) PredictNext(window time.Duration) float64 {
// average over last N samples weighted by recency
return weightedAverage(p.samples)
}
Hybrid: reactive + predictive¶
Run both. Predictive sets a baseline target; reactive corrects for unexpected variation.
func (a *Autoscaler) tick() {
predicted := a.forecast.Predict()
reactive := a.measure()
target := max(predicted, reactive)
a.pool.Resize(clamp(target, a.floor, a.ceiling))
}
When prediction is accurate, the pool is right-sized before load arrives. When prediction is wrong, reactive catches up.
Limits of prediction¶
- Sudden events (a marketing email blast, an external incident) cannot be predicted.
- Predictions are sometimes wrong; over-trusting them wastes capacity.
- Complex models cost CPU and engineering time.
For most services, time-of-day schedule + reactive autoscaler suffices. Heavy prediction is worth it only at scale.
Workload Modeling¶
To tune autoscaler, model the workload. Three dimensions matter.
Arrival rate¶
- Mean (λ̄): tasks per second over long window
- Variance: how bursty?
- Pattern: time-of-day, day-of-week
- Spikes: rare, large bursts
Service time¶
- Mean (W̄): average processing time
- Variance: are all tasks similar, or bimodal?
- Tail: how heavy is p99 / p999?
Coupling¶
- Are tasks independent, or do they share resources?
- Downstream effects: does one slow task slow others?
- Burst correlation: do bursts correlate with slow downstream?
A workload model can be as simple as:
type WorkloadModel struct {
MeanArrivalRate float64
PeakArrivalRate float64
MeanServiceMs float64
P99ServiceMs float64
}
From this, derive:
- Steady state pool size:
MeanArrivalRate × MeanServiceMs / 1000 × 1.3(30% headroom) - Burst pool size:
PeakArrivalRate × P99ServiceMs / 1000(for tail-driven) - Floor:
MeanArrivalRate × 0.3 × MeanServiceMs / 1000 - Ceiling:
2 × PeakArrivalRate × P99ServiceMs / 1000
These are baselines for the autoscaler's floor, ceiling, and thresholds.
Continuous modeling¶
A production-grade autoscaler can update the model continuously:
Each task's start time updates arrival rate (EWMA). Each task's duration updates service time (EWMA, HDR for percentile). The autoscaler decisions use live model values.
This is a form of self-tuning: as workload character changes, the autoscaler adapts thresholds.
Cost-Aware Scaling¶
In cloud environments, every worker has a cost. Cost-aware autoscaling factors this in.
Cost dimensions¶
- Per-worker memory. Tied to stack size (small) + heap usage per task.
- Per-task downstream cost. Each call to a paid downstream API.
- Compute cost. Cloud bill per CPU-second.
- SLO violation cost. Customer impact from missed latency.
Cost-driven decision¶
func (a *Autoscaler) costAwareDecide(signals Signals, cur int) int {
// Estimate cost of growing by 1 worker
workerCost := a.workerHourCost
// Estimate benefit: latency improvement * value-of-improvement
latencyImprovement := a.estimateLatencyImprovement(cur)
benefit := latencyImprovement * a.valuePerLatencyMs
if benefit > workerCost {
return cur + 1
}
return cur
}
The autoscaler grows only when expected benefit exceeds cost. Hard to estimate accurately; in practice, codify as thresholds.
Budget caps¶
Simple: don't grow beyond a budget.
func (a *Autoscaler) checkBudget(targetSize int) int {
monthlyBudget := a.budget.Remaining()
workerMonths := workerCostPerMonth * float64(targetSize)
if workerMonths > monthlyBudget {
// log alert, cap to budget
return int(monthlyBudget / workerCostPerMonth)
}
return targetSize
}
Production systems often pair autoscaling with explicit budget caps. The autoscaler is allowed to grow as long as budget permits; beyond that, the system degrades gracefully (longer queues, backpressure).
Spot/preemptible workers¶
In some environments, you can mix on-demand and spot workers. Spot is cheaper but can be revoked.
Autoscaler runs two pools:
type DualPool struct {
stable *Pool // on-demand
spot *Pool // spot
}
func (d *DualPool) Submit(task func()) bool {
// try spot first; fall back to stable
if d.spot.Submit(task) {
return true
}
return d.stable.Submit(task)
}
The autoscaler grows the spot pool aggressively when allowed; falls back to stable when spot capacity is constrained.
Pluggable Autoscaler Architectures¶
To build a production autoscaler that you can evolve, design it as composable parts.
Interfaces¶
type Pool interface {
Size() int
Resize(int) error
}
type Signal interface {
Value() float64
Name() string
}
type Decider interface {
Decide(cur int, signals []Signal) (target int, reason string)
}
type Cooldown interface {
AllowUp(now time.Time) bool
AllowDown(now time.Time) bool
RecordResize(now time.Time, dir Direction)
}
The autoscaler is a runner¶
type Autoscaler struct {
Pool Pool
Signals []Signal
Decider Decider
Cooldown Cooldown
Bounds Bounds
Interval time.Duration
Logger *slog.Logger
}
func (a *Autoscaler) Run(ctx context.Context) {
ticker := time.NewTicker(a.Interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case now := <-ticker.C:
cur := a.Pool.Size()
target, reason := a.Decider.Decide(cur, a.Signals)
target = a.Bounds.Clamp(target)
if target == cur { continue }
dir := DirectionUp
if target < cur { dir = DirectionDown }
switch dir {
case DirectionUp:
if !a.Cooldown.AllowUp(now) { continue }
case DirectionDown:
if !a.Cooldown.AllowDown(now) { continue }
}
if err := a.Pool.Resize(target); err != nil {
a.Logger.Warn("resize failed", "err", err)
continue
}
a.Cooldown.RecordResize(now, dir)
a.Logger.Info("resized", "from", cur, "to", target, "reason", reason)
}
}
}
Each piece is swappable. Want AIMD instead of threshold? Swap Decider. Want time-of-day cooldown? Swap Cooldown. Want a different signal? Add a new Signal implementation.
Decider implementations¶
type ThresholdDecider struct { /* thresholds */ }
type AIMDDecider struct { /* additive/multiplicative params */ }
type PIDDecider struct { /* Kp, Ki, Kd, setpoint */ }
type ScheduleDecider struct { /* time-of-day schedule */ }
type CompositeDecider struct {
Deciders []Decider
Strategy MergeStrategy // max, min, priority
}
A CompositeDecider combines several. The merge strategy decides how. "Max" grows aggressively (take the largest target); "min" grows conservatively; "priority" uses the first non-zero decision.
Testing each piece in isolation¶
Pure functions/structs. No real pool needed for tests:
func TestAIMDDeciderGrowsAdditive(t *testing.T) {
d := &AIMDDecider{GrowStep: 1, ShrinkFactor: 0.25}
target, _ := d.Decide(20, []Signal{
&FakeSignal{Value_: 0.9, Name_: "util"},
})
if target != 21 { t.Errorf("expected 21, got %d", target) }
}
Mock everything. The runner's logic is fully covered by tests on its parts.
Coding Patterns¶
Pattern: Builder for autoscaler¶
type Builder struct {
pool Pool
signals []Signal
decider Decider
cooldown Cooldown
bounds Bounds
interval time.Duration
}
func NewBuilder(pool Pool) *Builder {
return &Builder{
pool: pool,
decider: &ThresholdDecider{},
cooldown: &SimpleCooldown{Up: 3 * time.Second, Down: 60 * time.Second},
bounds: Bounds{Min: 1, Max: 100},
interval: 500 * time.Millisecond,
}
}
func (b *Builder) WithDecider(d Decider) *Builder { b.decider = d; return b }
func (b *Builder) WithSignal(s Signal) *Builder { b.signals = append(b.signals, s); return b }
// ... etc
func (b *Builder) Build() *Autoscaler {
return &Autoscaler{
Pool: b.pool, Signals: b.signals, Decider: b.decider,
Cooldown: b.cooldown, Bounds: b.bounds, Interval: b.interval,
}
}
Builder makes configuration readable and validated. Common pattern in Go for any multi-option struct.
Pattern: pluggable metrics¶
type MetricsSink interface {
GaugeSet(name string, v float64)
CounterInc(name string)
HistogramObserve(name string, v float64)
}
type Autoscaler struct {
Metrics MetricsSink
}
Implementations: Prometheus, OpenTelemetry, stdout, no-op (for tests).
Pattern: middleware around Submit¶
type SubmitMiddleware func(next SubmitFunc) SubmitFunc
func WithMetrics(m MetricsSink) SubmitMiddleware {
return func(next SubmitFunc) SubmitFunc {
return func(task func()) bool {
m.CounterInc("submits")
return next(task)
}
}
}
func WithLogging(l *slog.Logger) SubmitMiddleware { /* ... */ }
func WithTimeout(d time.Duration) SubmitMiddleware { /* ... */ }
Chain middlewares for cross-cutting concerns.
Pattern: observable resize events¶
type ResizeEvent struct {
Old, New int
Reason string
Signals map[string]float64
Timestamp time.Time
}
type Observer interface {
OnResize(e ResizeEvent)
}
Pluggable observers. One logs to stdout; one writes to a database for forensic analysis; one emits a metric.
Pattern: declarative policy¶
autoscaler:
signals:
- type: wait_p99
window: 30s
- type: utilization
window: 10s
decider:
type: aimd
grow_step: 2
shrink_factor: 0.20
cooldown:
up: 3s
down: 60s
bounds:
min: 4
max: 128
Load policy from config; build autoscaler. Lets ops tune without redeploying.
Clean Code¶
- Names match domain.
Signal,Decider,Cooldown,Bounds. NotHelper,Manager,Util. - Each function does one thing.
Decidedoesn't enforce cooldown;Cooldowndoes. - Constants for thresholds at the top. Easy to scan, easy to override.
- Document policy. Why this Kp? Why this cooldown? Future-you will thank you.
- Test interfaces, not implementations. Mock the pool; assert on Resize calls.
- Configuration as data. YAML/JSON > buried magic numbers.
- Composition over inheritance. CompositeDecider wraps Deciders; doesn't subclass them.
Error Handling and Recovery¶
Pool resize failure¶
func (a *Autoscaler) safeResize(target int) {
if err := a.Pool.Resize(target); err != nil {
a.Metrics.CounterInc("resize_errors")
a.Logger.Warn("resize failed", "target", target, "err", err)
// Don't update cooldown; retry next tick
return
}
}
Autoscaler panic¶
func (a *Autoscaler) Run(ctx context.Context) {
defer func() {
if r := recover(); r != nil {
a.Logger.Error("autoscaler panicked", "panic", r, "stack", string(debug.Stack()))
// Restart? Alert? Depend on policy.
}
}()
a.runInner(ctx)
}
Signal collection failure¶
func (a *Autoscaler) safeSignal(s Signal) float64 {
defer func() {
if r := recover(); r != nil {
a.Logger.Warn("signal panic", "signal", s.Name(), "panic", r)
}
}()
return s.Value()
}
Closed pool¶
type Pool interface {
Resize(int) error
}
var ErrPoolClosed = errors.New("pool closed")
// Autoscaler treats this as terminal:
case errors.Is(err, ErrPoolClosed):
a.Logger.Info("pool closed; stopping autoscaler")
return
Performance Tips¶
- Decision functions are pure; cache nothing.
- Avoid logging at info on every tick if tick rate is high; sample.
- Histograms over sort for p99 (O(buckets) vs O(n log n)).
- Use atomic operations; avoid mutex on hot paths.
- Sample signals; don't measure every task.
- Use
time.Ticker, nottime.After-in-loop (allocates each iteration). - Avoid map iteration in decision functions; use struct fields.
- Profile under load. Look for unexpected GC pressure.
Best Practices¶
- Design as composable interfaces (Pool, Signal, Decider, Cooldown).
- Tune empirically. No "right" Kp for a PID. Measure.
- Always have bounds. Floor, ceiling, max-step-per-tick.
- Integrate with backpressure and circuit breakers from day 1.
- Cost-aware: every growth has a cost; weigh against benefit.
- Predictive when patterns are clear; reactive always.
- Multi-pool coordination via global budget.
- Workload model drives initial sizing; live model updates over time.
- Test deciders as pure functions.
- Document policy.
Edge Cases and Pitfalls¶
Pitfall: PID windup¶
Mentioned above. Integral term grows during saturation; overshoots after.
Pitfall: PID without anti-windup is unsafe¶
Combine Kp, Ki, Kd with limits on integral; pause integration during saturation.
Pitfall: AIMD with no minimum step¶
cur - cur*0.25 rounds to 0 for small cur. Always enforce minimum step of 1.
Pitfall: Multiple autoscalers fighting over global budget¶
If two pools' autoscalers both request budget at the same time, one wins, the other waits. If the winner releases later (shrink), the other could over-grab. Use a proper queue or distributed lock for the budget.
Pitfall: Predictive over-trust¶
Predictive autoscaler grows in anticipation; load doesn't arrive; pool is oversized. Always cap predictive growth at some multiple of reactive growth.
Pitfall: Cost-aware that misses SLO¶
Cost-aware decision says "don't grow"; SLO is breached. Add SLO veto: regardless of cost, grow to meet SLO.
Pitfall: Coupled feedback loops¶
Pool A grows → uses downstream X → X gets slow → Pool A's tasks slow → autoscaler grows Pool A more. Positive feedback. Defense: detect downstream sickness; veto growth.
Pitfall: Hysteresis with adaptive band that shrinks too quickly¶
Adaptive band updates each tick. If noise spikes briefly, band widens. If noise drops, band shrinks too fast, flapping resumes. Use smoothing on the band update itself.
Pitfall: Decider that returns target outside bounds¶
If clamp is only in the runner, a misbehaving decider can pass insane values. Validate inside the decider too (defensive).
Pitfall: Cooldown counted from completion, not from start¶
Tricky to get right. Track from when the resize finished (e.g., after Resize returns). Otherwise spawning many workers might make cooldown count from wrong time.
Common Mistakes¶
- PID without anti-windup.
- AIMD that rounds to zero step.
- Multi-pool autoscalers without coordination.
- Predictive scaler that ignores reactive corrections.
- Hysteresis without cooldown (or vice versa).
- Composite decider with no merge strategy specified.
- Forgetting circuit breaker integration.
- Cost-aware that lets SLO breach to save money.
- Pluggable architecture so abstract that it is unreadable.
- Tests that mock so much they don't test anything real.
Common Misconceptions¶
- "PID is the best controller." Often overkill. Bang-bang + hysteresis is usually enough.
- "Predictive is better than reactive." Only if your model is good. Mostly use both.
- "Cost-aware means cheap." No — it means informed. Sometimes the answer is "spend more."
- "AIMD converges to optimum." Converges to a stable state, not necessarily optimum.
- "More pluggability is better." No — readability matters more than maximum flexibility.
Tricky Points¶
- AIMD's stability depends on the measurement function, not just the algorithm. Garbage in, garbage out.
- PID Kd is sensitive to noise; high Kd amplifies measurement noise into oscillation.
- Multi-pool budget must be released on shrink; missing this leaks budget.
- Predictive models drift; retrain periodically.
- Backpressure-pool coupling: changing pool size mid-burst can affect backpressure thresholds; coordinate them.
- The autoscaler tick is not the only signal; events (deploy, manual override) also trigger resize.
Test¶
Test the decider as a pure function¶
func TestAIMDDecider(t *testing.T) {
d := &AIMDDecider{GrowStep: 1, ShrinkFactor: 0.25, GrowPredicate: func(s Signals) bool {
return s.Util > 0.85
}, ShrinkPredicate: func(s Signals) bool {
return s.Util < 0.30
}}
cases := []struct {
cur int
sig Signals
want int
}{
{cur: 10, sig: Signals{Util: 0.9}, want: 11},
{cur: 100, sig: Signals{Util: 0.2}, want: 75},
{cur: 10, sig: Signals{Util: 0.5}, want: 10},
}
for _, c := range cases {
got, _ := d.Decide(c.cur, c.sig)
if got != c.want { t.Errorf("cur=%d sig=%v want=%d got=%d", c.cur, c.sig, c.want, got) }
}
}
Test integration with a fake pool¶
type FakePool struct {
size int
resizes []int
}
func (p *FakePool) Size() int { return p.size }
func (p *FakePool) Resize(n int) error {
p.resizes = append(p.resizes, n)
p.size = n
return nil
}
func TestAutoscalerLoop(t *testing.T) {
pool := &FakePool{size: 8}
sig := &FakeSignal{Value_: 0.9}
d := &AIMDDecider{...}
a := &Autoscaler{Pool: pool, Signals: []Signal{sig}, Decider: d, Interval: 5 * time.Millisecond, ...}
ctx, cancel := context.WithCancel(context.Background())
go a.Run(ctx)
time.Sleep(50 * time.Millisecond)
cancel()
if pool.size <= 8 { t.Errorf("expected grow, got %d", pool.size) }
}
Test stability under noise¶
func TestStabilityUnderNoise(t *testing.T) {
pool := &FakePool{size: 16}
sig := &NoisySignal{Mean: 0.5, Variance: 0.1}
d := &AIMDDecider{...}
a := &Autoscaler{...}
ctx, cancel := context.WithCancel(context.Background())
go a.Run(ctx)
time.Sleep(2 * time.Second)
cancel()
// Pool should not have resized more than a few times
if len(pool.resizes) > 10 {
t.Errorf("excessive resize under noise: %d", len(pool.resizes))
}
}
Tricky Questions¶
-
AIMD vs MIAD: which one for a worker pool? AIMD almost always. MIAD overshoots and stays high.
-
PID Kp too high: what happens? System oscillates around set-point, possibly diverging.
-
PID Ki too high: what happens? Slow oscillation; very slow recovery from disturbances.
-
What is the role of Kd? Damps oscillation by responding to rate of change. Caveat: amplifies noise.
-
How do you detect oscillation? Monitor resize/min counter. > 30/min usually means oscillation.
-
What if Little's Law says you need 50 workers but you have 5 CPU cores? Two cases: CPU-bound (cannot exceed cores) or I/O-bound (50 is fine). Diagnose with profiling.
-
Why integrate breaker with autoscaler? To prevent positive feedback on downstream failure.
-
Why cost-aware? Sometimes saving money matters more than latency. Make the trade explicit.
-
What is anti-windup? Limiting PID integral so it doesn't grow unboundedly during saturation.
-
Predictive scaling: when does it pay off? When patterns are strong (time-of-day) and prediction model is cheap. For random workloads: no.
Cheat Sheet¶
// AIMD
target = cur + 1 (grow)
target = cur - cur/4 (shrink)
// PID
err = setpoint - measured
integral += err * dt
deriv = (err - lastErr) / dt
output = Kp*err + Ki*integral + Kd*deriv
clamp(integral, -limit, +limit) // anti-windup
// Little's Law
workers = throughput * service_time
// Integration:
// Pool ↔ Backpressure (Submit returns ErrPoolFull)
// Pool ↔ Breaker (veto growth when open)
// Pool ↔ Limiter (limiter rate tracks pool capacity)
// Multi-pool ↔ Budget (global budget enforces limit)
// Pluggable arch:
// Pool, Signal, Decider, Cooldown, Bounds → Autoscaler
Self-Assessment Checklist¶
- I can implement AIMD with anti-windup-equivalent (minimum step)
- I can implement PID with anti-windup
- I can size a pool from Little's Law
- I can integrate breaker, limiter, backpressure with autoscaler
- I can architect a pluggable autoscaler with composable parts
- I can diagnose oscillation, retry storms, positive feedback
- I can build a predictive scaler with reactive fallback
- I can design cost-aware policies
- I can test the decider in isolation from real pool
- I can design multi-pool budget coordination
Summary¶
Junior taught mechanism. Middle taught policy. Senior teaches systems thinking.
Big themes: - AIMD: stable algorithm borrowed from TCP - PID: control-theory technique; sometimes overkill - Little's Law: analytical baseline for sizing - Integration: autoscaler doesn't live alone — combine with backpressure, breakers, limiters - Pluggable architecture: composable, testable, swappable parts - Cost-aware and predictive: layered on top of reactive - Multi-pool coordination: global budget, fairness
Professional level zooms in on production internals: ants/tunny/pond, distributed coordination, capacity planning math, and the deep operational knowledge you need at scale.
What You Can Build¶
- A production autoscaler with composable signals, deciders, and cooldowns
- A multi-pool service with global budget and fairness
- A predictive scaler with time-of-day patterns and reactive correction
- A cost-aware autoscaler that respects budgets
- A PID-driven utilization controller for tight latency control
Further Reading¶
- "TCP/IP Illustrated" — Stevens, on AIMD
- "Process Dynamics and Control" — for PID
- "Site Reliability Engineering" — Google, chapter on autoscaling
- "Designing Data-Intensive Applications" — Kleppmann, on backpressure
- Papers on AWS Auto Scaling Group internals
- Linux kernel's CPU governor (similar control loop ideas)
Related Topics¶
- Backpressure patterns
- Circuit breaker pattern
- Rate limiting
- Capacity planning
- Distributed coordination
Deep Dive: Building a PID Controller for a Worker Pool¶
We have introduced PID. Let us implement and tune one for a real worker pool.
Target¶
Keep worker utilization at 70%. When utilization drifts up, grow. When it drifts down, shrink. Smooth, no oscillation.
Code¶
package main
import (
"context"
"math"
"sync/atomic"
"time"
)
type PIDController struct {
Kp, Ki, Kd float64
Setpoint float64
integral float64
lastError float64
lastTime time.Time
primed bool
// anti-windup
IntegralMin, IntegralMax float64
}
func (p *PIDController) Step(measured float64, now time.Time) float64 {
err := p.Setpoint - measured
if !p.primed {
p.primed = true
p.lastError = err
p.lastTime = now
return p.Kp * err
}
dt := now.Sub(p.lastTime).Seconds()
if dt <= 0 {
return p.Kp * err
}
p.integral += err * dt
if p.integral > p.IntegralMax { p.integral = p.IntegralMax }
if p.integral < p.IntegralMin { p.integral = p.IntegralMin }
derivative := (err - p.lastError) / dt
p.lastError = err
p.lastTime = now
return p.Kp*err + p.Ki*p.integral + p.Kd*derivative
}
func (p *PIDController) Reset() {
p.integral = 0
p.lastError = 0
p.primed = false
}
Wiring into a pool¶
type PIDAutoscaler struct {
Pool *Pool
Pid *PIDController
Interval time.Duration
Floor int
Ceiling int
Util func() float64
}
func (a *PIDAutoscaler) Run(ctx context.Context) {
ticker := time.NewTicker(a.Interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case now := <-ticker.C:
u := a.Util()
control := a.Pid.Step(u, now)
cur := a.Pool.Size()
// Interpret control as a fractional delta in pool size
target := cur + int(math.Round(control * float64(cur) / 10))
if target < a.Floor { target = a.Floor }
if target > a.Ceiling { target = a.Ceiling }
if target != cur {
a.Pool.Resize(target)
}
}
}
}
The control output is multiplied by cur/10 so the delta is proportional to current size — larger pools change in larger steps.
Tuning¶
Starting values: - Kp = 10 (proportional gain) - Ki = 0.5 (small integral) - Kd = 1.0 (small derivative) - Setpoint = 0.7
Run; observe behavior. If overshoots: reduce Kp. If steady-state error: increase Ki. If oscillates: reduce Kd or Kp.
Ziegler-Nichols method:
- Set Ki = Kd = 0.
- Increase Kp until sustained oscillation. Call this critical Kp = Ku, period T_u.
- Set Kp = 0.6 Ku, Ki = 1.2 Ku / T_u, Kd = 0.075 Ku T_u.
This is a starting point; manual tuning usually follows.
Real-world challenge: integral windup¶
Imagine util is consistently 0.95 (above setpoint of 0.7). The error is -0.25 each tick. Integral grows: -0.25, -0.5, -0.75, ... The control output becomes very negative; the autoscaler aggressively shrinks. But wait — we want to grow, not shrink, when util is high.
Sign error! The setpoint is "we want util to be 70%, not higher." Higher util → grow. So error should be measured - setpoint = +0.25. Sign was wrong.
Let me fix:
With this fix: - High util → positive error → positive control → grow ✓ - Low util → negative error → negative control → shrink ✓
Now integral windup: if we are at ceiling and can't grow, error stays positive, integral keeps growing. Anti-windup helps: pause integration when at ceiling.
if cur >= a.Ceiling && control > 0 {
// pause integration during saturation in the grow direction
a.Pid.integral -= err * dt // undo this tick's increment
}
This is the operational reality of PID. The math is straightforward; the engineering is the corner cases.
When PID is genuinely useful¶
For a worker pool, PID becomes the right choice when:
- You care about precise utilization (e.g., maintaining 70% exactly for cost reasons)
- The signal has measurable trends that PID can follow
- You have time to tune carefully
- The system is slow enough that PID's smoothness helps
For most pools, AIMD or simple threshold suffices. PID is a senior-level capability you may or may not need.
Deep Dive: Implementing AIMD with Anti-Hysteresis¶
Standard AIMD oscillates slightly around the optimal pool size. Sometimes this is fine; sometimes it isn't. Anti-hysteresis is a small modification that reduces oscillation amplitude.
The idea¶
After a multiplicative shrink, the pool is smaller than optimal. Reactive grow brings it back. The cycle repeats; pool oscillates.
If we remember recent oscillations, we can grow back to the post-shrink size faster, or hold longer before shrinking again.
type AntiHystAIMD struct {
GrowStep int
ShrinkFactor float64
History []int // recent target sizes
}
func (a *AntiHystAIMD) Decide(cur int, sig Signals) int {
a.History = append(a.History, cur)
if len(a.History) > 20 {
a.History = a.History[1:]
}
switch {
case shouldGrow(sig):
// Grow toward median of recent history, faster
med := median(a.History)
if med > cur {
return min(cur+a.GrowStep*2, med)
}
return cur + a.GrowStep
case shouldShrink(sig):
return cur - max(1, int(float64(cur)*a.ShrinkFactor))
}
return cur
}
When the pool oscillates around a steady size, the median of history captures that. Growing toward the median is faster than naive additive.
This is a heuristic; production behavior depends on workload. It is most useful for workloads where the optimal size is stable.
Trade-off¶
- Pro: less oscillation, faster recovery from shrink
- Con: slower to adapt when the actual right size changes
For workloads with stable optima, anti-hysteresis is a win. For workloads where the right size varies, it can be worse than pure AIMD.
Deep Dive: Time-Series Forecasting for Predictive Scaling¶
Predictive scaling needs a forecast. Let us look at three approaches.
Linear extrapolation¶
Simplest. Take recent samples; fit a line; extrapolate.
type LinearForecast struct {
samples []point // (time, value)
}
type point struct {
t time.Time
v float64
}
func (f *LinearForecast) Add(t time.Time, v float64) {
f.samples = append(f.samples, point{t, v})
if len(f.samples) > 100 { f.samples = f.samples[1:] }
}
func (f *LinearForecast) Predict(at time.Time) float64 {
if len(f.samples) < 2 { return 0 }
// least-squares regression on (t, v)
var sumT, sumV, sumTV, sumTT float64
base := f.samples[0].t.Unix()
for _, p := range f.samples {
t := float64(p.t.Unix() - base)
sumT += t
sumV += p.v
sumTV += t * p.v
sumTT += t * t
}
n := float64(len(f.samples))
m := (n*sumTV - sumT*sumV) / (n*sumTT - sumT*sumT)
b := (sumV - m*sumT) / n
tt := float64(at.Unix() - base)
return m*tt + b
}
Good for short horizons (next 1-2 minutes). Fails for non-linear trends.
Exponential smoothing (Holt-Winters)¶
Captures trend and seasonality. Built-in to many time-series libraries.
type HoltWinters struct {
alpha, beta, gamma float64
season int // length of seasonal period
level, trend float64
seasonal []float64
}
Implementation is involved; libraries like github.com/montanaflynn/stats or a custom port of statsmodels can help.
Useful for daily patterns: predict tomorrow morning's load based on the last 7 mornings.
Neural networks¶
Overkill for most worker pools, but used in large-scale systems. LSTM or transformer trained on time-series of pool load.
The cost: training infrastructure, model serving, model staleness handling. The benefit: better predictions for complex non-linear workloads.
For 95% of services, linear or exponential smoothing is enough. Reach for neural networks only if the workload truly demands it.
Deep Dive: Multi-Pool Budget Implementation¶
We sketched multi-pool budgets earlier. Now a complete implementation.
type Budget struct {
mu sync.Mutex
total int
used int
waiters []chan int
}
func NewBudget(total int) *Budget {
return &Budget{total: total}
}
func (b *Budget) Request(want int) int {
b.mu.Lock()
defer b.mu.Unlock()
avail := b.total - b.used
if avail >= want {
b.used += want
return want
}
granted := avail
b.used += granted
return granted // partial grant
}
func (b *Budget) Release(n int) {
b.mu.Lock()
defer b.mu.Unlock()
b.used -= n
if b.used < 0 { b.used = 0 }
// notify waiters
for len(b.waiters) > 0 && b.used < b.total {
ch := b.waiters[0]
b.waiters = b.waiters[1:]
select { case ch <- 1: default: }
}
}
func (b *Budget) Available() int {
b.mu.Lock()
defer b.mu.Unlock()
return b.total - b.used
}
Using it¶
type BudgetedAutoscaler struct {
Pool *Pool
Budget *Budget
Bounds Bounds
}
func (a *BudgetedAutoscaler) growBy(want int) int {
granted := a.Budget.Request(want)
if granted == 0 {
return 0
}
cur := a.Pool.Size()
target := cur + granted
if target > a.Bounds.Max {
// give back the extras
a.Budget.Release(target - a.Bounds.Max)
target = a.Bounds.Max
}
a.Pool.Resize(target)
return granted
}
func (a *BudgetedAutoscaler) shrinkBy(n int) {
cur := a.Pool.Size()
target := cur - n
if target < a.Bounds.Min {
n = cur - a.Bounds.Min
target = a.Bounds.Min
}
a.Pool.Resize(target)
a.Budget.Release(n)
}
Each autoscaler asks the budget. The total across pools is bounded.
Priority budget¶
Some pools are more important. High-priority pools should get budget before low-priority.
type PriorityBudget struct {
mu sync.Mutex
total int
used int
perPool map[string]*PoolBudget
}
type PoolBudget struct {
name string
priority int
used int
softCap int
}
func (b *PriorityBudget) Request(name string, want int) int {
b.mu.Lock()
defer b.mu.Unlock()
pb := b.perPool[name]
avail := b.total - b.used
// Higher-priority pools may exceed soft cap by reclaiming from lower
if avail < want && pb.priority > 0 {
for _, other := range b.perPool {
if other.priority < pb.priority && other.used > 0 {
reclaim := min(other.used, want - avail)
other.used -= reclaim
avail += reclaim
if avail >= want { break }
}
}
}
granted := min(want, avail)
pb.used += granted
b.used += granted
return granted
}
Reclaim from lower-priority pools when needed. The lower-priority pools see their budget reduced and must shrink. This is how Kubernetes resource preemption works at cluster scale.
Deep Dive: Workload Models in Detail¶
A workload model is the basis for sizing. Let us build one.
Components¶
type WorkloadModel struct {
ArrivalRate *EWMA // tasks per second, exponentially smoothed
ServiceTime *EWMA // mean processing time
ServiceTimeP99 *HDR // p99 processing time
BurstFactor float64 // peak / mean ratio
Concurrency *EWMA // observed concurrency
}
func (m *WorkloadModel) Update(arrival float64, service time.Duration) {
m.ArrivalRate.Add(arrival)
m.ServiceTime.Add(service.Seconds())
m.ServiceTimeP99.Record(service)
}
Deriving sizes¶
func (m *WorkloadModel) Steady() int {
// Little's Law with 30% headroom
return int(m.ArrivalRate.Value() * m.ServiceTime.Value() * 1.3)
}
func (m *WorkloadModel) PeakCapacity() int {
// For p99 latency-sensitive work, size for peak * p99
return int(m.ArrivalRate.Value() * m.BurstFactor * m.ServiceTimeP99.Quantile(0.99))
}
func (m *WorkloadModel) Floor() int {
// 30% of mean traffic
return int(m.ArrivalRate.Value() * 0.3 * m.ServiceTime.Value())
}
func (m *WorkloadModel) Ceiling() int {
// 2× peak as safety margin
return int(m.ArrivalRate.Value() * m.BurstFactor * 2 * m.ServiceTime.Value())
}
These are model-driven autoscaler bounds. As the workload changes, the model updates; the autoscaler's bounds shift.
Live re-tuning¶
If the workload changes character (service time doubles after a code deploy), the model adapts. Within minutes, sizes shift to match.
func (a *Autoscaler) tick() {
// refresh bounds from model
a.Bounds.Min = a.Model.Floor()
a.Bounds.Max = a.Model.Ceiling()
// ... normal decide ...
}
Caveat: live re-tuning can introduce instability if the model itself is noisy. Smooth the model's outputs (EWMA on top of EWMA). Alert on big changes.
Deep Dive: Production Architecture for a Multi-Service Autoscaler¶
We have built autoscalers for one pool. In a real service, you have many pools across many subsystems. How do you architect this?
The autoscaler manager¶
type AutoscalerManager struct {
autoscalers map[string]*Autoscaler
budget *Budget
metrics MetricsSink
logger *slog.Logger
}
func (m *AutoscalerManager) Register(name string, a *Autoscaler) {
m.autoscalers[name] = a
}
func (m *AutoscalerManager) Run(ctx context.Context) {
for name, a := range m.autoscalers {
go func(name string, a *Autoscaler) {
defer func() {
if r := recover(); r != nil {
m.logger.Error("autoscaler panicked", "name", name, "panic", r)
}
}()
a.Run(ctx)
}(name, a)
}
<-ctx.Done()
}
Each autoscaler runs in its own goroutine. The manager owns the lifecycle. Single panic does not crash the whole system.
Reload-on-config-change¶
Production: ops want to retune thresholds without redeploying. Provide a reload API:
func (m *AutoscalerManager) Reload(name string, policy Policy) error {
a, ok := m.autoscalers[name]
if !ok { return ErrNotFound }
return a.UpdatePolicy(policy)
}
func (a *Autoscaler) UpdatePolicy(p Policy) error {
a.policyMu.Lock()
defer a.policyMu.Unlock()
a.policy = p
return nil
}
HTTP endpoint or gRPC method. Live tune. Useful during incidents.
Centralized observability¶
All autoscaler events flow to a single sink:
type Event struct {
Autoscaler string
Type string // "resize_up", "resize_down", "veto", "error"
From, To int
Reason string
Signals map[string]float64
Time time.Time
}
func (m *AutoscalerManager) emit(e Event) {
m.metrics.HistogramObserve("autoscaler_events", 1)
m.logger.Info("autoscaler event", "name", e.Autoscaler, "type", e.Type, /* ... */)
// also: write to durable store for forensic analysis
}
When an incident happens, ops queries: "show me all autoscaler events in the last 30 minutes." A central event log makes this trivial.
Health and self-monitoring¶
type Health struct {
LastTickAt time.Time
SignalErrs int64
ResizeErrs int64
StuckAtCeil time.Duration
}
func (m *AutoscalerManager) Health(name string) Health {
a := m.autoscalers[name]
return a.health
}
Expose via HTTP /healthz/autoscalers. Alert on:
time.Since(LastTickAt) > 10 * interval(autoscaler is hung)SignalErrs > 0(signal source failing)StuckAtCeil > 5 minutes(need to bump ceiling)
Autoscaler observability is as important as service observability.
Deep Dive: Race Conditions in Multi-Pool Coordination¶
Multi-pool coordination introduces subtle races. Let us look at the common ones.
Race 1: budget overcommit¶
Two autoscalers call Budget.Request(10) simultaneously. Budget has 15 available.
If the budget is not properly locked: - Both read avail=15 - Both write used += 10 - Used = 20, total = 15 — overcommit
Fix: mutex on budget. Atomic check-and-modify is also possible but harder to get right with multiple values.
Already in our implementation:
Race 2: shrink-release lag¶
Autoscaler A shrinks by 5, calls Budget.Release(5). Autoscaler B requests budget 0.1 ms later. If B's request goroutine was already waiting and A's release happens after B reads avail, B sees stale data.
Fix: do not read avail; use atomic compare-and-swap or hold the lock through decision.
In our implementation, we hold the lock through the decision. Race avoided.
Race 3: priority preemption while in flight¶
High-priority pool wants to reclaim budget. Low-priority pool is mid-resize. If we preempt, the low-priority resize may complete with surprising state.
Defense: priority preemption only changes the budget, not the pool size directly. The low-priority pool's autoscaler observes its budget shrink and shrinks itself on next tick.
This is eventually consistent — the low-priority pool catches up over time. Correct as long as overcommit windows are short.
Race 4: lifecycle race¶
Autoscaler A is registering with the manager. Autoscaler B is shutting down. Manager's iteration over autoscalers map might see both, or neither, or one.
Fix: register/deregister via channel; manager has a single goroutine doing map mutations.
type Manager struct {
register chan registerReq
deregister chan string
autoscalers map[string]*Autoscaler
}
func (m *Manager) Run(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
case r := <-m.register:
m.autoscalers[r.name] = r.a
case n := <-m.deregister:
delete(m.autoscalers, n)
}
}
}
Single goroutine owns the map. Public methods send on channels. Race-free.
Deep Dive: Diagnostics for Autoscaler Issues¶
When an autoscaler misbehaves, here is the diagnostic playbook.
Symptom: pool flapping¶
Definition: pool size changes more than 30 times per minute.
Diagnose: 1. Plot signal value over time. Does it cross thresholds repeatedly? 2. Plot resize events. Pattern of up-down-up-down? 3. Check deadband. Are thresholds too close? 4. Check cooldown. Is it less than tick rate?
Fix: widen deadband, lengthen cooldown, smooth signal more.
Symptom: pool stuck at ceiling¶
Definition: pool at max for > 5 minutes despite ongoing high signal.
Diagnose: 1. Verify the signal is actually high (not autoscaler bug). 2. Check workload: has throughput or service time changed? 3. Check downstream: is downstream the bottleneck? 4. Is ceiling correct?
Fix: - If workload exceeded model: bump ceiling, alert. - If downstream is bottleneck: scale downstream (different system). - If signal bug: fix it.
Symptom: pool stays at floor¶
Definition: pool at floor for hours despite some load.
Diagnose: 1. Is the signal actually low? (verify with raw measurements) 2. Is the floor too high? 3. Is autoscaler running? (last-tick timestamp) 4. Is shrink-cooldown too short, causing instant shrink after any tiny grow?
Fix: lower floor, fix autoscaler, lengthen shrink cooldown.
Symptom: latency spikes despite autoscaling¶
Definition: p99 wait > SLO for sustained periods.
Diagnose: 1. Is pool at ceiling? 2. Is autoscaler reacting fast enough? (up cooldown) 3. Is signal the right one? (e.g., depth misses slow downstream) 4. Is processing time degraded?
Fix: bump ceiling, shorten up cooldown, switch signal, address processing time root cause.
Symptom: high cost¶
Definition: pool consistently at high size during low traffic.
Diagnose: 1. Is autoscaler shrinking at all? 2. Is shrink-cooldown too long? 3. Is shrink threshold too low (only triggers at very low signal)? 4. Are workers idle but not exiting? (idle timeout misconfigured)
Fix: adjust shrink config, shorten cooldown, tune idle timeout.
Tooling¶
A small CLI for diagnostics:
$ poolcli status
Pool Size Target P99Wait Util Resizes/min
api 24 24 120ms 0.65 2
email 8 8 30ms 0.20 0
webhook 45 50 800ms 0.92 12 <-- flapping?
$ poolcli flap-detector api 5m
Pool 'api' resized 87 times in last 5 minutes. Likely flapping.
Recent resizes:
12:34:01 up 8 → 12 reason: wait p99 high
12:34:05 down 12 → 9 reason: mean wait low
12:34:09 up 9 → 12 reason: wait p99 high
...
Diagnosis: deadband too narrow (high=200ms, low=50ms is only 4×)
Recommended: widen to high=500ms, low=50ms (10×)
Tools like this are the difference between operating a dynamic pool calmly and being woken up at night.
Deep Dive: Capacity Planning Integration¶
The autoscaler operates within fixed bounds. The bounds come from capacity planning.
Capacity planning inputs¶
- Traffic forecast (next quarter, next year)
- Per-task resource consumption (CPU, memory, downstream calls)
- SLO targets (p99 latency budget)
- Cost constraints
Output: bounds¶
- Floor: enough for minimum expected traffic with headroom
- Ceiling: enough for peak forecast traffic with some safety margin
Capacity planning feedback loop¶
Each quarter, ops reviews autoscaler metrics:
- How often did we hit ceiling? If frequently, bump ceiling.
- How often did floor matter? If never, lower floor.
- What was peak resource consumption? Compare to forecast.
The autoscaler is the runtime adaptation; capacity planning sets the bounds. Together they cover short-term and long-term.
Multi-region capacity¶
Distribute load across regions. Each region has its own autoscaler. Cross-region coordination (rare) handled by a global controller.
type RegionAutoscaler struct {
Region string
Pool *Pool
Global *GlobalCoordinator
}
func (r *RegionAutoscaler) tick() {
decision := r.localDecision()
approved := r.Global.Approve(r.Region, decision)
r.Pool.Resize(approved)
}
The global coordinator looks at all regions' demand and capacity, decides allocation. Usually overkill for in-process pools; relevant for multi-cluster systems.
Deep Dive: Stability Analysis¶
Control theory has formal tools for analyzing stability. Let us apply some to a worker pool.
The linearized model¶
Assume small perturbations around steady state. Let n(t) be pool size, s(t) be signal. The autoscaler is a discrete-time controller that updates n based on s.
At steady state: n* = f(λ, W) where λ is arrival rate, W is service time.
Perturb: n(t) = n* + δn(t), s(t) = s* + δs(t). The autoscaler's response:
where g is the "gain" — how strongly the autoscaler reacts to signal perturbation.
The signal-pool coupling: if signal is utilization, more workers reduces utilization roughly proportionally (assuming arrival rate constant):
where k is some constant and d is the lag between resize and observed signal change.
Substituting:
This is a delay difference equation. Its stability depends on g · k and d. The classical result: stable if g · k < 2 sin(π / (2(2d+1))).
For d = 1 (one tick lag): stable if g · k < 2 sin(π/6) = 1.
In practice:
- If your autoscaler reacts very aggressively (high gain), instability.
- If you have long feedback delays, even modest gain causes instability.
- If gain × delay product is too large, oscillation.
This is the math behind the practical advice "use small steps, slow ticks, hysteresis."
Stability margins¶
In practice, you don't compute analytically. You measure:
- Step response: change load suddenly; how does pool size respond?
- Frequency response: drive signal sinusoidally; observe gain at different frequencies.
- Phase margin: how much delay before instability?
Tools: open-loop testing, gradient-descent tuning, perturbation analysis.
For most engineers, the right approach is: implement a known-good shape (AIMD with hysteresis and cooldown), measure in production, and adjust empirically.
Discrete time and quantization¶
Pool sizes are integers. The autoscaler decides in continuous space (PID output) but rounds to integer. This quantization can cause limit cycles — small persistent oscillation that never settles.
Defenses: - Dead zone: ignore tiny PID outputs. - Pulse-width-modulation-like behavior: average over multiple ticks before actuating. - Anti-jitter: filter PID output through low-pass before rounding.
In practice, most pools tolerate ±1 worker jitter happily. Rare to over-engineer this.
Deep Dive: Distributed Autoscaling Considerations¶
When the pool is in one process but the workload comes from a distributed system, the autoscaler interacts with cross-service dynamics.
Coordination via shared metric¶
Multiple instances of the same service each have their own pool. Each autoscaler decides independently. They may collectively overcommit.
Mitigation: each instance reports its pool size to a central metric store. The autoscaler reads its own metric and the cluster's total. If cluster total is high, individual instances are more conservative.
func (a *ClusterAwareAutoscaler) growBudget(localTarget int) int {
clusterTotal := a.metrics.Sum("worker_pool_size")
clusterMax := a.config.ClusterMax
if clusterTotal >= clusterMax {
return a.Pool.Size() // don't grow
}
return localTarget
}
This is loose coordination — eventually consistent. Acceptable when overcommit windows are short.
Strict coordination via lease¶
For strict bounds, use a distributed lease. Each pool acquires a lease of N workers from a central authority. Lease has a TTL.
type LeaseAutoscaler struct {
pool *Pool
lease *Lease // implements Acquire(n), Release(n), TTL
}
func (a *LeaseAutoscaler) growBy(n int) error {
if !a.lease.Acquire(n) {
return ErrNoBudget
}
a.pool.Resize(a.pool.Size() + n)
return nil
}
Lease renewal happens periodically. If a process crashes, its lease expires, capacity is freed. This is more complex but provides strict bounds.
In practice, Kubernetes HPA + per-pod soft caps cover most needs.
Heterogeneous instances¶
Different machines have different capacity (CPU, memory). The autoscaler should adapt to local capacity.
func (a *Autoscaler) localCeiling() int {
cpus := runtime.NumCPU()
mem := availableMemory()
cpuCap := cpus * a.cpuFactor // e.g., 50 workers per core
memCap := int(mem / a.memPerWorker)
return min(cpuCap, memCap)
}
Heterogeneous fleets need per-instance ceilings. Don't use a uniform ceiling across all instances.
Deep Dive: Real Production Stack¶
How does dynamic worker scaling fit into a complete production stack? Let us trace one.
The setup¶
A service handles 10k req/s. It has:
- HTTP server (gorilla/mux)
- Worker pool for async tasks
- Database connection pool
- HTTP client to two downstream services
- Redis cache
Worker pool details¶
type Service struct {
pool *Pool // size 16 → 256
db *sql.DB
httpClient *http.Client
redis *redis.Client
breaker *gobreaker.CircuitBreaker
limiter *rate.Limiter
}
Autoscaler config¶
pool:
initial: 32
floor: 16
ceiling: 256
queue_buffer: 4096
autoscaler:
interval: 500ms
decider:
type: composite
parts:
- type: wait_p99_threshold
high: 500ms
cooldown: 3s
- type: util_target
setpoint: 0.7
kp: 5
cooldown: 5s
- type: breaker_veto
states: [open, half_open]
- type: error_rate_veto
threshold: 0.05
cooldown:
up: 3s
down: 60s
bounds:
min: 16
max: 256
Interaction flow¶
Request arrives:
- HTTP server accepts, hands to handler
- Rate limiter checks token bucket; if depleted, return 429
- Handler prepares task, calls
pool.Submit(ctx, task) - Pool's Submit tries to enqueue; if full, returns ErrPoolFull → handler returns 503
- Worker picks up task, calls breaker
- Breaker checks state; if open, fail fast → task error, return to pool
- Downstream call via httpClient
- DB call via sql.DB connection pool
- Cache read/write
- Task completes, worker is free
Meanwhile, autoscaler:
- Every 500 ms, samples wait p99, util, breaker state, error rate
- If wait p99 > 500ms and breaker closed: grow
- If util < 0.4 sustained and queue < 10%: shrink (slow)
- If breaker open: veto growth
- If error rate > 5%: veto growth
Observability¶
Every component emits metrics. Prometheus scrapes. Grafana dashboards:
- Pool dashboard: size, queue, wait, process
- Autoscaler dashboard: signals, resize events, decision reasons
- Breaker dashboard: state, transitions
- Limiter dashboard: token rate, throttled rate
- DB dashboard: connection pool, query rate
Alerts:
- Pool at ceiling > 5 min
- Resize/min > 30 (flapping)
- Breaker open > 1 min
- Limiter throttle rate > 1%
- Wait p99 > SLO
Operations¶
- On-call gets paged on alerts
- Runbook: which dashboard to check first, what to look for, common fixes
- Live tuning via HTTP endpoint
- Capacity review weekly: re-check ceilings, re-tune thresholds
This is one cohesive system. The worker pool is one component; it depends on careful integration with the rest.
Deep Dive: Anti-Patterns in Production Autoscaling¶
Beyond the basic mistakes from junior/middle, here are senior-level anti-patterns.
Anti-pattern: autoscaler that learns too eagerly¶
A model-based autoscaler updates its workload model on every tick. A traffic anomaly (DDoS, deploy gone wrong) is mistaken for "new normal." Model becomes corrupted; autoscaler decisions worsen.
Defense: smooth model updates aggressively. Anomaly detection on signals before they update the model. Resetting the model on incident detection.
Anti-pattern: cascading autoscalers¶
Pool A autoscales based on signal from pool B. Pool B autoscales based on signal from pool C. C autoscales based on its load. A change in C propagates through B to A — but with cumulative delays. Result: ringing.
Defense: avoid coupled signals. Each pool autoscales on its own load. Cross-pool coordination via budget, not signal.
Anti-pattern: autoscaling that hides bugs¶
Autoscaler grows the pool. Throughput increases. The bug (slow code path) is masked. Engineers don't fix it; autoscaler is doing its job.
Defense: monitor per-task metrics, not just aggregate. Tail latency at constant pool size should be tracked. Autoscaling is for variable demand, not as compensation for slow code.
Anti-pattern: autoscaling that fights deploys¶
During a deploy, instances roll. New instance starts at floor; old instance shrinks during drain. The autoscaler on the new instance reacts to startup spike: grows wildly. Old instance shrinks too aggressively (queue is empty).
Defense: special handling at startup (warm-up period before autoscaler kicks in). Special handling at drain (don't shrink while draining; freeze size during drain).
type Autoscaler struct {
warmupComplete chan struct{}
drainStarted chan struct{}
}
func (a *Autoscaler) Run(ctx context.Context) {
select {
case <-a.warmupComplete:
case <-ctx.Done():
return
}
// ... normal loop ...
// also handle drain:
select {
case <-a.drainStarted:
// stop resizing
<-ctx.Done()
return
default:
}
}
Anti-pattern: tuning by superstition¶
"Last week the autoscaler thrashed; let me lower the cooldown." But the actual fix was widening the deadband. Engineer makes the change; thrash returns; engineer makes another change. The whole config drifts.
Defense: each tuning change has a written hypothesis and a measurable success criterion. Track changes over time. Version control the config.
Anti-pattern: ignoring "boring" pools¶
Most pools are not on the critical path. The team focuses on the busy ones. The boring pool gets a default config that is wrong. One day, the boring pool becomes critical (a customer change) and breaks.
Defense: every pool gets the same treatment. Bounds set by capacity planning. Alerts on basic health. Periodic audit.
Deep Dive: Building a Cost Model¶
Cost-aware scaling needs a cost model. Let us build one for AWS EC2.
Inputs¶
- Instance type (determines CPU, memory)
- Instance hourly cost
- Memory per worker
- Downstream cost per call (e.g., S3 GET: $0.0004 per 1000)
- SLO penalty (cost of missing latency target)
The model¶
type CostModel struct {
InstanceHourly float64 // dollars
MemPerWorker int64 // bytes
CallsPerTask int // downstream calls
CostPerCall float64 // dollars
LatencyPenaltyMs float64 // dollars per ms over SLO per request
SLO time.Duration
}
func (m *CostModel) PerWorkerHourly(memUsed int64) float64 {
fraction := float64(m.MemPerWorker) / float64(memUsed)
return m.InstanceHourly * fraction
}
func (m *CostModel) TaskCost(downstreamCalls int) float64 {
return float64(downstreamCalls) * m.CostPerCall
}
func (m *CostModel) LatencyPenalty(observedP99 time.Duration) float64 {
excess := observedP99 - m.SLO
if excess <= 0 { return 0 }
return m.LatencyPenaltyMs * float64(excess.Milliseconds())
}
Cost-aware decision¶
func (a *CostAutoscaler) Decide(cur int, sig Signals) int {
currentCost := a.Model.PerWorkerHourly(a.totalMem) * float64(cur)
penaltyNow := a.Model.LatencyPenalty(sig.P99)
// What if we grew by 1?
if sig.P99 > a.Model.SLO {
growCost := a.Model.PerWorkerHourly(a.totalMem)
// Estimate p99 improvement (heuristic)
expectedNewP99 := sig.P99 * time.Duration(cur) / time.Duration(cur+1)
newPenalty := a.Model.LatencyPenalty(expectedNewP99)
if penaltyNow - newPenalty > growCost {
return cur + 1
}
}
// What if we shrunk by 1?
if sig.Util < 0.3 && cur > a.floor {
shrinkSaving := a.Model.PerWorkerHourly(a.totalMem)
// Estimate p99 degradation
expectedNewP99 := sig.P99 * time.Duration(cur) / time.Duration(cur-1)
newPenalty := a.Model.LatencyPenalty(expectedNewP99)
if shrinkSaving > newPenalty - penaltyNow {
return cur - 1
}
}
return cur
}
The cost-benefit comparison is explicit. Grow when latency penalty saved > grow cost. Shrink when grow cost saved > additional latency penalty.
Caveats¶
- The model is approximate. Real p99 vs pool size is non-linear.
- Penalty function is a business decision. What is a missed 100ms SLO worth?
- Cost vs latency may not be linear either.
In practice, cost-aware scaling is rare in worker pools. More common at cluster autoscaling (AWS ASG, Kubernetes HPA). For in-process pools, simple bounds usually suffice.
Deep Dive: Operational Playbook¶
When the autoscaler breaks, what do you do?
Step 1: stabilize¶
If the pool is misbehaving (oscillating, stuck), manual override:
Override autoscaler for 1 hour. Pool size fixed at 64. Time to investigate without traffic pressure.
Step 2: diagnose¶
Look at metrics: - Resize events - Signal values - Cooldown state - Recent config changes
Common causes: - Signal source broken (e.g., Prometheus scrape failing) - Config change (someone tweaked thresholds) - Workload changed (new tenant, new code path) - Downstream issue (cascading)
Step 3: hypothesize and test¶
Form a hypothesis. Test with a single tweak. Re-enable autoscaler. Observe.
If the hypothesis is wrong, return to override mode. Try again.
Step 4: document¶
Postmortem template:
- What happened?
- When did it start?
- What was the impact?
- What was the root cause?
- How was it diagnosed?
- How was it fixed?
- How can we prevent it?
The "prevent" section often produces autoscaler improvements: new alerts, new vetoes, refined thresholds.
Step 5: prevent¶
Add detection: an alert for the symptom you saw. Add a unit test if it can be reproduced. Update runbook.
This is the SRE feedback loop. Autoscaler ops is no different.
Deep Dive: Going Beyond Worker Pools¶
The same ideas apply to scaling many things, not just goroutines.
Scaling connection pools¶
A database connection pool with MaxOpenConns. Same autoscaler shape: signal (query queue depth), decision (grow/shrink), bounds. Most production setups don't dynamically scale connection pools (they have hard DB limits), but the math is the same.
Scaling consumer groups¶
A Kafka consumer group with N consumers per partition. As lag grows, add consumers. Autoscaler ticks every minute (Kafka rebalances take seconds; you don't want to thrash). Bounds: 1 to partition count.
Scaling Lambda concurrency¶
AWS Lambda has a per-function concurrency limit. The Lambda runtime autoscales executors within that limit. You configure the limit; the runtime handles the rest. The concepts (signal, cooldown, bounds) are the same.
Scaling thread pools (Java)¶
ThreadPoolExecutor with corePoolSize and maximumPoolSize. Same shape: a queue, a core size, growth on demand, idle expiry. Autoscaling Java thread pools uses identical control loops.
The lesson: dynamic worker scaling is a general control problem. Master it for goroutine pools, and you understand the family of related problems.
Deep Dive: A Reading List for Senior Practitioners¶
Beyond the basics, the senior-level practitioner should read:
- "Concrete Operations on Concurrent Systems" — academic paper, dense
- "AIMD revisited" — Chiu and Jain, 1989 (the original AIMD paper for TCP)
- "Practical Control Theory for Engineers" — Levine
- "TCP Congestion Control Demystified" — RFC 5681 and beyond
- AWS Auto Scaling User Guide (the practitioner's guide, especially the scaling policies section)
- Kubernetes HPA documentation and source
- Brendan Burns's "Designing Distributed Systems"
- Heidi Howard's papers on consensus and coordination
- Henrique Gemignani's "Building Adaptive Systems"
For Go specifically:
- The
antssource code in detail - The
syncandsync/atomicpackage documentation - The Go runtime source (
runtime/proc.go) for scheduler interaction
This list takes months. But it covers everything a senior engineer needs to design and operate dynamic scaling at scale.
Deep Dive: Tightly-Coupled vs Loosely-Coupled Components¶
Architecturally, where should the autoscaler logic live?
Tightly coupled¶
Autoscaler is part of the pool. pool.AutoEnable() turns it on. Pool exposes its signals internally.
Pros: - Simple. One thing to deploy and manage. - No extra moving parts.
Cons: - Hard to test autoscaler logic in isolation. - Hard to swap autoscaler policies. - Pool code grows.
Loosely coupled¶
Autoscaler is a separate component. Holds a reference to the pool via a Resizer interface. Signals are injected.
Pros: - Testable. - Swappable. - Pool code stays clean.
Cons: - More components to manage. - Easy to misconfigure (signal points to wrong pool, etc.).
Recommendation¶
For libraries (e.g., ants): tightly coupled, with hooks. Library provides the pool and Tune; user supplies the autoscaler.
For your services: loosely coupled. Autoscaler is a separate package; depends on Resizer interface; testable.
This is the standard Go pattern: small interfaces, dependency injection.
Deep Dive: A Production Case Study¶
A real (anonymized) case from a large company.
The service¶
A pipeline that processes user events. Each event triggers 0-3 downstream calls. Total volume: 50k events/s peak, 5k events/s off-peak.
Initial design¶
Static pool of 200 workers. Sized for peak. Constantly idle off-peak (~5% utilization).
Iteration 1: simple dynamic¶
Wait-time autoscaler. Floor 50, ceiling 500. Cooldown 5s/60s. Threshold p99 wait > 500 ms.
Results: - Average pool size: 80 (vs 200 static) - Cost reduction: ~60% - Latency: p99 went from 200ms to 280ms (still under SLO of 500ms)
Trade-off accepted: lose some latency margin for cost savings.
Iteration 2: AIMD¶
Replace threshold with AIMD. GrowStep 5, ShrinkFactor 0.20.
Results: - Same average pool size (~80) - Tighter convergence (less oscillation) - p99 latency: 250ms (slight improvement) - Resize events: -40% (fewer changes)
Iteration 3: predictive¶
Add time-of-day predictor. Pre-warm to 150 before each daily peak.
Results: - Same off-peak size - Peak-time latency p99: 180ms (down from 250ms) - Slight cost increase during peaks (predictably oversized for 20 min) - Trade-off: better tail latency for slight cost
Iteration 4: multi-pool¶
Split into two pools: one for events with 0-1 downstream calls (fast), one for events with 2-3 (slow). Each with its own autoscaler.
Results: - Fast pool: small (avg 30), fast SLO met - Slow pool: larger (avg 90), slow SLO met - Total workers: similar to single pool but better latency - Operations complexity: +1 dashboard
Iteration 5: live model¶
Workload model updates continuously. Bounds and thresholds shift with traffic character.
Results: - Less manual tuning needed - Better adaptation to gradual workload shifts - Slightly more complex to operate (model drift can be a debugging factor)
Lessons¶
- Each iteration is a measured improvement, not a guess
- Diminishing returns: early iterations big wins; later iterations small
- Some iterations introduce complexity that costs more than they save
- Iteration 3 (predictive) was the biggest latency win for the cost
- Iteration 4 (multi-pool) was the biggest architecture change
This is the journey of a real production dynamic pool. Months of work, big wins, occasional missteps.
Deep Dive: Migration from Single-Pool to Multi-Pool¶
A common architectural pivot. Let us trace it.
Initial state¶
One pool, all task types. Autoscaler tunes based on aggregate wait time. Working but: fast tasks queue behind slow tasks. Tail latency for fast tasks is bad.
Step 1: classify tasks¶
Add classification: each task knows whether it is fast (<50ms expected) or slow.
Step 2: track per-class metrics¶
Two histograms: fast wait, slow wait. Two utilization gauges.
Step 3: split the pool¶
Two pools, two autoscalers. Submit routes by classification.
type ServiceV2 struct {
fastPool *Pool
slowPool *Pool
}
func (s *ServiceV2) Submit(task Task) error {
if task.Fast {
return s.fastPool.Submit(task.Run)
}
return s.slowPool.Submit(task.Run)
}
Step 4: tune separately¶
Fast pool: tight cooldowns, low p99 threshold. Aggressive grow. Slow pool: longer cooldowns, higher p99 threshold. Conservative grow.
Step 5: consider total capacity¶
Total workers across pools: capped by host. Use a shared budget.
budget := NewBudget(256)
fastAutoscaler := &BudgetedAutoscaler{Pool: fastPool, Budget: budget, ...}
slowAutoscaler := &BudgetedAutoscaler{Pool: slowPool, Budget: budget, ...}
Allocate within the budget; total never exceeds 256.
Step 6: validation¶
Compare before/after: - Fast tail latency: was 800ms p99, now 80ms p99 (10x improvement) - Slow tail latency: was 5s p99, now 4s p99 (small improvement) - Total cost: roughly same - Operations complexity: more (two dashboards, two tunings)
The win is fast-path latency. The slow path remains complex.
When multi-pool is right¶
- Tasks have clearly different latency characteristics
- Latency SLO for fast path is much tighter than slow path
- Volume is large enough to justify the operational overhead
When multi-pool is wrong¶
- Tasks are similar
- Operational overhead outweighs latency wins
- The "fast" and "slow" classification is fuzzy
For our case study, multi-pool was a clear win. Not universal.
Deep Dive: Working with Kubernetes HPA¶
When your service runs in Kubernetes, in-process autoscaling interacts with Horizontal Pod Autoscaler.
What HPA does¶
HPA scales the number of pods (containers running your service). Based on aggregate CPU, memory, or custom metrics.
How it interacts¶
Each pod has its own in-process worker pool with its own autoscaler. HPA scales pods up/down.
- Each pod's autoscaler reacts to local load
- HPA reacts to aggregate load (sum of pods' metrics)
Coordination¶
Two levels of scaling. They should not fight.
Practical rule: in-process autoscaler reacts within seconds; HPA reacts within minutes. Different time scales = different decisions.
- Short bursts: in-process pool absorbs
- Sustained increase: HPA adds pods
Conflict scenarios¶
- HPA adds a pod; new pod starts at floor. While ramping, it has low utilization. HPA sees low util on new pod, might want to scale down. Defense: HPA cool-down (
--horizontal-pod-autoscaler-downscale-stabilization=5m). - In-process pool grows aggressively; HPA sees high CPU; HPA adds pods. Now you have more pods than needed. Defense: HPA should look at custom metrics (per-pod), not host CPU.
Custom metrics¶
For best results, export per-pod metrics to HPA:
metrics:
- type: Pods
pods:
metric:
name: worker_pool_utilization
target:
type: AverageValue
averageValue: 70 # target 70% util across pods
HPA scales pods to keep average per-pod utilization at 70%. In-process autoscaler keeps per-pod size right. Both work toward the same metric at different scales.
Pod start-up time¶
A new pod takes 30-60 seconds to start (image pull, init, healthcheck). For traffic spikes within that window, in-process pool is the only thing that helps.
Recommendation¶
- Use HPA for sustained capacity (5+ minutes)
- Use in-process autoscaler for transient bursts (seconds)
- Tune them on different time scales
- Coordinate via custom metrics if possible
Deep Dive: Cost-Benefit of Autoscaling Itself¶
Should you autoscale? Cost-benefit analysis:
Costs¶
- Engineering time to build and maintain
- Operational complexity (dashboards, alerts, runbooks)
- Risk of bugs (autoscaler-induced outages)
- Monitoring overhead
Benefits¶
- Reduced cost during low traffic
- Better latency during high traffic
- Adapts to gradual workload changes
When the math works¶
Autoscaling pays off when:
- Load varies by 2x or more
- Cost of capacity is non-trivial (cloud bill)
- Engineering team has bandwidth to operate it
- Service tier justifies the investment
When it doesn't¶
- Internal tools with predictable, low load: not worth it
- Critical services with tight latency: static (oversized) may be safer
- Teams without observability infra: too risky
A rough heuristic¶
If your monthly compute bill is > $500 and load varies by > 3x, autoscaling probably pays off. Below that, the operational complexity may exceed the savings.
This is a senior-level decision: when not to use a tool you know. The right answer is sometimes "no autoscaling, just size right."
Deep Dive: Composite Signals in Practice¶
Real autoscalers combine multiple signals. Let us examine a specific composite policy.
The signals¶
type CompositeSignal struct {
WaitP99 time.Duration
Utilization float64
QueueDepth float64
ErrorRate float64
DownstreamP99 time.Duration
CPUUsage float64
}
The policy¶
func DecideComposite(s CompositeSignal, cur, floor, ceiling int) (int, string) {
// Hard vetoes
if s.ErrorRate > 0.10 {
return cur, "veto: high error rate"
}
if s.CPUUsage > 0.90 {
if s.WaitP99 > 5 * time.Second {
return cur, "veto: cpu saturated, latency catastrophic, growing hurts"
}
return cur, "veto: cpu saturated"
}
// Strong grow signals
if s.WaitP99 > 2 * time.Second {
return min(cur+5, ceiling), "grow: critical latency"
}
// Normal grow signals
grows := 0
if s.WaitP99 > 500*time.Millisecond { grows++ }
if s.Utilization > 0.85 { grows++ }
if s.QueueDepth > 0.75 { grows++ }
if grows >= 2 {
return min(cur+2, ceiling), "grow: two+ signals"
}
if grows >= 1 && s.DownstreamP99 < 200*time.Millisecond {
// Only grow when downstream is healthy
return min(cur+1, ceiling), "grow: one signal, downstream healthy"
}
// Shrink signals (all-or-nothing)
if s.Utilization < 0.30 &&
s.WaitP99 < 50*time.Millisecond &&
s.QueueDepth < 0.10 {
return max(cur-1, floor), "shrink: all signals low"
}
return cur, "hold"
}
Why this complexity?¶
Each rule encodes operational knowledge:
- Critical latency (2s) → emergency grow. Override hysteresis.
- Two signals high → confirmed; safe to grow.
- One signal high + downstream healthy → likely safe; grow cautiously.
- All signals low → confirmed idle; safe to shrink.
- Error rate or CPU vetoes → growing would worsen things.
Maintainability¶
This policy is complex. It needs: - Unit tests for each branch - Documentation explaining why each threshold - Periodic review (every 6 months?) - Version control
The complexity is justified by the production criticality. For simpler services, a simpler policy suffices.
Testing¶
func TestCompositeDecide(t *testing.T) {
cases := []struct {
name string
sig CompositeSignal
cur int
want int
}{
{
name: "high error rate vetoes",
sig: CompositeSignal{ErrorRate: 0.15, WaitP99: 2 * time.Second},
cur: 20, want: 20,
},
{
name: "critical latency overrides cooldown",
sig: CompositeSignal{WaitP99: 3 * time.Second},
cur: 20, want: 25,
},
{
name: "two signals high",
sig: CompositeSignal{WaitP99: 600*time.Millisecond, Utilization: 0.9},
cur: 20, want: 22,
},
{
name: "shrink when all low",
sig: CompositeSignal{Utilization: 0.2, WaitP99: 20*time.Millisecond, QueueDepth: 0.05},
cur: 20, want: 19,
},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, _ := DecideComposite(c.sig, c.cur, 1, 100)
if got != c.want {
t.Errorf("got %d, want %d", got, c.want)
}
})
}
}
Test each branch. Pure functions make this easy.
Deep Dive: When the Autoscaler is Wrong¶
The autoscaler is software. It has bugs. It makes wrong calls. Here is how to handle when the autoscaler is the problem.
Symptom recognition¶
- Suspicious pool size (way too high or too low for known load)
- Repeated resize events
- Recent autoscaler config changes
- Alerts firing despite no obvious workload change
Manual override¶
The fastest fix: take the autoscaler out of the loop.
type Autoscaler struct {
manualSize *int32 // nil if not in manual mode
}
func (a *Autoscaler) SetManual(size int) {
s := int32(size)
atomic.StorePointer((*unsafe.Pointer)(&a.manualSize), unsafe.Pointer(&s))
a.Pool.Resize(size)
}
func (a *Autoscaler) ClearManual() {
atomic.StorePointer((*unsafe.Pointer)(&a.manualSize), nil)
}
func (a *Autoscaler) Run(ctx context.Context) {
for ; ; {
select {
case <-ctx.Done():
return
case <-ticker.C:
if atomic.LoadPointer((*unsafe.Pointer)(&a.manualSize)) != nil {
continue // manual mode; autoscaler dormant
}
// normal logic
}
}
}
Expose via HTTP or CLI:
curl -X POST /admin/pool/api/manual?size=64
# investigate
curl -X DELETE /admin/pool/api/manual # resume autoscaler
Rollback config¶
If a recent config change caused the issue, roll back. Keep recent configs in version control.
Disable the autoscaler¶
In extreme cases, disable autoscaling entirely. Pool reverts to static at last known good size.
Forensics¶
After stabilization, dig in: - What signals were the autoscaler seeing? - What decisions did it make? - What was the workload doing? - Is there a code bug, config bug, or workload anomaly?
Event log is critical:
Persist decisions for at least 24 hours. Query them during postmortem.
Iteration¶
After diagnosis: - Fix code if there is a bug - Adjust thresholds if signals are right but decisions wrong - Add veto if a new condition needs handling - Update tests
Re-enable. Watch. Iterate.
Deep Dive: Stress Testing the Autoscaler¶
Before production, stress test. Drive the autoscaler past its limits.
Test 1: linear ramp¶
Increase load from 100 req/s to 5000 req/s over 5 minutes. Watch: - Pool size: should track linearly - Tail latency: should stay within SLO - Resize events: should be smooth, not chunky
Test 2: step¶
Instant jump from 100 to 2000. Watch: - How fast does the pool grow? - What is the tail latency during the transient? - Does it overshoot?
Test 3: sawtooth¶
Repeating: 100 → 2000 over 10s, then 2000 → 100 over 10s. Watch: - Does the pool keep up? - Does shrinking work? - Is there hysteresis-related lag?
Test 4: chaos¶
Random arrivals between 100 and 5000 req/s. Watch: - Pool size variance - Latency distribution - Resize frequency
Test 5: slow downstream¶
Inject 5x slowdown in downstream for 60 seconds. Watch: - Does the pool grow into the slow downstream? (bad) - Does the breaker veto? (good) - Does the system recover when downstream recovers?
Run these tests in a staging environment. They reveal corner cases that production rarely hits cleanly.
Deep Dive: Coordinating with Rate Limiters and Token Buckets¶
Rate limiters and autoscalers serve different purposes but interact.
Rate limiter as protector¶
A rate limiter at the edge prevents excessive load from reaching the pool. If your pool can handle 1000 req/s, a limiter at 1200 req/s gives headroom.
The autoscaler grows the pool. As the pool grows, the limiter should grow too — otherwise the autoscaler grew capacity that the limiter denies.
func (s *Service) updateLimit() {
cap := s.pool.Size()
avgService := s.workload.MeanServiceTime()
targetRate := float64(cap) / avgService.Seconds() * 0.8 // 80% headroom
s.limiter.SetLimit(rate.Limit(targetRate))
}
Trade-offs¶
- Limiter too tight: 429s while pool has capacity. Wasted capacity.
- Limiter too loose: queue grows beyond pool's ability. Latency spikes.
- Limiter dynamic: tracks pool capacity. More accurate but more complex.
Most systems start with a static limit set conservatively. Dynamic limits are senior-level optimization.
Deep Dive: Choosing Between Reactive, Predictive, and Hybrid¶
The big design choice. Let us compare.
Reactive¶
Standard. Read signal; decide; resize. Lag = signal collection lag + autoscaler tick + resize-effective lag.
Good for unpredictable workloads. Easy to operate. Most production deployments.
Predictive¶
Forecast future load; resize ahead. Lag potentially negative (resize before load).
Good for strongly periodic or trendy workloads. Adds complexity (forecasting infrastructure). Requires good training data.
Hybrid¶
Use predictive for baseline; reactive for corrections.
Best of both for many workloads. Most complex of the three. Engineer carefully.
Decision criteria¶
| Criterion | Reactive | Predictive | Hybrid |
|---|---|---|---|
| Workload predictability | Any | High | High to medium |
| Operational complexity | Low | High | Highest |
| Latency under spike | OK | Best | Best |
| Cost optimization | OK | Best | Best |
| Failure mode | Brief SLO miss | Wrong-prediction harm | Worst case is reactive's |
| Code complexity | Small | Medium | Largest |
For most services: reactive. For services with strong patterns: hybrid. Pure predictive is rare in worker pools.
Deep Dive: A Mental Model for Senior-Level Decisions¶
At senior level, you face design decisions other engineers will follow. A few mental shortcuts:
"Where does this signal come from?"¶
Before adopting any signal, trace its source. A signal that comes from the pool itself is fast and reliable. A signal that comes from Prometheus is delayed (scrape interval) and may fail (network).
Slow signals require slower autoscalers. Unreliable signals require fallbacks.
"What is the worst-case behavior?"¶
Imagine the autoscaler's decision is completely wrong. What is the impact?
If "completely wrong" means "pool at floor when load is at peak," that is an SLO breach. Bad but recoverable.
If "completely wrong" means "pool at ceiling with no work, costing money," that is wasted spend. Bad but recoverable.
If "completely wrong" means "pool growing without bound until OOM kill," that is an outage. Bound everything to prevent this.
"What if the autoscaler is broken?"¶
If the autoscaler stops running, what happens? Pool freezes at current size. Is that survivable?
Yes if your service can tolerate static for a while. No if the workload is highly variable and a stuck pool fails.
Design for the autoscaler being broken: floor and ceiling that are survivable even without dynamic.
"Who maintains this?"¶
The autoscaler is software. Software needs maintenance. Who?
If it's a platform team, design for ease of operation: defaults, dashboards, runbook, alerts.
If it's the service team, keep it simple. Don't over-engineer.
"What is the SLO?"¶
The autoscaler exists to meet an SLO. If there is no SLO, there is no measurable success. Define the SLO before the autoscaler.
p99 latency? p99 wait time? Throughput? Cost per request? Make it explicit.
"What is the kill switch?"¶
Always have a way to disable the autoscaler. Manual override, kill switch, runtime flag. Never trust the autoscaler enough to remove the ability to take it out.
In an incident, you want the autoscaler out of the way. Make that fast.
Deep Dive: Lessons from Operating Autoscalers for Years¶
Some hard-won insights only experience teaches:
Lesson 1: tuning is never done¶
Every six months, workload character drifts. New customers, new code paths, new downstream dependencies. Tune accordingly.
A quarterly review of autoscaler config is reasonable for any production service. Look at: - Resize event rate - Time at ceiling - Time at floor - Tail latency - Cost per request
Lesson 2: alerts beat dashboards¶
Dashboards are nice. Alerts wake you up. The alert-to-dashboard ratio matters. Too few alerts: incidents go unnoticed. Too many: alert fatigue.
For autoscalers, three alerts cover most cases: 1. Stuck at ceiling > 5 min 2. Resize/min > 30 3. SLO violation (p99 > target)
Lesson 3: defaults rule¶
Most autoscalers run on defaults. Tune the defaults; teams will get reasonable behavior without thought.
A "platform" team that owns dynamic pooling for the org should provide opinionated defaults that work for 80% of services. Other 20% can override.
Lesson 4: observability for the autoscaler itself¶
The autoscaler has its own state: last-tick time, integral, sample buffer. Expose these. When debugging, you'll want to know why the autoscaler made a specific decision; you can only know if the state was visible.
Lesson 5: explain decisions¶
Every resize event should include a reason string. Not "resize 12 → 14" but "resize 12 → 14: util 0.92 > 0.85 for 3 ticks". Future-you reads logs; the reason saves hours.
Lesson 6: bound everything¶
Floor, ceiling, max step, max integral, max history size. Every variable that can grow unboundedly will, eventually, hit a problem. Bound proactively.
Lesson 7: practice incidents¶
Run game days. Simulate autoscaler bugs, downstream failures, deploy storms. Make sure the runbook works.
Lesson 8: write down assumptions¶
The autoscaler embeds assumptions: arrival rate is roughly Poisson, service time is mostly stable, downstream is independent. When assumptions break, autoscaler misbehaves. Document them so future engineers can revisit when changes are made.
Deep Dive: Robust Implementations in Real Codebases¶
A few patterns I have seen in production codebases (sanitized):
Pattern: separate "policy" and "controller"¶
type Policy interface {
NextSize(state State) (size int, reason string)
}
type Controller struct {
Pool *Pool
Policy Policy
Bounds Bounds
Cool *Cooldown
Hooks []Hook // observers
}
Policy is stateless (re-computed each tick). Controller is stateful (tracks last-resize times). The split makes policy testable.
Pattern: Hook for side effects¶
Hooks observe; they cannot mutate. Useful for: - Metrics - Logging - Audit trail to database - Notification (Slack, PagerDuty)
Many hooks can exist; they run in order, each catching errors.
Pattern: dry-run mode¶
type Controller struct {
DryRun bool
}
func (c *Controller) tick() {
target, reason := c.Policy.NextSize(c.collectState())
if c.DryRun {
c.logger.Info("would resize", "to", target, "reason", reason)
return
}
c.Pool.Resize(target)
}
Run with DryRun = true in shadow mode to validate decisions before going live.
Pattern: progressive rollout¶
type Controller struct {
Effective float64 // 0.0 to 1.0 - fraction of decisions to act on
}
func (c *Controller) tick() {
target, _ := c.Policy.NextSize(c.collectState())
if rand.Float64() > c.Effective {
return // not this tick
}
c.Pool.Resize(target)
}
Effective = 0.1 means act on 10% of decisions. Lets you tentatively turn on a new policy. Increase to 1.0 once confident.
Pattern: chain of responsibility for decisions¶
type Decider interface {
Decide(state State, next Decider) (int, string)
}
// Composed:
chain := Veto(Limit(AIMD(Default())))
target, reason := chain.Decide(state, nil)
Each link gets a chance to short-circuit. The final default produces some baseline. Useful for composing many concerns.
Pattern: pluggable signal sources¶
type SignalSource interface {
Sample(ctx context.Context) (float64, error)
Name() string
}
type WaitTimeSource struct { Tracker *WaitTracker }
type UtilSource struct { Pool *Pool }
type PrometheusSource struct { Client *prom.Client; Query string }
Signals can come from anywhere: in-process tracker, Prometheus, external service. Same interface; autoscaler doesn't care.
Deep Dive: Tackling Cold-Start Issues¶
A frequent senior-level concern: cold start. Pool drops to floor; first burst pays a penalty.
Strategies¶
Floor higher than zero¶
Obvious but worth stating. Never let floor be 0 for production services.
Time-of-day floor¶
Adjust floor based on time:
func (a *Autoscaler) currentFloor(now time.Time) int {
h := now.Hour()
if h >= 9 && h < 18 {
return 16
}
return 4
}
Pre-warm on signal¶
Before scheduled events (a known traffic surge), warm up:
if upcomingEvent := scheduledEvents.Next(now); upcomingEvent != nil && upcomingEvent.In() < 5 * time.Minute {
target := upcomingEvent.ExpectedSize
a.Pool.Resize(target)
}
Anticipatory grow¶
If load is consistently rising (linear trend), grow ahead of trend:
Reserve capacity at startup¶
When a new pod starts, accept a brief overprovisioning:
For the first 60 seconds, never go below warmupSize. After that, normal logic.
Trade-offs¶
All these defenses cost capacity. The right answer depends on: - How often cold start happens - How bad the penalty is - How sensitive is your SLO
For most services, time-of-day floor is the simplest effective defense. Anticipatory grow is the next step for workloads with strong patterns.
Deep Dive: When to Replace Autoscaling with Static¶
After running dynamic autoscaling for a year, you might decide to revert to static. Reasons:
- Workload stabilized. What was variable is now predictable. Static is simpler.
- Operational cost too high. Maintaining the autoscaler costs more than the savings.
- Bugs. Repeated incidents from autoscaler mistakes.
- Cost no object. Reserved instances or unlimited budget makes savings irrelevant.
How to revert safely¶
- Measure current dynamic behavior. What is the steady-state size?
- Choose static size: usually slightly above the average (with headroom).
- Disable autoscaler with manual override at that size.
- Watch for one week. Confirm no SLO breach.
- If good, remove autoscaler code. Static pool.
The lesson¶
Dynamic is not always better. The decision to autoscale is a trade-off. Revisit it periodically. If the trade-off has shifted, change.
Deep Dive: A Note on Generics¶
Go generics (1.18+) let us write more reusable autoscaler components.
type Pool[T any] interface {
Submit(T) error
Resize(int) error
Size() int
}
type Signal[S any] interface {
Sample() S
}
type Decider[S any] interface {
Decide(state S, cur int) int
}
Implementations can be specialized:
Use cases: - Pools with typed task arguments (no interface{} casts) - Signal sources that produce structured outputs
For most projects, untyped (interface{}) is simpler. Generics shine when type safety matters or when you have many specialized variants.
Diagrams¶
What is next for dynamic worker scaling?
Machine learning¶
Reinforcement-learning autoscalers. Agent learns the decision policy from rewards (low latency + low cost). Active area of research.
Practical concerns: - Training data quantity and quality - Online learning vs offline - Catastrophic forgetting - Explainability
For most workloads, not yet ready. For massive workloads (Netflix, AWS internal), explored.
Auction-based scaling¶
Pools bid for resources from a shared budget. Higher demand pools outbid lower. The market reaches equilibrium.
Theoretically elegant. Operationally complex. Rare in practice.
Federation¶
A central autoscaler oversees many pools across services. Coordinates capacity across the whole organization.
Used at very large scale (Google's Borg, Netflix's Atlas). Beyond the scope of in-process autoscaling.
Verifiable autoscalers¶
Formal verification of autoscaler decisions. Proof that under given workload model, autoscaler converges to optimum.
Academic. Maybe useful for safety-critical systems.
Quantum¶
(Joking. There is no quantum autoscaling.)
Diagrams¶
AIMD vs MIAD vs PID convergence
ideal: stable at 30
AIMD: 8 ─ +1 ─ +1 ─ ... ─ 30 ─ (steady) ─ overshoots briefly, returns
MIAD: 8 ─ x2 ─ 16 ─ x2 ─ 32 ─ (overshot) ─ slow shrink to 30
PID: smooth approach to 30, possibly tiny oscillation
step: 8 → 16 → 32 → 16 → 32 → ... (flap if hysteresis poor)
backpressure + autoscaling system
callers
│
▼
┌────────────┐
│ rate limit │ (429 if exceeded)
└─────┬──────┘
▼
┌────────────┐
│ submit │ (503 if full)
└─────┬──────┘
▼
┌────────────┐
│ queue │ ←─ depth signal
└─────┬──────┘
▼
┌────────────┐
│ workers │ ←─ util signal
└─────┬──────┘
▼
┌────────────┐
│ downstream │ ←─ breaker
└────────────┘
│
▼
responses
│
▼ (latency feedback)
autoscaler ──→ resize signals
PID loop
setpoint ──┐
↓
error = setpoint - measured
│
┌───────┼───────┐
▼ ▼ ▼
Kp·e Ki·∫e·dt Kd·de/dt
│ │ │
└───┬───┴───────┘
▼
control output
│
▼
pool size change
│
▼
observed signal → measured (next iteration)
multi-pool coordination
┌─────────────┐
│ Global Budget│
└──────┬──────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
+─────+ +─────+ +─────+
│ p1 │ │ p2 │ │ p3 │
+─────+ +─────+ +─────+
▲ ▲ ▲
ascaler1 ascaler2 ascaler3
│ │ │
└ request growth from budget ┘
(granted partially)
hierarchy of autoscalers
┌───────────────┐
│ Cluster HPA │ (scales pods)
│ (minutes) │
└───────┬───────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────┐ ┌─────┐ ┌─────┐
│ pod │ │ pod │ │ pod │
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
in-process in-process in-process
autoscaler autoscaler autoscaler
(seconds) (seconds) (seconds)
│ │ │
worker worker worker
pool pool pool
short timescale: in-process responds
long timescale: HPA reacts
control loop in production
signals collected
─────────────────
- wait p99 (every task)
- utilization (busy/live)
- queue depth (atomic)
- error rate (counter ratio)
- downstream p99
- CPU usage (os)
↓
smoothing
─────────
- EWMA per signal
- rolling window for percentiles
- histograms for quantile
↓
decision function
─────────────────
- veto checks
- threshold checks
- AIMD or PID step
- cost adjustments
↓
bounds and cooldown
──────────────────
- clamp to [floor, ceiling]
- check time since last
↓
Resize
──────
- actuate pool
- emit metric
- log decision
↓
feedback (back to signals)
production dashboard layout
┌─────────────────┬─────────────────┬─────────────────┐
│ pool size │ queue depth │ workers busy │
│ (gauge line) │ (gauge line) │ (gauge line) │
├─────────────────┼─────────────────┼─────────────────┤
│ submit/s │ complete/s │ drop/s │
│ (rate line) │ (rate line) │ (rate line) │
├─────────────────┼─────────────────┼─────────────────┤
│ wait p50/99 │ process p50/99 │ error rate │
│ (lines) │ (lines) │ (line) │
├─────────────────┼─────────────────┼─────────────────┤
│ resize up/min │ resize down/min │ reasons │
│ (counter) │ (counter) │ (table) │
└─────────────────┴─────────────────┴─────────────────┘
alerts:
- pool at ceiling > 5 min
- resize/min > 30
- drop/s > 0 for 30s
- wait p99 > SLO for 60s
state machine: autoscaler lifecycle
┌──────┐ register ┌──────────┐
│ idle │ ──────────────→ │ running │
└──────┘ └──────────┘
│ ▲
manual │ │ clear manual
▼ │
┌─────────────┐
│ manual │
│ override │
└─────────────┘
│
drain │
▼
┌─────────────┐
│ draining │
└──────┬──────┘
│
stop ▼
┌──────────┐
│ stopped │
└──────────┘