Capacity Planning & Breakpoint Testing¶

Take a service that "feels slow under load" and turn it into a defensible number: the request rate at which p99 crosses SLO, the throughput knee where adding load stops adding work, the breakpoint where it falls over — and the one resource that binds all three.


Tier	Load-testing (meta-skill)
Primary domain	Capacity & scalability analysis
Skills exercised	Step-ramp load profiles, throughput-vs-latency curves, Universal Scalability Law, Little's Law, USE method, bottleneck attribution, autoscaling-threshold design, fleet sizing, Go (pprof, `runtime/metrics`)
Interview sections	17 (performance engineering), 22 (scalability & HA), 14 (system design)
Builds on	`load-testing/01-distributed-load-generator` (you need a rig that holds an open-model arrival rate)
Est. effort	2–4 focused days

1. Context¶

You own a stateless HTTP service — call it orders-api. It does request validation, one Postgres read, one Redis lookup, and one downstream gRPC call, then returns JSON. Capacity planning today is folklore: "one pod handles maybe a few hundred requests a second," autoscaling fires on 70% CPU because that's the template default, and the last incident postmortem said "the service got slow" with no number attached.

Your job is to characterize orders-api under controlled, increasing load until it breaks, and from that produce a capacity model: X req/s per instance at SLO, bound by resource R, with H% headroom, therefore N instances for a Y req/s peak. You will replace "it's slow" with a curve, a knee, a breakpoint, and a named binding resource — numbers you could defend to a staff panel and stake an autoscaling policy on.

2. Goals / Non-goals¶

Goals - Find the throughput knee (where latency starts climbing faster than linear) and the breakpoint (where p99 crosses SLO and/or errors begin) for a single instance, expressed in req/s. - Identify and prove the binding resource at saturation (CPU, memory, IO, lock contention, or a downstream dependency) using the USE method. - Cross-check the measured curve against Little's Law and the Universal Scalability Law — predicted vs measured concurrency should agree. - Derive a per-instance capacity number at SLO with explicit headroom, then a fleet-sizing recommendation and an autoscaling threshold anchored to the knee — not to a stock 70% CPU rule. - Demonstrate that removing the proven bottleneck moves the knee in the predicted direction.

Non-goals - Tuning the application's algorithms. You characterize the service as-is; the one allowed change is the targeted bottleneck-removal experiment (§10.5). - Multi-service / full-system load testing. Single service, controlled dependencies — isolate the unit before composing the fleet. - Chaos / failure injection. That's the resilience lab; here the system is healthy and we push it to its honest limit. - Beating a vendor benchmark. The deliverable is your number and why.

3. Functional requirements¶

A target service (cmd/target) — orders-api or a faithful stand-in — with one knob that lets you deliberately constrain a resource: e.g. DB connection-pool size (-pool), worker count (-workers), and GOMAXPROCS.
A load profile runner built on the 01-distributed-load-generator rig, driving an open model (fixed arrival rate, not closed-loop "as fast as it answers"). It must support staged step-ramps: hold a rate for a dwell, step up, repeat.
The runner emits, per step: offered rate (req/s), achieved rate (req/s), p50/p90/p99/p999 latency, error rate, and in-flight concurrency.
A resource sampler (cmd/use or sidecar) capturing, per step and aligned to the load timeline: CPU utilization, run-queue/saturation, memory + GC pressure, IO wait, lock/mutex contention (Go mutex profile), and downstream call latency + saturation.
An analysis step (script or notebook) that plots throughput-vs-latency, marks the knee and breakpoint, and computes the Little's Law and USL fits.

4. Load & data profile¶

Profile shape: step-ramp. Start well below capacity (e.g. 50 req/s), step in fixed increments (e.g. +50 req/s) with a ≥ 60 s dwell per step so the system reaches steady state (queues stabilize) before you read the numbers. Continue past the breakpoint — you must see the cliff, not stop at the knee.
Coverage of the six load-test types, each with its purpose: | Type | Profile | What it answers | |------|---------|-----------------| | Smoke | 1–5 req/s, 1 min | Does it even work; baseline latency floor | | Average | sustained ~expected peak | Behavior at normal load | | Stress / breakpoint | step-ramp until SLO breach / errors | Where it breaks, and on what resource | | Spike | flat baseline → instant 10× → back | Recovery after sudden surge | | Soak | knee-ish rate for ≥ 1 h | Leaks, GC creep, slow resource drift | | Capacity (this lab's synthesis) | the ramp, read at SLO | The per-instance number you ship |
Request mix: representative read/write ratio (e.g. 90/10). Don't ramp a single trivial endpoint — the knee must reflect real work.
Determinism: fixed seed for payloads and key distribution so two ramps are comparable. Keep the dependency (Postgres/Redis) warm and at fixed size.
One variable at a time: instance size, pool size, and GOMAXPROCS are pinned across a ramp; change exactly one between ramps.

5. Non-functional requirements / SLOs¶

The SLO is the contract that defines "broken." Pick it first; the breakpoint is where you cross it.

Metric	Target / what to report
Latency SLO (the line)	p99 < 200 ms at the offered rate (state your own; everything below is relative to it)
Knee throughput	The req/s where p99 begins rising faster than linear (latency derivative inflects). Report the number.
Breakpoint throughput	The req/s where p99 first crosses the SLO or error rate exceeds 0.1%. Report the number — this is the hard ceiling.
Binding resource at breakpoint	Named and proven (CPU / mem / IO / lock / downstream), with USE evidence attached.
Per-instance capacity at SLO	Max sustained req/s with p99 < SLO and error rate < 0.1%, held for the dwell.
Headroom target	Operate at ≤ 70% of knee throughput (or a stated target); state the autoscale-out threshold derived from it.
Little's Law agreement	Predicted concurrency `L = λ·W` within ±10% of measured in-flight at 2–3 ramp points.
USL fit	Throughput curve fits the Universal Scalability Law; report contention σ and coherency κ coefficients and the predicted peak.

The goal is not a leaderboard number. It's to find this service's number and explain which resource sets it.

6. Architecture constraints & guidance¶

Open-model load only. Closed-loop generators throttle themselves when the service slows, which hides the breakpoint (offered load silently drops as latency rises). Use the 01 rig's fixed-arrival-rate mode and verify the generator itself is not the bottleneck (it must sustain offered rate with headroom — check its own CPU and the send-timestamp skew).
Isolate the unit under test. Pin the target to known resources (--cpus, --memory in Docker, or a dedicated node) so the curve reflects the service, not a noisy neighbor. Record the exact instance shape — the capacity number is meaningless without it.
Steady state before you read. Latency must plateau within the dwell; if it's still rising when the step ends, the dwell is too short and your knee is an artifact of a growing queue, not capacity.
Instrument both sides. Load side (offered/achieved/latency/concurrency) and resource side (USE signals) on one aligned clock. Prometheus + Grafana, plus Go net/http/pprof for CPU and mutex profiles at saturation.
Coordinated omission: measure latency from intended send time, not from when the generator got around to sending. The 01 lab covers this; if you got it wrong there, every curve here is optimistic.

7. Data model¶

The deliverable is a small, tidy results table per ramp — this is the model.

ramp_step:  { offered_rps, achieved_rps, p50_ms, p99_ms, p999_ms,
              err_rate, inflight, cpu_util, sat_runq, mem_mb, gc_pct,
              io_wait, mutex_wait_ms, downstream_p99_ms, downstream_util }

capacity_model (derived, one row):
  { instance_shape, knee_rps, breakpoint_rps, slo_rps_per_instance,
    binding_resource, headroom_pct, autoscale_threshold,
    usl_sigma, usl_kappa, usl_peak_rps }

USE attribution per resource at the breakpoint step:

use_row: { resource, utilization, saturation, errors, verdict }
         -- the resource whose util→~100% OR saturation climbs while
            throughput flattens is the binding one

8. Interface contract¶

Runner config: -target, -start-rps, -step-rps, -dwell, -max-rps, -mix (read/write ratio), -seed, -out results.csv.
Target knobs (for the bottleneck-removal experiment): -pool, -workers, -cpus, GOMAXPROCS.
GET /metrics on both target and runner → Prometheus exposition.
Output: one results.csv per ramp (the ramp_step rows) + a derived capacity_model.json (the single row in §7). Every run reproducible from a committed command line + config.

9. Key technical challenges¶

Reading the latency curve correctly. It has three regimes: a flat floor (latency ≈ service time, queueing negligible), a gentle rise → knee (queues forming; latency now service time + wait), then the cliff (utilization → 1, latency → ∞ in theory, timeouts in practice). The knee is an inflection, not a threshold — find it by the second derivative, not by eyeballing where it "looks bad."
Attributing the bottleneck, not guessing it. High CPU might be effect, not cause — if the real limit is a 20-connection DB pool, threads spin waiting and CPU looks busy. USE (Utilization, Saturation, Errors) per resource disambiguates: the binding resource is the one whose saturation climbs (run queue, pool wait, IO wait) while throughput flattens.
Little's Law as a lie-detector. L = λ·W: in-flight concurrency = arrival rate × residence time. At any steady step, predicted L must match measured in-flight. A mismatch means coordinated omission, a closed-loop generator, or a measurement bug — fix it before you trust the curve.
The USL knee vs the cliff. The Universal Scalability Law predicts that throughput rises, bends (contention σ), then can retrograde (coherency κ) — more load yields less work (lock convoy, GC thrash, retry storms). Distinguish a flat-topped knee (saturation) from a retrograde collapse (negative scaling).
Spike ≠ ramp. A service can pass a slow ramp to 5k req/s yet die on an instant 0→5k spike: cold pools, cold caches, autoscaler lag, and queue overflow all bite at once. Recovery time after the spike is its own number.

10. Experiments to run (break it / tune it)¶

Record before/after numbers and attach the curve for each.

Step-ramp to knee + breakpoint. Ramp orders-api from 50 req/s in +50 steps, 60 s dwell, until p99 crosses SLO and error rate climbs. Plot throughput (achieved req/s) on x vs p50/p99/p999 on y. Mark the knee (latency inflection) and the breakpoint (SLO crossing). Measure: knee_rps, breakpoint_rps, the latency floor, and the multiplier between them.
USE bottleneck identification at saturation. At the breakpoint step, sample every resource. Measure: utilization + saturation + errors for CPU, memory/GC, IO, mutex, and the downstream gRPC call. Fill the USE table; name the one resource whose saturation rises while throughput flattens. Capture a pprof CPU + mutex profile as evidence.
Little's Law cross-check. At three steps (below knee, at knee, past knee), compute predicted L = λ·W from achieved rate and measured residence time; compare to measured in-flight. Measure: the % error at each point. >10% error ⇒ investigate (likely coordinated omission or a closed generator).
Spike test + recovery. Hold baseline (e.g. 30% of knee), then jump instantly to 10× for 60 s, then back. Measure: peak p99 during the spike, error count, and recovery time (seconds until p99 returns to baseline). Compare to where the same rate sat on the slow ramp.
Remove the bottleneck → move the knee. Based on §10.2, relieve the proven binding resource (e.g. raise DB pool 20→60, or GOMAXPROCS 2→4, or scale the downstream). Re-run the ramp. Measure: new knee_rps and breakpoint_rps — they must move in the predicted direction, and a new binding resource should appear (you moved the bottleneck, you didn't delete it).
Soak at the knee. Hold knee-ish rate for ≥ 1 h. Measure: p99 drift, RSS trend, GC pause trend, connection/file-descriptor counts. A rising line means the "capacity" number is only valid for a few minutes — disqualifying.
Derive the capacity model + fleet sizing. From the SLO-bounded rate, produce: per-instance capacity at SLO, headroom-adjusted operating rate, autoscale-out threshold (as a % of knee or a CPU/saturation proxy that correlates with the knee), and N = ceil(peak_rps / slo_rps_per_instance / headroom) for a stated peak. Measure: write the one-row model in §7.

11. Milestones¶

Target up with the one constrain-knob; 01 rig wired in open model; Grafana board showing offered/achieved/p99/in-flight on one timeline.
First clean step-ramp with proper dwell and steady state per step (experiment 1); knee and breakpoint marked on the curve.
USE sampler aligned to the load clock; bottleneck named with pprof evidence (experiment 2); Little's Law check passing within ±10% (experiment 3).
Spike + recovery and soak runs (experiments 4, 6); USL fit computed.
Bottleneck-removal ramp showing the knee moved (experiment 5); capacity model
fleet-sizing note written (experiment 7).

12. Acceptance criteria (definition of done)¶

A throughput-vs-latency curve, ramped past the breakpoint, with knee and breakpoint marked and their req/s stated.
The binding resource named and proven at the breakpoint (USE table + pprof/iostat/pool-wait evidence), not asserted.
Little's Law cross-check within ±10% at ≥ 3 points; deviations explained.
USL fit reported with σ, κ, and predicted peak; curve overlaid on data.
Spike test recovery time measured; soak shows flat p99/RSS over ≥ 1 h.
Bottleneck-removal ramp shows the knee moved in the predicted direction and a new binding resource identified.
A one-row capacity model: per-instance req/s at SLO, instance shape, binding resource, headroom %, autoscale threshold, and fleet size N for a stated peak — every number reproducible from a committed command.

13. Stretch goals¶

Two instance shapes: run the ramp on 2 vCPU and 4 vCPU; show whether capacity scales linearly per core or the binding resource doesn't follow CPU (it usually doesn't past the knee).
Validate fleet sizing for real: run N instances behind the 01 rig at the predicted peak and confirm the fleet holds SLO with the planned headroom.
Autoscaling signal selection: test whether CPU%, in-flight concurrency, or queue depth best tracks the knee, and recommend the autoscale metric with evidence (CPU often lags the true saturation signal).
Closed vs open model side-by-side: run the same target under a closed-loop generator and show how it under-reports the breakpoint.

14. Evaluation rubric¶

Dimension	Senior bar	Staff bar
Curve reading	Produces a throughput-vs-latency curve and points at the knee	Locates the knee by inflection, separates floor/knee/cliff regimes, ramps past the cliff
Bottleneck attribution	Notices what's saturated	Proves the binding resource via USE + profiles; predicts and confirms the next one after relief
Quantitative models	Knows Little's Law and USL exist	Uses Little's Law as a measurement check (±10%) and fits USL (σ, κ) to predict the peak
Load-test design	Runs a ramp	Picks the right test type per question; runs open-model; defends dwell and steady-state
Capacity output	States a req/s number	Ships a defensible capacity model: per-instance at SLO + headroom + autoscale threshold + fleet size, each justified
Communication	Clear findings note	Could defend every number — knee, breakpoint, binding resource, fleet N — to a staff panel

Staff bar, in one line: you walk out with a capacity model you would stake an on-call rotation and a cloud bill on — and you can show your work.

15. References¶

Brendan Gregg — The USE Method (Utilization, Saturation, Errors) for systematic bottleneck identification; Systems Performance (Ch. 2).
Little's Law (L = λ·W) — relating concurrency, throughput, and latency at steady state; the load-tester's lie-detector.
Neil Gunther — Universal Scalability Law (contention σ, coherency κ); the throughput-knee and retrograde-scaling model.
Grafana k6 — load-test type taxonomy (smoke / average / stress / spike / soak / breakpoint) and when each applies.
Gil Tene — "How NOT to Measure Latency" (coordinated omission; why open-model matters).
See also: Interview Question/17-performance-engineering/, Interview Question/22-scalability-and-high-availability/, Interview Question/14-system-design/.
Builds on: load-testing/01-distributed-load-generator.