Skip to content

Latency and Throughput — Senior Level

Roadmap: Performance → Latency and Throughput The middle page taught you to read a percentile chart. This page is about the physics underneath it: why p99 of one backend becomes the p50 of a fan-out request, why a queue's response time explodes as utilization passes 70%, and why your load generator has almost certainly been lying to you about tail latency for years.


Table of Contents

  1. Introduction
  2. Prerequisites
  3. Queueing Theory — Why Response Time Is Nonlinear
  4. Variability Kills Tails — Kingman's Approximation
  5. The Tail at Scale — Why Fan-Out Amplifies p99
  6. Tail-Tolerance Techniques — Hedging, Tied Requests, Micro-Partitioning
  7. Coordinated Omission — How Load Tests Hide Latency
  8. Percentile Statistics Pitfalls — Don't Add, Can't Average
  9. USE and RED — Two Methods, Two Vantage Points
  10. Latency Budgets and Backpressure as Latency Protection
  11. Mental Models
  12. Common Mistakes
  13. Test Yourself
  14. Cheat Sheet
  15. Summary
  16. Further Reading
  17. Related Topics

Introduction

Focus: The math and the measurement discipline a senior engineer uses to reason about tails — where almost all real latency pain lives.

By the middle level you can read a latency histogram, talk about p50/p95/p99, and explain why averages are useless for user-facing latency. That makes you literate. The senior jump is mechanistic: you can predict why a system's tail will blow up before it does, you can prove a load test is undercounting, and you can name the specific architectural lever that will pull p99 down.

Three bodies of knowledge make that possible, and they reinforce each other. Queueing theory tells you that response time is not linear in load — it has a vertical asymptote, and you are usually operating closer to it than you think. The Tail at Scale (Dean & Barroso, 2013) tells you that in a fan-out architecture, the rare slow response of one component becomes the common slow response of the aggregate request — so a 1-in-100 event at the leaf becomes a 1-in-2 event at the root. And coordinated omission tells you that the tools most teams use to measure all of this systematically erase the worst latencies, so you are optimizing against a fiction.

This page is those three things at the level where you can do the arithmetic, write the corrective code, and make the call.


Prerequisites

  • Required: You've internalized middle.md — percentiles, histograms, the difference between latency and throughput, and Little's Law as a relationship (L = λ · W).
  • Required: Comfort with basic probability — independence, the meaning of a percentile, and that P(all of N events) = p^N for independent events.
  • Helpful: You've run a load test (wrk, k6, JMeter, Gatling, Vegeta) and stared at a p99 number you didn't fully trust.
  • Helpful: You've operated a service behind a load balancer or service mesh (Envoy, gRPC) and seen a tail-latency incident.

Queueing Theory — Why Response Time Is Nonlinear

Every server is a queue. Requests arrive, wait if the server is busy, get served, and leave. The single most important consequence of treating it as a queue is this: response time does not grow linearly with load — it grows hyperbolically and has a vertical asymptote at 100% utilization. Engineers who reason linearly ("we're at 50%, so doubling traffic is fine") get destroyed by this.

Start with the canonical model, M/M/1: Markovian (Poisson) arrivals at rate λ, Markovian (exponential) service at rate μ, one server. Define utilization ρ = λ/μ (the fraction of time the server is busy; the system is only stable if ρ < 1). The closed-form results:

Utilization        ρ = λ / μ
Avg # in system    L = ρ / (1 − ρ)
Avg response time  W = (1/μ) / (1 − ρ) = Ws / (1 − ρ)      ← Ws = 1/μ = base service time
Avg wait in queue  Wq = ρ · Ws / (1 − ρ)

The term that matters is the 1/(1 − ρ) factor. It is the amplification the queue applies to your base service time. Tabulate it and the danger becomes visceral:

Utilization ρ 1/(1−ρ) Response time vs. service time
0.50 2.0 2× the bare service time
0.70 3.3 3.3×
0.80 5.0
0.90 10.0 10×
0.95 20.0 20×
0.99 100.0 100×

A service whose base latency is 1 ms running at 50% utilization responds in ~2 ms. The same code at 90% responds in ~10 ms; at 99% it responds in ~100 ms. Nothing about the work changed — only the queue. This is why capacity planning that targets 80–90% CPU "to save money" is a tail-latency time bomb: you are buying a 5–10× latency multiplier.

Adding servers helps, but with diminishing geometry. The M/M/c model (c parallel servers, one shared queue — the correct model for a thread pool or a connection pool) replaces the simple factor with the Erlang-C probability P_wait that an arriving request must queue:

Wq = P_wait · Ws / (c · (1 − ρ))      where ρ = λ / (c · μ)   (per-server utilization)

The crucial, counter-intuitive M/M/c result: one big pool of c servers beats c independent pools of one server each, because a single shared queue lets any idle server pick up any waiting request. This is the queueing-theory argument against sharding work into per-core or per-shard queues with no work-stealing — partitioned queues idle while a neighbor's queue backs up. (Go's runtime scheduler embodies the opposite lesson: per-P run queues plus work-stealing, to get partition locality without the partitioned-queue tail.)

Key insight: Response time isn't linear in load — it's Ws / (1 − ρ). The headroom you keep below 100% utilization isn't waste; it is the budget that pays for your tail latency. The difference between 70% and 95% utilization is the difference between a 3× and a 20× latency multiplier on identical code.


Variability Kills Tails — Kingman's Approximation

M/M/1 assumes exponential service times. Real service times are not exponential — they have their own variability, and that variability is the dominant driver of queue-induced latency. The result that captures this is Kingman's formula (the VUT equation), the approximation for the general G/G/1 queue (general arrival distribution, general service distribution, one server):

              ρ        c_a² + c_s²
Wq  ≈   Ws · ─────  ·  ───────────
            1 − ρ          2

         └── Utilization ──┘   └── Variability ──┘

Read the three factors — Kingman calls them V·U·T:

  • T = Ws, the base service Time.
  • U = ρ/(1−ρ), the Utilization amplifier — the same hyperbola as before.
  • V = (c_a² + c_s²)/2, the Variability term, where c_a is the coefficient of variation of inter-arrival times and c_s the coefficient of variation of service times. (c = σ/mean; for an exponential distribution c = 1, which recovers M/M/1.)

The variability term is the senior's lever. It says queue wait scales with the square of the coefficient of variation of your service time. Halve your service-time variance and you roughly halve the queueing delay at any given utilization. Concretely: a service whose requests take a uniform 1 ms (c_s ≈ 0) queues far less than one averaging 1 ms but occasionally taking 50 ms (high c_s), even at identical mean throughput.

This is why these things destroy tails out of proportion to their average cost:

  • GC pauses (a Java stop-the-world or a Go STW phase) inject occasional multi-millisecond service times into an otherwise sub-millisecond distribution — they spike c_s.
  • Cold caches, lock contention, occasional disk hits, TLS handshakes — anything that makes some requests dramatically slower than the median.
  • Slow-path branches (validation failures, retries, large payloads) that share the same server pool as fast requests.

The mitigation strategy follows directly: reduce service-time variance, not just the mean. Isolate slow work onto separate pools so it can't inject variance into the fast path; cap request work; use allocation-light hot paths to avoid GC; tune the GC for pause time over throughput when latency is the SLO (Go's GC, the JVM's ZGC/Shenandoah). You are managing the second moment of the distribution, not the first.

Key insight: At fixed throughput, the thing that determines your tail is service-time variance, and Kingman shows it enters quadratically (c_s²). A latency engineer's job is as much about making service times uniform as making them fast — a predictable 2 ms beats a 1 ms-average that occasionally hits 40 ms.


The Tail at Scale — Why Fan-Out Amplifies p99

Here is the result that reorganizes how you think about distributed latency. Dean & Barroso's observation: in a service that fans out to many backends and waits for all of them, the latency of the slowest backend determines the response — so a rare slow event at the leaf becomes a common slow event at the root.

Make it arithmetic. Suppose each backend independently responds within its p99 latency 99% of the time (i.e., 1% chance of being "slow"). A root request that fans out to N backends and must wait for all N is fast only if every backend is fast:

P(root request is "fast")  = (0.99)^N
P(root hits a slow backend) = 1 − (0.99)^N
Fan-out N P(at least one slow backend)
1 1%
5 4.9%
10 9.6%
100 63%
200 87%

At a fan-out of 100, a per-backend 1-in-100 tail event becomes a 63% chance per root request. The leaf's p99 has become worse than the root's p50. This is not a bug in any one service — it is the statistics of "wait for the slowest of N." Modern services routinely fan out to dozens or hundreds of shards (a search query, a feed assembly, a fan-out write), so this is the normal operating regime, not an edge case.

A second amplifier compounds it: the more components a request touches, the higher the chance of hitting each component's own tail. A request that traverses 5 services in series, each with a 1% chance of a slow response, has a 1 − 0.99^5 ≈ 4.9% chance of a slow leg somewhere — even though no single service degraded.

The structural consequences Dean & Barroso draw:

  • Reducing component-level tail latency is leveraged: shaving the leaf p99 helps the root p50 dramatically because of the exponent.
  • You cannot eliminate tails by averaging more requests — fan-out makes them more likely, not less.
  • The right defenses are architectural, treating tail latency as a property to be tolerated rather than eliminated, exactly as fault-tolerance treats failures.

Key insight: (0.99)^N is the whole story. A 1% per-backend tail is invisible at N=1 and catastrophic at N=100. In any fan-out system, the leaf's p99 is the budget that sets the root's median, which is why "just make the common case fast" is not enough — you must make the tail of every dependency bounded.


Tail-Tolerance Techniques — Hedging, Tied Requests, Micro-Partitioning

If you can't eliminate tails, you tolerate them. Dean & Barroso name a toolkit; each trades a little extra work for a much tighter tail.

Hedged requests. Send the request to one replica; if no response arrives within, say, the p95 latency (a brief deferral, not immediate duplication), send a second copy to another replica and take whichever returns first. The first request covers the common case; the hedge covers the tail. Because you only hedge the slow ~5%, the extra load is small (~5%) but the tail improvement is large — the hedge wins whenever the original was about to be a tail event.

// Hedged request: fire the backup only if the primary is slow.
func hedged(ctx context.Context, call func(context.Context) (Resp, error), hedgeAfter time.Duration) (Resp, error) {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel() // cancel the loser as soon as we have a winner

    results := make(chan result, 2)
    go func() { r, e := call(ctx); results <- result{r, e} }()

    select {
    case res := <-results:
        return res.r, res.err            // primary returned before the hedge deadline
    case <-time.After(hedgeAfter):       // primary is in the slow tail — hedge it
        go func() { r, e := call(ctx); results <- result{r, e} }()
        res := <-results
        return res.r, res.err            // whichever replica answers first
    }
}

Tied requests. Hedging still wastes work — both copies may run to completion. Tied requests enqueue the request on two replicas simultaneously, but each copy carries the identity of the other; when one replica starts executing, it sends a cancellation to its twin. The tie is broken by which queue drains first (cheap to detect), not by waiting out a timeout, so you get tail protection with minimal duplicated execution — only duplicated queueing. This requires cross-replica cancellation, which is why it lives in the data layer (Google's storage systems) rather than in application code.

Request cancellation / deadline propagation. The dual of hedging: when a response is no longer needed (the winner returned, or the client's deadline passed), cancel the loser so it stops consuming capacity — capacity that would otherwise raise ρ and worsen everyone's tail (per Kingman). In gRPC this is automatic via deadline propagation: a deadline set at the edge flows through every downstream call, and an expired deadline cancels the in-flight RPC chain. Honor ctx.Done() everywhere; a request that keeps working after its deadline is pure tail-inflating waste.

Micro-partitioning. Slice data into many more partitions than machines (say 20× more), so each machine holds dozens of small partitions. When one machine is slow or hot, its partitions are cheaply redistributed in small increments rather than moving one giant shard — and load balances at fine granularity. Fine partitions also make recovery and rebalancing fast, which keeps transient hotspots from becoming sustained tail events.

Good-enough / quorum responses. When fan-out makes "wait for all N" untenable, return once a quorum (or "enough") of backends have replied, accepting slightly stale or incomplete results to escape the slowest backend's tail. A search root that has heard from 95% of its leaves can return rather than wait for the last 5% — trading a sliver of completeness for a bounded p99. The 5% of leaves still in flight are then cancelled.

Key insight: Every tail-tolerance technique is the same bet — spend a little extra capacity or accept a little staleness to cut the dependence on the slowest unit of work. Hedging duplicates ~5% of requests; tied requests duplicate queueing but not execution; quorums drop the slowest few percent of results. All of them attack the (0.99)^N exponent by making the root not wait on its slowest leaf.


Coordinated Omission — How Load Tests Hide Latency

This section will change how you trust every latency number you've ever seen. Coordinated omission is a systematic measurement error, named by Gil Tene, in which a load generator's own slowness causes it to skip sending requests during the exact periods the system is slowest — so the worst latencies are never recorded, and the reported tail is wildly optimistic.

The mechanism, in the common synchronous-loop load generator:

loop:
    t0 = now()
    response = send_request()      // blocks until the response comes back
    record(now() - t0)
    // immediately loop and send the next request

The bug is "immediately loop and send the next request." If the server stalls for 1 second (a GC pause, a lock convoy, a failover), the loop blocks inside send_request() for that whole second. It records exactly one slow sample (~1000 ms). But during that stalled second, at the intended rate of, say, 1000 req/s, 1000 requests that should have been sent were never sent. In the real world those 1000 requests did arrive — they queued behind the stall and each experienced part of that 1-second delay (1000 ms, 999 ms, 998 ms, …, down to ~1 ms). The load generator silently omitted all 1000, recording only the single straggler.

The damage to percentiles is enormous. Tene's canonical illustration: a system that runs at 0 ms for 100 seconds and then stalls for 100 seconds reports a naive p99.99 of essentially 0 ms — because the long stall produced only one giant sample — when the true p99.99, accounting for everyone who waited, is on the order of tens of seconds. Coordinated omission can understate the tail by one to two orders of magnitude. A reported p99 of 5 ms can hide a true p99 of 500 ms.

There are two correct fixes, and good tools implement both:

  1. Constant-rate (open-model) load. Schedule each request at its intended send time (start_time = i / rate) regardless of when the previous one finished. If the response comes back late, the intended-to-actual gap is the real latency that request would have experienced. This is the open model (arrivals independent of completions) versus the buggy closed model (next arrival waits for previous completion). wrk2 (Tene's fork of wrk) does exactly this — you specify a fixed -R rate and it measures from intended schedule.

  2. Latency back-filling correction. When a sample exceeds the expected inter-request interval, synthesize the omitted samples that would have queued during the stall, decaying from the observed latency down to the interval. HdrHistogram's recordValueWithExpectedInterval(value, expectedInterval) does this: it records the real sample and the implied missing ones, reconstructing the true distribution.

// HdrHistogram with coordinated-omission correction built in.
Histogram h = new Histogram(3);                 // 3 significant digits
long expectedIntervalNanos = 1_000_000;         // 1 ms between requests at 1000 req/s

long start = System.nanoTime();
doRequest();
long elapsed = System.nanoTime() - start;
// records the sample AND back-fills the requests omitted during a stall:
h.recordValueWithExpectedInterval(elapsed, expectedIntervalNanos);

Practical rules that follow: - Use wrk2, not wrk; specify a target rate, not "as fast as possible." A "max throughput" run is intrinsically coordinated-omission-prone because there is no intended schedule to measure against. - Drive load below saturation. At ρ ≥ 1 the closed model degenerates entirely; the only meaningful latency numbers come from an open model at a sustainable rate. - Prefer tools and libraries that record into HdrHistogram (or a t-digest) with correction, and report the full distribution, not a single average.

Key insight: The cheap way to measure latency — a loop that sends the next request after the previous returns — cannot see the latencies it causes by stalling, because the load generator stalls in lockstep with the server. Honest latency measurement requires an open model (constant intended rate) and/or back-filling (HdrHistogram). If your tool doesn't do one of these, your tail numbers are fiction.


Percentile Statistics Pitfalls — Don't Add, Can't Average

Percentiles are the right summary for user-facing latency, but they obey rules that violate everyone's arithmetic instincts. Get these wrong and you will report numbers that are not just imprecise but meaningless.

Percentiles don't add. The p99 of a request that traverses A then B is not p99(A) + p99(B). That sum assumes both legs hit their p99 simultaneously on the same request, which is the rare case, not the typical one. For independent legs the true end-to-end p99 is less than the sum of the per-leg p99s (you'd need both tails to coincide), and computing it requires convolving the distributions, not adding quantiles. The only correct way to get an end-to-end percentile is to measure the end-to-end latency directly and take its percentile. Decomposing a budget into per-leg p99s that sum to the target is a useful planning heuristic but is conservative — it over-provisions.

Percentiles can't be averaged. You cannot average the p99 of three machines to get the fleet p99. avg(p99_a, p99_b, p99_c) is not the p99 of the combined population — averaging quantiles is statistically nonsense. If machine A served 1M requests and B served 1K, their p99s carry vastly different weight; and even with equal weight, the quantile of a mixture is not the mean of the components' quantiles. The same applies across time: you cannot average per-minute p99s to get an hourly p99.

So how do you aggregate? You must aggregate the distributions, then compute the percentile from the merged distribution. This is exactly why mergeable sketches exist:

  • HdrHistogram uses fixed, log-linear buckets with a configured precision (significant digits). Two histograms with the same configuration merge by summing bucket countsa.add(b) — and you compute any percentile from the merged counts. Mergeability with bounded relative error is the whole point: each machine emits a histogram, a central system sums them, and the true fleet p99 falls out.
  • t-digest (Dunning) is a mergeable sketch optimized for extreme quantiles — it keeps fine resolution at the tails (p99, p99.9) and coarse resolution in the middle, so it answers p99.9 accurately with little memory. Digests merge by combining and re-clustering their centroids. This is what powers tail-percentile estimates in many metrics backends.
// Correct fleet aggregation: merge distributions, THEN read the percentile.
fleet = HdrHistogram(precision=3)
for each machine m:
    fleet.add(m.histogram)        // sum bucket counts — mergeable by construction
fleet_p99 = fleet.percentile(99)  // computed from merged counts, not from per-machine p99s

Beware the average entirely for latency. The mean is dominated by the body and blind to the tail; a service can hold a 2 ms mean while 1% of users wait 2 seconds. Report distributions and high percentiles (p99, p99.9), and remember that which users hit the tail matters — the tail is often concentrated on the heaviest, most valuable accounts.

Key insight: A percentile is a property of a distribution, so you can only combine percentiles by combining the distributions they came from. Adding per-stage p99s over-counts (they rarely coincide); averaging per-machine p99s is meaningless. Emit mergeable histograms (HdrHistogram) or tail-accurate sketches (t-digest), merge those, and read the percentile from the merged result.


USE and RED — Two Methods, Two Vantage Points

Two complementary methodologies tell you where latency is being created. They answer different questions, and a senior uses both.

USE — Utilization, Saturation, Errors (Brendan Gregg) is a resource-centric checklist. For every resource — CPU, memory, disk, network, and also software resources like thread pools, connection pools, and lock-protected sections — ask:

  • Utilization: what fraction of time is it busy? (ρ from the queueing section.)
  • Saturation: how much queued, waiting work has piled up? (run-queue length, pool wait count, socket backlog.)
  • Errors: what is the error count?

USE is how you localize a bottleneck: walk the resources, find the one that's saturated, and you've found the queue driving your 1/(1−ρ) blow-up. The Saturation signal is the early warning — utilization can read 100% while still keeping up, but growing saturation means the queue is unbounded and latency is heading for the asymptote.

RED — Rate, Errors, Duration (Tom Wilkie) is a request-centric checklist, one set of signals per service:

  • Rate: requests per second (λ).
  • Errors: failed requests per second.
  • Duration: the distribution of request latency — not a mean, a histogram.

RED is how you monitor user-facing health and is the natural fit for microservices and dashboards: every service exports the same three signals, and Duration is your latency SLO surface. In a service mesh, Envoy emits exactly these (upstream_rq_total, upstream_rq_xx, upstream_rq_time histograms) per cluster, giving you RED for free at every hop.

The two compose: RED tells you a service's latency is bad; USE tells you which resource inside it is the cause. A spiking RED Duration on service X, cross-referenced with a saturated thread pool (USE Saturation) inside X, is a complete diagnosis: requests are queueing for threads, and Wq is climbing per Kingman. The USE Saturation signal is the one that lets you act before the RED Duration crosses the SLO — which is the entire point of backpressure.

Key insight: RED watches the request stream (Rate/Errors/Duration) from the user's side; USE watches the resources (Utilization/Saturation/Errors) from the machine's side. Latency symptoms show up in RED Duration; latency causes show up in USE Saturation — and Saturation rising is your only pre-SLO-breach warning.


Latency Budgets and Backpressure as Latency Protection

A latency budget allocates an end-to-end SLO across the components of a request path so that each owner knows their slice. If the user-facing SLO is "p99 ≤ 200 ms," you decompose it down the call tree:

End-to-end p99 budget: 200 ms
├─ Edge / TLS / gateway ........... 15 ms
├─ Auth service .................... 10 ms
├─ Business logic (this service) ... 40 ms
│   ├─ serialization .............. 5 ms
│   └─ business compute ........... 35 ms
├─ Database (p99) ................. 60 ms
├─ Cache lookups (2 × p99) ........ 10 ms
├─ Downstream RPC fan-out (p99) ... 50 ms
└─ Network + headroom ............. 15 ms

Two caveats keep this honest. First — from the percentile section — the budget legs do not literally sum, because per-leg p99s rarely coincide; summing them is a conservative allocation, which is fine for planning (it over-provisions) but means the real end-to-end p99 will usually beat the sum. Second, the budget is also a deadline: propagate it as a gRPC deadline from the edge so each hop knows how much time remains, and any hop that has already blown its slice can fail fast rather than do doomed work. A budget you don't enforce as a deadline is just a spreadsheet.

When demand exceeds capacity, the queueing math is merciless: ρ → 1, Wq → ∞, the queue grows without bound, and latency runs to the asymptote for everyone. The defenses are backpressure and load shedding, and the senior framing is that they are latency-protection mechanisms, not just availability mechanisms.

Backpressure propagates "I'm full" upstream so producers slow down instead of piling work into an unbounded queue. The enabling principle is bounded queues: an unbounded queue converts overload into latency (every request waits behind the backlog) and then into OOM; a bounded queue converts overload into rejection (fast, explicit) so the requests that are admitted keep meeting their SLO. gRPC flow control (HTTP/2 windows) and Go's bounded channels are backpressure primitives — a full channel blocks the producer by design.

Load shedding is the active form: when saturation crosses a threshold, reject excess requests immediately (HTTP 429 / gRPC RESOURCE_EXHAUSTED) rather than admitting them into a growing queue. The latency logic is precise: admitting a request you can't serve within its deadline doesn't help that request and it raises ρ, inflating Wq for every other request (Kingman again). Shedding it early keeps utilization in the sane region of the 1/(1−ρ) curve. Envoy's adaptive concurrency and circuit breakers, and the CoDel/PID-controller admission schemes (drop requests that have already queued past a target latency), are this principle in production. The advanced version sheds by priority — drop the cheap, low-value, or already-late work first, protecting the requests that still have budget.

                 admit                    serve
  request ──▶ [ bounded queue ] ──▶ [ workers (c) ] ──▶ response
                    └─ full / aged past deadline ──▶ 429 / RESOURCE_EXHAUSTED  (shed)

  Principle: keep ρ in the linear region. Shedding the excess
  protects the latency of everything you DO admit.

Key insight: Backpressure and load shedding are latency mechanisms. An unbounded queue trades a throughput problem for a latency catastrophe (then an OOM); a bounded queue plus early shedding keeps utilization off the 1/(1−ρ) cliff, so the requests you admit still hit their SLO. Saying "no" fast is how you keep "yes" fast.


Mental Models

  • Every server is a queue, and response time is Ws / (1 − ρ). Utilization isn't a cost-efficiency number; it's a latency multiplier with a vertical asymptote at 100%. The headroom you keep is the budget that pays for your tail.

  • The tail is a property of variance, not the mean. Kingman's c_s² term enters quadratically: occasional slow requests (GC, contention, cold caches) wreck percentiles far out of proportion to their effect on the average. Make service times uniform, not just fast.

  • (0.99)^N is the law of fan-out. A 1% per-backend tail becomes a 63% root tail at N=100. In any fan-out system the leaf's p99 sets the root's median — so bounding every dependency's tail beats optimizing any one common case.

  • Your load test is lying until proven otherwise. A closed-loop generator omits exactly the latencies it causes. Trust only open-model (constant-rate, wrk2) measurements and/or HdrHistogram back-filling.

  • A percentile is a fact about a distribution. You can't add per-stage p99s (they rarely coincide) and you can't average per-machine p99s (meaningless). Merge the distributions (HdrHistogram / t-digest), then read the percentile.

  • Saying "no" fast is how you keep "yes" fast. Bounded queues + load shedding keep ρ in the linear region. Admitting work you can't serve on time raises everyone's Wq — shedding it protects the latency of the work you keep.


Common Mistakes

  1. Capacity-planning to 90%+ utilization to "save money." That's a 10–20× latency multiplier per the 1/(1−ρ) curve. Latency-sensitive services target 50–70% to keep the queue in its linear region; the headroom is the tail budget.

  2. Optimizing the mean service time while ignoring variance. Kingman shows c_s² drives queue wait quadratically. A change that cuts the mean 1 ms but adds occasional 50 ms spikes makes your tail worse. Isolate variance-injecting work onto its own pool.

  3. Assuming more requests average the tail away. In fan-out, more components make root tails more likely (1 − 0.99^N), not less. You cannot dilute a tail by aggregating; you must bound each dependency or use tail-tolerance.

  4. Trusting wrk/JMeter "max throughput" latency numbers. Closed-loop, saturating load generators suffer coordinated omission and understate the tail by 1–2 orders of magnitude. Use wrk2 with a fixed -R rate below saturation, recording into HdrHistogram.

  5. Summing per-stage p99s for an end-to-end p99. They almost never coincide on the same request, so the sum over-counts. It's an acceptable conservative budget, but the real end-to-end p99 must be measured directly.

  6. Averaging p99s across machines or time windows. Statistically meaningless. Merge the histograms/digests and compute the percentile from the merged distribution.

  7. Using unbounded queues / no load shedding. Overload then turns into unbounded latency for everyone and eventually OOM. Bound every queue and shed (429 / RESOURCE_EXHAUSTED) past a saturation/latency threshold, ideally by priority.

  8. Hedging or retrying without cancellation. A hedge or retry that doesn't cancel the loser doubles the load it was meant to protect against, raising ρ and worsening the very tail you're fighting. Propagate deadlines and honor ctx.Done().


Test Yourself

  1. A service has a 2 ms base service time. By the M/M/1 model, what's its mean response time at 50%, 90%, and 99% utilization? What does this say about capacity targets?
  2. State Kingman's formula and name its three factors. Why does an occasional GC pause hurt the tail more than its contribution to the average would suggest?
  3. Each of 50 backends independently returns within p99 99% of the time. A root request fans out to all 50 and waits for all. Roughly what fraction of root requests hit at least one slow backend? What's the structural lesson?
  4. Explain coordinated omission. Why does a synchronous request loop systematically miss the worst latencies, and what are the two standard corrections?
  5. You have per-machine p99 latencies from 200 hosts and want the fleet p99. Why can't you average them, and what's the correct procedure?
  6. Distinguish USE from RED. Which one warns you before the SLO breaches, and via which signal?
  7. Why is a bounded queue with load shedding a latency-protection mechanism, not just an availability one? Tie it to the queueing math.
Answers 1. `W = Ws / (1 − ρ)`. At ρ=0.5: `2/0.5 = 4 ms`. At ρ=0.9: `2/0.1 = 20 ms`. At ρ=0.99: `2/0.01 = 200 ms`. The same code is 50× slower at 99% than at 50% purely from queueing — so latency-sensitive services must keep utilization low (≈50–70%); the headroom is the tail budget. 2. `Wq ≈ Ws · [ρ/(1−ρ)] · [(c_a² + c_s²)/2]` — base **Time**, **Utilization** amplifier, **Variability** term. A GC pause spikes the *variance* of service time (`c_s`), and Kingman's variability term is quadratic in `c_s`, so it inflates queue wait far more than its small effect on the mean would imply. 3. `1 − 0.99^50 ≈ 1 − 0.605 = 0.395`, about **40%**. A 1-in-100 per-backend tail becomes a ~40% root tail. The lesson: in fan-out, the leaf p99 dominates the root's *typical* latency, so you must bound every dependency's tail or use tail-tolerance (hedging, quorums) — optimizing one common case isn't enough. 4. Coordinated omission: a closed-loop generator blocks inside `send_request()` during a server stall, so it stops issuing the requests that *would* have queued and waited — recording one slow sample instead of the thousands that really suffered. The reported tail is far too optimistic. Corrections: (a) **open-model constant-rate** load measured from intended send time (`wrk2 -R`); (b) **back-filling** the omitted samples (HdrHistogram `recordValueWithExpectedInterval`). 5. Averaging quantiles is statistically invalid — the p99 of a mixture is not the mean of the components' p99s, and hosts carry different request counts (weights). Correct: have each host emit a **mergeable distribution** (HdrHistogram with identical config, or a t-digest), **sum/merge** them, and compute the p99 from the merged distribution. 6. **USE** is resource-centric (Utilization, Saturation, Errors per resource); **RED** is request-centric (Rate, Errors, Duration per service). USE's **Saturation** signal (growing queue/backlog) rises *before* RED's Duration crosses the SLO, so it's your pre-breach warning. 7. An unbounded queue turns overload into ever-growing wait (`ρ → 1 ⇒ Wq → ∞`) for *every* admitted request, then OOM. A bounded queue + early shedding keeps ρ in the linear region of `1/(1−ρ)`, so admitting fewer requests *protects the latency of the ones you admit*. Admitting work you can't serve on time only raises ρ and inflates everyone's `Wq`.

Cheat Sheet

QUEUEING (every server is a queue)
  ρ = λ/μ              utilization (stable iff ρ < 1)
  M/M/1  W  = Ws/(1−ρ)         response time = service time × amplifier
         Wq = ρ·Ws/(1−ρ)       queue wait
  M/M/c  shared pool of c beats c single-server pools (work-stealing wins)
  1/(1−ρ): 0.5→2×  0.7→3.3×  0.8→5×  0.9→10×  0.95→20×  0.99→100×

KINGMAN (G/G/1 — variability is the lever)
  Wq ≈ Ws · ρ/(1−ρ) · (c_a² + c_s²)/2
       └Time┘ └Utilization┘  └Variability┘    c = σ/mean ; exp ⇒ c=1
  Tail ∝ c_s²  → reduce service-time VARIANCE (GC, contention, cold cache)

TAIL AT SCALE (fan-out)
  P(root slow) = 1 − (0.99)^N    N=10→9.6%  N=100→63%  N=200→87%
  tolerate, don't eliminate: hedged reqs (after p95), tied reqs (cancel twin),
  deadline propagation + cancellation, micro-partitioning, good-enough quorums

COORDINATED OMISSION (your load test lies)
  closed loop blocks during stalls → omits the latencies it causes
  FIX: open model, constant rate  → wrk2 -R <rate>  (below saturation)
  FIX: back-fill omitted samples  → HdrHistogram.recordValueWithExpectedInterval

PERCENTILES (facts about distributions)
  don't ADD per-stage p99s (rarely coincide → over-counts; OK as conservative budget)
  can't AVERAGE per-host p99s (meaningless)
  DO: merge distributions then read pctile — HdrHistogram.add / t-digest merge
  t-digest = tail-accurate (p99.9); HdrHistogram = log-linear buckets, mergeable

METHODS
  USE  = Utilization, Saturation, Errors   per RESOURCE (find the bottleneck)
  RED  = Rate, Errors, Duration            per SERVICE  (user-facing health)
  Saturation rises BEFORE Duration breaches → act early

PROTECTION
  latency budget = SLO split across path + propagated as gRPC deadline
  bounded queues + load shedding (429 / RESOURCE_EXHAUSTED) keep ρ off the cliff
  shed by priority: drop cheap/late/low-value first

Summary

  • Every server is a queue, and M/M/1 gives response time Ws / (1 − ρ) — a hyperbola with a vertical asymptote at 100% utilization. Running hot (90–99%) buys a 10–100× latency multiplier on identical code; headroom is the tail budget. M/M/c shows one shared pool beats many private queues.
  • Kingman's G/G/1 approximation (Wq ≈ Ws · ρ/(1−ρ) · (c_a²+c_s²)/2) makes variance the senior's lever: queue wait scales with c_s², so occasional slow requests (GC, contention) destroy tails out of proportion to their mean. Manage the second moment of service time.
  • The Tail at Scale: in fan-out, P(root slow) = 1 − (0.99)^N — a 1% leaf tail is 63% at N=100. The leaf p99 sets the root median, so you tolerate tails (hedged/tied requests, deadline propagation + cancellation, micro-partitioning, good-enough quorums) rather than pretend to eliminate them.
  • Coordinated omission makes closed-loop load tests omit the very latencies they cause, understating the tail by 1–2 orders of magnitude. Measure honestly with an open model (wrk2 -R) below saturation and/or HdrHistogram back-filling.
  • Percentiles are properties of distributions: you can't add per-stage p99s (they rarely coincide) or average per-host p99s (meaningless). Emit mergeable HdrHistograms or tail-accurate t-digests, merge those, then read the percentile.
  • USE localizes the bottleneck resource (Utilization/Saturation/Errors); RED monitors user-facing health (Rate/Errors/Duration). Saturation rising is your only pre-breach warning — which is what backpressure and load shedding act on, keeping ρ in the linear region so the requests you admit still meet their SLO.

You can now do the arithmetic of a tail before it bites, prove a load test is undercounting, and name the architectural lever — hedge, shed, repartition, or buy headroom — that pulls p99 down. The next layer — professional.md — operationalizes this across SLOs, error budgets, and on-call.


Further Reading

  • The Tail at Scale — Jeffrey Dean & Luiz André Barroso, CACM 2013. The foundational paper on fan-out tail amplification and tail-tolerance techniques.
  • How NOT to Measure Latency — Gil Tene (talk + slides). The definitive treatment of coordinated omission and why wrk2/HdrHistogram exist.
  • Designing Data-Intensive Applications — Martin Kleppmann, Ch. 1 (percentiles, tail latency, the amplification of slow requests).
  • Systems Performance — Brendan Gregg. The USE method and queueing-theory-grounded capacity analysis.
  • Performance Modeling and Design of Computer Systems — Mor Harchol-Balter. M/M/1, M/M/c, and the rigorous treatment of variability and tails.
  • John D. C. Little, A Proof for the Queuing Formula L = λW (1961), and Kingman's The single server queue in heavy traffic (1961) — the original results.
  • The HdrHistogram and t-digest project docs — mergeable, tail-accurate distribution recording.
  • Envoy docs on circuit breaking, adaptive concurrency, and outlier detection; gRPC docs on deadlines and deadline propagation.