Circuit Breaker, Bulkhead & Timeout Budgets¶
One slow dependency should cost you that dependency — not your whole service. Build the three primitives that contain a partial failure (breaker, bulkhead, timeout/retry budget), then prove under load that they turn a cascade into a bounded, fast-failing brownout. Numbers, not vibes.
| Tier | Resilience |
| Primary domain | Cascading-failure isolation |
| Skills exercised | Circuit breakers, bulkhead pools, context deadline propagation, timeout/retry budgets, fallbacks, load-shedding, Go (sony/gobreaker, golang.org/x/sync/semaphore, context) |
| Interview sections | 13 (distributed systems), 22 (scalability & HA), 9 (networking) |
| Est. effort | 3–4 focused days |
1. Context¶
You own checkout-api. It calls three downstreams synchronously per request: pricing (fast, critical), recommendations (slow-ish, optional), and fraud (medium, critical). It runs on a fixed worker budget — a Go HTTP server with a bounded number of in-flight goroutines and a connection pool of 100 to each downstream.
At 02:00 recommendations GCs badly and its p99 jumps from 40 ms to 8 s. Within ninety seconds checkout-api is fully down — even though recommendations is optional and was never on the critical path. Every worker is parked waiting on a recommendations call, the connection pool is drained, new requests queue, and the load balancer health check times out. One non-critical, slow dependency took the whole service offline. That is a cascading failure, and it is the single most common way distributed systems die.
Your job: reproduce that cascade on purpose, then add circuit breakers, bulkheads, and timeout/retry budgets and prove the service stays serving (degraded but up) while the same dependency is just as broken. You will produce before/after numbers for the cascade, not a description of one.
2. Goals / Non-goals¶
Goals - Reproduce a real cascade: inject latency into one downstream and show checkout-api goes from healthy to fully unavailable, and explain the exact exhaustion mechanism (goroutines / connection pool / accept queue). - Add a circuit breaker (closed/open/half-open) per downstream and show the caller fast-fails instead of parking workers once the dependency is unhealthy. - Add bulkhead isolation (separate concurrency pools per downstream) and show one bad dependency can no longer starve calls to the healthy ones. - Implement timeout budgets via context deadline propagation across the call chain, not fixed per-hop timeouts, and show the difference under a slow chain. - Add a retry budget and demonstrate the retry-storm amplification it prevents. - Add fallbacks / graceful degradation so optional dependencies failing degrade the response instead of failing the request.
Non-goals - A general service mesh (Istio/Linkerd/Envoy). You may reference how they do it, but implement the primitives yourself so you understand the state machine. - Async/queue-based decoupling — that is a different mitigation (see events/). Here the calls are synchronous and request-scoped on purpose. - Autoscaling your way out. The whole point is a fixed capacity envelope.
3. Functional requirements¶
- A
checkout-apiservice (cmd/checkout) that, per request, calls three downstream stubs (pricing,recommendations,fraud) and returns a quote.recommendationsis optional; the other two are required. - Fault-injectable downstreams (
cmd/downstream) whose per-endpoint latency, error rate, and error type are controllable at runtime (flag/env/HTTP admin endpoint), so you can dial in "slow", "erroring", or "slow + erroring". - A resilience layer, toggleable by flag so you can run with and without protection on identical load:
- Circuit breaker per downstream: closed → open on a trip condition, half-open probing, close on probe success. Expose state transitions.
- Bulkhead per downstream: a bounded concurrency permit pool (
semaphore.Weightedor a buffered-channel token pool); reject/shed when full. - Timeout budget: a per-request deadline set at ingress and propagated via
context.Contextto every hop; each hop spends from the remaining budget. - Retry budget: retries gated by a token-bucket / ratio cap so retries can never exceed e.g. 10% of base traffic to a downstream.
- Fallback: on
recommendationsopen-circuit or shed, return a cached / empty recommendation set and serve the request anyway. - A load driver (
cmd/load) that holds a fixed open-model request rate while you inject and remove faults.
4. Load & data profile¶
- Base load: sustained 2,000 req/s to
checkout-apifor runs of ≥ 10 min, open-model (fixed arrival rate, not closed-loop — closed-loop hides cascades because it self-throttles when the service slows). - Fan-out: each checkout request makes 3 downstream calls;
recommendations~30% of total downstream call volume. - Capacity envelope (fixed, documented):
checkout-apiruns with a known in-flight ceiling and per-downstream pool of 100 connections. Write the envelope down — the cascade math depends on it (Little's Law below). - Healthy baseline latencies:
pricingp50 8 ms,fraudp50 25 ms,recommendationsp50 40 ms. - Fault profile: inject
recommendationsp99 → 5–8 s (slow), and a separate run at 40% error rate (errors), and a combined run.
Little's Law is the whole lab in one line: in-flight = arrival_rate × latency. At 2,000 req/s, if a dependency's effective latency goes from 0.04 s to 8 s, the in-flight count for that call balloons by 200×. Without isolation, that in-flight demand consumes your shared worker/connection budget and everything behind it starves.
5. Non-functional requirements / SLOs¶
The contract is about what the caller does while a downstream is broken.
| Metric | Target |
|---|---|
checkout-api availability while recommendations is fully down (slow or erroring) | ≥ 99% of requests succeed (degraded, no recommendations) — the cascade must not happen |
checkout-api p99 while a non-critical downstream is broken | ≤ 1.5× the healthy-baseline p99 (you fast-fail, you don't park) |
| Fast-fail latency once a breaker is open | < 5 ms to reject/fallback (no network attempt, no timeout wait) |
Required-dependency (pricing/fraud) success rate while recommendations is broken | Unaffected — bulkhead proves isolation (≥ 99.9%) |
| Retry-induced extra load on a struggling downstream | ≤ 1.1× base call rate (retry budget caps amplification at ≤ 10%) |
| Recovery: time from downstream-healthy to breaker-closed & full throughput | < 30 s, and measured, with no thundering-herd re-open |
| Without protection (control run) | Reproduce full outage: availability → ~0%, all workers parked. This is a required deliverable, not a failure. |
The win condition is not "no errors." It's: the broken dependency's blast radius is the broken dependency. The service degrades and survives; the SLOs above are how you prove the blast radius was contained.
6. Architecture constraints & guidance¶
- Go, synchronous calls. This is the realistic and dangerous case. Each downstream call is an HTTP round-trip from a request goroutine.
- Breaker:
sony/gobreakeris the canonical choice (closed/open/half-open withReadyToTrip,Interval,Timeout,MaxRequestsfor half-open). You may hand-roll one to show you understand the state machine — but then justify your trip math. - Bulkhead:
golang.org/x/sync/semaphore.Weightedper downstream, or a buffered-channel token pool. Acquire a permit before the call with the requestcontext; ifAcquirefails fast (full pool), shed or fall back. One pool per dependency — never a single shared pool, that defeats the point. - Timeout budget: set one deadline at ingress (
context.WithTimeout), pass thatctxdown every hop. Each downstream client usesctx, so a hop that runs late shortens the budget for later hops automatically. Compare against the naive "every hop gets a fixed 2 s" design and show how fixed timeouts overshoot a chain. - Retry budget: a per-downstream token bucket; a retry costs a token, tokens refill at a fraction of success rate. When the bucket is empty, do not retry. This is what keeps a struggling downstream from getting retry-stormed into the ground.
- Order of the wrappers matters: breaker → bulkhead → timeout → call. Decide and justify the nesting (e.g. don't burn a bulkhead permit on a call the breaker would reject; don't let a retry re-enter a tripped breaker).
- Instrument everything with Prometheus: per-downstream breaker state, bulkhead permits in use / rejected, retries attempted / budget-denied, in-flight goroutines, connection-pool wait time, and caller p50/p99/p999 + success rate.
7. Data model / state¶
breaker (per downstream): state ∈ {closed, open, half_open}
consecutive_failures, failure_ratio over Interval,
opened_at, half_open_inflight
bulkhead (per downstream): permits_total, permits_in_use, rejected_total
retry_budget (per dep): tokens (token-bucket), ratio_cap (e.g. 0.10),
denied_total
request context: deadline (absolute), remaining_budget = deadline - now
opened_at + Timeout is when the breaker allows a half-open probe; MaxRequests caps how many probes run concurrently so recovery isn't a thundering herd.
8. Interface contract¶
POST /checkout→{ quote, recommendations?, degraded: bool }.degraded:truewhen an optional dependency was breaker-open/shed and a fallback was served.GET /metrics→ Prometheus exposition (all the gauges/counters in §6).- Downstream admin:
POST /admin/fault {endpoint, latency_ms, error_rate, ...}to inject and clear faults at runtime. - Resilience configured via flags/env:
-breaker,-bulkhead-size,-budget-ms,-retry-ratio,-fallback, and a master-protection=off|onfor control runs.
9. Key technical challenges¶
- Reproducing the cascade honestly. You must see the exhaustion: in-flight goroutines climbing, connection-pool acquire-wait climbing, then accept-queue overflow and health-check failure. A cascade you can't measure isn't a cascade.
- Trip thresholds that aren't twitchy. Too sensitive → the breaker opens on normal jitter and you lose availability you didn't need to. Too loose → it opens after the cascade already started. Tune on failure ratio over a window, not a raw count, and justify the numbers.
- Half-open without a thundering herd. When the breaker probes, a flood of queued requests must not all rush the recovering downstream and re-trip it.
MaxRequests+ jitter; measure the recovery shape. - Timeout budgets across a chain. Fixed per-hop timeouts sum up: three hops at 2 s each can make a request wait 6 s before failing, well past the client's patience. A propagated budget fails fast and correctly; show the distribution difference.
- Retry storms. Naive retry-on-error adds load exactly when the downstream is already failing — a positive feedback loop that turns a blip into an outage. The budget breaks the loop; you must demonstrate the amplification with and without it (this is the same failure mode as
events/05-dlq-and-retry-topology). - Interaction effects. Breaker + bulkhead + retry + timeout interact. A retry inside a bulkhead double-spends permits; a retry into an open breaker is wasted; a too-short budget makes the breaker trip on your own impatience. Get the composition right and prove it.
10. Experiments to run (break it / tune it)¶
Record before/after numbers and a dashboard screenshot for each.
- The cascade (control, no protection). Steady 2,000 req/s healthy. Inject
recommendationsp99 → 8 s. Measure: time-to-full-outage, in-flight goroutines, connection-pool acquire-wait, andcheckout-apiavailability collapsing to ~0%. Name the exhausted resource. This is your baseline horror. - Circuit breaker fast-fail. Same fault, breaker on (no bulkhead yet). Measure: breaker trip time, fast-fail latency (< 5 ms once open), and caller availability holding ≥ 99%. Plot breaker state vs caller success rate.
- Bulkhead isolation. Same fault, bulkhead on. Hold steady load to all three downstreams. Measure:
pricing/fraudsuccess rate and p99 whilerecommendationsis fully broken — prove they're unaffected (the permit pool for the bad dep fills and sheds; the others never see it). - Timeout budget vs fixed timeouts. Build a slow chain (
fraudp99 → 1.5 s). Run (a) fixed 2 s-per-hop and (b) one propagated 800 ms request budget. Measure: the p99 waiting time before a request gives up, and how many requests wait far longer than the client's patience under (a). - Retry storm.
recommendationsat 50% error rate. Run (a) naive retry-3x and (b) retry budget capped at 10%. Measure: downstream call-rate amplification factor and whether the retries push the downstream from "degraded" into "down". - Half-open recovery tuning. After tripping the breaker, heal the downstream. Sweep
Timeout(open duration) and half-openMaxRequests. Measure: recovery time to full throughput, and whether the breaker re-trips (thundering-herd) on first probe. Find the setting that recovers < 30 s with no flap. - Everything together, combined fault.
recommendationsslow + erroring,fraudmildly slow, at full load. Turn the whole stack on. Measure: the §5 SLO table end-to-end and showdegraded:trueresponses served instead of failures.
11. Milestones¶
checkout-api+ 3 fault-injectable downstreams + load driver + Prometheus/Grafana board (in-flight, pool-wait, p99, success rate).- Reproduce the cascade (experiment 1) and write down the exhaustion mechanism. You cannot move on until you've made it fall over and explained why.
- Circuit breaker + fast-fail (experiment 2); breaker-state dashboard.
- Bulkhead isolation (experiment 3) proving the blast radius is contained.
- Timeout budget + retry budget (experiments 4–5); fallbacks for the optional dep.
- Half-open tuning + the combined run (experiments 6–7); findings note.
12. Acceptance criteria (definition of done)¶
- Cascade reproduced: a control run showing
checkout-apigoing to ~full outage from one slow non-critical downstream, with the exhausted resource named and proven (goroutine count / pool acquire-wait graph). - Containment proven: the same fault with protection on keeps caller availability ≥ 99% and required-dependency success ≥ 99.9% (bulkhead).
- Fast-fail measured: open-breaker reject/fallback latency < 5 ms.
- Timeout budget beats fixed-per-hop timeouts on a slow chain, with the latency distributions plotted.
- Retry storm demonstrated and tamed: amplification ≤ 1.1× with the budget, and shown to be much worse without it.
- Recovery: breaker closes < 30 s after the downstream heals, with no re-trip flap, on a graph.
- Every number reproducible from a committed command + config; resilience knobs documented with the reasoning behind each chosen value.
13. Stretch goals¶
- Adaptive / latency-based bulkhead (concurrency limit that shrinks as observed latency rises, à la Netflix
concurrency-limits/ TCP-Vegas) instead of a fixed permit count — and compare it to the fixed bulkhead under a ramping fault. - Hedged requests for the critical path (
fraud): send a second request after p95 and take the first answer — measure tail-latency gain vs added load, and the interaction with the retry budget. - Deadline-aware shedding: drop requests at ingress whose remaining budget is already too small to plausibly succeed, before they consume any permit.
- Breaker on partial signals: trip on rising latency (saturation) before hard errors appear, and measure how much earlier you contain the fault.
- Wire the whole thing into
load-testing/02-chaos-and-fault-injectionso the fault schedule is driven by the chaos harness rather than manualadmin/faultcalls.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Cascade understanding | Reproduces the outage | Explains it via Little's Law; names the first resource to exhaust and predicts it before measuring |
| Circuit breaker | Trips and fast-fails | Justifies trip math (ratio-over-window), tunes half-open to avoid thundering-herd, proves recovery shape |
| Bulkhead | Separate pools per dep | Proves isolation with numbers; sizes pools from the capacity envelope, not by guessing |
| Timeout budgets | Uses context deadlines | Propagates a budget across the chain; shows why fixed per-hop timeouts overshoot; deadline-aware shedding |
| Retry safety | Knows retries can amplify | Quantifies the storm; caps it with a budget; knows where retries must not re-enter a tripped breaker |
| Composition | Stacks the primitives | Gets the nesting order right and defends it; reasons about permit/breaker/retry interactions |
| Communication | Clear before/after numbers | Could defend every threshold and curve to a staff panel and tie each to an SLO |
15. References¶
- Go:
contextdeadline propagation;sony/gobreaker(closed/open/half-open,ReadyToTrip,MaxRequests);golang.org/x/sync/semaphorefor bulkheads. - Michael Nygard, Release It! — circuit breaker, bulkhead, timeout, and the cascading-failure / "integration point" failure modes (the canonical source).
- Google SRE Book — "Handling Overload" and "Addressing Cascading Failures" (retry budgets, deadline propagation, load shedding).
- Netflix Hystrix (history) and
concurrency-limits(adaptive bulkheads); Envoy's outlier-detection / circuit-breaking docs for how a mesh does the same job. - See also:
events/05-dlq-and-retry-topology(retry storms & backpressure),load-testing/02-chaos-and-fault-injection(driving the faults), andresilience/02-adaptive-concurrency-and-load-shedding(the adaptive sibling). - Interview prep:
Interview Question/13-distributed-systems/(cascading failure, timeouts, idempotent retries) andInterview Question/22-scalability-and-high-availability/(bulkheads, graceful degradation, blast-radius containment).