Tasks
Practice computing and reasoning about failure probability. Global constraints: show your arithmetic; keep availability and unavailability straight (
U = 1 − A); state every independence assumption you rely on, and flag where it's likely false; quote the nines table when it helps a human feel the result. Several tasks are pure numeric — do them by hand before checking. Worked solutions follow each task; cover them and try first.
Task 1 — System availability of a series chain¶
A request must pass through: CDN (99.99%), load balancer (99.99%), app server (99.9%), database (99.95%), payment provider (99.9%). All required. Compute system availability and translate it to downtime per year.
Solution
~99.74%. Unavailability ≈ 0.26% × 525,600 min/yr ≈ **~23 hours/year**. The two 99.9% services dominate; the next nine comes from fixing *those*, not the already-excellent CDN/LB.Task 2 — Two replicas (and why not 99% + 99%)¶
(a) Two independent replicas at 99% each, system works if either works — availability? (b) Three replicas? (c) Explain why you can't add the availabilities.
Solution
(a) `U = 0.01² = 0.0001 → A = 99.99%`. (b) `U = 0.01³ = 1e-6 → A = 99.9999%`. (c) Availability is a probability ≤ 1; "99% + 99% = 198%" is meaningless. The system fails only if *all* replicas fail, so you multiply the **unavailabilities** (the small numbers), which collapse fast. Each independent copy roughly doubles the nines.Task 3 — Mixed topology, find the bottleneck¶
LB (single, 99.9%) → one-of-two app servers (each 99%) → one-of-two DB replicas (each 99.5%). Compute system availability and name the dominating component.
Solution
The **single load balancer (99.9%)** dominates — the redundant blocks are near-perfect, so the lone SPOF caps the system. Adding LB redundancy is the highest-value next move.Task 4 — The correlation floor¶
Two replicas, each 99% available. 90% of each replica's downtime is independent, 10% is common-cause (shared deploy + AZ). Compute realistic system availability and compare to the naive independent result.
Solution
A 0.1% common-cause term dragged 99.99% down to **99.89%** — most of the second replica's value evaporated. A third replica wouldn't help: `p_cc` is a floor. Fix: put replicas in different AZs and deploy waves to shrink `p_cc`.Task 5 — Union bound on a deploy¶
A release ships changes to 60 services, each with a 0.4% chance of a regression. Bound the probability that at least one regresses, then give the exact independent figure. What does it argue for?
Solution
~1-in-4 to 1-in-5 even though each service is 99.6% safe. Argues for **staged/decomposed releases** (fewer services per deploy, canary, bake time) so a regression hits a wave, not the fleet.Task 6 — Rare-per-request becomes common at scale¶
An endpoint has a 1-in-1,000,000 chance of a fatal edge-case bug per request. It serves 8,000,000 requests/day. Expected hits per day? Probability of at least one hit per day?
Solution
"One in a million" is hit ~8 times daily here. At scale, every rare edge case *will* fire — handle it, don't assume it won't happen.Task 7 — MTBF/MTTR and choosing a lever¶
Service X: MTBF 240 h, MTTR 3 h. (a) Availability? (b) You can either double MTBF or halve MTTR for the same cost — which gives more availability? Compute both.
Solution
Roughly equal *here* (because `A ≈ 1 − MTTR/MTBF`, so doubling MTBF and halving MTTR both halve `MTTR/MTBF`). But in practice **halving MTTR is far cheaper** (rollback, alerting, automation) than doubling MTBF (fighting entropy and the `p_cc` floor) — so pick MTTR.Task 8 — Build a risk matrix and critique it¶
Place these on a 3×3 probability×impact matrix, then state one limitation of the placement: (a) flaky CI test, fails ~daily, costs 10 min of dev time; (b) region outage, ~once/3yr, costs $4M; (c) expired TLS cert, ~once/yr if unmonitored, takes site down 30 min.
Solution
| | Low impact | Med impact | High impact | |---|---|---|---| | High prob | (a) flaky CI 🟢 | | | | Med prob | | (c) TLS cert 🟠 | | | Low prob | | | (b) region outage 🔴? | **Limitation:** the matrix buckets (b) and (c) by coarse labels, but (b)'s $4M is fat-tailed and ruin-adjacent — "low prob / high impact = medium-ish" *understates* it. The matrix hides that `0.33/yr × $4M ≈ $1.3M/yr` expected loss dwarfs (c)'s `1/yr × small`. Decide with the numbers, not the cell color.Task 9 — Find the correlated failure¶
A team reports: "We run 3 app replicas and 3 DB replicas, all 99.9%, so we're effectively bulletproof." Their setup: all 6 are in us-east-1a, deployed simultaneously by one CI job, reading config from a single config service. List the correlated failure modes and estimate the real ceiling.
Solution
Common-cause modes (each ignores all the redundancy): - **Same AZ** — `us-east-1a` outage kills all 6. - **Same deploy** — one bad CI rollout ships to all 6 at once. - **Same config service** — a bad config push poisons all 6. The independent-replica math (`U = 0.001³`) is fantasy; availability is governed by `p_cc` of {AZ outage, bad deploy, bad config} — likely **~99.9% or worse**, not the "nine nines" they imagine. Fix: spread AZs, stage deploys, validate + stage config. Redundancy without independence is theater.Task 10 — Series chain target¶
You need the overall request path to hit 99.95%. It has 4 required components in series, currently 99.99%, 99.99%, 99.95%, 99.9%. (a) Current system availability? (b) Does it meet target? (c) Cheapest single change to reach it?
Solution
(a) 0.9999 × 0.9999 × 0.9995 × 0.999 ≈ 0.99830 → 99.83%
(b) No — 99.83% < 99.95%.
(c) The 99.9% component dominates the shortfall. Replace/redundify just it to ~99.99%:
0.9999 × 0.9999 × 0.9995 × 0.9999 ≈ 0.99920 → 99.92% (still short)
Also lift the 99.95% to 99.99%:
0.9999 × 0.9999 × 0.9999 × 0.9999 ≈ 0.99960 → 99.96% ✓
Task 11 — Fault tree and SPOF identification¶
"Checkout fails" if: the DB cluster is down OR the payment provider is down OR (app replica 1 AND app replica 2 are both down). DB-down = 0.001, payment-down = 0.001, each app replica down = 0.02 (independent). Compute P(checkout fails) and identify the SPOFs.
Solution
**SPOFs = the OR branches that fail alone:** the DB cluster and the payment provider each take checkout down by themselves and *dominate* the redundant app pair. Next moves: redundify the DB, add a secondary payment provider — not more app replicas.Task 12 — Justify or reject a mitigation¶
Proposal: spend $400k/yr on a hot-standby second region. Region outages occur ~0.25/yr; an outage costs ~$1M in lost revenue + reputational harm, and you estimate failover would cut that loss by 80%. (a) Is it justified on expected value? (b) When might you fund it anyway? (c) What turns it into theater?
Solution
(b) Fund anyway if the downside is **fat-tailed / ruin-class** — e.g. the "$1M" understates true loss (regulatory exposure, churn, a contractual SLA penalty, or a tail where an outage runs days not hours). For ruin-class risk you rationally pay a **tail premium** above EV. (c) It becomes **theater** if the standby region is never failover-tested, drifts out of config parity, or can't actually take traffic — an untested DR region changes the outcome by ~0. Justification requires *drilled* failover (game days), or the $400k buys a false sense of safety.See also¶
- Levels: junior · middle · senior · professional · interview
- Siblings: base rates and expected value, reasoning under uncertainty, estimation under uncertainty
- Related: systems thinking, evaluating tradeoffs objectively, the section root, and the Engineering Thinking overview.
In this topic
- interview
- tasks