Tasks
Practice problems for reasoning under uncertainty. Global constraints: show your work; for every Bayesian problem, use the natural-frequency method (imagine a concrete population) and state the answer as a probability; round sensibly and call out when a "high-accuracy" detector is actually untrustworthy. Worked answers are in collapsible blocks — try each one before expanding. Concepts come from middle.md and senior.md.
Task 1 — Compute the posterior: monitoring alert¶
Your alert has a 95% true-positive rate and a 3% false-positive rate. Real incidents occur in 0.2% of minutes. The alert fires. What's the probability of a real incident? Then state what you'd change to make the alert trustworthy.
Answer
Per 1,000,000 minutes: **~6%.** Even a 95%-accurate alert is right only 1 time in 17. Fix: drive down the **false-positive rate** (multi-signal, longer windows, burn-rate alerting) — at a 0.2% base rate, the FP rate dominates.Task 2 — The rare-disease structure, applied to fraud¶
A fraud model flags transactions. True-positive rate 90%, false-positive rate 2%. 0.5% of transactions are actually fraudulent. A transaction is flagged. P(fraud | flagged)? If you auto-block on a flag, what fraction of blocked users are innocent?
Answer
Per 100,000 transactions: **~18%** are fraud; **~82% of blocked users are innocent.** Auto-blocking on this flag harms 4 legit users for every fraudster caught — usually unacceptable. Use the flag to *trigger review*, not to block.Task 3 — Update on odds (the head-math method)¶
Prior that a bug is in your code (vs. the library): 30%. New evidence: the same failure reproduces on a fresh checkout of the library's own example. You judge this evidence ~6× more likely if the bug is in the library than in your code. Posterior?
Answer
Drops from 30% to **~7%** — strong evidence it's the library. Disconfirming evidence (LR < 1) moved you down correctly.Task 4 — Sequential updating¶
A latency spike appears. Prior the new deploy caused it: 40%. - E1: spike started 3 min after deploy. LR = 4 (toward deploy). - E2: a service that didn't get the deploy shows the same spike. LR = 0.15 (against).
Final probability the deploy is the cause?
Answer
**~29%.** E1 raised suspicion, but E2 (shared symptom on an un-deployed service) pulled it back down — look for a shared dependency instead.Task 5 — What's the real false-positive experience?¶
A security scanner has a 1% false-positive rate — "barely any false alarms," says the vendor. Your monorepo scan touches 80,000 code locations, of which ~40 are genuine issues (caught at 100%). How many alerts does an analyst see, and what fraction are noise?
Answer
The analyst wades through **~840 alerts, ~95% noise**, to find 40 real ones. A "1% FP rate" is brutal at scale — base rate of real issues is tiny. This is why "low FP rate" must be judged against volume, not in isolation.Task 6 — Calibrate this claim¶
A teammate says: "I'm 99% sure this refactor has no regressions — it passed all unit tests." List the unstated assumptions inflating that 99%, and give a more honest figure with reasoning.
Answer
Assumptions hidden in "99%": tests cover the changed behavior; no integration/contract gaps; prod data resembles test data; no concurrency/timing paths untested; "passed" means meaningful coverage, not just green. Unit-test pass is *one* piece of evidence, not proof. Honest figure: maybe **80–90%**, explicitly conditioned: "high confidence on the tested paths; the untested risk is the integration with X and prod-scale data." The skill is refusing to collapse partial evidence into near-certainty.Task 7 — Point estimate that lies¶
Two API designs report the same mean latency of 100ms. Design A: p50=95, p99=130. Design B: p50=40, p99=950. Which would you ship, and why is the mean useless here?
Answer
Ship **A**. The mean (100ms, identical) hides everything that matters: B's tail (p99=950ms) means ~1% of requests are nearly 10× worse — felt acutely by real users and amplified when a request fans out to several B-like services. The mean is a point estimate that erases variance; **percentiles reveal the experience.** Never compare on mean alone for latency.Task 8 — Classify the uncertainty¶
For each, label risk / uncertainty / ignorance and name the right tool: (a) annualized failure rate of a known SSD model; (b) whether a 3-week-old open-source dependency will have a critical CVE this year; (c) an attack vector no one on the team has conceived of.
Answer
- (a) **Risk** — measured distribution → *compute* (redundancy math, EV). - (b) **Uncertainty** (Knightian) — outcome known, probability genuinely unknown → *hedge*: pin versions, isolate, keep it swappable, monitor advisories. - (c) **Ignorance** — unknown unknown → *detect & recover*: defense-in-depth, observability, bounded blast radius, fast patch path. Matching the tool to the type is the whole point ([senior.md](senior.md)).Task 9 — Expected value flips the decision¶
Two deploys, each 3% chance of failure. Deploy A: failure = 5-min auto-rollback, no data impact. Deploy B: failure = corrupts a financial ledger, multi-day manual recovery, customer trust damage. Same probability — argue the opposite decisions.
Answer
Decide on `P × cost`, not P alone. A: EV of harm ≈ 0.03 × (trivial) → **ship**, it's a two-way door. B: EV of harm ≈ 0.03 × (catastrophic, partly irreversible) → **do not ship as-is**; make it reversible first (dry-run on a copy, transactional + backup, canary on 1% of accounts), reducing either the probability or the cost-if-wrong before committing. Consequence asymmetry, not probability, drives the call.Task 10 — Why 50/50 is the wrong default¶
A new hire reasons: "I have no information about whether this feature flag is on in prod, so it's 50/50." Critique this, and show how a base rate changes it.
Answer
50/50 is a *strong claim* of maximum uncertainty, not a neutral non-answer — and it ignores cheap base-rate evidence. If 95% of flags in this service default off and are rarely enabled, the prior the flag is on is ~5%, not 50%. Defaulting to 50/50 is base-rate neglect in disguise. The fix: *find the base rate* (config defaults, deploy history) before assuming ignorance.Task 11 — Score your calibration¶
You logged 20 predictions you each called "90% confident." 14 came true. Are you calibrated? Compute a quick Brier contribution and state the correction.
Answer
14/20 = **70%** actual vs **90%** stated → **overconfident** (the typical engineer error). Per-prediction Brier: correct ones contribute (0.9−1)²=0.01, wrong ones (0.9−0)²=0.81. Mean = (14×0.01 + 6×0.81)/20 = (0.14+4.86)/20 = **0.25** — poor (squared term punishes confident misses). Correction: when you *feel* 90%, say ~70% until your buckets line up. (Tetlock, *Superforecasting*.)Task 12 — Communicate it honestly¶
Rewrite this for a VP without losing credibility: "The migration should be fine, we've tested it a bunch."
Answer
"We're ~80% confident the migration completes cleanly in the 2-hour window. The main risk (the 20%) is row volume on the `orders` table, which we couldn't fully load-test. If we hit it, recovery is a rollback to the pre-migration snapshot — about 30 minutes, no data loss. We'll run a dry-run against a prod-sized copy on Tuesday; if that's red we'll reschedule rather than risk the window." It has: quantified confidence, the named tail, the cost if wrong, a mitigation, and a trigger date — exactly the structure from [professional.md](professional.md).Next: carry these into the siblings — base rates and expected value, risk and failure probabilities, and estimation under uncertainty. Back to Probabilistic Thinking.
In this topic
- interview
- tasks