Middle
What? Failure probability you can engineer. At this level "risk and failure probabilities" means: combining component reliabilities into a system number (series and parallel), reasoning with MTBF/MTTR, deriving
availability = MTBF / (MTBF + MTTR), and choosing deliberately between reducing failure probability and reducing time-to-recover.How? You compute. Given a topology and per-component numbers, you produce a system availability. You add redundancy and show how it multiplies unavailability down. You measure MTBF and MTTR for your services, and you argue — with arithmetic — that cutting MTTR in half is often cheaper and more effective than chasing one more nine of MTBF.
1. Recap: Series Multiplies Down¶
For components that all must work (a series chain), availabilities multiply:
The system is worse than its worst component, and more hops always make it worse. This is the default for a request path: load balancer → app → cache → database → downstream API. (See junior for the worked basics.) Now we attack the other direction.
2. Parallel Redundancy: Unavailability Multiplies Down¶
Redundancy works on the failure side. Define unavailability U = 1 − A. For N identical, independent components in parallel (the system works if at least one works):
You multiply the small numbers (the failure probabilities), so they shrink fast.
Worked example — two replicas at 99% each:
U_single = 1 − 0.99 = 0.01
U_parallel = 0.01 × 0.01 = 0.0001
A_parallel = 1 − 0.0001 = 0.9999 → 99.99%
Two 99% replicas give 99.99% — two nines became four nines. Three replicas:
Each redundant copy (if truly independent) adds roughly double the nines of the single component. This is the most powerful lever in reliability — and the one most often undermined by correlation (covered in senior).
The asymmetry to remember: in series you multiply the availabilities (close to 1 → erodes slowly but surely). In parallel you multiply the unavailabilities (small → collapses fast). Series = "every dependency is a liability." Parallel = "every spare is an insurance policy."
3. Mixed Topologies¶
Real systems mix both. Reduce them piecewise — collapse each parallel block into a single equivalent availability, then multiply the blocks in series.
Example. A request needs: a load balancer (single, 99.9%) → one of two app servers (each 99%) → one of two database replicas (each 99.5%).
LB block: A = 0.999
App block: U = (1−0.99)² = 0.0001 → A = 0.9999
DB block: U = (1−0.995)² = 0.000025 → A = 0.999975
A_system = 0.999 × 0.9999 × 0.999975 ≈ 0.99887 → ~99.89%
Notice the single load balancer (99.9%) now dominates — the redundant blocks are near-perfect, so the lone component caps the whole system. Redundancy moves the bottleneck to whatever you left un-redundant. Find the single points of failure (SPOFs); they set your ceiling.
4. MTBF and MTTR¶
The nines hide two separate things, and you should track them separately:
- MTBF — Mean Time Between Failures. How often it breaks. (Sometimes MTTF for non-repairable items; for services MTBF is fine.)
- MTTR — Mean Time To Repair/Recover. How long it's down once it breaks.
The fundamental identity:
Equivalently, downtime fraction ≈ MTTR / MTBF when MTBF ≫ MTTR.
Worked example. A service fails once every 30 days and takes 1 hour to recover:
Same uptime, two different worlds:
| Scenario | MTBF | MTTR | Availability |
|---|---|---|---|
| Rare but slow | 720 h | 4 h | 99.45% |
| Frequent but fast | 24 h | 2 min | 99.86% |
A service that breaks daily but recovers in 2 minutes beats one that breaks monthly but limps for 4 hours. Users feel MTTR.
5. The Cheapest Lever Is Usually MTTR¶
Because availability depends on the ratio MTBF/MTTR, you can improve it by pushing either term — but they have very different cost curves.
- Raising MTBF (failing less) gets exponentially expensive. Each nine of "never break" demands more redundancy, more testing, more hardening. You fight physics and entropy.
- Lowering MTTR (recovering faster) is mostly engineering you control: faster rollback, better alerting, runbooks, automated failover, smaller blast radius. Halving MTTR roughly halves downtime — directly, cheaply.
This is a core lesson from Google's Site Reliability Engineering: invest heavily in detection and recovery speed, not just prevention. A 10-minute MTTR organization is more available than a 2-hour MTTR organization with the same failure rate — at a fraction of the cost.
Want to cut downtime in half? Two paths:
(a) Double MTBF — hard, expensive, fights entropy.
(b) Halve MTTR — usually cheap: rollback, alerts, automation.
Pick (b) first.
This connects directly to blast radius: if a failure only takes down 1% of users (because you sharded, or you canary), your effective impact — and often your recovery — shrinks. Reducing impact and reducing MTTR are the two underrated dials.
6. Independence Is the Hidden Assumption¶
Every formula in §2–§3 quietly assumes the components fail independently. The parallel formula U = U₁ × U₂ is only valid if replica A failing tells you nothing about replica B failing.
In reality they often share:
- the same physical rack, AZ, or power feed,
- the same deploy (one bad rollout kills all of them),
- the same upstream dependency (one DNS or config service),
- the same bug, triggered by the same input.
When failures are correlated, your beautiful "four nines from two replicas" evaporates. This is the reason real systems underperform their paper reliability — and it's important enough that senior devotes a full treatment to it. For now: whenever you multiply unavailabilities, ask "what could make both fail at once?"
7. A Practical Workflow¶
When asked "how available is this?", do this:
- Draw the topology. Mark each component series or parallel.
- Label availabilities. Use measured numbers if you have them, estimates if not.
- Collapse parallel blocks with
A = 1 − ∏Uᵢ. - Multiply the series chain. That's your system number.
- Find the dominating term (the worst series component or the lone SPOF). That's where the next nine comes from.
- Sanity-check independence. If the parallel components share a failure mode, your number is optimistic — discount it.
- Split MTBF vs MTTR. Decide which lever is cheaper for the target you need.
Where this goes next¶
- Senior: correlated failure, tail risk / fat tails, the union bound, and FMEA / fault trees.
- Professional: org-level risk registers, reliability budgets, justifying mitigation spend.
- Siblings: base rates and expected value, estimation under uncertainty, reasoning under uncertainty.
- Related: evaluating tradeoffs objectively, systems thinking, the section root, and the Engineering Thinking overview.
In this topic
- junior
- middle
- senior
- professional