Hierarchical Multi-Tenant Quotas & Fairness¶
Enforce nested limits (global → tenant → endpoint → key) across a fleet, and make sure one noisy tenant can't eat the cluster. Build a quota system that is fair under contention and accurate across many nodes — then prove both with numbers, not assertions.
| Tier | Resilience (multi-tenant control plane) |
| Primary domain | Rate/quota management / distributed counting |
| Skills exercised | Hierarchical limits, weighted fair queueing, max-min fairness, deficit round robin, distributed token buckets, Redis + Lua, quota borrowing, SLO tiers, throttle attribution, Go (go-redis + Lua) |
| Interview sections | 7 (caching / Redis), 22 (scalability & HA), 12 (architecture) |
| Est. effort | 4–6 focused days |
| Builds on | resilience/01-rate-limit-algorithm-bake-off (single-limit algorithm internals) |
1. Context¶
You run the API gateway for a multi-tenant SaaS. ~5,000 tenants share one fleet, and traffic is wildly skewed: 20 "whale" tenants drive 80% of load, and a long tail of small tenants each send a trickle. Aggregate is ~1.2M req/s across ~40 gateway nodes. The platform sells tiers — free, standard, enterprise — each with a contractual request budget.
Two things keep breaking. First, a single limit is too coarse: a tenant within its overall budget can still hammer one expensive endpoint (POST /export) and melt a downstream. Second, when a tenant suddenly spikes — a bad deploy, a retry storm, a scrape — it consumes shared capacity and everybody else's latency goes up, even tenants nowhere near their own limit. That's the noisy-neighbor problem, and "just give everyone a hard cap" leaves capacity idle most of the time while still failing under bursts.
resilience/01 taught you one limiter (token bucket vs sliding window vs GCRA) on one node. This project is the hard version: nested limits a request must clear at every level, fairness so spare capacity is shared justly instead of first-come-first-served, and distributed enforcement where the limit is global — not per-node. You will produce a quota control plane and defend its fairness and accuracy under a whale-plus-long-tail load.
2. Goals / Non-goals¶
Goals - Enforce a hierarchy: a request must pass global → per-tenant → per-endpoint → per-key and is rejected at the first level it violates. - Keep a noisy tenant isolated: its spike must not degrade other tenants' success rate or p99 below their SLO, even when global capacity is contended. - Share spare capacity fairly via a real fairness algorithm (weighted fair queueing / max-min / deficit round robin), not arbitrarily. - Enforce limits globally across the fleet with a stated, measured accuracy bound — not 40 independent per-node limiters that allow 40× overage. - Make every throttle explainable: which level fired, which tenant, why.
Non-goals - Building the gateway/proxy itself — you may sit inside one Go service or behind Envoy; the quota engine is the deliverable. - Billing/metering accuracy for invoicing (that's a different, stricter problem). - Per-user auth — tenant_id and api_key arrive trusted from an upstream. - A pretty UI — a /quota/debug JSON endpoint is enough.
3. Functional requirements¶
- A quota engine library (
quota.Decide(ctx, req) → Decision) that a request passes through.reqcarries{tenant_id, endpoint, api_key, weight, tier};Decisionis{allow bool, level string, retry_after time.Duration, reason string}. - Hierarchical limits configured per level. A request consumes from each level it traverses and is denied at the first exhausted level. Denial must name the level (
global/tenant/endpoint/key). - A fairness scheduler for contended capacity: when demand exceeds a level's limit, admit by weighted max-min fairness across tenants (weights from tier), not by arrival order. Implement at least two of {WFQ, max-min, DRR} and make the active one switchable by flag.
- Distributed enforcement in two modes, switchable by flag:
central— every decision reads/writes a shared Redis token bucket via a single atomic Lua script (strict, but Redis is in the hot path).local-reconciled— each node keeps a local bucket sized from its share of the global limit, and a background loop reconciles with Redis everyNms. State the accuracy/latency trade-off you measured for each.- Quota borrowing / bursting: a tenant under its steady limit may borrow from a shared burst pool up to a ceiling, repaid as the pool refills. Soft limits trigger borrowing; hard limits never yield.
- Priority tiers & SLOs:
enterprise > standard > free. Under global pressure, lower tiers are shed first; protected tiers keep their guarantee. - Observability: per-tenant admitted/throttled counters, per-level denial counts, current borrow debt, and a
/quota/debug?tenant=Xendpoint returning the live decision trace for that tenant.
4. Load & data profile¶
- Tenants: 5,000. Tier mix: 50
enterprise, 950standard, 4,000free. - Traffic skew: Zipfian (s≈1.2) over tenants — ~20 whales carry ~80% of volume; the long tail sends < 5 req/s each. This skew is the whole point.
- Aggregate rate: sustained ≥ 1M req/s across the fleet for a ≥ 20-min run; burst experiments push a single tenant to +300k req/s instantaneously.
- Endpoints: at least 3 with different weights —
GET /read(weight 1),POST /write(weight 5),POST /export(weight 50) — so endpoint limits bite before tenant limits for expensive calls. - Fleet: ≥ 8 gateway nodes (Docker) sharing one Redis (single node, then Cluster for the hot-key experiment).
- Generator:
cmd/loadis deterministic given a seed; supports a "noisy neighbor" mode that spikes one named tenant on a schedule. - Traffic model: open-model (fixed offered rate, not closed-loop) so you can watch a victim tenant's success rate move while the attacker spikes.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Fairness guarantee | Under contention, each tenant's admitted share is within ±5% of its weighted max-min fair share; a noisy tenant cannot exceed its fair share by more than its borrowable burst ceiling |
| Noisy-neighbor isolation | During a single-tenant +300k req/s spike, innocent tenants' success rate stays ≥ 99% and their decision p99 rises by < 10% |
| Enforcement accuracy (central) | Global admitted count within ±0.5% of the configured global limit |
| Enforcement accuracy (local-reconciled) | Stated, bounded overage — target ≤ 5% over the global limit at a 200 ms reconcile interval; report the overage-vs-interval curve |
| Decision latency p99 | < 1 ms added per request in local-reconciled; < 5 ms in central at 1M req/s (Redis in path) |
| Throttle attribution | 100% of denials report the correct firing level and tenant (verified against an oracle counter) |
| Tier protection | Under 100% global overload, enterprise admitted ≥ its reserved floor; free is shed first |
The goal is not a magic number — it's to find the fairness/accuracy/latency trade-off your design makes and prove you control it.
6. Architecture constraints & guidance¶
- Go service(s);
go-redis/v9for Redis. All multi-key counter mutations go through a single Lua script so check-and-decrement is atomic — never GET-then-SET in app code (that's the classic distributed-counting race). - Hierarchy is an ordered list of limiters;
Decideshort-circuits on the first denial. Be deliberate about ordering: cheapest/most-likely-to-deny first. - For
local-reconciled, each node getsglobal_limit / live_nodesas its local budget; the reconcile loop trues up against Redis and redistributes unused budget from idle nodes (this is where fairness across nodes lives). - Fairness scheduler operates on the contended level only — don't pay WFQ cost when capacity is plentiful (fast path: if under limit, admit immediately).
- Burst pool is one shared Redis counter with its own refill rate; borrowing is a conditional decrement in the same Lua call as the tenant bucket.
- Instrument with Prometheus:
quota_admitted_total{tenant,tier},quota_denied_total{level},quota_borrow_debt{tenant}, decision-latency histogram, Redis-call latency. A Grafana board is part of "done".
7. Data model¶
config (static, reloadable):
global_limit rps
tenant_limit[tier] rps # enterprise/standard/free
endpoint_weight[ep] int # read=1 write=5 export=50
burst_pool { size, refill_rps, per_tenant_ceiling }
central mode — Redis keys (mutated only by Lua):
q:global token-bucket {tokens, ts}
q:tenant:{id} token-bucket {tokens, ts, weight}
q:ep:{id}:{endpoint} token-bucket {tokens, ts}
q:key:{api_key} token-bucket {tokens, ts}
q:burst shared pool {tokens, ts}
q:debt:{id} borrow debt
local-reconciled mode (per node):
local bucket per level, sized = share of global; reconcile loop:
every 200ms: push local consumed → Redis, pull new share, redistribute
The token-bucket Lua script takes {key, rate, burst, cost, now} and returns {allowed, remaining, retry_after}; the engine calls it per level and folds the results into one Decision.
8. Interface contract¶
quota.Decide(ctx, Request) (Decision, error)— the engine entry point.Decision{ Allow bool; Level string; RetryAfter time.Duration; Reason string }.- Demo HTTP surface:
POST /api/{endpoint}(headerX-Tenant,X-Api-Key) →200or429withRetry-AfterandX-Quota-Level/X-Quota-Reasonheaders.GET /quota/debug?tenant=X→ live limits, remaining tokens per level, borrow debt, and the last N decisions for that tenant.GET /metrics→ Prometheus exposition.- Engine configured via flags:
-mode {central|local-reconciled},-fairness {wfq|maxmin|drr|none},-reconcile-ms,-burst-pool,-global-rps.
9. Key technical challenges¶
- Nested enforcement without N round-trips. Four levels naively = four Redis calls per request. Fold the hierarchy into one Lua script that checks and decrements all levels atomically and rolls back partial consumption if a later level denies — otherwise you leak tokens on rejected requests.
- Fairness vs throughput. Max-min / WFQ must decide who gets scarce capacity without becoming the bottleneck. The fast path (under limit) must stay allocation-free; fairness logic only runs when a level is contended.
- Distributed accuracy vs latency.
centralis exact but puts Redis in every request path (and creates a hot key for whales).local-reconciledis fast but drifts — overage scales with reconcile interval and node count. You must measure the curve, not hand-wave it. - Hot-tenant contention. A whale's tenant bucket is a single hot Redis key — the cluster's throughput ceiling for that tenant is one key's ops/s. Sharded counters or local-reconciled mode trade accuracy to relieve it.
- Correct attribution under concurrency. When two levels are simultaneously near-empty, the denial must report the level that actually fired, consistently — easy to get wrong when limiters run in parallel.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each:
- Noisy neighbor, fairness off vs on. Run steady long-tail traffic, then spike tenant
whale-1to +300k req/s. With-fairness none, measure innocent tenants' success rate and p99 as the whale eats the global pool. Switch to-fairness maxminand re-measure. Measure: victim success rate (must recover to ≥ 99%), victim p99 delta, whale's admitted share (should clamp to fair share + burst ceiling). - Distributed accuracy: central vs local-reconciled. Drive a fixed global rate above the limit in both modes. Measure: actual admitted vs configured global limit (overage %), and decision p99. Sweep
-reconcile-ms= 50, 200, 1000 and plot overage vs latency — find the knee. - Quota borrowing under partial load. Most tenants idle, one tenant bursts above its steady limit. Measure: how much it borrows from the pool, that it never exceeds the per-tenant ceiling, and that borrowing collapses to zero (hard limit holds) once the pool drains.
- Hot-tenant contention on the central store. Push one whale's tenant bucket to its Redis-key ops ceiling. Measure: the per-key throughput wall and decision p99 at that key; then relieve it with sharded counters or local-reconciled and re-measure the accuracy cost.
- Whale + long-tail fairness. Run the full Zipfian profile at global overload. Measure: each tenant's admitted share vs its weighted max-min fair share (must be within ±5%); confirm the long tail isn't starved by whales.
- Tier protection under total overload. Offer 150% of global capacity across all tiers. Measure: that
enterprisekeeps its reserved floor andfreeis shed first; quantify each tier's admitted rate. - Throttle-attribution correctness. Run a mixed load where tenant, endpoint, and key limits all fire. Compare engine-reported denial levels against an independent oracle counter. Measure: attribution accuracy (target 100%).
11. Milestones¶
- Single-node engine: hierarchical token buckets + one Lua script per level;
Decideshort-circuits and attributes denials. Prometheus + Grafana board. cmd/loadwith Zipfian tenants + noisy-neighbor mode; first overload run.- Distributed enforcement:
centralLua path, thenlocal-reconciledwith the reconcile loop (experiment 2 curve). - Fairness scheduler (max-min + one of WFQ/DRR); noisy-neighbor containment (experiments 1, 5).
- Burst pool + borrowing + tier protection (experiments 3, 6); attribution proof (experiment 7); findings note.
12. Acceptance criteria (definition of done)¶
- Sustained ≥ 20-min run at ≥ 1M req/s across ≥ 8 nodes; dashboard screenshot attached.
- Noisy-neighbor experiment shows innocent tenants ≥ 99% success and < 10% p99 rise with fairness, and the contrast without it (both plotted).
-
centraladmitted within ±0.5% of global limit;local-reconciledoverage-vs-reconcile-interval curve plotted with the knee identified. - All four hierarchy levels demonstrably enforced, with denials attributing the correct level (oracle-verified, 100%).
- Quota borrowing respects the per-tenant ceiling and the hard limit holds when the pool drains (shown).
- Tier protection: under 150% overload,
enterprisefloor held,freeshed first (numbers shown). - Findings note: the accuracy/latency/fairness trade-off of your design, defensible to a staff panel. Every number reproducible from a committed command + config.
13. Stretch goals¶
- Sharded counters for hot whale keys (split one tenant bucket into K Redis keys, sum on read) and measure the accuracy loss vs the contention relief.
- Predictive borrowing: forecast pool availability from recent demand and pre-allocate, reducing borrow-request latency for bursty tenants.
- Per-endpoint SLOs layered on tiers (e.g.
exportcapped harder thanreadregardless of tier). - Adaptive limits: raise/lower a tenant's effective limit from downstream health signals (load-shed feedback), not just a static config.
- Graceful node loss: when a node dies in
local-reconciled, prove its share is reclaimed by the reconcile loop without a global overage spike.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Hierarchical enforcement | Levels enforced; denial names the level | One atomic Lua check across all levels with correct rollback; no token leak on rejection |
| Fairness | Implements a fairness algorithm; noisy tenant is contained | Proves admitted shares match weighted max-min within ±5%; fast path stays allocation-free |
| Distributed accuracy | Knows central is exact and local drifts | Plots the overage-vs-interval curve; recommends a mode for a given SLO with evidence |
| Noisy-neighbor isolation | Shows victims recover with fairness on | Quantifies victim p99/success delta and bounds the whale to fair-share + burst |
| Hot-key handling | Notices the whale's bucket is a hot key | Measures the per-key wall and mitigates it with a stated accuracy cost |
| Tiers & borrowing | Tiers and burst pool work | Tier floors hold under overload; borrowing respects ceilings and hard limits |
| Communication | Clear findings note | Could defend every fairness/accuracy curve to a staff panel |
15. References¶
resilience/01-rate-limit-algorithm-bake-off— token bucket / sliding window / GCRA internals this project composes and distributes.- Demers, Keshav, Shenker — Analysis and Simulation of a Fair Queueing Algorithm (WFQ); Shreedhar & Varghese — Efficient Fair Queueing using Deficit Round Robin.
- Stripe / Cloudflare engineering posts on distributed rate limiting and Redis + Lua token buckets.
go-redisLua scripting (EVALSHA) and Redis Cluster hot-key guidance.- See also:
Interview Question/22-scalability-and-high-availability/andInterview Question/07-caching-and-redis/.