Sharded Multi-Tenant Database Platform¶
Scale a relational workload past one Postgres node by sharding it across many. Build the routing layer, isolate noisy tenants — and then do the hard thing: split a hot shard while it serves live traffic, losing zero rows and zero seconds of uptime. The shard key is a one-way door; defend yours.
| Tier | Staff (horizontal data scaling) |
| Primary domain | Distributed data / multi-tenancy |
| Skills exercised | Sharding, shard-key design, consistent hashing, directory routing, cross-shard fan-out, online resharding (dual-write + backfill + cutover), tenant isolation, fleet-wide schema migration, Go (pgx, jackc/pgxpool) |
| Interview sections | 5 (postgres), 13 (distributed systems), 22 (scalability), 23 (db selection) |
| Est. effort | 6–10 focused days |
1. Context¶
You run the data platform for a B2B SaaS doing ~4 billion rows of tenant data across ~8,000 tenants. The primary Postgres node is at 85% CPU during business hours, the working set no longer fits in RAM, and a single VACUUM on the biggest table now runs into the next maintenance window. Vertical scaling bought you 18 months; it has run out. Read replicas help reads but not the write ceiling, and the write ceiling is the wall you've hit.
Worse: tenants are wildly skewed. Three "whale" tenants account for ~40% of all rows and ~55% of write volume; the long tail is 7,000 tenants under 50k rows each. One whale's batch import currently degrades p99 for everyone on the box — the classic noisy-neighbor failure.
Your mandate is to shard the relational workload horizontally across many Postgres instances behind a routing layer, keep tenants isolated, and — the part that decides whether this is a staff-level result — reshard live: detect a hot shard and split it without downtime, without losing or duplicating a single row under concurrent writes. You will produce numbers and a defended design, not a diagram.
This is also a when-not-to exercise. Sharding is the most expensive architecture decision on this list. Part of the deliverable is proving you'd reach for Citus, Vitess, or simply a bigger box first — and saying exactly where that stops being enough.
2. Goals / Non-goals¶
Goals - Stand up N ≥ 4 Postgres shards behind a Go routing layer that maps any query to the correct shard from a stable shard key, and serve a multi-tenant OLTP workload across them. - Show near-linear write-throughput scaling as shards are added, find where it stops being linear, and explain why. - Implement and prove an online shard split: take a hot shard, split it into two, and cut over live traffic with zero downtime and zero data loss/dupes under concurrent writes. - Enforce tenant isolation so one whale's burst cannot blow the SLO for tenants on other shards (and bound the blast radius for co-resident tenants). - Roll a schema migration across all N shards safely (expand/contract), and report fleet-wide rollout time and failure handling.
Non-goals - Adopting Citus or Vitess as the solution. You will benchmark against them (§13), but the point is to build the routing/resharding logic yourself so you understand what they hide. - Globally-distributed / multi-region placement (that's a separate staff lab). - A general distributed-SQL query planner. Cross-shard queries exist here to be measured and then designed away, not to be made fast. - Strong cross-shard ACID by default. You will deliberately design to not need distributed transactions; the one place you can't, you'll state the guarantee you actually provide.
3. Functional requirements¶
- A router (
cmd/routeror a libraryinternal/shardmap) resolves any request carrying atenant_idto exactly one shard and returns apgxpoolhandle to it. Routing must survive shard add/remove and an in-progress split. - A data API (
cmd/api) exposes tenant-scoped OLTP operations: POST /tenants/{tid}/orders— single-shard write.GET /tenants/{tid}/orders?...— single-shard read.GET /admin/reports/revenue— cross-shard scatter-gather aggregate (exists so you can measure the fan-out tax and then justify avoiding it).- A shard map is the source of truth for key→shard placement. Implement both routing strategies behind one interface and compare them (§8):
- Directory-based (a lookup table:
tenant_id → shard_id, cacheable). - Algorithmic (consistent hashing over a hash ring with virtual nodes).
- A resharder (
cmd/reshard) performs an online split of one source shard into two: snapshot → dual-write → backfill → verify → cutover → decommission, each step resumable and individually reversible before cutover. - A migrator (
cmd/migrate) applies a schema change across all shards with per-shard status, bounded concurrency, and a stop-on-failure mode. - A load harness (
cmd/load) drives a tenant-skewed open-model workload and a chaos hook that can promote a tenant to "whale" mid-run and kill a shard.
4. Load & data profile¶
- Volume: ≥ 4 billion rows total across shards (orders + line items); the largest single table per shard ≥ 200M rows so index and vacuum behavior is realistic, not toy.
- Tenants: 8,000 tenants. Sizes are Zipfian (s ≈ 1.2): 3 whales hold ~40% of rows, ~60 mid-tenants hold ~35%, ~7,900 long-tail tenants hold the rest. This skew is the whole point — it breaks naive hashing.
- Shards: start at 4, scale experiments to 8 and 16. Each shard is its own Postgres instance (separate container, own page cache, own WAL).
- Write/read mix: 30% writes / 70% reads at steady state; whale import bursts push one tenant to 10× its baseline write rate for 5-minute windows.
- Generator:
cmd/genis deterministic given a seed; it can emit the Zipfian tenant distribution and a configurable hot-tenant burst. - Traffic model: open-model (fixed arrival rate, not closed-loop) so lag and tail latency build visibly when a shard saturates.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Single-shard write p99 (steady state, below ceiling) | < 15 ms |
| Single-shard read p99 (indexed, tenant-scoped) | < 10 ms |
| Router resolution overhead (per request, cache-hit) | < 100 µs added p99; report cache-miss path separately |
| Write-throughput scaling 4→8→16 shards | Near-linear; report the actual slope and name what bends it |
| Cross-shard scatter-gather p99 (revenue report, N shards) | Measure & report; expect ≫ single-shard. Justify why it's an admin-only path |
| Online split — data correctness | After split under live writes: rows_before == rows_after, zero lost, zero duplicate (prove with a checksum diff) |
| Online split — availability | Zero failed writes during cutover; report max write-path latency spike and its duration |
| Noisy-neighbor blast radius | A whale at 10× write rate must not push other shards' p99 above SLO; co-resident tenants' degradation bounded and reported |
| Fleet schema migration (N shards) | Completes with per-shard status; report total wall-clock and behavior on a mid-rollout shard failure |
The target is not a magic number — it's to find your platform's scaling slope and resharding cost, and defend them to a staff panel.
6. Architecture constraints & guidance¶
- N Postgres instances via
docker-compose, each a real separate node (own volume, ownshared_buffers). Pin the version. No managed RDS — you need to see the WAL, the vacuum, the connection limits. - Go +
pgx/pgxpool, one pool per shard. The router owns the pool map; size pools so total connections stay under each shard'smax_connections(this becomes a real constraint at 16 shards — budget it). - Shard map lives in a small coordination store (a dedicated Postgres "catalog" DB or etcd). Routers cache it and watch for version bumps; an in-progress split publishes a "splitting" state the router must honor.
- Idempotency keys on writes from day one — resharding correctness depends on being able to replay safely.
- Instrument with Prometheus: per-shard QPS / p99 / connection-pool saturation / replication or backfill lag, router cache hit-rate, and a per-tenant top-talkers view so you can see the whale.
7. Data model & shard map¶
-- Every shardable table carries the shard key as a leading column.
-- Shard key = tenant_id (see §9 for why, and what it costs).
orders(
tenant_id BIGINT NOT NULL, -- SHARD KEY
order_id BIGINT NOT NULL, -- unique *within* tenant
amount_cents BIGINT NOT NULL,
status TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (tenant_id, order_id) -- co-located, no cross-shard PK
);
line_items(
tenant_id BIGINT NOT NULL, -- SHARD KEY (co-located with orders)
order_id BIGINT NOT NULL,
sku TEXT, qty INT, price_cents BIGINT,
PRIMARY KEY (tenant_id, order_id, sku),
FOREIGN KEY (tenant_id, order_id) REFERENCES orders(tenant_id, order_id)
);
-- Idempotency ledger (per shard) — makes dual-write + replay safe.
write_log(tenant_id BIGINT, idem_key UUID, applied_at TIMESTAMPTZ,
PRIMARY KEY (tenant_id, idem_key));
Shard map (catalog DB — source of truth):
shards(shard_id INT PK, dsn TEXT, state TEXT); -- state: active | splitting | draining
-- Directory strategy:
tenant_placement(tenant_id BIGINT PK, shard_id INT, moved_at TIMESTAMPTZ);
-- Algorithmic strategy: hash ring with virtual nodes
ring(vnode_hash BIGINT PK, shard_id INT); -- ~256 vnodes/shard for even spread
map_version BIGINT; -- routers watch this; bump on any change
Key design points to defend: - tenant_id leads every PK so a whole tenant lives on one shard → most queries are single-shard, and a tenant moves as one atomic unit during resharding. - No global auto-increment IDs (they'd force a cross-shard sequence). IDs are unique within a tenant, or use ULID/snowflake-style IDs if you need global uniqueness — state which and why. - The whale problem the model can't hide: one whale > one shard's capacity means tenant_id alone is no longer a sufficient shard key. §9 forces the decision (sub-shard a whale by a composite key vs. give it a dedicated shard).
8. Routing / API contract¶
- Router interface (one contract, two implementations):
ShardFor(tenant_id) → (shard_id, *pgxpool.Pool); both directory and consistent-hashing implement it. Compare: directory gives arbitrary placement (you can pin a whale to its own shard) at the cost of a lookup + cache-invalidation problem; consistent hashing gives O(1) stateless routing and minimal key movement on rebalance, but you can't pin a hot tenant without a directory override on top. - During a split,
ShardForconsultsstate: a tenant whose range issplittingroutes writes to both old and new shard (dual-write) and reads to the authoritative side until cutover flips the map version. - Endpoints:
POST /tenants/{tid}/orders→ 200; carriesIdempotency-Key.GET /tenants/{tid}/orders→ single-shard.GET /admin/reports/revenue→ fan-out to all shards, scatter-gather, merge; response includesshards_queried,fanout_ms,slowest_shard.GET /admin/shardmap→ current placement + map version.GET /metrics→ Prometheus.- Config/flags:
-strategy=directory|consistent-hash,-shards,-vnodes,-pool-size-per-shard.
9. Key technical challenges¶
- Shard-key choice is a one-way door.
tenant_id(co-locates a tenant, cheap single-shard reads, easy moves) vs. hash(tenant_id) (even spread, kills range scans, hard to pin a whale) vs. range (range scans + easy splits, but hot-range skew). Pick one, state the workload it optimizes, and name the query class it makes expensive. Then handle the case it breaks: a whale bigger than one shard. - Online resharding without losing a row. The split must be correct under concurrent writes: snapshot a consistent point → dual-write new writes to source and destination → backfill the historical rows (chunked, resumable, idempotent via
write_log) → verify with a per-chunk checksum → cutover by bumpingmap_versionatomically → drain & decommission the old copy. The dangerous windows are (a) writes that land between snapshot and dual-write start, and (b) the cutover instant — design both so no write is lost and no row is double-applied. - Cross-shard queries are a tax, not a feature. A scatter-gather report is bounded by the slowest shard (tail amplification) and N× the connections. Measure it, then justify why it's admin-only / async / served from a pre-aggregated rollup instead of the live path.
- Distributed transactions — designed away. A write that must touch two tenants/shards atomically has no cheap answer (2PC = coordinator + blocking). The staff move is to make the shard key guarantee such writes don't exist; for the one case that genuinely spans shards, state the weaker guarantee you ship (e.g. saga + idempotency, eventual consistency) and why it's acceptable.
- Noisy neighbors. A whale's burst can saturate its shard's CPU, WAL, and connection pool. Bound the blast radius: per-tenant rate limits / pool quotas, and shard placement that isolates whales — and prove other shards stay green.
- Fleet-wide schema migration. A migration must be expand/contract so old and new code coexist while it rolls; a non-concurrent
CREATE INDEXor a rewritingALTERon a 200M-row shard locks writes. Roll with bounded concurrency and handle a shard that fails mid-rollout without leaving the fleet in mixed schema forever.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each:
- Scaling slope: fix per-tenant rate, run on 4 → 8 → 16 shards. Plot aggregate write throughput. Is it linear? Where does it bend, and what bends it — router CPU, catalog lookups, connection limits, or skew?
- Shard-key bake-off: route the same workload by
tenant_idvshash(tenant_id)vs range. Measure single-shard hit-rate, load balance (per-shard QPS spread), and what each does to the revenue report. - Hot-shard detection + live split: drive a whale to 10× write rate, detect the hot shard, and perform an online split while traffic runs. Measure: failed writes (target 0), max write-latency spike and its duration, and total split wall-clock. Then prove correctness (experiment 4).
- Resharding correctness under concurrency: during the split, run a continuous writer. After cutover, checksum source vs destination:
rows_before == rows_after, zero lost, zero duplicate. Show the diff/SQL. Then kill the resharder mid-backfill and prove it resumes idempotently. - Scatter-gather vs single-shard: revenue report fan-out p99 at 4/8/16 shards vs a single-shard read. Quantify tail amplification (slowest-shard effect). Then serve it from a rollup and re-measure.
- Router overhead: directory (with and without cache) vs consistent-hash resolution latency under load. Report cache hit-rate and the cache-miss p99.
- Noisy neighbor: whale burst on shard A — measure p99 on shards B/C/D before and after adding per-tenant pool quotas / rate limits. Show isolation.
- Fleet migration: roll an expand/contract change (add column + backfill + index
CONCURRENTLY) across all shards. Report wall-clock, write impact per shard, and behavior when one shard fails mid-rollout.
11. Milestones¶
- Compose N shards + catalog; router with directory strategy; data API doing single-shard reads/writes; Prometheus + Grafana per-shard board.
cmd/genZipfian tenants at scale; baseline scaling run (experiment 1) and shard-key bake-off (experiment 2).- Consistent-hashing strategy behind the same interface; router-overhead + scatter-gather measurements (experiments 5, 6).
- Online resharder: dual-write + backfill + verify + cutover; live split with correctness proof (experiments 3, 4). This is the milestone that matters.
- Noisy-neighbor isolation + fleet schema migration (experiments 7, 8); findings note + Citus/Vitess comparison (§13).
12. Acceptance criteria (definition of done)¶
- N ≥ 4 real Postgres shards serving the tenant-scoped API; per-shard dashboard screenshot attached.
- Scaling curve (4/8/16 shards) plotted, with the bend named and proven (pprof / pg_stat / connection-saturation evidence).
- Both routing strategies implemented behind one interface and compared with numbers.
- Online split executed under live writes with zero failed writes, and a checksum diff proving
rows_before == rows_after, zero lost/dupe. - Resharder proven resumable (killed mid-backfill, resumes idempotently).
- Noisy-neighbor: whale at 10× does not breach other shards' SLO, with the isolation mechanism and before/after numbers shown.
- Fleet schema migration rolled across all shards with per-shard status and a documented mid-rollout-failure behavior.
- Findings note: shard-key choice defended, resharding plan defended, and a crisp statement of when you would not shard (and reach for Citus/Vitess or a bigger box instead).
- Every number reproducible from a committed command + config + seed.
13. Stretch goals¶
- Citus / Vitess comparison: run the same workload and the same split on Citus (distributed Postgres) and/or Vitess (sharded MySQL). Where do they beat your hand-rolled router, and what do they cost you in control or ops?
- Whale sub-sharding: for a tenant bigger than one shard, introduce a composite shard key (
tenant_id, bucket) and route a single whale across shards while keeping the long tail ontenant_id. Measure the complexity tax. - Cross-shard saga: implement one genuinely cross-shard operation as a saga with compensation + idempotency; show it converges and bound its window.
- Read-replica blend: add a replica per shard and route reads to it; measure replication lag's effect on read-your-writes and the throughput it buys.
- Automatic rebalancing: turn hot-shard detection into an automated split trigger with a safety budget (never split more than one shard at a time).
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Shard-key choice | Picks a reasonable key | Defends it against the workload; names the query class it makes expensive and handles the whale-bigger-than-a-shard case |
| Routing layer | Routes correctly | Compares directory vs consistent-hash with numbers; routing survives an in-flight split |
| Online resharding | Can move data with a maintenance window | Live split, zero downtime, zero lost/dupe under concurrent writes, resumable; explains why the protocol is correct |
| Cross-shard queries | Knows fan-out is slow | Quantifies tail amplification; designs the live path to avoid it (rollup/async) |
| Distributed transactions | Avoids them | Designs the shard key so they don't exist; for the unavoidable case, states the exact guarantee shipped and why it's safe |
| Tenant isolation | Notices noisy neighbors | Bounds blast radius and proves other shards stay green under a 10× whale |
| Fleet migration | Migrates one DB | Expand/contract across N shards, handles mid-rollout failure |
| Judgment | Can shard | Knows when NOT to — reaches for Citus/Vitess/a bigger box first and says exactly where that stops being enough |
| Communication | Clear findings note | Could defend the shard key and the resharding plan to a staff panel |
15. References¶
- Designing Data-Intensive Applications — Ch. 6 (Partitioning) & Ch. 7 (Transactions): consistent hashing, rebalancing, range vs hash partitioning.
- Citus documentation — distributed tables, shard rebalancer, reference tables.
- Vitess documentation — VTGate routing, resharding (
MoveTables,SplitClone), vindexes. jackc/pgx&pgxpool— per-shard pooling and connection budgeting.- See also:
Interview Question/05-postgresql-and-sql/— indexing, vacuum, EXPLAIN, schema migration mechanics.Interview Question/13-distributed-systems/— partitioning, consistency, 2PC vs sagas, idempotency.Interview Question/22-scalability-and-high-availability/— horizontal scaling, hot keys, blast-radius isolation.Interview Question/23-database-types-and-selection/— when to shard vs replicate vs adopt Citus/Vitess.