Projects — High-Load & Big-Data Engineering Labs¶

Hands-on project briefs for senior and staff Go backend engineers.

These are not CRUD toys. Every project makes you stand up a real system (Kafka, ClickHouse, Postgres, Redis, Elasticsearch…), generate large datasets, drive high load, then measure, break, and tune it until you can defend the numbers. The skill being trained is not "can you make it work" — it's "can you make it work at scale, explain why it behaves the way it does, and prove it with data."

Each brief is a TZ (техническое задание / technical specification): a realistic problem, hard requirements, explicit SLOs, a load-and-data profile, a set of experiments to run, and a grading rubric a staff interviewer would recognize.

How to use a project¶

Read the whole TZ first. Note the SLOs and the dataset scale — they drive every design decision.
Generate data at the stated scale before you optimize anything. Tuning against 1k rows teaches you nothing; the interesting behavior starts at 10M+.
Build the smallest thing that meets the functional requirements, then run the load harness and watch it fall over.
Run the experiments. Each TZ lists "break it / tune it" investigations. Record before/after numbers (throughput, p50/p99/p999, lag, CPU, allocations).
Write a short findings note per project: what you changed, what it bought you, and why. This is the artifact that proves staff-level depth.

You do not need to finish every stretch goal. You do need numbers you can defend. Depth over breadth — one project taken to real scale with a written findings note beats eight half-built repos.

The scale matrix — build every project in 4 stages¶

"High load" isn't one thing. It has two independent axes — data volume and request rate — and they fail in completely different ways. So every project is a 4-rung ladder. Build Stage 0 correct first (it's your control); then push each axis on its own; then push both together. Don't tune what isn't yet correct.

Stage	Data	Request rate	What it stresses
0 · Simple	small	low	correctness only — the baseline you measure everything else against
1 · Big data	huge	low	storage layout, indexing, memory, compaction, query plans
2 · High RPS	small	very high	connection pools, locking/contention, tail latency, backpressure
3 · Big data + High RPS	huge	very high	the interaction: hot keys in a huge set under concurrency, cache working-set misses, tail amplification — full SLOs

The two axes are independent: a system can ace Stage 1 and fall over at Stage 2 (or vice versa). Stage 3 is the production boss fight where they compound. A project is only "senior/staff done" at Stage 3 — measured and defended.

Each project instantiates the axes in its own terms (a rate limiter's "big data" = millions of keys; a columnar store's "high RPS" = many concurrent queries). The stages map onto the TZ sections: Functional reqs → 0; Load & data profile → 1; load harness + SLOs → 2; Experiments + full SLOs → 3.

Tiers¶

labs/ — Data-systems load labs. The core of this library: push a single technology (Kafka, ClickHouse, Postgres indexes, Redis, ES) to its limits with big data and learn its failure modes.
senior/ — Own one service end-to-end, with high load baked into the requirements. You design the data model, the API, and the scaling story.
staff/ — Cross-cutting, ambiguous, multi-system problems at scale: consensus, sharding, multi-region, migrations, exactly-once pipelines.

Catalog¶

`labs/` — Data-Systems Load Labs¶

#	Project	Trains	Interview sections
01	Kafka throughput & exactly-once	Partitioning, consumer lag, rebalancing, EOS	11, 13, 17
02	ClickHouse OLAP at scale	MergeTree, materialized views, projections, billions of rows	23, 17, 14
03	Postgres indexing & partitioning	B-tree/GIN/BRIN/partial/covering, bloat, partitioning, EXPLAIN	5, 17, 23
04	Redis at scale	Cluster, pipelining, hot keys, eviction, persistence	7, 22, 17
05	Elasticsearch at scale	Sharding, bulk indexing, relevance, query latency	8, 17, 22
06	Database bake-off: analytics under load	Postgres vs ClickHouse vs Mongo, selection	23, 5, 6
07	Connection-pool & DB saturation	Pool sizing, PgBouncer, N+1, lock contention	5, 17, 22
08	Streaming backpressure	Lag, rebalancing storms, exactly-once delivery	11, 13, 2
09	Cache stampede & invalidation	Thundering herd, TTL jitter, single-flight, write strategies	7, 17, 22

`events/` — Event-Engineering Labs¶

#	Project	Trains	Interview sections
01	CDC pipeline (Debezium)	Postgres WAL → Kafka, snapshot + stream, outbox-vs-CDC	5, 11, 13
02	Stateful windowing processor	Tumbling/sliding/session windows, watermarks, late data, local state	11, 13, 17
03	Event replay & reprojection	Rebuild read models from 1B-event log, time-travel, zero-downtime	11, 12, 13
04	Schema registry & evolution	Avro/Protobuf compat, breaking-change detection, rolling upgrades	10, 11, 12
05	DLQ & retry topology	Poison messages, retry topics, backoff, parking-lot, replay	11, 13, 17
06	Broker bake-off	Kafka vs RabbitMQ vs NATS JetStream under load; selection	11, 13, 23
07	Idempotent inbox/outbox (cross-service)	Exactly-once over HTTP/gRPC, idempotency keys, dedup	10, 11, 13
08	Consumer autoscaling on lag	KEDA-style scaling, 10x spikes, drain time, rebalance cost	11, 19, 22

`resilience/` — Rate-Limiting & Resilience Labs¶

#	Project	Trains	Interview sections
01	Rate-limit algorithm bake-off	Token/leaky bucket, sliding window log/counter, GCRA; accuracy vs memory vs burst	7, 17, 22
02	Adaptive concurrency & load shedding	AIMD, Little's Law, concurrency-limits, shed before collapse	13, 17, 22
03	Circuit breaker / bulkhead / timeout	Cascading-failure isolation, timeout budgets, half-open probing	13, 22, 9
04	Hierarchical multi-tenant quotas	Per-tenant/endpoint/global fairness, quota borrowing, noisy-neighbor isolation	7, 22, 12

`load-testing/` — How to Test High Load (the meta-skill)¶

#	Project	Trains	Interview sections
01	Distributed load generator	Open vs closed model, coordinated omission, HdrHistogram, distributed agents	17, 22, 9
02	Chaos & fault injection	Latency/error/kill injection, steady-state hypothesis, blast radius	13, 22, 18
03	Soak & leak hunting	24h endurance, goroutine/memory/fd leaks, pprof over time	1, 17, 18
04	Capacity & breakpoint testing	Stress to breaking point, USE method, Little's Law, the knee	17, 22, 14
05	Go memory & zero-allocation	Escape analysis, sync.Pool, GOGC/GOMEMLIMIT, arenas/mmap, alloc reduction	1, 17, 2
06	Microbenchmarking & benchstat	testing.B pitfalls, benchstat A/B significance, CI perf-regression gates	1, 17, 15
07	Profiling-guided optimization	pprof CPU/heap/block/mutex, flame graphs, runtime/trace, optimize a hot path	17, 1, 18

`observability/` — Logging & Monitoring at scale¶

#	Project	Trains	Interview sections
01	Centralized logging pipeline	High-volume structured logging: ship→buffer→index→query, sampling, cardinality, backpressure	18, 17, 22
02	Metrics, monitoring & alerting	Prometheus instrumentation, scrape, alerting rules, SLO/error-budget burn, alert fatigue	18, 22, 17

`distributed-patterns/` — Coordination & Distributed-Transaction Patterns¶

#	Project	Trains	Interview sections
01	Leader election	Lease-based / Raft-lease election, split-brain avoidance, fencing	13, 2, 22
02	Distributed lock with fencing	Redis/etcd locks, fencing tokens, the Redlock debate, liveness vs safety	13, 7, 2
03	Scatter-gather aggregator	Fan a request to N shards/services, aggregate, straggler/timeout handling	13, 22, 9
04	Claim-check	Large payloads via a store + reference through the queue; broker-size limits	11, 13, 20
05	Fan-out / fan-in pipeline	Parallel fan-out then join, bounded concurrency, partial-failure handling	2, 13, 17
06	2PC / 3PC coordinator	Two/three-phase commit, coordinator-failure blocking, recovery log, why it doesn't scale	13, 5, 15
07	Saga: orchestration vs choreography	Build both; compare coupling, failure handling, observability, compensation	11, 12, 13
08	TCC (Try-Confirm-Cancel)	Reservation/confirm/cancel, idempotency, timeout-driven cancellation; payment txns	13, 11, 16

`networking/` — Sockets & Network Performance¶

#	Project	Trains	Interview sections
01	High-performance TCP socket server	Raw sockets, framing, goroutine-per-conn vs epoll, TCP tuning, zero-copy, C10K→C10M	9, 2, 17

`senior/` — Service builds (high-load baked in)¶

#	Project	Trains	Interview sections
01	Idempotent double-entry ledger	Postgres txns, idempotency, exactly-once, 10k+ TPS	5, 10, 15, 16
02	Distributed rate limiter	Redis + Lua, token bucket/sliding window, 1M+ req/s	7, 22, 9
03	Durable job queue / scheduler	Go concurrency, retries/backoff, DLQ	2, 11, 15
04	Realtime chat & presence	WebSockets, pub/sub, 100k+ connections, backpressure	2, 7, 9
05	Content-addressed storage (S3-like)	Chunking, dedup, multipart, throughput	5, 20, 16
06	Observability backend	High-cardinality OTLP ingest, Go perf	18, 17
07	Event-driven order/payment service ⭐ flagship	Go · Postgres · Kafka · Outbox · DDD, end-to-end	01, 05, 11, 12, 13, 15, 18
08	API gateway / edge proxy	Reverse proxy, JWT/JWKS, per-route rate limit + breaker, hot-reload, 150k+ req/s	10, 9, 22, 16, 18

`staff/` — Systems at scale¶

#	Project	Trains	Interview sections
01	Sharded multi-tenant platform	Sharding, routing, live resharding, isolation	5, 13, 22, 23
02	Event-sourced CQRS + Saga + Outbox	Event sourcing, projections, distributed txns	11, 12, 13
03	Raft-backed metadata KV store	Consensus, log replication, leader election	13, 2
04	Multi-region active-active	CRDT/LWW, geo-routing, replication	13, 22, 20
05	Exactly-once streaming pipeline	Kafka → ClickHouse, windowing, stateful processing	11, 13, 17
06	Monolith → services migration	Strangler, dual-write, backfill, zero-downtime	12, 13, 5
07	Mini message broker (Kafka-lite)	Segmented append-only log, partitions, consumer groups, replication	11, 13, 17
08	LSM-tree storage engine	Memtable, SSTables, compaction, WAL, Bloom filters, read/write amplification	5, 17, 23

Each project tags 3–5 interview sections; the catalog collectively covers all 23.

Shared tooling conventions¶

Every TZ assumes these, so they aren't repeated in each brief:

Local infra: docker-compose (or kind for k8s labs) brings up the dependency. Pin versions; record them in your findings.
Dataset generator: a Go program (cmd/gen) that produces data at the target scale with realistic distributions (Zipfian for hot keys, skewed cardinality where it matters). Deterministic via a seed so runs are reproducible.
Load harness: k6, vegeta, or a custom Go driver. Drive a defined traffic profile (open vs closed model; state the model). Capture a full latency histogram, not just an average.
Measurement: p50 / p95 / p99 / p999 latency, throughput, error rate, and the resource cost (CPU, memory, allocations via pprof, disk/network I/O). An average with no tail is not an answer.
Reproducibility: every number in your findings note must come with the command and config that produced it.

Global evaluation rubric¶

Each TZ has a project-specific rubric; all of them inherit this spine:

Dimension	Senior bar	Staff bar
Correctness under load	Meets functional reqs at target scale	Holds invariants through failure & rebalancing
Measurement rigor	Reports p99 + throughput honestly	Explains the tail, isolates the bottleneck, proves cause
Design judgment	Reasonable architecture for the SLO	Defends trade-offs; knows when not to use the fancy option
Failure handling	Graceful degradation, retries, backpressure	Designs for partial failure, poison data, and recovery
Tuning	Improves a metric with evidence	Pareto-aware; quantifies cost of each gain
Communication	Clear findings note + ADRs	Could present the numbers to a staff panel and survive

A project is "done" when you can put its findings note in front of a staff engineer and defend every number on it.

See the parent Interview Question/ bank for the matching theory.

Projects — High-Load & Big-Data Engineering Labs¶

How to use a project¶

The scale matrix — build every project in 4 stages¶

Tiers¶

Catalog¶

labs/ — Data-Systems Load Labs¶

events/ — Event-Engineering Labs¶

resilience/ — Rate-Limiting & Resilience Labs¶

load-testing/ — How to Test High Load (the meta-skill)¶

observability/ — Logging & Monitoring at scale¶

distributed-patterns/ — Coordination & Distributed-Transaction Patterns¶

networking/ — Sockets & Network Performance¶

senior/ — Service builds (high-load baked in)¶

staff/ — Systems at scale¶