Projects — High-Load & Big-Data Engineering Labs¶
Hands-on project briefs for senior and staff Go backend engineers.
These are not CRUD toys. Every project makes you stand up a real system (Kafka, ClickHouse, Postgres, Redis, Elasticsearch…), generate large datasets, drive high load, then measure, break, and tune it until you can defend the numbers. The skill being trained is not "can you make it work" — it's "can you make it work at scale, explain why it behaves the way it does, and prove it with data."
Each brief is a TZ (техническое задание / technical specification): a realistic problem, hard requirements, explicit SLOs, a load-and-data profile, a set of experiments to run, and a grading rubric a staff interviewer would recognize.
How to use a project¶
- Read the whole TZ first. Note the SLOs and the dataset scale — they drive every design decision.
- Generate data at the stated scale before you optimize anything. Tuning against 1k rows teaches you nothing; the interesting behavior starts at 10M+.
- Build the smallest thing that meets the functional requirements, then run the load harness and watch it fall over.
- Run the experiments. Each TZ lists "break it / tune it" investigations. Record before/after numbers (throughput, p50/p99/p999, lag, CPU, allocations).
- Write a short findings note per project: what you changed, what it bought you, and why. This is the artifact that proves staff-level depth.
You do not need to finish every stretch goal. You do need numbers you can defend. Depth over breadth — one project taken to real scale with a written findings note beats eight half-built repos.
The scale matrix — build every project in 4 stages¶
"High load" isn't one thing. It has two independent axes — data volume and request rate — and they fail in completely different ways. So every project is a 4-rung ladder. Build Stage 0 correct first (it's your control); then push each axis on its own; then push both together. Don't tune what isn't yet correct.
| Stage | Data | Request rate | What it stresses |
|---|---|---|---|
| 0 · Simple | small | low | correctness only — the baseline you measure everything else against |
| 1 · Big data | huge | low | storage layout, indexing, memory, compaction, query plans |
| 2 · High RPS | small | very high | connection pools, locking/contention, tail latency, backpressure |
| 3 · Big data + High RPS | huge | very high | the interaction: hot keys in a huge set under concurrency, cache working-set misses, tail amplification — full SLOs |
The two axes are independent: a system can ace Stage 1 and fall over at Stage 2 (or vice versa). Stage 3 is the production boss fight where they compound. A project is only "senior/staff done" at Stage 3 — measured and defended.
Each project instantiates the axes in its own terms (a rate limiter's "big data" = millions of keys; a columnar store's "high RPS" = many concurrent queries). The stages map onto the TZ sections: Functional reqs → 0; Load & data profile → 1; load harness + SLOs → 2; Experiments + full SLOs → 3.
Tiers¶
labs/— Data-systems load labs. The core of this library: push a single technology (Kafka, ClickHouse, Postgres indexes, Redis, ES) to its limits with big data and learn its failure modes.senior/— Own one service end-to-end, with high load baked into the requirements. You design the data model, the API, and the scaling story.staff/— Cross-cutting, ambiguous, multi-system problems at scale: consensus, sharding, multi-region, migrations, exactly-once pipelines.
Catalog¶
labs/ — Data-Systems Load Labs¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Kafka throughput & exactly-once | Partitioning, consumer lag, rebalancing, EOS | 11, 13, 17 |
| 02 | ClickHouse OLAP at scale | MergeTree, materialized views, projections, billions of rows | 23, 17, 14 |
| 03 | Postgres indexing & partitioning | B-tree/GIN/BRIN/partial/covering, bloat, partitioning, EXPLAIN | 5, 17, 23 |
| 04 | Redis at scale | Cluster, pipelining, hot keys, eviction, persistence | 7, 22, 17 |
| 05 | Elasticsearch at scale | Sharding, bulk indexing, relevance, query latency | 8, 17, 22 |
| 06 | Database bake-off: analytics under load | Postgres vs ClickHouse vs Mongo, selection | 23, 5, 6 |
| 07 | Connection-pool & DB saturation | Pool sizing, PgBouncer, N+1, lock contention | 5, 17, 22 |
| 08 | Streaming backpressure | Lag, rebalancing storms, exactly-once delivery | 11, 13, 2 |
| 09 | Cache stampede & invalidation | Thundering herd, TTL jitter, single-flight, write strategies | 7, 17, 22 |
events/ — Event-Engineering Labs¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | CDC pipeline (Debezium) | Postgres WAL → Kafka, snapshot + stream, outbox-vs-CDC | 5, 11, 13 |
| 02 | Stateful windowing processor | Tumbling/sliding/session windows, watermarks, late data, local state | 11, 13, 17 |
| 03 | Event replay & reprojection | Rebuild read models from 1B-event log, time-travel, zero-downtime | 11, 12, 13 |
| 04 | Schema registry & evolution | Avro/Protobuf compat, breaking-change detection, rolling upgrades | 10, 11, 12 |
| 05 | DLQ & retry topology | Poison messages, retry topics, backoff, parking-lot, replay | 11, 13, 17 |
| 06 | Broker bake-off | Kafka vs RabbitMQ vs NATS JetStream under load; selection | 11, 13, 23 |
| 07 | Idempotent inbox/outbox (cross-service) | Exactly-once over HTTP/gRPC, idempotency keys, dedup | 10, 11, 13 |
| 08 | Consumer autoscaling on lag | KEDA-style scaling, 10x spikes, drain time, rebalance cost | 11, 19, 22 |
resilience/ — Rate-Limiting & Resilience Labs¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Rate-limit algorithm bake-off | Token/leaky bucket, sliding window log/counter, GCRA; accuracy vs memory vs burst | 7, 17, 22 |
| 02 | Adaptive concurrency & load shedding | AIMD, Little's Law, concurrency-limits, shed before collapse | 13, 17, 22 |
| 03 | Circuit breaker / bulkhead / timeout | Cascading-failure isolation, timeout budgets, half-open probing | 13, 22, 9 |
| 04 | Hierarchical multi-tenant quotas | Per-tenant/endpoint/global fairness, quota borrowing, noisy-neighbor isolation | 7, 22, 12 |
load-testing/ — How to Test High Load (the meta-skill)¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Distributed load generator | Open vs closed model, coordinated omission, HdrHistogram, distributed agents | 17, 22, 9 |
| 02 | Chaos & fault injection | Latency/error/kill injection, steady-state hypothesis, blast radius | 13, 22, 18 |
| 03 | Soak & leak hunting | 24h endurance, goroutine/memory/fd leaks, pprof over time | 1, 17, 18 |
| 04 | Capacity & breakpoint testing | Stress to breaking point, USE method, Little's Law, the knee | 17, 22, 14 |
| 05 | Go memory & zero-allocation | Escape analysis, sync.Pool, GOGC/GOMEMLIMIT, arenas/mmap, alloc reduction | 1, 17, 2 |
| 06 | Microbenchmarking & benchstat | testing.B pitfalls, benchstat A/B significance, CI perf-regression gates | 1, 17, 15 |
| 07 | Profiling-guided optimization | pprof CPU/heap/block/mutex, flame graphs, runtime/trace, optimize a hot path | 17, 1, 18 |
observability/ — Logging & Monitoring at scale¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Centralized logging pipeline | High-volume structured logging: ship→buffer→index→query, sampling, cardinality, backpressure | 18, 17, 22 |
| 02 | Metrics, monitoring & alerting | Prometheus instrumentation, scrape, alerting rules, SLO/error-budget burn, alert fatigue | 18, 22, 17 |
distributed-patterns/ — Coordination & Distributed-Transaction Patterns¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Leader election | Lease-based / Raft-lease election, split-brain avoidance, fencing | 13, 2, 22 |
| 02 | Distributed lock with fencing | Redis/etcd locks, fencing tokens, the Redlock debate, liveness vs safety | 13, 7, 2 |
| 03 | Scatter-gather aggregator | Fan a request to N shards/services, aggregate, straggler/timeout handling | 13, 22, 9 |
| 04 | Claim-check | Large payloads via a store + reference through the queue; broker-size limits | 11, 13, 20 |
| 05 | Fan-out / fan-in pipeline | Parallel fan-out then join, bounded concurrency, partial-failure handling | 2, 13, 17 |
| 06 | 2PC / 3PC coordinator | Two/three-phase commit, coordinator-failure blocking, recovery log, why it doesn't scale | 13, 5, 15 |
| 07 | Saga: orchestration vs choreography | Build both; compare coupling, failure handling, observability, compensation | 11, 12, 13 |
| 08 | TCC (Try-Confirm-Cancel) | Reservation/confirm/cancel, idempotency, timeout-driven cancellation; payment txns | 13, 11, 16 |
networking/ — Sockets & Network Performance¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | High-performance TCP socket server | Raw sockets, framing, goroutine-per-conn vs epoll, TCP tuning, zero-copy, C10K→C10M | 9, 2, 17 |
senior/ — Service builds (high-load baked in)¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Idempotent double-entry ledger | Postgres txns, idempotency, exactly-once, 10k+ TPS | 5, 10, 15, 16 |
| 02 | Distributed rate limiter | Redis + Lua, token bucket/sliding window, 1M+ req/s | 7, 22, 9 |
| 03 | Durable job queue / scheduler | Go concurrency, retries/backoff, DLQ | 2, 11, 15 |
| 04 | Realtime chat & presence | WebSockets, pub/sub, 100k+ connections, backpressure | 2, 7, 9 |
| 05 | Content-addressed storage (S3-like) | Chunking, dedup, multipart, throughput | 5, 20, 16 |
| 06 | Observability backend | High-cardinality OTLP ingest, Go perf | 18, 17 |
| 07 | Event-driven order/payment service ⭐ flagship | Go · Postgres · Kafka · Outbox · DDD, end-to-end | 01, 05, 11, 12, 13, 15, 18 |
| 08 | API gateway / edge proxy | Reverse proxy, JWT/JWKS, per-route rate limit + breaker, hot-reload, 150k+ req/s | 10, 9, 22, 16, 18 |
staff/ — Systems at scale¶
| # | Project | Trains | Interview sections |
|---|---|---|---|
| 01 | Sharded multi-tenant platform | Sharding, routing, live resharding, isolation | 5, 13, 22, 23 |
| 02 | Event-sourced CQRS + Saga + Outbox | Event sourcing, projections, distributed txns | 11, 12, 13 |
| 03 | Raft-backed metadata KV store | Consensus, log replication, leader election | 13, 2 |
| 04 | Multi-region active-active | CRDT/LWW, geo-routing, replication | 13, 22, 20 |
| 05 | Exactly-once streaming pipeline | Kafka → ClickHouse, windowing, stateful processing | 11, 13, 17 |
| 06 | Monolith → services migration | Strangler, dual-write, backfill, zero-downtime | 12, 13, 5 |
| 07 | Mini message broker (Kafka-lite) | Segmented append-only log, partitions, consumer groups, replication | 11, 13, 17 |
| 08 | LSM-tree storage engine | Memtable, SSTables, compaction, WAL, Bloom filters, read/write amplification | 5, 17, 23 |
Each project tags 3–5 interview sections; the catalog collectively covers all 23.
Shared tooling conventions¶
Every TZ assumes these, so they aren't repeated in each brief:
- Local infra:
docker-compose(orkindfor k8s labs) brings up the dependency. Pin versions; record them in your findings. - Dataset generator: a Go program (
cmd/gen) that produces data at the target scale with realistic distributions (Zipfian for hot keys, skewed cardinality where it matters). Deterministic via a seed so runs are reproducible. - Load harness: k6, vegeta, or a custom Go driver. Drive a defined traffic profile (open vs closed model; state the model). Capture a full latency histogram, not just an average.
- Measurement: p50 / p95 / p99 / p999 latency, throughput, error rate, and the resource cost (CPU, memory, allocations via
pprof, disk/network I/O). An average with no tail is not an answer. - Reproducibility: every number in your findings note must come with the command and config that produced it.
Global evaluation rubric¶
Each TZ has a project-specific rubric; all of them inherit this spine:
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Correctness under load | Meets functional reqs at target scale | Holds invariants through failure & rebalancing |
| Measurement rigor | Reports p99 + throughput honestly | Explains the tail, isolates the bottleneck, proves cause |
| Design judgment | Reasonable architecture for the SLO | Defends trade-offs; knows when not to use the fancy option |
| Failure handling | Graceful degradation, retries, backpressure | Designs for partial failure, poison data, and recovery |
| Tuning | Improves a metric with evidence | Pareto-aware; quantifies cost of each gain |
| Communication | Clear findings note + ADRs | Could present the numbers to a staff panel and survive |
A project is "done" when you can put its findings note in front of a staff engineer and defend every number on it.
See the parent Interview Question/ bank for the matching theory.