System Design Roadmap¶

Roadmap: https://roadmap.sh/system-design

A single, logically ordered learning path: Foundations → Networking → Compute & API → Data → Async & Coordination → Building Blocks → Reliability & Ops → Security & Governance → Specialized → Capstone. Every topic follows TEMPLATE.md (9 files: junior, middle, senior, professional, interview, tasks, find-bug, optimize, specification).

Companion roadmaps (not duplicated here): - Distributed Systems — consensus, replication, sharding, sagas, CRDTs, service mesh, tracing - Architecture / DDD — bounded contexts, aggregates, hexagonal, event storming - Computer Science — OS, networking internals, DB internals

Foundations¶

1. Introduction¶

1.1 What is System Design?
1.2 How to Approach System Design
1.3 Functional vs Non-Functional Requirements
1.4 Key Characteristics — scalability, availability, reliability, maintainability
1.5 Numbers Every Engineer Should Know

2. Trade-offs Framework¶

2.1 CAP Theorem
2.2 PACELC
2.3 Consistency vs Availability — weak / eventual / strong, fail-over, replication

3. Capacity Estimation¶

3.1 QPS
3.2 Storage
3.3 Bandwidth
3.4 Latency Budgets

4. Back-of-Envelope¶

4.1 Number Tables
4.2 Fermi Estimation

Networking¶

5. Networking & Protocols¶

5.1 OSI & TCP/IP
5.2 TCP vs UDP
5.3 TLS & HTTPS
5.4 HTTP Evolution — HTTP/1.1, HTTP/2, HTTP/3, QUIC
5.5 WebSockets
5.6 Server-Sent Events
5.7 Long-Polling & Streaming
5.8 Network Proxies & NAT

6. Domain Name System¶

6.1 DNS Resolution Flow
6.2 Record Types
6.3 DNS Load Balancing
6.4 DNS Caching & TTL
6.5 GeoDNS & Anycast

7. Content Delivery Networks¶

7.1 Pull CDN
7.2 Push CDN
7.3 Cache Invalidation
7.4 Edge Locations
7.5 CDN Security

8. Load Balancers¶

8.1 LB vs Reverse Proxy
8.2 Load Balancing Algorithms
8.3 Layer 4 Load Balancing
8.4 Layer 7 Load Balancing
8.5 Health Checks & Failover
8.6 Horizontal Scaling
8.7 Global Server Load Balancing

9. Communication¶

9.1 HTTP
9.2 TCP
9.3 UDP
9.4 RPC
9.5 gRPC
9.6 REST
9.7 GraphQL
9.8 Idempotent Operations — HTTP method semantics (mechanics → §18.1)

Compute & API¶

10. Application Layer¶

10.1 Microservices
10.2 Monolith vs Microservices
10.3 Service Discovery
10.4 API Composition
10.5 Stateless Design
10.6 Service Mesh (intro)

11. API Design at Scale¶

11.1 API Gateway (canonical home for gateway patterns) — routing, aggregation, offloading
11.2 REST Design at Scale
11.3 GraphQL Federation
11.4 gRPC & Streaming
11.5 Versioning & Deprecation
11.6 Pagination & Filtering
11.7 Idempotency & Retries — API request dedup (mechanics → §18.1)
11.8 Webhooks
11.9 Backends for Frontend (BFF)

Data¶

12. Databases¶

Data models / types (each topic explains the model, trade-offs & representative engines — engine internals live in their own roadmaps: Redis, MongoDB, PostgreSQL, Elasticsearch): - 12.1 Relational (RDBMS) — PostgreSQL, MySQL - 12.2 Key-Value — Redis, DynamoDB, etcd - 12.3 Document — MongoDB, Couchbase - 12.4 Wide-Column — Cassandra, ScyllaDB, HBase, Bigtable - 12.5 Column-Oriented (OLAP) — ClickHouse, Druid, Pinot - 12.6 Graph — Neo4j, JanusGraph - 12.7 Time-Series — InfluxDB, TimescaleDB, Prometheus - 12.8 Search Engine — Elasticsearch, OpenSearch, Solr - 12.9 Vector — pgvector, Milvus, Pinecone, Weaviate - 12.10 NewSQL / Distributed SQL — CockroachDB, Spanner, Vitess, TiDB

Cross-cutting concepts: - 12.11 Replication - 12.12 Sharding & Partitioning - 12.13 Indexing - 12.14 Transactions & Isolation - 12.15 Denormalization - 12.16 SQL Tuning - 12.17 SQL vs NoSQL - 12.18 OLTP vs OLAP - 12.19 Polyglot Persistence - 12.20 Choosing a Database — decision framework

13. Storage Systems¶

Low-level storage only — database data models live in §12. - 13.1 Object vs Block vs File - 13.2 Distributed File Systems — GFS, HDFS - 13.3 Blob Storage — S3-like - 13.4 LSM-Trees vs B-Trees — RocksDB, LevelDB - 13.5 Data Warehouse vs Data Lake - 13.6 File Formats — Parquet, ORC, Iceberg

14. Caching¶

14.1 Cache-Aside
14.2 Write-Through
14.3 Write-Behind
14.4 Refresh-Ahead
14.5 Eviction Policies
14.6 Types of Caching — client, CDN, web, DB, application
14.7 Cache Invalidation
14.8 Cache Stampede & Hot Keys

15. Data Streaming & Big Data¶

15.1 Batch Processing — MapReduce
15.2 Apache Spark
15.3 Stream Processing
15.4 Apache Kafka
15.5 Lambda vs Kappa Architecture
15.6 Data Lake & Warehouse
15.7 Change Data Capture
15.8 ETL vs ELT

Async & Coordination¶

16. Asynchronism¶

16.1 Message Queues
16.2 Task Queues
16.3 Back Pressure
16.4 Dead-Letter Queues
16.5 Delivery Guarantees

17. Background Jobs¶

17.1 Event-Driven
17.2 Schedule-Driven
17.3 Returning Results
17.4 Retries & Idempotency — job re-runs (mechanics → §18.1)

18. Concurrency & Coordination¶

18.1 Idempotency Keys (canonical: idempotency & exactly-once mechanics — referenced by §9.8 HTTP, §11.7 API, §17.4 jobs)
18.2 Leases & Fencing
18.3 Exactly-Once Semantics
18.4 Optimistic vs Pessimistic Locking
18.5 Coordination Services — ZooKeeper, etcd, Consul

Building Blocks¶

19. Building Blocks¶

Use vs Build: §19 is the canonical home for building each component from scratch. Other sections use them as ready components and link here — no algorithm is re-taught: message queue (use §16.1 ↔ build §19.6), blob store (use §13.3 ↔ build §19.8), search/typeahead (type §12.8 ↔ build §19.9), pub-sub (pattern §21.13 ↔ build §19.7), distributed lock (concept §18.4 ↔ build §19.11).

19.1 Rate Limiter (canonical home) — token bucket, leaky bucket, fixed window, sliding-window log/counter, distributed rate limiting
19.2 Consistent Hashing — hash ring, virtual nodes
19.3 Unique ID Generator — UUID, Snowflake, ticket server
19.4 Distributed Key-Value Store — quorum, vector clocks
19.5 Distributed Cache — sharding, eviction, hot keys
19.6 Distributed Message Queue — delivery, ordering, DLQ
19.7 Pub-Sub System — topics, fan-out, retention
19.8 Blob / Object Store — chunking, metadata, lifecycle
19.9 Distributed Search / Typeahead — inverted index, trie
19.10 Distributed Task Scheduler — cron at scale, leasing
19.11 Distributed Lock — fencing tokens, Redlock
19.12 Distributed Logging — ingestion, indexing, sampling
19.13 Sharded Counters / Leaderboard — write contention

Reliability & Operations¶

20. Reliability Patterns¶

20.1 Circuit Breaker
20.2 Bulkhead
20.3 Retry
20.4 Throttling — server-side load shedding angle (algorithms → §19.1; see also §40.6 Load Shedding)
20.5 Health Endpoint Monitoring
20.6 Leader Election
20.7 Compensating Transaction
20.8 Deployment Stamps & Geodes
20.9 Queue-Based Load Leveling

21. Cloud Design Patterns¶

21.1 Strangler Fig — pattern definition (org-scale application → §36.2)
21.2 Sidecar
21.3 Ambassador
21.4 Anti-Corruption Layer
21.5 CQRS
21.6 Event Sourcing
21.7 Materialized View
21.8 Pipes and Filters
21.9 External Config Store
21.10 Valet Key
21.11 Claim Check
21.12 Competing Consumers
21.13 Publisher/Subscriber — pattern (build a pub-sub system → §19.7)

Gateway routing/aggregation/offloading moved into §11.1; Backends-for-Frontend lives in §11.9.

22. Performance Antipatterns¶

22.1 Busy Database
22.2 Busy Frontend
22.3 Chatty I/O
22.4 Extraneous Fetching
22.5 Improper Instantiation
22.6 Monolithic Persistence
22.7 Noisy Neighbor
22.8 Synchronous I/O
22.9 Retry Storm
22.10 No Caching

23. Monitoring¶

23.1 Health Monitoring
23.2 Availability Monitoring
23.3 Performance Monitoring
23.4 Security Monitoring
23.5 Usage Monitoring
23.6 Instrumentation
23.7 Visualization & Alerts

24. Observability¶

24.1 Logs, Metrics, Traces
24.2 SLO / SLI / Error Budgets
24.3 RED & USE Methods
24.4 Distributed Tracing
24.5 Metrics Pipelines
24.6 Log Aggregation
24.7 Alerting & On-Call

25. Chaos Engineering¶

25.1 Failure Modes
25.2 Fault Injection
25.3 Game Days
25.4 Resilience Testing
25.5 Blast Radius & Recovery

26. Deployment & Infrastructure¶

26.1 Containers & Docker
26.2 Kubernetes Orchestration
26.3 Deployment Strategies — blue-green, canary, rolling
26.4 CI/CD Pipelines
26.5 Infrastructure as Code
26.6 Multi-Region Deployment
26.7 Disaster Recovery
26.8 Autoscaling

Security & Governance¶

27. Security at Scale¶

27.1 Authentication
27.2 Authorization — RBAC, ABAC
27.3 OAuth2 & OIDC
27.4 JWT & Tokens
27.5 Encryption at Rest & in Transit
27.6 Secrets Management
27.7 DDoS Mitigation
27.8 WAF & API Security
27.9 Rate Limiting for Abuse — bot / DDoS / login-abuse angle (algorithms → §19.1)

28. Data Privacy & Compliance¶

28.1 PII & Data Classification
28.2 GDPR & Right to Be Forgotten
28.3 Data Residency
28.4 Audit Logging
28.5 Encryption Key Lifecycle

29. Multi-Tenancy & SaaS¶

29.1 Tenant Isolation Models
29.2 Data Partitioning per Tenant
29.3 Noisy-Neighbor Mitigation
29.4 Per-Tenant Scaling & Limits
29.5 Tenant Onboarding & Config

Specialized¶

30. Geospatial Systems¶

30.1 Geohashing
30.2 Quadtrees
30.3 S2 & H3
30.4 Proximity Search
30.5 Map Tiling & Routing

31. ML & Recommendation Systems¶

31.1 Recommendation Architecture
31.2 Feature Store
31.3 Candidate Generation
31.4 Ranking & Scoring
31.5 Online vs Offline Inference
31.6 A/B Testing & Feedback Loops

Capstone¶

32. Classic Problems¶

URL shortener · Twitter timeline · WhatsApp/chat · YouTube/Netflix · Uber dispatch · Dropbox sync · Instagram feed · Stack Overflow · ad click counter · payment system · web crawler · recommendation engine · key-value store · Google Docs collab editor · proximity/Maps · Ticketmaster booking · notification system · live streaming · distributed job scheduler · stock exchange · S3 object storage · online judge · distributed analytics counter

33. Real-World Architectures¶

Google Spanner · Facebook TAO · Amazon DynamoDB · Netflix stack · Apache Kafka · Apache Cassandra · Redis internals · Discord realtime · Slack messaging · Uber/Lyft dispatch

34. Interview Playbook¶

34.1 RESHADED Framework
34.2 Requirements Clarification
34.3 Capacity Estimation in the Interview
34.4 API Design Step
34.5 High-Level Design
34.6 Data Model & Storage Choice
34.7 Deep Dives & Bottlenecks
34.8 Trade-offs & Wrap-up
34.9 Common Mistakes
34.10 Mock Interview Walkthroughs

Staff / Principal¶

Beyond building systems — org-scale judgment, evolution over time, cost, and sociotechnical design. Each topic can carry a 5th staff.md tier above professional.md.

35. Architecture Decision-Making¶

35.1 Architecture Decision Records (ADRs)
35.2 RFC Process
35.3 Evolutionary Architecture
35.4 Fitness Functions
35.5 Tech Radar
35.6 Build vs Buy
35.7 Trade-off Analysis Frameworks

36. Large-Scale Migrations¶

36.1 Monolith to Microservices
36.2 Strangler Fig at Scale
36.3 Zero-Downtime Migration
36.4 Expand-Contract Pattern
36.5 Dual-Write & Backfill
36.6 Data Migration at Scale
36.7 Deprecation Strategy

37. Sociotechnical & Org Design¶

37.1 Conway's Law
37.2 Team Topologies
37.3 Platform Engineering / IDP
37.4 Ownership & Boundaries
37.5 Cognitive Load

38. Cost & Efficiency (FinOps)¶

38.1 Cost Modeling
38.2 Capacity Planning
38.3 Efficiency as a Feature
38.4 Hardware-Aware Design
38.5 Performance Economics

39. Global / Multi-Region Architecture¶

39.1 Active-Active Architecture
39.2 Data Sovereignty & Residency
39.3 Geo-Routing
39.4 Global Consistency
39.5 Conflict Resolution
39.6 Follow-the-Sun

40. SRE & Reliability Engineering¶

40.1 Error Budgets
40.2 SLO Ownership
40.3 Incident Management
40.4 Postmortems
40.5 Toil Reduction
40.6 Load Shedding
40.7 Graceful Degradation

41. Performance Engineering & Tail Latency¶

41.1 Tail Latency — p99 / p999
41.2 Coordinated Omission
41.3 Hedged Requests
41.4 Backpressure (deep)
41.5 Queueing Theory — Little's Law
41.6 Universal Scalability Law
41.7 Amdahl's Law

42. Data Governance & Contracts¶

42.1 Schema Registry
42.2 Data Contracts
42.3 Data Lineage
42.4 Data Quality
42.5 Master Data Management
42.6 Privacy by Design