Skip to content

System Design Roadmap

  • Roadmap: https://roadmap.sh/system-design

A single, logically ordered learning path: Foundations → Networking → Compute & API → Data → Async & Coordination → Building Blocks → Reliability & Ops → Security & Governance → Specialized → Capstone. Every topic follows TEMPLATE.md (9 files: junior, middle, senior, professional, interview, tasks, find-bug, optimize, specification).

Companion roadmaps (not duplicated here): - Distributed Systems — consensus, replication, sharding, sagas, CRDTs, service mesh, tracing - Architecture / DDD — bounded contexts, aggregates, hexagonal, event storming - Computer Science — OS, networking internals, DB internals


Foundations

1. Introduction

  • 1.1 What is System Design?
  • 1.2 How to Approach System Design
  • 1.3 Functional vs Non-Functional Requirements
  • 1.4 Key Characteristics — scalability, availability, reliability, maintainability
  • 1.5 Numbers Every Engineer Should Know

2. Trade-offs Framework

  • 2.1 CAP Theorem
  • 2.2 PACELC
  • 2.3 Consistency vs Availability — weak / eventual / strong, fail-over, replication

3. Capacity Estimation

  • 3.1 QPS
  • 3.2 Storage
  • 3.3 Bandwidth
  • 3.4 Latency Budgets

4. Back-of-Envelope

  • 4.1 Number Tables
  • 4.2 Fermi Estimation

Networking

5. Networking & Protocols

  • 5.1 OSI & TCP/IP
  • 5.2 TCP vs UDP
  • 5.3 TLS & HTTPS
  • 5.4 HTTP Evolution — HTTP/1.1, HTTP/2, HTTP/3, QUIC
  • 5.5 WebSockets
  • 5.6 Server-Sent Events
  • 5.7 Long-Polling & Streaming
  • 5.8 Network Proxies & NAT

6. Domain Name System

  • 6.1 DNS Resolution Flow
  • 6.2 Record Types
  • 6.3 DNS Load Balancing
  • 6.4 DNS Caching & TTL
  • 6.5 GeoDNS & Anycast

7. Content Delivery Networks

  • 7.1 Pull CDN
  • 7.2 Push CDN
  • 7.3 Cache Invalidation
  • 7.4 Edge Locations
  • 7.5 CDN Security

8. Load Balancers

  • 8.1 LB vs Reverse Proxy
  • 8.2 Load Balancing Algorithms
  • 8.3 Layer 4 Load Balancing
  • 8.4 Layer 7 Load Balancing
  • 8.5 Health Checks & Failover
  • 8.6 Horizontal Scaling
  • 8.7 Global Server Load Balancing

9. Communication

  • 9.1 HTTP
  • 9.2 TCP
  • 9.3 UDP
  • 9.4 RPC
  • 9.5 gRPC
  • 9.6 REST
  • 9.7 GraphQL
  • 9.8 Idempotent Operations — HTTP method semantics (mechanics → §18.1)

Compute & API

10. Application Layer

  • 10.1 Microservices
  • 10.2 Monolith vs Microservices
  • 10.3 Service Discovery
  • 10.4 API Composition
  • 10.5 Stateless Design
  • 10.6 Service Mesh (intro)

11. API Design at Scale

  • 11.1 API Gateway (canonical home for gateway patterns) — routing, aggregation, offloading
  • 11.2 REST Design at Scale
  • 11.3 GraphQL Federation
  • 11.4 gRPC & Streaming
  • 11.5 Versioning & Deprecation
  • 11.6 Pagination & Filtering
  • 11.7 Idempotency & Retries — API request dedup (mechanics → §18.1)
  • 11.8 Webhooks
  • 11.9 Backends for Frontend (BFF)

Data

12. Databases

Data models / types (each topic explains the model, trade-offs & representative engines — engine internals live in their own roadmaps: Redis, MongoDB, PostgreSQL, Elasticsearch): - 12.1 Relational (RDBMS) — PostgreSQL, MySQL - 12.2 Key-Value — Redis, DynamoDB, etcd - 12.3 Document — MongoDB, Couchbase - 12.4 Wide-Column — Cassandra, ScyllaDB, HBase, Bigtable - 12.5 Column-Oriented (OLAP) — ClickHouse, Druid, Pinot - 12.6 Graph — Neo4j, JanusGraph - 12.7 Time-Series — InfluxDB, TimescaleDB, Prometheus - 12.8 Search Engine — Elasticsearch, OpenSearch, Solr - 12.9 Vector — pgvector, Milvus, Pinecone, Weaviate - 12.10 NewSQL / Distributed SQL — CockroachDB, Spanner, Vitess, TiDB

Cross-cutting concepts: - 12.11 Replication - 12.12 Sharding & Partitioning - 12.13 Indexing - 12.14 Transactions & Isolation - 12.15 Denormalization - 12.16 SQL Tuning - 12.17 SQL vs NoSQL - 12.18 OLTP vs OLAP - 12.19 Polyglot Persistence - 12.20 Choosing a Database — decision framework

13. Storage Systems

Low-level storage only — database data models live in §12. - 13.1 Object vs Block vs File - 13.2 Distributed File Systems — GFS, HDFS - 13.3 Blob Storage — S3-like - 13.4 LSM-Trees vs B-Trees — RocksDB, LevelDB - 13.5 Data Warehouse vs Data Lake - 13.6 File Formats — Parquet, ORC, Iceberg

14. Caching

  • 14.1 Cache-Aside
  • 14.2 Write-Through
  • 14.3 Write-Behind
  • 14.4 Refresh-Ahead
  • 14.5 Eviction Policies
  • 14.6 Types of Caching — client, CDN, web, DB, application
  • 14.7 Cache Invalidation
  • 14.8 Cache Stampede & Hot Keys

15. Data Streaming & Big Data

  • 15.1 Batch Processing — MapReduce
  • 15.2 Apache Spark
  • 15.3 Stream Processing
  • 15.4 Apache Kafka
  • 15.5 Lambda vs Kappa Architecture
  • 15.6 Data Lake & Warehouse
  • 15.7 Change Data Capture
  • 15.8 ETL vs ELT

Async & Coordination

16. Asynchronism

  • 16.1 Message Queues
  • 16.2 Task Queues
  • 16.3 Back Pressure
  • 16.4 Dead-Letter Queues
  • 16.5 Delivery Guarantees

17. Background Jobs

  • 17.1 Event-Driven
  • 17.2 Schedule-Driven
  • 17.3 Returning Results
  • 17.4 Retries & Idempotency — job re-runs (mechanics → §18.1)

18. Concurrency & Coordination

  • 18.1 Idempotency Keys (canonical: idempotency & exactly-once mechanics — referenced by §9.8 HTTP, §11.7 API, §17.4 jobs)
  • 18.2 Leases & Fencing
  • 18.3 Exactly-Once Semantics
  • 18.4 Optimistic vs Pessimistic Locking
  • 18.5 Coordination Services — ZooKeeper, etcd, Consul

Building Blocks

19. Building Blocks

Use vs Build: §19 is the canonical home for building each component from scratch. Other sections use them as ready components and link here — no algorithm is re-taught: message queue (use §16.1 ↔ build §19.6), blob store (use §13.3 ↔ build §19.8), search/typeahead (type §12.8 ↔ build §19.9), pub-sub (pattern §21.13 ↔ build §19.7), distributed lock (concept §18.4 ↔ build §19.11).

  • 19.1 Rate Limiter (canonical home) — token bucket, leaky bucket, fixed window, sliding-window log/counter, distributed rate limiting
  • 19.2 Consistent Hashing — hash ring, virtual nodes
  • 19.3 Unique ID Generator — UUID, Snowflake, ticket server
  • 19.4 Distributed Key-Value Store — quorum, vector clocks
  • 19.5 Distributed Cache — sharding, eviction, hot keys
  • 19.6 Distributed Message Queue — delivery, ordering, DLQ
  • 19.7 Pub-Sub System — topics, fan-out, retention
  • 19.8 Blob / Object Store — chunking, metadata, lifecycle
  • 19.9 Distributed Search / Typeahead — inverted index, trie
  • 19.10 Distributed Task Scheduler — cron at scale, leasing
  • 19.11 Distributed Lock — fencing tokens, Redlock
  • 19.12 Distributed Logging — ingestion, indexing, sampling
  • 19.13 Sharded Counters / Leaderboard — write contention

Reliability & Operations

20. Reliability Patterns

  • 20.1 Circuit Breaker
  • 20.2 Bulkhead
  • 20.3 Retry
  • 20.4 Throttling — server-side load shedding angle (algorithms → §19.1; see also §40.6 Load Shedding)
  • 20.5 Health Endpoint Monitoring
  • 20.6 Leader Election
  • 20.7 Compensating Transaction
  • 20.8 Deployment Stamps & Geodes
  • 20.9 Queue-Based Load Leveling

21. Cloud Design Patterns

  • 21.1 Strangler Fig — pattern definition (org-scale application → §36.2)
  • 21.2 Sidecar
  • 21.3 Ambassador
  • 21.4 Anti-Corruption Layer
  • 21.5 CQRS
  • 21.6 Event Sourcing
  • 21.7 Materialized View
  • 21.8 Pipes and Filters
  • 21.9 External Config Store
  • 21.10 Valet Key
  • 21.11 Claim Check
  • 21.12 Competing Consumers
  • 21.13 Publisher/Subscriber — pattern (build a pub-sub system → §19.7)

Gateway routing/aggregation/offloading moved into §11.1; Backends-for-Frontend lives in §11.9.

22. Performance Antipatterns

  • 22.1 Busy Database
  • 22.2 Busy Frontend
  • 22.3 Chatty I/O
  • 22.4 Extraneous Fetching
  • 22.5 Improper Instantiation
  • 22.6 Monolithic Persistence
  • 22.7 Noisy Neighbor
  • 22.8 Synchronous I/O
  • 22.9 Retry Storm
  • 22.10 No Caching

23. Monitoring

  • 23.1 Health Monitoring
  • 23.2 Availability Monitoring
  • 23.3 Performance Monitoring
  • 23.4 Security Monitoring
  • 23.5 Usage Monitoring
  • 23.6 Instrumentation
  • 23.7 Visualization & Alerts

24. Observability

  • 24.1 Logs, Metrics, Traces
  • 24.2 SLO / SLI / Error Budgets
  • 24.3 RED & USE Methods
  • 24.4 Distributed Tracing
  • 24.5 Metrics Pipelines
  • 24.6 Log Aggregation
  • 24.7 Alerting & On-Call

25. Chaos Engineering

  • 25.1 Failure Modes
  • 25.2 Fault Injection
  • 25.3 Game Days
  • 25.4 Resilience Testing
  • 25.5 Blast Radius & Recovery

26. Deployment & Infrastructure

  • 26.1 Containers & Docker
  • 26.2 Kubernetes Orchestration
  • 26.3 Deployment Strategies — blue-green, canary, rolling
  • 26.4 CI/CD Pipelines
  • 26.5 Infrastructure as Code
  • 26.6 Multi-Region Deployment
  • 26.7 Disaster Recovery
  • 26.8 Autoscaling

Security & Governance

27. Security at Scale

  • 27.1 Authentication
  • 27.2 Authorization — RBAC, ABAC
  • 27.3 OAuth2 & OIDC
  • 27.4 JWT & Tokens
  • 27.5 Encryption at Rest & in Transit
  • 27.6 Secrets Management
  • 27.7 DDoS Mitigation
  • 27.8 WAF & API Security
  • 27.9 Rate Limiting for Abuse — bot / DDoS / login-abuse angle (algorithms → §19.1)

28. Data Privacy & Compliance

  • 28.1 PII & Data Classification
  • 28.2 GDPR & Right to Be Forgotten
  • 28.3 Data Residency
  • 28.4 Audit Logging
  • 28.5 Encryption Key Lifecycle

29. Multi-Tenancy & SaaS

  • 29.1 Tenant Isolation Models
  • 29.2 Data Partitioning per Tenant
  • 29.3 Noisy-Neighbor Mitigation
  • 29.4 Per-Tenant Scaling & Limits
  • 29.5 Tenant Onboarding & Config

Specialized

30. Geospatial Systems

  • 30.1 Geohashing
  • 30.2 Quadtrees
  • 30.3 S2 & H3
  • 30.4 Proximity Search
  • 30.5 Map Tiling & Routing

31. ML & Recommendation Systems

  • 31.1 Recommendation Architecture
  • 31.2 Feature Store
  • 31.3 Candidate Generation
  • 31.4 Ranking & Scoring
  • 31.5 Online vs Offline Inference
  • 31.6 A/B Testing & Feedback Loops

Capstone

32. Classic Problems

URL shortener · Twitter timeline · WhatsApp/chat · YouTube/Netflix · Uber dispatch · Dropbox sync · Instagram feed · Stack Overflow · ad click counter · payment system · web crawler · recommendation engine · key-value store · Google Docs collab editor · proximity/Maps · Ticketmaster booking · notification system · live streaming · distributed job scheduler · stock exchange · S3 object storage · online judge · distributed analytics counter

33. Real-World Architectures

Google Spanner · Facebook TAO · Amazon DynamoDB · Netflix stack · Apache Kafka · Apache Cassandra · Redis internals · Discord realtime · Slack messaging · Uber/Lyft dispatch

34. Interview Playbook

  • 34.1 RESHADED Framework
  • 34.2 Requirements Clarification
  • 34.3 Capacity Estimation in the Interview
  • 34.4 API Design Step
  • 34.5 High-Level Design
  • 34.6 Data Model & Storage Choice
  • 34.7 Deep Dives & Bottlenecks
  • 34.8 Trade-offs & Wrap-up
  • 34.9 Common Mistakes
  • 34.10 Mock Interview Walkthroughs

Staff / Principal

Beyond building systems — org-scale judgment, evolution over time, cost, and sociotechnical design. Each topic can carry a 5th staff.md tier above professional.md.

35. Architecture Decision-Making

  • 35.1 Architecture Decision Records (ADRs)
  • 35.2 RFC Process
  • 35.3 Evolutionary Architecture
  • 35.4 Fitness Functions
  • 35.5 Tech Radar
  • 35.6 Build vs Buy
  • 35.7 Trade-off Analysis Frameworks

36. Large-Scale Migrations

  • 36.1 Monolith to Microservices
  • 36.2 Strangler Fig at Scale
  • 36.3 Zero-Downtime Migration
  • 36.4 Expand-Contract Pattern
  • 36.5 Dual-Write & Backfill
  • 36.6 Data Migration at Scale
  • 36.7 Deprecation Strategy

37. Sociotechnical & Org Design

  • 37.1 Conway's Law
  • 37.2 Team Topologies
  • 37.3 Platform Engineering / IDP
  • 37.4 Ownership & Boundaries
  • 37.5 Cognitive Load

38. Cost & Efficiency (FinOps)

  • 38.1 Cost Modeling
  • 38.2 Capacity Planning
  • 38.3 Efficiency as a Feature
  • 38.4 Hardware-Aware Design
  • 38.5 Performance Economics

39. Global / Multi-Region Architecture

  • 39.1 Active-Active Architecture
  • 39.2 Data Sovereignty & Residency
  • 39.3 Geo-Routing
  • 39.4 Global Consistency
  • 39.5 Conflict Resolution
  • 39.6 Follow-the-Sun

40. SRE & Reliability Engineering

  • 40.1 Error Budgets
  • 40.2 SLO Ownership
  • 40.3 Incident Management
  • 40.4 Postmortems
  • 40.5 Toil Reduction
  • 40.6 Load Shedding
  • 40.7 Graceful Degradation

41. Performance Engineering & Tail Latency

  • 41.1 Tail Latency — p99 / p999
  • 41.2 Coordinated Omission
  • 41.3 Hedged Requests
  • 41.4 Backpressure (deep)
  • 41.5 Queueing Theory — Little's Law
  • 41.6 Universal Scalability Law
  • 41.7 Amdahl's Law

42. Data Governance & Contracts

  • 42.1 Schema Registry
  • 42.2 Data Contracts
  • 42.3 Data Lineage
  • 42.4 Data Quality
  • 42.5 Master Data Management
  • 42.6 Privacy by Design