๐ Senior Level (501โ750)¶
โ Middle ยท README ยท Professional โ
Focus: Distributed systems depth, geo-distribution, advanced data modeling, complex system designs (Uber, Netflix-scale), trade-off justification, ambiguous requirements, real-time pipelines, advanced consistency.
For whom: 5โ8 years of experience, senior / L5 engineer. Time per question: 45โ60 minutes; write a trade-off justification.
๐งฌ Advanced Distributed Systems (501โ540)¶
- Walk through Raft consensus algorithm step by step.
- Walk through Paxos consensus algorithm step by step.
- Compare Raft, Paxos, ZAB, and Multi-Paxos.
- What is a quorum-based read/write protocol (e.g., Dynamo)?
- What is hinted handoff in Cassandra/Dynamo?
- What is read repair?
- What is anti-entropy and Merkle trees?
- What is gossip protocol?
- What is a vector clock and why use it?
- What is a Lamport timestamp?
- What is a hybrid logical clock (HLC)?
- What is a CRDT and what types exist (G-counter, OR-Set, LWW)?
- How would you build a collaborative editor with CRDTs vs OT?
- What is operational transformation (OT)?
- What is a vector version (causal) vs a wall clock?
- What is the difference between a leader-based and leaderless replication?
- What is "leader stickiness" and when does it hurt?
- What is a witness replica?
- What is chain replication?
- What is primary-backup replication?
- What is the FLP impossibility result?
- What is the Two Generals problem?
- What is the Byzantine Generals problem?
- What is BFT consensus (PBFT, Tendermint)?
- What is a distributed snapshot (Chandy-Lamport)?
- What is Spanner's TrueTime and how does it enable external consistency?
- How does Google Spanner achieve strong consistency at global scale?
- How does CockroachDB implement Spanner-like consistency without atomic clocks?
- How does DynamoDB handle global tables?
- How does Cassandra handle multi-datacenter replication?
- What are the trade-offs of synchronous cross-region writes?
- What is the role of a meta-data service (e.g., ZooKeeper)?
- How would you build a distributed lock service?
- How does etcd implement leader leases?
- How would you implement leader election with Redis (Redlock controversy)?
- What is the split-brain problem in leader election and how to avoid it?
- What are fence tokens and why are they needed for distributed locks?
- What is the difference between a strong leader and a weak leader?
- What is leader piggy-backing for heartbeats?
- Why is "exactly-once" actually "effectively-once with idempotency"?
๐ Geo-Distribution & Multi-Region (541โ570)¶
- How do you architect for multi-region active-active?
- How do you architect for multi-region active-passive?
- What is RPO vs RTO?
- How do you replicate data across continents with bounded lag?
- How do you handle time-zone-sensitive workloads globally?
- What is "read local, write global" pattern?
- What is "follow-the-sun" architecture?
- How do you reroute traffic during a regional failover?
- What is GeoDNS and what are its limits?
- How do you implement consistent global IDs (Snowflake, ULID)?
- How does Twitter Snowflake assign IDs?
- What are the trade-offs of using a centralized ID service vs local IDs?
- How would you design a globally distributed counter?
- How would you implement a globally distributed rate limiter?
- What is the cost of cross-region writes in terms of latency?
- How would you replicate a write-heavy timeline to multiple regions?
- What is a follower read in Spanner-style systems?
- How do you handle clock skew across regions?
- What is the role of NTP/PTP in distributed systems?
- How would you design a multi-region disaster recovery plan?
- What is chaos failover testing across regions?
- How do you test for region-failover correctness?
- What is "graceful brownout" of a region?
- What is Anycast routing for global services?
- How does CloudFront/Akamai route a user to the nearest PoP?
- How does multi-region S3 replication work?
- How would you serve dynamic content from the edge (Cloudflare Workers)?
- What is the "eventual reachability" property in geo-distributed messaging?
- How would you design a global chat with read-after-write consistency?
- How do you handle data residency / data sovereignty laws?
๐ Scalability Deep Dive (571โ610)¶
- How would you scale a "likes" counter to handle 1M writes/sec?
- How would you scale a feed-fanout system for 100M users?
- Compare push vs pull vs hybrid fanout for a social feed.
- How would you handle the "celebrity problem" in fanout?
- How would you build a typeahead serving 100K QPS?
- How would you scale a notification system to 1B users?
- How would you scale a real-time leaderboard to 10M players?
- Walk through scaling MySQL from 1 to 10M users.
- Walk through scaling Postgres for write-heavy workload.
- How would you scale a write-heavy time-series system (IoT)?
- How would you build a metrics ingestion system at the scale of Datadog?
- How would you architect a logs pipeline at the scale of Splunk?
- How would you design Pinterest's image pipeline?
- How would you design Instagram's feed for 500M MAU?
- How would you scale a chat system to 100M concurrent connections?
- How would you architect WhatsApp's E2E messaging?
- How would you design a global content moderation pipeline?
- How would you scale Stripe's payment ledger?
- How would you design Plaid-style financial data aggregation?
- How would you design Uber's geo-index (S2/H3)?
- How would you design Lyft's matching engine?
- How would you design DoorDash's dispatch service?
- How would you scale Airbnb's calendar / availability service?
- How would you scale Booking.com's pricing engine?
- How would you design Amazon's "prime now" inventory service?
- How would you architect a TikTok-scale recommendation feed?
- How would you design YouTube's view counter?
- How would you scale Reddit's voting/comments?
- How would you scale a video transcoding pipeline?
- How would you build a CDN like Cloudflare from scratch?
- How would you design a content moderation queue at scale?
- How would you build a fraud detection pipeline at scale?
- How would you design a real-time bidding (RTB) ad-auction system?
- How would you design Google AdSense at high level?
- How would you scale an analytics dashboard for billions of events?
- How would you design a multi-tenant log search (per-customer isolation)?
- How would you re-architect a monolith handling 50K RPS into microservices?
- How would you remove a single hot DB shard during peak traffic?
- How would you design a system to handle Black Friday spikes?
- How would you handle a 100x traffic spike with no advance warning?
๐งฐ Advanced Data Pipelines (611โ640)¶
- Compare Lambda vs Kappa architecture.
- When would you avoid Lambda architecture?
- How would you design a real-time analytics dashboard?
- How would you reconcile real-time and batch metrics?
- How would you build a near-real-time feature store?
- What is Apache Beam and where does it fit?
- What is exactly-once in Flink and how is it implemented?
- What is Flink checkpointing and savepointing?
- What is Spark structured streaming vs DStream?
- What is data lake vs data warehouse vs lakehouse?
- How would you architect an Iceberg/Delta lakehouse?
- What is medallion architecture (bronze/silver/gold)?
- What is reverse ETL?
- What is CDC (Debezium) at scale?
- What is the outbox pattern combined with Debezium?
- What is Kafka tiered storage?
- What is Pulsar and how does it differ from Kafka?
- What is the role of schema registry?
- How do you handle schema evolution across producers and consumers?
- How would you design a clickstream pipeline (Snowplow-like)?
- How would you architect an A/B testing platform end-to-end?
- How would you design an experiment platform with metrics + guardrails?
- How would you build a marketing event funnel pipeline?
- How would you build an attribution pipeline (multi-touch)?
- How would you architect a search-relevance feedback loop?
- How would you architect a vector search system (RAG-friendly)?
- How would you architect a real-time anomaly detector?
- How would you architect an ML feature pipeline (offline + online)?
- How would you architect a model-serving platform with shadow traffic?
- How would you architect a cost-aware data lifecycle (hot/warm/cold tiers)?
๐ฏ Trade-offs & Architecture Choices (641โ680)¶
- When would you pick microservices over a modular monolith and vice versa?
- When would you pick GraphQL over REST?
- When would you pick gRPC over REST/GraphQL?
- When would you pick eventual consistency over strong consistency?
- When would you pick at-least-once over at-most-once?
- When is exactly-once worth the cost?
- When would you build vs buy?
- When is a SaaS DB (Snowflake) better than self-managed Postgres?
- When is Postgres "good enough" instead of a NoSQL DB?
- When is JSON in Postgres better than Mongo?
- When is a bus (Kafka) better than a request/response API?
- When is HTTP/3 worth deploying?
- When does a service mesh add too much overhead?
- When is server-side rendering better than client-side?
- When is Redis a bad choice as a primary store?
- When does sharding become unavoidable?
- When is denormalization the right call?
- When should you avoid microservices entirely?
- When is a monolith actually the better long-term choice?
- Trade-offs of synchronous vs asynchronous APIs?
- Trade-offs of REST vs event-driven?
- Trade-offs of optimistic vs pessimistic locking at scale?
- Trade-offs of JWT vs opaque tokens?
- Trade-offs of JWT vs session in distributed services?
- Trade-offs of leader-based vs leaderless replication?
- Trade-offs of per-tenant DB vs shared DB?
- Trade-offs of synchronous vs asynchronous replication?
- Trade-offs of B-tree vs LSM-tree storage engines?
- Trade-offs of row-store vs column-store?
- Trade-offs of row-level vs document-level versioning?
- Trade-offs of soft deletes vs hard deletes?
- Trade-offs of in-DB vs out-of-DB joins?
- Trade-offs of schema-on-read vs schema-on-write?
- Trade-offs of async fanout vs synchronous notification?
- Trade-offs of pre-computing aggregates vs computing on read?
- Trade-offs of caching at edge vs at origin?
- Trade-offs of self-hosted Kafka vs MSK/Confluent?
- Trade-offs of using ELB vs nginx vs Envoy?
- Trade-offs of using a queue vs a stream?
- Trade-offs of static partitioning vs dynamic partitioning?
๐ Advanced Security (681โ710)¶
- How do you design end-to-end encryption (Signal protocol)?
- How do you implement key rotation at scale?
- How does Zero Trust architecture work?
- How would you design SSO with OIDC across services?
- What is mTLS and how do you operate it at scale?
- How would you design secret management (HashiCorp Vault patterns)?
- How do you handle PII in a logging pipeline?
- How do you design field-level encryption for a SaaS DB?
- How do you implement audit logs that are tamper-evident?
- How do you implement data redaction for support staff?
- How do you design a permission system (RBAC vs ABAC vs ReBAC)?
- How would you design a Google Zanzibar-style authorization service?
- How do you implement API rate limiting per tenant?
- How do you mitigate DDoS at the edge?
- How do you implement a WAF with a CDN provider?
- How do you design account takeover prevention?
- How do you design device fingerprinting?
- How do you implement OAuth token revocation?
- How do you implement secure cookie strategies (HttpOnly, SameSite, Secure)?
- How do you implement CSP for a complex SaaS app?
- What is supply-chain security and how do you mitigate (SLSA, sigstore)?
- How do you handle CVE response across hundreds of microservices?
- What is a HSM and when do you need one?
- How do you implement client-side encryption for S3 data?
- How do you design a data lake with column-level access control?
- How do you handle GDPR right-to-be-forgotten across services?
- How do you design SOC2-compliant audit pipelines?
- How do you implement HIPAA-grade auditing?
- How do you securely share data with third parties (S3 presigned, JIT access)?
- How do you design a bug bounty / responsible disclosure intake system?
๐งช Performance & Internals (711โ740)¶
- Walk through what happens during a Postgres query (parser โ planner โ executor).
- How does Postgres VACUUM work and why is it necessary?
- How does PostgreSQL handle MVCC tuple bloat?
- What is a heap-only tuple (HOT) update?
- How does MySQL InnoDB clustered index differ from Postgres heap?
- What is the LSM tree and how does compaction work?
- How does Cassandra's read path work (memtable, SSTable, bloom filter)?
- How does Bloom filter improve read performance?
- What is a Cuckoo filter vs Bloom filter?
- How does B+ tree differ from B-tree?
- What is fractal tree?
- What is the Linux page cache and how does it interact with DB?
- What is "fsync" and how does it affect durability?
- What is a journaling file system?
- What is RDMA and when does it matter?
- What is io_uring and what problem does it solve?
- What is zero-copy networking?
- What is kernel bypass (DPDK, XDP)?
- How does TCP slow start and congestion control affect tail latency?
- What is BBR vs CUBIC congestion control?
- How does QUIC improve over TCP?
- What is the cost of DNS lookups in microservices?
- What is JIT and how does V8/JVM warm up affect deployment?
- What is JVM GC tuning (G1 vs ZGC vs Shenandoah)?
- What is Go GC behavior under high allocation rate?
- What is heap fragmentation and how to mitigate?
- What is profiling vs tracing vs sampling?
- What is flame graph and how to read it?
- What is the role of perf and eBPF in production diagnostics?
- How would you debug a production-only memory leak?
๐งฏ Failure & Recovery (741โ750)¶
- How would you respond to a global outage of a critical dependency?
- How would you handle a runaway query taking down the DB?
- How do you design a "kill switch" for a feature?
- How do you handle a cache layer outage gracefully?
- How do you handle a write outage when reads must continue?
- What is circuit-breaker fallback strategy for payments?
- How do you design a degraded-read mode?
- How do you design a degraded-write mode?
- How do you safely roll back a schema change that already replicated?
- How do you design a "fire drill" / disaster simulation calendar?