🟠 Senior Level (501–750)¶

← Middle · README · Professional →

Focus: Distributed systems depth, geo-distribution, advanced data modeling, complex system designs (Uber, Netflix-scale), trade-off justification, ambiguous requirements, real-time pipelines, advanced consistency.

For whom: 5–8 years of experience, senior / L5 engineer. Time per question: 45–60 minutes; write a trade-off justification.

🧬 Advanced Distributed Systems (501–540)¶

Walk through Raft consensus algorithm step by step.
Walk through Paxos consensus algorithm step by step.
Compare Raft, Paxos, ZAB, and Multi-Paxos.
What is a quorum-based read/write protocol (e.g., Dynamo)?
What is hinted handoff in Cassandra/Dynamo?
What is read repair?
What is anti-entropy and Merkle trees?
What is gossip protocol?
What is a vector clock and why use it?
What is a Lamport timestamp?
What is a hybrid logical clock (HLC)?
What is a CRDT and what types exist (G-counter, OR-Set, LWW)?
How would you build a collaborative editor with CRDTs vs OT?
What is operational transformation (OT)?
What is a vector version (causal) vs a wall clock?
What is the difference between a leader-based and leaderless replication?
What is "leader stickiness" and when does it hurt?
What is a witness replica?
What is chain replication?
What is primary-backup replication?
What is the FLP impossibility result?
What is the Two Generals problem?
What is the Byzantine Generals problem?
What is BFT consensus (PBFT, Tendermint)?
What is a distributed snapshot (Chandy-Lamport)?
What is Spanner's TrueTime and how does it enable external consistency?
How does Google Spanner achieve strong consistency at global scale?
How does CockroachDB implement Spanner-like consistency without atomic clocks?
How does DynamoDB handle global tables?
How does Cassandra handle multi-datacenter replication?
What are the trade-offs of synchronous cross-region writes?
What is the role of a meta-data service (e.g., ZooKeeper)?
How would you build a distributed lock service?
How does etcd implement leader leases?
How would you implement leader election with Redis (Redlock controversy)?
What is the split-brain problem in leader election and how to avoid it?
What are fence tokens and why are they needed for distributed locks?
What is the difference between a strong leader and a weak leader?
What is leader piggy-backing for heartbeats?
Why is "exactly-once" actually "effectively-once with idempotency"?

🌐 Geo-Distribution & Multi-Region (541–570)¶

How do you architect for multi-region active-active?
How do you architect for multi-region active-passive?
What is RPO vs RTO?
How do you replicate data across continents with bounded lag?
How do you handle time-zone-sensitive workloads globally?
What is "read local, write global" pattern?
What is "follow-the-sun" architecture?
How do you reroute traffic during a regional failover?
What is GeoDNS and what are its limits?
How do you implement consistent global IDs (Snowflake, ULID)?
How does Twitter Snowflake assign IDs?
What are the trade-offs of using a centralized ID service vs local IDs?
How would you design a globally distributed counter?
How would you implement a globally distributed rate limiter?
What is the cost of cross-region writes in terms of latency?
How would you replicate a write-heavy timeline to multiple regions?
What is a follower read in Spanner-style systems?
How do you handle clock skew across regions?
What is the role of NTP/PTP in distributed systems?
How would you design a multi-region disaster recovery plan?
What is chaos failover testing across regions?
How do you test for region-failover correctness?
What is "graceful brownout" of a region?
What is Anycast routing for global services?
How does CloudFront/Akamai route a user to the nearest PoP?
How does multi-region S3 replication work?
How would you serve dynamic content from the edge (Cloudflare Workers)?
What is the "eventual reachability" property in geo-distributed messaging?
How would you design a global chat with read-after-write consistency?
How do you handle data residency / data sovereignty laws?

📈 Scalability Deep Dive (571–610)¶

How would you scale a "likes" counter to handle 1M writes/sec?
How would you scale a feed-fanout system for 100M users?
Compare push vs pull vs hybrid fanout for a social feed.
How would you handle the "celebrity problem" in fanout?
How would you build a typeahead serving 100K QPS?
How would you scale a notification system to 1B users?
How would you scale a real-time leaderboard to 10M players?
Walk through scaling MySQL from 1 to 10M users.
Walk through scaling Postgres for write-heavy workload.
How would you scale a write-heavy time-series system (IoT)?
How would you build a metrics ingestion system at the scale of Datadog?
How would you architect a logs pipeline at the scale of Splunk?
How would you design Pinterest's image pipeline?
How would you design Instagram's feed for 500M MAU?
How would you scale a chat system to 100M concurrent connections?
How would you architect WhatsApp's E2E messaging?
How would you design a global content moderation pipeline?
How would you scale Stripe's payment ledger?
How would you design Plaid-style financial data aggregation?
How would you design Uber's geo-index (S2/H3)?
How would you design Lyft's matching engine?
How would you design DoorDash's dispatch service?
How would you scale Airbnb's calendar / availability service?
How would you scale Booking.com's pricing engine?
How would you design Amazon's "prime now" inventory service?
How would you architect a TikTok-scale recommendation feed?
How would you design YouTube's view counter?
How would you scale Reddit's voting/comments?
How would you scale a video transcoding pipeline?
How would you build a CDN like Cloudflare from scratch?
How would you design a content moderation queue at scale?
How would you build a fraud detection pipeline at scale?
How would you design a real-time bidding (RTB) ad-auction system?
How would you design Google AdSense at high level?
How would you scale an analytics dashboard for billions of events?
How would you design a multi-tenant log search (per-customer isolation)?
How would you re-architect a monolith handling 50K RPS into microservices?
How would you remove a single hot DB shard during peak traffic?
How would you design a system to handle Black Friday spikes?
How would you handle a 100x traffic spike with no advance warning?

🧰 Advanced Data Pipelines (611–640)¶

Compare Lambda vs Kappa architecture.
When would you avoid Lambda architecture?
How would you design a real-time analytics dashboard?
How would you reconcile real-time and batch metrics?
How would you build a near-real-time feature store?
What is Apache Beam and where does it fit?
What is exactly-once in Flink and how is it implemented?
What is Flink checkpointing and savepointing?
What is Spark structured streaming vs DStream?
What is data lake vs data warehouse vs lakehouse?
How would you architect an Iceberg/Delta lakehouse?
What is medallion architecture (bronze/silver/gold)?
What is reverse ETL?
What is CDC (Debezium) at scale?
What is the outbox pattern combined with Debezium?
What is Kafka tiered storage?
What is Pulsar and how does it differ from Kafka?
What is the role of schema registry?
How do you handle schema evolution across producers and consumers?
How would you design a clickstream pipeline (Snowplow-like)?
How would you architect an A/B testing platform end-to-end?
How would you design an experiment platform with metrics + guardrails?
How would you build a marketing event funnel pipeline?
How would you build an attribution pipeline (multi-touch)?
How would you architect a search-relevance feedback loop?
How would you architect a vector search system (RAG-friendly)?
How would you architect a real-time anomaly detector?
How would you architect an ML feature pipeline (offline + online)?
How would you architect a model-serving platform with shadow traffic?
How would you architect a cost-aware data lifecycle (hot/warm/cold tiers)?

🎯 Trade-offs & Architecture Choices (641–680)¶

When would you pick microservices over a modular monolith and vice versa?
When would you pick GraphQL over REST?
When would you pick gRPC over REST/GraphQL?
When would you pick eventual consistency over strong consistency?
When would you pick at-least-once over at-most-once?
When is exactly-once worth the cost?
When would you build vs buy?
When is a SaaS DB (Snowflake) better than self-managed Postgres?
When is Postgres "good enough" instead of a NoSQL DB?
When is JSON in Postgres better than Mongo?
When is a bus (Kafka) better than a request/response API?
When is HTTP/3 worth deploying?
When does a service mesh add too much overhead?
When is server-side rendering better than client-side?
When is Redis a bad choice as a primary store?
When does sharding become unavoidable?
When is denormalization the right call?
When should you avoid microservices entirely?
When is a monolith actually the better long-term choice?
Trade-offs of synchronous vs asynchronous APIs?
Trade-offs of REST vs event-driven?
Trade-offs of optimistic vs pessimistic locking at scale?
Trade-offs of JWT vs opaque tokens?
Trade-offs of JWT vs session in distributed services?
Trade-offs of leader-based vs leaderless replication?
Trade-offs of per-tenant DB vs shared DB?
Trade-offs of synchronous vs asynchronous replication?
Trade-offs of B-tree vs LSM-tree storage engines?
Trade-offs of row-store vs column-store?
Trade-offs of row-level vs document-level versioning?
Trade-offs of soft deletes vs hard deletes?
Trade-offs of in-DB vs out-of-DB joins?
Trade-offs of schema-on-read vs schema-on-write?
Trade-offs of async fanout vs synchronous notification?
Trade-offs of pre-computing aggregates vs computing on read?
Trade-offs of caching at edge vs at origin?
Trade-offs of self-hosted Kafka vs MSK/Confluent?
Trade-offs of using ELB vs nginx vs Envoy?
Trade-offs of using a queue vs a stream?
Trade-offs of static partitioning vs dynamic partitioning?

🔐 Advanced Security (681–710)¶

How do you design end-to-end encryption (Signal protocol)?
How do you implement key rotation at scale?
How does Zero Trust architecture work?
How would you design SSO with OIDC across services?
What is mTLS and how do you operate it at scale?
How would you design secret management (HashiCorp Vault patterns)?
How do you handle PII in a logging pipeline?
How do you design field-level encryption for a SaaS DB?
How do you implement audit logs that are tamper-evident?
How do you implement data redaction for support staff?
How do you design a permission system (RBAC vs ABAC vs ReBAC)?
How would you design a Google Zanzibar-style authorization service?
How do you implement API rate limiting per tenant?
How do you mitigate DDoS at the edge?
How do you implement a WAF with a CDN provider?
How do you design account takeover prevention?
How do you design device fingerprinting?
How do you implement OAuth token revocation?
How do you implement secure cookie strategies (HttpOnly, SameSite, Secure)?
How do you implement CSP for a complex SaaS app?
What is supply-chain security and how do you mitigate (SLSA, sigstore)?
How do you handle CVE response across hundreds of microservices?
What is a HSM and when do you need one?
How do you implement client-side encryption for S3 data?
How do you design a data lake with column-level access control?
How do you handle GDPR right-to-be-forgotten across services?
How do you design SOC2-compliant audit pipelines?
How do you implement HIPAA-grade auditing?
How do you securely share data with third parties (S3 presigned, JIT access)?
How do you design a bug bounty / responsible disclosure intake system?

🧪 Performance & Internals (711–740)¶

Walk through what happens during a Postgres query (parser → planner → executor).
How does Postgres VACUUM work and why is it necessary?
How does PostgreSQL handle MVCC tuple bloat?
What is a heap-only tuple (HOT) update?
How does MySQL InnoDB clustered index differ from Postgres heap?
What is the LSM tree and how does compaction work?
How does Cassandra's read path work (memtable, SSTable, bloom filter)?
How does Bloom filter improve read performance?
What is a Cuckoo filter vs Bloom filter?
How does B+ tree differ from B-tree?
What is fractal tree?
What is the Linux page cache and how does it interact with DB?
What is "fsync" and how does it affect durability?
What is a journaling file system?
What is RDMA and when does it matter?
What is io_uring and what problem does it solve?
What is zero-copy networking?
What is kernel bypass (DPDK, XDP)?
How does TCP slow start and congestion control affect tail latency?
What is BBR vs CUBIC congestion control?
How does QUIC improve over TCP?
What is the cost of DNS lookups in microservices?
What is JIT and how does V8/JVM warm up affect deployment?
What is JVM GC tuning (G1 vs ZGC vs Shenandoah)?
What is Go GC behavior under high allocation rate?
What is heap fragmentation and how to mitigate?
What is profiling vs tracing vs sampling?
What is flame graph and how to read it?
What is the role of perf and eBPF in production diagnostics?
How would you debug a production-only memory leak?

🧯 Failure & Recovery (741–750)¶

How would you respond to a global outage of a critical dependency?
How would you handle a runaway query taking down the DB?
How do you design a "kill switch" for a feature?
How do you handle a cache layer outage gracefully?
How do you handle a write outage when reads must continue?
What is circuit-breaker fallback strategy for payments?
How do you design a degraded-read mode?
How do you design a degraded-write mode?
How do you safely roll back a schema change that already replicated?
How do you design a "fire drill" / disaster simulation calendar?

← Middle · README · Professional Level →