๐ด Professional Level (751โ1000)¶
โ Senior ยท README ยท Sources โ
Focus: Staff/Principal/Distinguished-level questions โ multi-year thinking, organizational architecture, build-vs-buy, cost optimization, complex migrations, internals of databases & runtimes, technical leadership, RFC processes, cross-team trade-offs.
For whom: Staff/Principal/L6+ engineer. Time per question: 1โ2 hours; produce a written answer in RFC format.
๐๏ธ Staff+ Architecture & Leadership (751โ790)¶
- How would you write a 1-pager design doc for a $10M migration?
- How do you run an architecture review board (ARB)?
- How do you mentor 4 senior engineers across 3 teams?
- How do you build technical strategy for a 50-engineer org?
- What is a "tech radar" and how would you maintain one?
- How do you decide which legacy system to sunset first?
- How do you propose moving from on-prem to cloud (3-year plan)?
- How would you justify a $2M infra bill to non-engineering stakeholders?
- How do you run a blameless post-mortem for a $1M outage?
- How do you balance KTLO (keep-the-lights-on) vs new features?
- How do you build an internal developer platform (IDP)?
- How do you measure DevEx and improve it?
- How do you set technical KPIs for a platform team?
- How do you sell a refactor to product leadership?
- How do you build consensus across 5 conflicting team leads?
- How do you choose between "rewrite" and "incremental refactor"?
- How do you manage tech debt as a portfolio?
- How would you set up a guild model for cross-cutting concerns (security, perf)?
- How do you onboard a Staff engineer hire on day 1?
- How do you run a "tech week" or innovation sprint?
- How do you publish RFCs effectively across 200 engineers?
- How do you maintain code health across 1,000+ services?
- What is "architecture as code" and how would you adopt it?
- How do you design a paved-road framework for new services?
- How do you decide what becomes a paved road vs a guardrail?
- How do you manage a multi-cloud governance plan?
- How do you choose between Kubernetes and serverless company-wide?
- How do you set platform SLOs that map to product SLOs?
- How do you negotiate SLAs with internal partner teams?
- How do you migrate authentication org-wide without an outage?
- How do you design a deprecation policy with hard cut-off dates?
- How do you communicate a deprecation across 1,000 client teams?
- How would you architect a platform for rapid acquisition integration?
- How would you stand up engineering after a M&A?
- How do you handle "two architectures" post-merger?
- How would you re-platform a 15-year-old monolith of 10M lines?
- How would you design a "language consolidation" plan (e.g., kill Python, standardize on Go)?
- How would you reduce p99 latency org-wide by 30% in 12 months?
- How would you reduce cloud spend by 25% without harming reliability?
- How would you design a FinOps practice from scratch?
๐ Extreme Scale Designs (791โ830)¶
- Design Google Search infrastructure at high level.
- Design Google Maps backend with real-time traffic.
- Design Google Photos at billion-user scale.
- Design Gmail for a billion users with spam + search.
- Design Google Calendar with multi-region availability.
- Design Google Docs sync engine.
- Design Google AdWords auction.
- Design Google AdSense placement system.
- Design Facebook News Feed at 3B MAU.
- Design Facebook Live video at scale.
- Design Instagram Stories at scale.
- Design WhatsApp delivery + read receipts globally.
- Design Messenger end-to-end encrypted group chat at scale.
- Design Apple iMessage E2E delivery.
- Design iCloud Photos sync engine.
- Design Spotify's music recommendation pipeline.
- Design Spotify's collaborative playlists at scale.
- Design Netflix's open-connect CDN.
- Design Netflix's chaos-monkey-style resilience platform.
- Design Netflix's recommendation pipeline (offline + online).
- Design YouTube's video upload + transcoding pipeline.
- Design YouTube's view-count denormalization pipeline.
- Design YouTube's content moderation pipeline.
- Design TikTok's For You Page at hundreds of millions of users.
- Design Twitter timeline at 500M users.
- Design Twitter's full-text search across all tweets.
- Design Twitter's trending topics.
- Design Reddit's voting + comment ranking at scale.
- Design Pinterest's home-feed personalization.
- Design LinkedIn's "People You May Know" graph service.
- Design LinkedIn's feed ranking pipeline.
- Design Stripe's idempotency layer.
- Design Stripe's payment-routing engine.
- Design Square's POS reliability under flaky internet.
- Design Robinhood's order matching at scale.
- Design Coinbase's trade engine.
- Design Binance-style global exchange (cross-region matching).
- Design Cloudflare Workers edge runtime.
- Design Vercel/Netlify deployment pipeline.
- Design GitHub Actions runner orchestration at billion-job scale.
โ๏ธ Database & Storage Internals (831โ870)¶
- Walk through Postgres MVCC and tuple visibility checks.
- Walk through Postgres WAL replication internals.
- Explain logical replication slots in Postgres.
- Explain Postgres' MVCC GC and why VACUUM matters.
- How does Postgres' planner choose between Hash, Merge, and Nested Loop joins?
- How does pg_partman/native partitioning work?
- What is Citus and how does it shard Postgres?
- What is YugabyteDB and how does it differ from Spanner?
- What is CockroachDB's transaction layer (Range + Raft + KV)?
- What is FoundationDB's deterministic simulation testing?
- Walk through MySQL InnoDB redo log + undo log + buffer pool.
- Walk through MySQL group replication.
- How does Vitess shard MySQL at scale?
- How does Aurora separate compute from storage?
- How does DynamoDB handle global secondary indexes consistency?
- How does DynamoDB partition resizing work?
- How does Cassandra repair work (full vs incremental vs subrange)?
- How does ScyllaDB outperform Cassandra (shard-per-core)?
- How does Redis Cluster handle resharding?
- How does Redis Streams + consumer groups compare to Kafka?
- How does Kafka KRaft replace ZooKeeper?
- How does Kafka Tiered Storage work?
- Explain Pulsar's segmented architecture (BookKeeper).
- Explain etcd's MVCC + Raft layering.
- Explain ZooKeeper's ZAB protocol.
- Walk through ClickHouse's MergeTree and how it scales analytical reads.
- Walk through Druid's segment-based architecture.
- Walk through Pinot's real-time + offline duality.
- Walk through Snowflake's storage-compute separation.
- Walk through BigQuery's Dremel execution model.
- Walk through Spanner's TrueTime + Paxos groups.
- Walk through DGraph's GraphQL + Badger LSM stack.
- Walk through Neo4j's graph traversal engine.
- Walk through TimescaleDB's hypertable + chunk design.
- Walk through InfluxDB's TSI + TSM file format.
- Walk through MongoDB replica set election.
- Walk through MongoDB shard balancer.
- Walk through Elasticsearch's segment merging and refresh.
- Walk through OpenSearch's cross-cluster replication.
- Walk through HBase's region splits and HDFS interaction.
๐ Specialized Systems (871โ910)¶
- Design a high-frequency trading order matching engine (microsecond latency).
- Design a market data fan-out system (millions of subscribers).
- Design a distributed key-value store with linearizable reads.
- Design a globally consistent counter (without single bottleneck).
- Design a globally distributed graph database.
- Design a globally distributed time-series database.
- Design an ad-bidding RTB system at 10M QPS.
- Design a bid-request fan-out across 50 DSPs in <50ms.
- Design a pixel/event tracking pipeline at 1B events/day.
- Design a programmatic ads attribution engine.
- Design a real-time recommendation system using vector search.
- Design a multi-armed bandit feature serving system.
- Design a federated learning training pipeline.
- Design a model registry + serving platform like Sagemaker.
- Design a feature store like Feast at scale.
- Design a vector DB (Pinecone-like) with billion-vector recall.
- Design a generative-AI inference serving platform.
- Design a streaming ETL platform like dbt + Materialize.
- Design an IoT ingestion system at 10M devices.
- Design a smart-home control plane at country scale.
- Design a connected-car telemetry pipeline.
- Design a maritime AIS tracking system.
- Design a flight-tracking system.
- Design a satellite imagery indexing pipeline.
- Design a real-time multiplayer game backend (FPS).
- Design a MMO server architecture (zones, instances).
- Design a turn-based puzzle game backend with replay.
- Design a streaming live-game telemetry system.
- Design a CDN with custom DDoS scrubbing.
- Design a distributed code-execution sandbox (Leetcode/Replit-style).
- Design a CI/CD platform for 1000 teams (GitLab-scale).
- Design a build cache shared across an org.
- Design a remote build execution (Bazel RBE).
- Design a binary artifact store (Artifactory-scale).
- Design a multi-tenant Kubernetes platform (per-team quotas).
- Design an internal serverless platform (Knative-based).
- Design a cloud cost-attribution service.
- Design a chargeback/showback system across business units.
- Design a multi-region object storage with strong read-after-write.
- Design a content-addressable storage for backups (Borg-style).
๐งช Migration, Modernization & Org Design (911โ940)¶
- Plan a strangler-fig migration of a 10-year monolith.
- Plan a database engine migration from Oracle to Postgres at 50TB scale.
- Plan a storage migration from on-prem NAS to S3 with zero downtime.
- Plan a search migration from Solr to Elasticsearch.
- Plan a queue migration from RabbitMQ to Kafka.
- Plan a config-store migration (env vars โ Vault โ dynamic config).
- Plan a runtime migration from Java 8 to Java 21 fleet-wide.
- Plan a Python 2 โ 3 migration across 800 services.
- Plan a Node.js LTS upgrade fleet-wide.
- Plan a Kubernetes version upgrade across 500 clusters.
- Plan a dependency vulnerability fleet-wide remediation.
- Plan a "remove TLS 1.0/1.1" rollout across thousands of services.
- Plan a CDN provider migration with no user-visible regression.
- Plan a cloud provider migration (AWS โ GCP) for a stateful app.
- Plan a "ship every commit to prod" transformation for a slow team.
- Plan a Trunk-based development adoption across an org.
- Plan a feature-flag system rollout org-wide.
- Plan a SOC2 program from zero to certification.
- Plan an ISO 27001 readiness program.
- Plan a HIPAA-compliant fork of an existing platform.
- Plan a PCI-DSS scope reduction strategy.
- Plan a data classification + tagging program at the warehouse level.
- Plan a permissions audit across the org.
- Plan a "secure by default" framework rollout.
- Plan a service-ownership review across an entire org.
- Plan a "you build it, you run it" rollout for product teams.
- Plan a centralized observability rollout.
- Plan a centralized incident-response (IR) program.
- Plan a multi-tenant cost-fairness mechanism.
- Plan a "freeze + harden" period after major outages.
๐ง Theory, Internals, Open Problems (941โ970)¶
- Explain the trade-off space of CRDT vs OT for collaborative editing.
- Explain the trade-offs between gossip and direct broadcast.
- Explain how Cassandra picks coordinator and replicas.
- Explain how DynamoDB enforces 1MB partition limits and how to design around them.
- Explain how Spanner's commit-wait works.
- Explain how Calvin (deterministic) differs from Spanner.
- Explain how Aurora's quorum (4/6) impacts read latency.
- Explain how PolarDB's shared-storage architecture works.
- Explain how TiDB's TSO (timestamp oracle) works.
- Explain how YugabyteDB places leaders for global tables.
- Explain how Materialize maintains incremental views.
- Explain how RocksDB's compaction styles (level vs universal) trade off.
- Explain how LevelDB's iterator works at byte level.
- Explain how Bigtable's tablet servers split.
- Explain how Manhattan/F1 layered storage works.
- Explain how Megastore differs from Spanner.
- Explain how Percolator (incremental indexing) works.
- Explain how Borg/Omega/Kubernetes pod scheduling differ.
- Explain how Mesos' two-level scheduling works.
- Explain how Kubernetes scheduler computes affinity/anti-affinity.
- Explain how Linux cgroups v2 enforces resource limits.
- Explain how container CPU throttling can hide tail latency bugs.
- Explain how TCP_INFO + epoll affect connection diagnostics.
- Explain why JVM safepoints can dominate p99 latency.
- Explain how Go's GMP scheduler interacts with syscalls.
- Explain how Rust's async runtime (tokio) drives I/O.
- Explain how Erlang/OTP supervisors map onto distributed reliability.
- Explain how Akka's actor model fits cluster sharding.
- Explain how GRPC HTTP/2 multiplexing affects connection counts.
- Explain how QUIC's 0-RTT introduces replay risk.
๐งญ Open-ended Architecture & Vision (971โ1000)¶
- How will AI-augmented infra change platform engineering in 5 years?
- How would you architect for an AI-first product where every API call hits an LLM?
- How do you design an LLM gateway with caching, routing, and cost control?
- How would you design a privacy-preserving ML pipeline (differential privacy)?
- How would you architect for federated identity across business partners?
- How would you architect for a "universal" customer record (CDP)?
- How would you design an open-data platform sharing terabytes daily?
- How do you architect an Apple-like privacy posture inside an ad-tech business?
- How would you architect for sub-second global config propagation?
- How would you architect for offline-first mobile with eventual sync?
- How would you architect a CRDT-backed shared database for offline apps?
- How would you design an at-edge personalization engine?
- How would you design real-time translation in a chat app?
- How would you architect a low-bandwidth assistant for emerging markets?
- How would you design a payments platform for hyperinflation economies?
- How would you design a system surviving 90% datacenter loss?
- How would you design an air-gapped variant of a SaaS product?
- How would you design data sovereignty per-customer in a global SaaS?
- How would you architect for sustainability (carbon-aware scheduling)?
- How would you design infra for a 100-person crisis response team during disaster?
- How would you architect an interplanetary delay-tolerant network application?
- How would you design a graceful end-of-life for a deprecated product with 1M users?
- How would you architect a system that intentionally costs less than $0.01 per user per year?
- How would you architect a system optimized for "boring" reliability over novelty?
- How would you architect a system with no on-call rotation?
- How would you design a self-healing platform for a small ops team?
- How would you design a system whose top requirement is auditability for regulators?
- How would you make an existing system 10x cheaper without functional changes?
- How would you design a "platform as a product" with measurable customer value?
- If you could redesign the internet's DNS today, what would you change and why?
โ Senior ยท README ยท Sources โ