🔴 Professional Level (751–1000)¶

← Senior · README · Sources →

Focus: Staff/Principal/Distinguished-level questions — multi-year thinking, organizational architecture, build-vs-buy, cost optimization, complex migrations, internals of databases & runtimes, technical leadership, RFC processes, cross-team trade-offs.

For whom: Staff/Principal/L6+ engineer. Time per question: 1–2 hours; produce a written answer in RFC format.

🎖️ Staff+ Architecture & Leadership (751–790)¶

How would you write a 1-pager design doc for a $10M migration?
How do you run an architecture review board (ARB)?
How do you mentor 4 senior engineers across 3 teams?
How do you build technical strategy for a 50-engineer org?
What is a "tech radar" and how would you maintain one?
How do you decide which legacy system to sunset first?
How do you propose moving from on-prem to cloud (3-year plan)?
How would you justify a $2M infra bill to non-engineering stakeholders?
How do you run a blameless post-mortem for a $1M outage?
How do you balance KTLO (keep-the-lights-on) vs new features?
How do you build an internal developer platform (IDP)?
How do you measure DevEx and improve it?
How do you set technical KPIs for a platform team?
How do you sell a refactor to product leadership?
How do you build consensus across 5 conflicting team leads?
How do you choose between "rewrite" and "incremental refactor"?
How do you manage tech debt as a portfolio?
How would you set up a guild model for cross-cutting concerns (security, perf)?
How do you onboard a Staff engineer hire on day 1?
How do you run a "tech week" or innovation sprint?
How do you publish RFCs effectively across 200 engineers?
How do you maintain code health across 1,000+ services?
What is "architecture as code" and how would you adopt it?
How do you design a paved-road framework for new services?
How do you decide what becomes a paved road vs a guardrail?
How do you manage a multi-cloud governance plan?
How do you choose between Kubernetes and serverless company-wide?
How do you set platform SLOs that map to product SLOs?
How do you negotiate SLAs with internal partner teams?
How do you migrate authentication org-wide without an outage?
How do you design a deprecation policy with hard cut-off dates?
How do you communicate a deprecation across 1,000 client teams?
How would you architect a platform for rapid acquisition integration?
How would you stand up engineering after a M&A?
How do you handle "two architectures" post-merger?
How would you re-platform a 15-year-old monolith of 10M lines?
How would you design a "language consolidation" plan (e.g., kill Python, standardize on Go)?
How would you reduce p99 latency org-wide by 30% in 12 months?
How would you reduce cloud spend by 25% without harming reliability?
How would you design a FinOps practice from scratch?

🌌 Extreme Scale Designs (791–830)¶

Design Google Search infrastructure at high level.
Design Google Maps backend with real-time traffic.
Design Google Photos at billion-user scale.
Design Gmail for a billion users with spam + search.
Design Google Calendar with multi-region availability.
Design Google Docs sync engine.
Design Google AdWords auction.
Design Google AdSense placement system.
Design Facebook News Feed at 3B MAU.
Design Facebook Live video at scale.
Design Instagram Stories at scale.
Design WhatsApp delivery + read receipts globally.
Design Messenger end-to-end encrypted group chat at scale.
Design Apple iMessage E2E delivery.
Design iCloud Photos sync engine.
Design Spotify's music recommendation pipeline.
Design Spotify's collaborative playlists at scale.
Design Netflix's open-connect CDN.
Design Netflix's chaos-monkey-style resilience platform.
Design Netflix's recommendation pipeline (offline + online).
Design YouTube's video upload + transcoding pipeline.
Design YouTube's view-count denormalization pipeline.
Design YouTube's content moderation pipeline.
Design TikTok's For You Page at hundreds of millions of users.
Design Twitter timeline at 500M users.
Design Twitter's full-text search across all tweets.
Design Twitter's trending topics.
Design Reddit's voting + comment ranking at scale.
Design Pinterest's home-feed personalization.
Design LinkedIn's "People You May Know" graph service.
Design LinkedIn's feed ranking pipeline.
Design Stripe's idempotency layer.
Design Stripe's payment-routing engine.
Design Square's POS reliability under flaky internet.
Design Robinhood's order matching at scale.
Design Coinbase's trade engine.
Design Binance-style global exchange (cross-region matching).
Design Cloudflare Workers edge runtime.
Design Vercel/Netlify deployment pipeline.
Design GitHub Actions runner orchestration at billion-job scale.

⚙️ Database & Storage Internals (831–870)¶

Walk through Postgres MVCC and tuple visibility checks.
Walk through Postgres WAL replication internals.
Explain logical replication slots in Postgres.
Explain Postgres' MVCC GC and why VACUUM matters.
How does Postgres' planner choose between Hash, Merge, and Nested Loop joins?
How does pg_partman/native partitioning work?
What is Citus and how does it shard Postgres?
What is YugabyteDB and how does it differ from Spanner?
What is CockroachDB's transaction layer (Range + Raft + KV)?
What is FoundationDB's deterministic simulation testing?
Walk through MySQL InnoDB redo log + undo log + buffer pool.
Walk through MySQL group replication.
How does Vitess shard MySQL at scale?
How does Aurora separate compute from storage?
How does DynamoDB handle global secondary indexes consistency?
How does DynamoDB partition resizing work?
How does Cassandra repair work (full vs incremental vs subrange)?
How does ScyllaDB outperform Cassandra (shard-per-core)?
How does Redis Cluster handle resharding?
How does Redis Streams + consumer groups compare to Kafka?
How does Kafka KRaft replace ZooKeeper?
How does Kafka Tiered Storage work?
Explain Pulsar's segmented architecture (BookKeeper).
Explain etcd's MVCC + Raft layering.
Explain ZooKeeper's ZAB protocol.
Walk through ClickHouse's MergeTree and how it scales analytical reads.
Walk through Druid's segment-based architecture.
Walk through Pinot's real-time + offline duality.
Walk through Snowflake's storage-compute separation.
Walk through BigQuery's Dremel execution model.
Walk through Spanner's TrueTime + Paxos groups.
Walk through DGraph's GraphQL + Badger LSM stack.
Walk through Neo4j's graph traversal engine.
Walk through TimescaleDB's hypertable + chunk design.
Walk through InfluxDB's TSI + TSM file format.
Walk through MongoDB replica set election.
Walk through MongoDB shard balancer.
Walk through Elasticsearch's segment merging and refresh.
Walk through OpenSearch's cross-cluster replication.
Walk through HBase's region splits and HDFS interaction.

🌐 Specialized Systems (871–910)¶

Design a high-frequency trading order matching engine (microsecond latency).
Design a market data fan-out system (millions of subscribers).
Design a distributed key-value store with linearizable reads.
Design a globally consistent counter (without single bottleneck).
Design a globally distributed graph database.
Design a globally distributed time-series database.
Design an ad-bidding RTB system at 10M QPS.
Design a bid-request fan-out across 50 DSPs in <50ms.
Design a pixel/event tracking pipeline at 1B events/day.
Design a programmatic ads attribution engine.
Design a real-time recommendation system using vector search.
Design a multi-armed bandit feature serving system.
Design a federated learning training pipeline.
Design a model registry + serving platform like Sagemaker.
Design a feature store like Feast at scale.
Design a vector DB (Pinecone-like) with billion-vector recall.
Design a generative-AI inference serving platform.
Design a streaming ETL platform like dbt + Materialize.
Design an IoT ingestion system at 10M devices.
Design a smart-home control plane at country scale.
Design a connected-car telemetry pipeline.
Design a maritime AIS tracking system.
Design a flight-tracking system.
Design a satellite imagery indexing pipeline.
Design a real-time multiplayer game backend (FPS).
Design a MMO server architecture (zones, instances).
Design a turn-based puzzle game backend with replay.
Design a streaming live-game telemetry system.
Design a CDN with custom DDoS scrubbing.
Design a distributed code-execution sandbox (Leetcode/Replit-style).
Design a CI/CD platform for 1000 teams (GitLab-scale).
Design a build cache shared across an org.
Design a remote build execution (Bazel RBE).
Design a binary artifact store (Artifactory-scale).
Design a multi-tenant Kubernetes platform (per-team quotas).
Design an internal serverless platform (Knative-based).
Design a cloud cost-attribution service.
Design a chargeback/showback system across business units.
Design a multi-region object storage with strong read-after-write.
Design a content-addressable storage for backups (Borg-style).

🧪 Migration, Modernization & Org Design (911–940)¶

Plan a strangler-fig migration of a 10-year monolith.
Plan a database engine migration from Oracle to Postgres at 50TB scale.
Plan a storage migration from on-prem NAS to S3 with zero downtime.
Plan a search migration from Solr to Elasticsearch.
Plan a queue migration from RabbitMQ to Kafka.
Plan a config-store migration (env vars → Vault → dynamic config).
Plan a runtime migration from Java 8 to Java 21 fleet-wide.
Plan a Python 2 → 3 migration across 800 services.
Plan a Node.js LTS upgrade fleet-wide.
Plan a Kubernetes version upgrade across 500 clusters.
Plan a dependency vulnerability fleet-wide remediation.
Plan a "remove TLS 1.0/1.1" rollout across thousands of services.
Plan a CDN provider migration with no user-visible regression.
Plan a cloud provider migration (AWS → GCP) for a stateful app.
Plan a "ship every commit to prod" transformation for a slow team.
Plan a Trunk-based development adoption across an org.
Plan a feature-flag system rollout org-wide.
Plan a SOC2 program from zero to certification.
Plan an ISO 27001 readiness program.
Plan a HIPAA-compliant fork of an existing platform.
Plan a PCI-DSS scope reduction strategy.
Plan a data classification + tagging program at the warehouse level.
Plan a permissions audit across the org.
Plan a "secure by default" framework rollout.
Plan a service-ownership review across an entire org.
Plan a "you build it, you run it" rollout for product teams.
Plan a centralized observability rollout.
Plan a centralized incident-response (IR) program.
Plan a multi-tenant cost-fairness mechanism.
Plan a "freeze + harden" period after major outages.

🧠 Theory, Internals, Open Problems (941–970)¶

Explain the trade-off space of CRDT vs OT for collaborative editing.
Explain the trade-offs between gossip and direct broadcast.
Explain how Cassandra picks coordinator and replicas.
Explain how DynamoDB enforces 1MB partition limits and how to design around them.
Explain how Spanner's commit-wait works.
Explain how Calvin (deterministic) differs from Spanner.
Explain how Aurora's quorum (4/6) impacts read latency.
Explain how PolarDB's shared-storage architecture works.
Explain how TiDB's TSO (timestamp oracle) works.
Explain how YugabyteDB places leaders for global tables.
Explain how Materialize maintains incremental views.
Explain how RocksDB's compaction styles (level vs universal) trade off.
Explain how LevelDB's iterator works at byte level.
Explain how Bigtable's tablet servers split.
Explain how Manhattan/F1 layered storage works.
Explain how Megastore differs from Spanner.
Explain how Percolator (incremental indexing) works.
Explain how Borg/Omega/Kubernetes pod scheduling differ.
Explain how Mesos' two-level scheduling works.
Explain how Kubernetes scheduler computes affinity/anti-affinity.
Explain how Linux cgroups v2 enforces resource limits.
Explain how container CPU throttling can hide tail latency bugs.
Explain how TCP_INFO + epoll affect connection diagnostics.
Explain why JVM safepoints can dominate p99 latency.
Explain how Go's GMP scheduler interacts with syscalls.
Explain how Rust's async runtime (tokio) drives I/O.
Explain how Erlang/OTP supervisors map onto distributed reliability.
Explain how Akka's actor model fits cluster sharding.
Explain how GRPC HTTP/2 multiplexing affects connection counts.
Explain how QUIC's 0-RTT introduces replay risk.

🧭 Open-ended Architecture & Vision (971–1000)¶

How will AI-augmented infra change platform engineering in 5 years?
How would you architect for an AI-first product where every API call hits an LLM?
How do you design an LLM gateway with caching, routing, and cost control?
How would you design a privacy-preserving ML pipeline (differential privacy)?
How would you architect for federated identity across business partners?
How would you architect for a "universal" customer record (CDP)?
How would you design an open-data platform sharing terabytes daily?
How do you architect an Apple-like privacy posture inside an ad-tech business?
How would you architect for sub-second global config propagation?
How would you architect for offline-first mobile with eventual sync?
How would you architect a CRDT-backed shared database for offline apps?
How would you design an at-edge personalization engine?
How would you design real-time translation in a chat app?
How would you architect a low-bandwidth assistant for emerging markets?
How would you design a payments platform for hyperinflation economies?
How would you design a system surviving 90% datacenter loss?
How would you design an air-gapped variant of a SaaS product?
How would you design data sovereignty per-customer in a global SaaS?
How would you architect for sustainability (carbon-aware scheduling)?
How would you design infra for a 100-person crisis response team during disaster?
How would you architect an interplanetary delay-tolerant network application?
How would you design a graceful end-of-life for a deprecated product with 1M users?
How would you architect a system that intentionally costs less than $0.01 per user per year?
How would you architect a system optimized for "boring" reliability over novelty?
How would you architect a system with no on-call rotation?
How would you design a self-healing platform for a small ops team?
How would you design a system whose top requirement is auditability for regulators?
How would you make an existing system 10x cheaper without functional changes?
How would you design a "platform as a product" with measurable customer value?
If you could redesign the internet's DNS today, what would you change and why?

← Senior · README · Sources →