Change-Data-Capture Pipeline (Postgres WAL → Kafka)¶
Tail the Postgres write-ahead log and turn every committed row change into an ordered Kafka event — through a ≥100M-row initial snapshot and a 20–50k changes/s write storm. Then prove it: no loss, no out-of-order per key, and a replication slot that never silently eats your disk. Finally, build the same thing with a transactional Outbox and explain which one you'd ship.
| Tier | Lab (event-engineering) |
| Primary domain | CDC / data integration |
| Skills exercised | Postgres logical replication (WAL, pgoutput/wal2json), replication slots & WAL retention, Debezium / Kafka Connect, snapshot↔stream cutover, per-key ordering, idempotent downstream, transactional Outbox, Go (twmb/franz-go consumer), Prometheus |
| Interview sections | 5 (Postgres & SQL), 11 (messaging), 13 (distributed systems) |
| Est. effort | 4–6 focused days |
1. Context¶
You run the platform team at a company whose system of record is a single large Postgres database. Three new consumers want the data as it changes: a search indexer, an analytics warehouse loader, and a cache invalidator. Today they all poll WHERE updated_at > ? on a hot table — and that polling is both slow and lossy (it misses deletes, it double-reads under clock skew, and it hammers the primary).
You're going to replace polling with Change Data Capture: read committed row changes straight from the Postgres WAL via logical replication and stream them to Kafka with Debezium, so every INSERT/UPDATE/DELETE becomes an ordered event keyed by primary key. The catch is that the source table already has ≥100M rows (so you must snapshot it before you can stream), the application writes 20–50k changes/s (so the connector can fall behind and a slow consumer can grow the WAL until the disk fills), and the schema changes under you (ALTER TABLE mid-stream).
Then you'll build the transactional Outbox alternative and compare them head-to-head. The deliverable is not "CDC works" — it's a findings note that states, with numbers, when CDC beats Outbox and when it doesn't, and a proof that your pipeline loses nothing and reorders nothing under failure.
2. Goals / Non-goals¶
Goals - Stand up logical replication on Postgres and a Debezium Postgres connector on Kafka Connect that snapshots a ≥100M-row table and then streams live changes to per-table Kafka topics. - Characterize CDC lag (commit-in-Postgres → event-in-Kafka) as a function of write rate, and find the rate at which the connector can no longer keep up. - Make the danger of an unconsumed replication slot concrete: stall the pipeline and watch pg_wal grow toward the disk limit; show the guardrails. - Build a downstream Go consumer that applies changes idempotently so that Debezium's at-least-once delivery yields an exactly-once effect in the sink, and prove it through connector restarts mid-snapshot and mid-stream. - Implement the transactional Outbox pattern and compare it to log-based CDC on latency, coupling, ordering, and operational cost.
Non-goals - A managed CDC service (Confluent Cloud, AWS DMS, Striim). Run Debezium + Kafka Connect yourself so you see the slot, the offsets, and the snapshot knobs. - Multi-table foreign-key-consistent CDC into a transactional sink (that's the CQRS/Saga staff lab). Here, per-key ordering per table is the contract. - Schema registry ergonomics (Avro/Protobuf compatibility rules) — covered in events/04. Use JSON or a fixed Avro schema; focus on the WAL→Kafka mechanics. - Bi-directional / multi-master replication. Source is a single Postgres primary.
3. Functional requirements¶
- A source loader (
cmd/seed) populatespublic.accounts(or similar) with ≥100M rows, including some wide rows (apayloadcolumn ≥ 4 KB that crosses the TOAST threshold) so you exercise TOAST behavior. - A write generator (
cmd/churn) applies a configurable mix ofINSERT/UPDATE/DELETEat a target rate (default 20k/s, sustainable to 50k/s) with a Zipfian key distribution, deterministic given a seed. - Logical replication is configured (
wal_level=logical), a replication slot and publication exist, and a Debezium Postgres connector (pluginpgoutput, with awal2jsonvariant tried for comparison) snapshots the table then streams CDC events to topiccdc.public.accounts, keyed by primary key. - A sink consumer (
cmd/sink,twmb/franz-go) consumes the CDC topic and maintains a downstream replica (a second Postgres schema or a Redis hash), applying each change idempotently — a re-delivered event must not corrupt state. Deletes (Debezium tombstones,null-value records) remove the key. - The connector and sink survive restarts: on restart they resume from offsets / LSN, not from zero, and never re-snapshot unless told to.
- A chaos hook (
cmd/chaos) can: stall the sink (to grow the slot), kill the Connect worker mid-snapshot, kill it mid-stream, and issue anALTER TABLEwhile changes are flowing. - An Outbox variant (
cmd/outbox-*): the app writes business rows and anoutboxrow in the same transaction; a relay publishes outbox rows to Kafka and marks them sent. Same downstream sink consumes them.
4. Load & data profile¶
- Initial snapshot: source table ≥ 100M rows, ≥ 40 GB on disk including TOAST. Report snapshot wall-clock and the lock/consistency mode used.
- Wide rows: ~10% of rows carry a
payload≥ 4 KB (forces out-of-line TOAST storage). Measure how anUPDATEthat doesn't touch the TOAST column emits CDC —REPLICA IDENTITYand unchanged-toast handling matter here. - Write rate: sustained 20k changes/s, with a stress run to 50k/s. Mix ≈ 70% UPDATE / 20% INSERT / 10% DELETE. Open-model generator (fixed commit rate, not "as fast as Debezium drains") so lag can build and be observed.
- Key distribution: primary key access is Zipfian (s≈1.1) so some keys are hot — this stresses per-key ordering and compaction, not just raw volume.
- Run length: a steady-state streaming run of ≥ 30 minutes at target rate, plus a separate full-snapshot run timed end to end.
- Schema change: at least one
ALTER TABLE ... ADD COLUMN(nullable, fast) and one wider change (ALTER TYPE/ addNOT NULLwith default) applied mid-stream; record what the connector emits and whether the sink survives.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Steady-state CDC lag (commit→Kafka) at 20k/s, below ceiling | p99 < 2 s, bounded and flat (not monotonically rising) |
| Connector throughput ceiling | Find & report changes/s the connector sustains; name the bound (snapshot chunking? single-slot decode? Connect task? sink) |
| Initial-snapshot duration (≥100M rows) | Reported with method; streaming must resume from the consistent point, zero gap, zero overlap |
| Per-key ordering | For every primary key, sink applies changes in commit order; prove with a monotonic per-key version check |
| Delivery / idempotency | Debezium is at-least-once; sink effect is exactly-once: after replay/restart, downstream replica == source for sampled keys, zero corruption |
| Replication-slot safety | pg_wal bounded by max_slot_wal_keep_size; an indefinitely stalled slot must not fill the disk — alert fires first |
| Restart recovery | After killing Connect mid-snapshot and mid-stream, pipeline resumes with no loss and no duplicate-induced corruption |
| Schema-change handling | ALTER TABLE ADD COLUMN mid-stream causes no pipeline halt and no lost changes; new field appears in events after the DDL |
The point is not a magic lag number — it's to find your connector's sustainable rate, explain what bounds it, and prove the correctness invariants survive failure.
6. Architecture constraints & guidance¶
- Bring up Postgres (≥ 15), Kafka (KRaft, pinned version), and Kafka Connect + Debezium via
docker-compose. Pin every version in your findings. - Postgres must run with
wal_level=logical. Use thepgoutputplugin (in-tree, no extension) as the baseline; optionally comparewal2json. - Set
max_slot_wal_keep_sizeso a stalled slot is capped — then demonstrate what happens when it caps (slot invalidated, source protected) vs when it's unset (disk fills). This trade-off is the heart of the lab. - Choose
REPLICA IDENTITYdeliberately:DEFAULT(PK only inbefore),FULL(whole old row — bigger WAL, but UPDATE/DELETE carry fullbefore). Measure the WAL-size and event-size cost ofFULL. - Snapshot mode: try Debezium
initial(snapshot then stream) and reason aboutincrementalsnapshots (signal-table driven, chunked, resumable). State which you chose and why for a 100M-row table. - Sink consumer in Go with
twmb/franz-go. Apply changes inside a DB transaction guarded by a dedup key ((source_lsn)or(table, pk, source_ts)) so re-delivery is a no-op. Keep one consumer per partition to preserve per-key order; never parallelize within a key. - Topic for the table should be log-compacted if the sink is a "latest state" replica (keyed by PK, tombstones delete) — note the interaction between compaction and replayability.
- Instrument with Prometheus: connector lag (
debezium_metrics_MilliSecondsBehindSource), Kafka consumer-group lag, snapshot rows remaining, slot retained bytes (pg_replication_slots.wal_status/pg_current_wal_lsn() - restart_lsn), sink apply rate, dedup-hit rate, p50/p99/p999 commit→apply.
7. Data model¶
source: public.accounts
account_id BIGINT PRIMARY KEY
balance BIGINT
status TEXT
payload TEXT -- ~10% ≥ 4 KB → TOAST out-of-line
version BIGINT -- bumped on every UPDATE; used to prove per-key order
updated_at TIMESTAMPTZ
-- REPLICA IDENTITY {DEFAULT | FULL} chosen & justified
Debezium CDC envelope (per change):
{ op: c|u|d|r, before, after, source:{ lsn, txId, ts_ms, table }, ts_ms }
key = { account_id } -- guarantees per-key partition routing
value = null -- tombstone, emitted after a delete
sink replica (idempotent apply):
replica.accounts(account_id PK, balance, status, payload, version, src_lsn)
-- apply rule: UPSERT only if incoming source.lsn > stored src_lsn
-- delete: remove row on tombstone / op=d
applied_lsn(slot TEXT PK, last_lsn BIGINT) -- optional progress ledger
outbox variant:
outbox(id BIGSERIAL PK, aggregate_id BIGINT, payload JSONB,
created_at, published_at NULL) -- written in the business txn
source.lsn (or the Debezium (lsn, txId) pair) is the ordering and dedup authority end to end: it is monotonic per key, lets the sink discard stale or re-delivered changes, and is the only number that proves no loss/no reorder.
8. Interface contract¶
- Connector: Debezium Postgres connector config (JSON) checked in —
slot.name,publication.name,plugin.name,snapshot.mode,table.include.list,tombstones.on.delete, key/value converters. - Kafka topic:
cdc.public.accounts, keyed byaccount_id, partitioned so a key always lands on one partition. Tombstone =nullvalue after delete. - Sink:
cmd/sinkflags —-topic,-group,-dedup-key,-apply-batch,-stall(to simulate a slow consumer for the slot-growth experiment). GET /replica/{account_id}→{ account_id, balance, version, src_lsn }for source-vs-sink diffing.GET /metrics→ Prometheus exposition (lag, slot bytes, dedup hits, apply p99).- Outbox relay:
cmd/outbox-relayflags —-batch,-poll, and a publish-then-mark loop you can also drive via logical decoding of the outbox table (note both relay strategies).
9. Key technical challenges¶
- Snapshot↔streaming cutover with zero gap and zero overlap. Debezium reads a consistent snapshot at an LSN, then streams from exactly that LSN. If your understanding of the boundary is wrong you'll either drop changes that landed during the snapshot or double-apply them. Prove the seam is clean.
- The replication slot is a loaded gun. An unconsumed (or slow) slot pins
restart_lsn, so Postgres cannot recycle WAL —pg_walgrows until the disk fills and the primary goes down. You must show the failure and the guardrail (max_slot_wal_keep_size, monitoring, alerting before the disk dies). - At-least-once → exactly-once effect. Debezium can re-deliver after a Connect crash or rebalance. Correctness lives entirely in the sink: idempotent UPSERT guarded by
src_lsnmonotonicity, deletes via tombstones, all in one DB txn. - Per-key ordering under a hot Zipfian key. Ordering is only guaranteed within a partition; a hot key concentrates load on one partition/consumer and can become the straggler. You must not "fix" lag by reordering within a key.
- TOAST &
REPLICA IDENTITY. An UPDATE that doesn't change a TOASTed column emits an unchanged-toast placeholder, not the value;REPLICA IDENTITY DEFAULTgives you only the PK inbefore. Naive sinks corrupt wide rows here. - Schema change mid-stream.
ALTER TABLEflows through the WAL; the connector must keep going and emit the new shape. With a registry-less JSON pipeline, prove the sink tolerates added fields and doesn't halt the connector. - Outbox vs CDC trade-off, defended. Outbox gives you app-controlled payloads and no logical-replication ops burden, at the cost of a dual-write-in-one-txn discipline, a polling/decoding relay, and table bloat. Quantify, don't hand-wave.
10. Experiments to run (break it / tune it)¶
Record before/after numbers and a one-paragraph "why" for each:
- CDC lag vs write rate. Sweep churn rate 5k → 20k → 35k → 50k/s. Plot commit→Kafka p99 lag. Find the rate where lag stops being flat and starts climbing — that's the connector ceiling. Name what bounds it.
- Stall the consumer → slot/WAL growth. Pause the sink (or the Connect task) with churn running. Watch
restart_lsnfreeze and slot-retained bytes /pg_walclimb. Run it twice: withmax_slot_wal_keep_sizeunset (disk heads toward full) and set (slot invalidated, primary protected). Report the alert lead time. - Snapshot duration vs table size. Snapshot at 10M, 50M, 100M rows. Plot wall-clock and rows/s. Compare
initialvsincrementalsnapshot, and the effect of snapshot chunk size and parallelism on lock/contention on the source. REPLICA IDENTITY& TOAST cost.DEFAULTvsFULL: measure WAL bytes/s and CDC event size at 20k/s. Then UPDATE a non-TOAST column on wide rows and verify the sink reconstructs the row correctly (no clobberedpayload).- Schema change mid-stream. Apply
ALTER TABLE ADD COLUMNand a heavier change while streaming. Confirm zero connector halt, zero lost changes, and that post-DDL events carry the new field. Note any error and how you'd handle it. - Failover correctness. During a 30-min run: (a) kill the Connect worker mid-snapshot; (b) kill it mid-stream; (c) trigger a consumer-group rebalance. After each, prove
replica == sourcefor a sample of keys and that per-keyversion/src_lsnis monotonic — no loss, no corruption from re-delivery. - Outbox vs CDC bake-off. Same workload through the Outbox relay. Compare: end-to-end p99 latency, source write amplification, ordering guarantees, ops surface (slot management vs outbox table bloat/cleanup), and coupling (who owns the event shape). Produce a recommendation table.
- Compaction & replay. With a log-compacted topic, kill and rebuild the sink from the topic. Show that a fresh consumer reaches correct latest state, and reason about what compaction destroys for full-history replay.
11. Milestones¶
- Compose up: Postgres (logical), Kafka, Connect+Debezium;
cmd/seedto 100M rows; Prometheus + a Grafana board for lag / slot bytes / apply rate. - Connector snapshots + streams to
cdc.public.accounts;cmd/sinkapplies idempotently; first source-vs-sink diff is clean at low rate. - Lag-vs-rate sweep and snapshot-duration runs (experiments 1, 3); ceiling named.
- Slot-growth experiment with and without the cap (experiment 2); guardrail proven.
REPLICA IDENTITY/TOAST and schema-change runs (experiments 4, 5).- Chaos/failover correctness run (experiment 6); diff proof committed.
- Outbox variant + bake-off (experiments 7, 8); findings note with recommendation.
12. Acceptance criteria (definition of done)¶
- ≥ 30-min steady-state run at ≥ 20k changes/s with flat CDC lag; dashboard screenshot attached, lag p99 number stated.
- Initial-snapshot duration for ≥ 100M rows reported, with the snapshot mode and consistency boundary explained; streaming resumes with no gap, no overlap (show the LSN seam).
- Connector throughput ceiling reported with the bottleneck named and evidenced (decode? Connect task? sink apply? slot single-threading?).
- Slot-safety experiment shown both ways: uncapped slot grows
pg_waltoward the limit, andmax_slot_wal_keep_sizecaps it before the disk dies, with the alert firing first. Both timelines committed. - After killing Connect mid-snapshot and mid-stream plus a rebalance,
replica == sourcefor sampled keys with zero corruption; per-keyversion/src_lsnproven monotonic (show the SQL/diff). - TOAST/
REPLICA IDENTITYbehavior demonstrated: an UPDATE not touching the TOASTed column does not clobberpayloadin the sink. -
ALTER TABLEmid-stream handled with no connector halt and no lost changes. - Outbox-vs-CDC comparison table with defended numbers (latency, write amplification, ordering, ops cost, coupling) and a stated recommendation.
- Every number reproducible from a committed command + config.
13. Stretch goals¶
- Incremental snapshot under load: trigger Debezium's signal-table-driven incremental snapshot of a new table while streaming continues; measure the impact on lag and prove no interleaving errors.
- Multi-table transaction grouping: enable Debezium transaction metadata and show how a downstream could reconstruct source-transaction boundaries.
- Heterogeneous sink: add a second sink (Redis or Elasticsearch) on the same topic; show fan-out with independent offsets and zero impact on the SQL sink.
- Outbox via logical decoding: publish the outbox table itself through CDC instead of polling — collapsing both patterns into one — and discuss the trade.
- Slot-failover drill: promote a Postgres replica and show what happens to the logical slot (it doesn't follow by default); design the recovery.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Snapshot/stream cutover | Snapshots then streams without obvious gaps | Explains and proves the LSN seam; knows initial vs incremental trade-offs at 100M rows |
| CDC lag analysis | Reports a lag number at a rate | Names and proves the connector ceiling; knows the next bottleneck |
| Slot / WAL safety | Knows an unconsumed slot retains WAL | Demonstrates disk-fill risk and the cap, with monitoring + alert lead time |
| Delivery correctness | Idempotent apply in the happy path | Exactly-once effect survives mid-snapshot + mid-stream kills + rebalance; explains why it's correct |
| Ordering | Knows ordering is per-partition/key | Handles hot-key skew without breaking per-key order; proves monotonicity |
| TOAST / replica identity | Aware deletes/updates carry before | Picks REPLICA IDENTITY with measured cost; sink never clobbers wide rows |
| CDC vs Outbox judgment | Builds both, lists pros/cons | Recommends one per context with quantified latency/ops/coupling evidence |
| Communication | Clear findings note | Could defend every lag curve and the slot timeline to a staff panel |
15. References¶
- Postgres docs: logical replication, replication slots,
wal_level=logical,REPLICA IDENTITY,max_slot_wal_keep_size, TOAST storage. - Debezium docs: Postgres connector, snapshot modes (
initial/incremental+ signaling),pgoutputvswal2json, tombstones, transaction metadata. - Kafka Connect: offsets, task restarts, converters, log compaction.
- Designing Data-Intensive Applications — Ch. 11 (change data capture & derived data); Ch. 5 (replication, leaders & logs).
- Microservices outbox / dual-write problem write-ups (transactional outbox, polling-publisher vs log-tailing relay).
twmb/franz-goconsumer-group example.- See also:
Interview Question/11-messaging-and-event-streaming/andInterview Question/05-postgresql-and-sql/.