Dead-Letter Queues & Retry Topology¶
A downstream is flaky and some of your messages are poison. Build the retry topology that lets transient failures heal, parks the poison, keeps the happy path fast under load — and prove that nothing is silently lost or stuck retrying forever. Find the failure rate where your DLQ grows without bound.
| Tier | Lab (event-engineering) |
| Primary domain | Messaging reliability / consumer design |
| Skills exercised | Non-blocking retry topics, exponential backoff + jitter, parking-lot DLQ, idempotency, transient-vs-permanent error classification, head-of-line blocking, Go (twmb/franz-go) |
| Interview sections | 11 (messaging), 13 (distributed systems), 17 (performance) |
| Est. effort | 3–5 focused days |
1. Context¶
You own a Kafka consumer that processes orders and calls a downstream service (payments, inventory, an email API) for each message. Reality intrudes:
- The downstream is transiently flaky — timeouts, 503s, brief partitions. These succeed on retry seconds later.
- A fraction of messages are poison: malformed payloads, references to deleted entities, a bug that will always throw on this exact message. No amount of retrying fixes them.
- During an incident the downstream goes fully down, and every in-flight message starts failing at once.
The naive consumer retries each message in place until it succeeds. That works in a demo and collapses in production: one poison message blocks its partition forever (head-of-line blocking), and a downstream outage turns into an ever-growing retry spin that pins CPU and stalls every partition behind it.
Your job is to design a retry + dead-letter topology that separates these three failure modes, recovers the transient ones without blocking the happy path, parks the permanent ones for human/automated inspection, and never loses or infinitely re-delivers a message. You will produce numbers — throughput under failure, DLQ growth curves, ordering impact — not opinions.
2. Goals / Non-goals¶
Goals - Build a non-blocking retry pipeline using tiered delay/retry topics, and show it sustains throughput while a blocking in-place retry consumer does not. - Implement exponential backoff with jitter and a hard max-retries → DLQ (parking lot) boundary; prove nothing retries forever. - Make processing idempotent so a message redelivered across retry tiers (or reprocessed after a crash) never double-applies its effect. - Classify transient vs permanent errors and route them differently (retry vs straight-to-DLQ); measure the cost of getting that classification wrong. - Provide DLQ inspection + replay (manual and automated) and prove replay is correct and idempotent. - Quantify the cost of retries on overall throughput and on ordering.
Non-goals - Building the downstream service. Use a stub whose failure rate and latency are injectable by flag. - Exactly-once produce semantics across the whole pipeline — that's the events/07-idempotent-inbox-outbox and staff/05 labs. Here, at-least-once delivery + idempotent processing is the contract. - A managed retry framework (Spring Retry, etc.). Build the topology yourself so you see the topics, offsets, and ordering trade-offs.
3. Functional requirements¶
- A producer (
cmd/producer) emits messages to topicordersat a configurable rate, with a configurable share of poison messages (deterministic given a seed, so the same messages are always poison). - A processor (
cmd/processor) consumesorders, calls the downstream stub per message, and on failure routes via the retry topology rather than blocking. It supports two modes by flag: blocking(baseline) — retry in place on the main topic, blocking the partition until success or give-up.non-blocking— publish failures to the next retry tier and commit the original offset, freeing the partition.- Tiered retry topics with increasing delay, e.g.
orders.retry.5s→orders.retry.30s→orders.retry.5m. A retry consumer waits out the tier's delay, reprocesses, and on repeated failure promotes the message to the next tier. - After
max_retries, the message goes to a DLQ (orders.DLQ) with its full failure metadata (original topic/partition/offset, attempt count, last error, first-seen timestamp) in headers. - A downstream stub (
cmd/downstreamor in-process) with injectable transient failure rate, poison detection, and latency, plus anoutageswitch that fails 100% for a window. - A DLQ tool (
cmd/dlq) to inspect parked messages and replay them (selectively or in bulk) back ontoordersafter a fix. - Processing is idempotent: applying a message N times has the same effect as applying it once (dedup keyed by a business/message id).
4. Load & data profile¶
- Throughput: sustain ≥ 50k msg/s through
ordersfor a ≥ 20-minute run (tune to your hardware; state your number and hold it constant across runs). - Transient failure rate: sweep 5% → 15% → 30% of messages fail transiently on first attempt, then succeed within 1–3 attempts. This is the primary load knob.
- Poison rate: 1% baseline of messages are permanently poison; in the DLQ-growth experiment sweep poison 0.1% → 1% → 5% → 10% to find where the DLQ outpaces drain.
- Message size: 512 B JSON order with a unique
order_id(idempotency key). - Generator:
cmd/gen/ producer flag deterministic via seed; the sameorder_ids are poison every run so DLQ contents are reproducible. - Traffic model: open-model producer (fixed send rate, not "as fast as the consumer drains") so retry backlog and DLQ growth are observable, not masked.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Happy-path throughput under 15% transient failure (non-blocking) | ≥ 90% of zero-failure throughput; state and defend the number |
| Head-of-line blocking from a single poison message (non-blocking) | Zero — poison message must not stall its partition |
| Transient recovery success | ≥ 99.9% of transient failures eventually succeed within the retry tiers (not in DLQ) |
| Time-to-DLQ for a poison message | Bounded and known: Σ tier delays; report it (e.g. ~5m35s for 5s+30s+5m) |
| DLQ growth at 1% poison, steady state | Bounded — DLQ arrival rate ≈ poison rate × input rate; not super-linear |
| Retries-forever | Impossible by construction — prove max_retries is hard |
| End-to-end p99 for a once-failed-then-succeeded message | < (first retry delay + tier processing); report it |
| Loss | Zero — every produced message ends in success or DLQ; success + DLQ == produced |
The point is not a magic throughput number — it's to find where your topology stops being free (failure rate, poison rate, ordering) and explain each curve.
6. Architecture constraints & guidance¶
- Kafka via
docker-compose(KRaft mode, pinned version). Topics:orders,orders.retry.5s,orders.retry.30s,orders.retry.5m,orders.DLQ. - Go client:
twmb/franz-go(good batching control, header access, per-record control). Keep producer, processor, retry consumers, and DLQ tool as separate binaries so you can scale and kill them independently. - Non-blocking retry mechanics: a retry consumer must honor the tier's delay without blocking the partition. Two valid designs — pick one and defend it:
- Sleep-to-delay: the retry consumer reads a record, sleeps until
record.timestamp + tier_delay, then processes. Pause partition fetch while sleeping so you don't buffer unboundedly; one consumer per tier topic. - Time-bucketed topics: delay is encoded by which topic the message sits in; a scheduler promotes due messages. (More plumbing, finer control.)
- Backoff: exponential base with full jitter (
sleep = random(0, base·2^attempt)), capped. State why jitter matters here (de-synchronizing a retry storm — see §9). - Idempotency: dedup on
order_idin the downstream store inside one transaction; a redelivered message is a no-op, not a double-apply. - Instrument everything with Prometheus: per-topic consume rate, per-tier retry rate, DLQ arrival rate + total, attempt-count histogram, head-of-line stall time, end-to-end latency p50/p99/p999, partition lag per topic.
7. Data model¶
order: { order_id string, customer_id string, amount int64, ts int64, payload []byte }
retry/DLQ headers (carried on every retried/parked record):
x-orig-topic string // "orders"
x-orig-partition int32
x-orig-offset int64
x-attempts int32 // incremented each tier hop
x-first-seen int64 // ms epoch of first failure
x-last-error string // classified reason
x-error-class string // "transient" | "permanent"
downstream idempotency ledger:
applied(order_id PK, applied_at TIMESTAMPTZ) -- dedup guard, checked in the apply txn
A message's "attempt budget" is len(retry_tiers); exceeding it routes to orders.DLQ with the headers above intact, so the parked record is self-describing for inspection and replay.
8. Interface contract¶
- Processor flags:
-mode={blocking|non-blocking},-max-retries,-tiers=5s,30s,5m,-backoff-base,-backoff-cap,-jitter={none|full},-classify={on|off}. - Downstream stub flags:
-fail-rate,-poison-detect,-latency,-outage-window(e.g. fail 100% for 60s starting at T+5m). - DLQ tool:
dlq inspect [--filter=error-class=permanent]lists parked messages with headers;dlq replay [--ids=… | --all | --since=…]republishes toorders. GET /metrics→ Prometheus exposition for every binary.
9. Key technical challenges¶
- Head-of-line blocking. In-place retry holds the partition offset until the message succeeds. One poison message → the whole partition stalls and lag on good messages climbs without bound. The non-blocking topology must commit the original offset and move the failure aside. Prove the difference.
- Transient vs permanent classification. Retrying a permanent error wastes the entire tier budget and delays parking by
Σ tier delays. Sending a transient error straight to DLQ loses recoverable work. Build a classifier (error type, HTTP class, validation result) and measure the cost of misclassification both ways. - Retry storms / amplification. When the downstream goes fully down, every message fails and floods the retry tiers at once. Without jitter, they all come due at the same instant and hammer the (still-recovering) downstream in a synchronized wave — a self-inflicted DoS. Full jitter spreads the reload; a circuit breaker in front of the downstream stops feeding it entirely while it's open. See the sibling
resilience/03-circuit-breaker-bulkhead-timeout— retry storms are exactly what circuit breaking exists to prevent. - Ordering. Moving a failed message to a retry topic means later messages for the same key pass it. For keyed/ordered streams this breaks per-key ordering. Decide explicitly: sacrifice global ordering (fast), or hold a per-key "parking lock" so later messages for a failed key also wait (correct, slower). Measure both.
- DLQ unboundedness. DLQ is a queue, not a black hole. If poison arrival rate exceeds your replay/triage rate, it grows forever. Find the poison rate at which it diverges; design alerting on DLQ arrival-rate, not just total.
- Idempotent replay. Replaying the DLQ re-injects messages that may have partially applied before failing. Without the dedup ledger, replay double-applies. Prove replay is a no-op for already-applied work.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each:
- Blocking vs non-blocking under load. At 15% transient failure + 1% poison, run both modes. Measure: happy-path throughput, per-partition lag, and head-of-line stall time (how long a good message waits behind a poison one). Show blocking mode's lag climbing monotonically while non-blocking stays flat.
- Failure-rate sweep. Non-blocking mode at transient failure 5% → 15% → 30%. Measure: throughput, retry-topic volume, and the multiplier — at 30% transient, how many total broker messages per input message (amplification factor)?
- Backoff schedule tuning. Sweep tier delays (
5s,30s,5mvs1s,5s,30svs10s,1m,10m) andjitter={none|full}. Measure: transient recovery success %, mean time-to-success, and whether short tiers cause re-hammering. - Poison-rate → DLQ growth. Sweep poison 0.1% → 1% → 5% → 10% with a fixed triage/replay rate. Measure: DLQ arrival rate and total over time; find the poison rate where DLQ growth outpaces drain and grows unbounded. Plot it.
- Ordering impact. Keyed stream (orders per
customer_id). Run with retry topics and measure: how often a later message for a key is processed before an earlier failed one (out-of-order rate). Then enable per-key parking lock and re-measure ordering vs throughput cost. - DLQ replay correctness. Let the DLQ fill with 1% poison; "fix" the downstream (poison now succeeds); replay
--all. Prove: every replayed message applies exactly once (appliedledger has no duplicates), andsuccess + remaining_DLQ == produced. - Retry storm. At T+5m trigger a 60s full downstream outage. Measure: the synchronized retry wave when it recovers (with
jitter=none), then repeat withjitter=fulland with a circuit breaker in front. Show the breaker/jitter flattening the reload spike and preventing amplification. - Misclassification cost. Run
-classify=off(everything retries) vson. Measure: wasted retry volume on permanent errors and the delay-to-park difference (Σ tier delaysvs immediate DLQ).
11. Milestones¶
- Compose cluster + topics; producer with injectable poison; blocking-mode processor; Prometheus + Grafana board for per-topic rate/lag/DLQ.
- Non-blocking tiered retry topology + max-retries→DLQ; experiment 1 (the head-of-line proof).
- Backoff+jitter and error classification; experiments 2–3, 8.
- Idempotent apply + DLQ inspect/replay tool; experiment 6 (replay correctness).
- Ordering, DLQ-growth, and retry-storm investigations; experiments 4, 5, 7; findings note.
12. Acceptance criteria (definition of done)¶
- Side-by-side run proving non-blocking holds flat lag while blocking mode's lag climbs under one poison message (dashboard screenshot).
- No-loss invariant shown:
success + DLQ == producedexactly, after a run with mixed transient + poison + a downstream outage (show the counts). -
max_retriesproven hard: no message exceeds the tier budget; time-to-DLQ equalsΣ tier delaysand is reported. - DLQ-growth curve plotted across poison rates, with the divergence point named.
- Replay proven idempotent: replay
--all, thenappliedledger has zero duplicateorder_ids (show the SQL/diff). - Retry-storm experiment shows jitter + circuit breaker flattening the synchronized reload (before/after spike).
- Ordering trade-off measured both ways (out-of-order rate vs per-key-lock throughput cost).
- Every number reproducible from a committed command + config + seed.
13. Stretch goals¶
- Adaptive max-retries: vary the attempt budget by
x-error-class(more patience for transient, immediate park for permanent). - Auto-replay with guardrails: a controller that drains the DLQ at a capped rate once the downstream is healthy, with a kill-switch if failures recur.
- DLQ alerting on arrival-rate (not total) + a "DLQ growing unbounded" SLO burn alert; show it firing in the poison-sweep.
- Bloom-filter dedup instead of the
appliedtable; measure memory vs the correctness window and the false-positive risk on replay. - Per-tenant DLQs / fair triage so one noisy producer can't starve another's replay budget (ties into
resilience/04-hierarchical-multitenant-quotas).
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Topology design | Builds tiered retry + DLQ that works | Defends non-blocking over blocking with head-of-line numbers; justifies tier delays |
| Head-of-line blocking | Knows in-place retry blocks the partition | Proves it with lag curves; shows non-blocking eliminates it |
| Error classification | Splits transient vs permanent | Measures misclassification cost both ways; tunes the boundary |
| No-loss / no-spin | Messages end in success or DLQ | Proves success+DLQ==produced and max_retries hard through an outage |
| Retry storms | Adds backoff + jitter | Explains synchronized reload; pairs jitter with a circuit breaker and measures the flattened spike |
| Ordering | Notices retry topics reorder a key | Quantifies out-of-order rate; chooses (and defends) sacrifice-vs-lock |
| Idempotent replay | Replay works in happy path | Proves replay is a no-op for partially-applied work |
| Communication | Clear findings note | Could defend every curve (DLQ growth, storm reload) to a staff panel |
15. References¶
- Kafka docs: consumer offset management, headers,
pause/resumefetch. - Uber Engineering — "Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka" (the canonical tiered-retry-topic pattern).
- Confluent — "Error Handling Patterns for Apache Kafka" (retry topics, DLQ).
- AWS Builders' Library — "Timeouts, retries, and backoff with jitter" (full jitter and why synchronized retries form a storm).
- Designing Data-Intensive Applications — Ch. 11 (stream processing, at-least-once + idempotence).
twmb/franz-goconsumer + producer + record-header examples.- See also: sibling
resilience/03-circuit-breaker-bulkhead-timeout(retry storms ↔ circuit breaking) andsenior/03-durable-job-queue(retries/backoff/DLQ in a non-Kafka queue). - See also:
Interview Question/11-messaging-and-event-streaming/.