Durable Job Queue & Scheduler¶
Build a background-job system that runs each job exactly once at a time, never loses one across a worker crash, and drains a backlog of millions without a single double-run. Then find the throughput ceiling where claim contention — not your handlers — becomes the bottleneck, and prove it.
| Tier | Senior (own a service end-to-end) |
| Primary domain | Async task processing / messaging |
| Skills exercised | Worker pools, lease/claim semantics, idempotency, retries + backoff, scheduling, DLQ, graceful shutdown, Go (pgx SKIP LOCKED or go-redis streams) |
| Interview sections | 2 (concurrency), 11 (messaging), 15 (testing) |
| Est. effort | 4–6 focused days |
1. Context¶
Every product eventually grows a "do it later" pile: send the email, transcode the video, charge the card, rebuild the search index, fan out the webhook. The naive version is a goroutine and a time.Sleep, and it works until the process restarts mid-job and the charge is silently lost — or runs twice. You're now the engineer who owns the job queue that the whole company enqueues onto, and the contract is brutal in its simplicity: a job that was accepted gets run, and a job that's running isn't also running somewhere else.
In this project you build a durable, Postgres- or Redis-backed job system in Go with worker pools, retries with backoff, scheduling/delays, priorities, and a dead-letter queue — then you load it to 10k+ jobs/s with a backlog of millions and prove the guarantees hold while a worker is being killed under load. You will produce numbers and a job-loss test, not a happy-path demo.
2. Goals / Non-goals¶
Goals - Implement reliable claim/lease semantics so a queued job is run by exactly one worker at a time — SELECT ... FOR UPDATE SKIP LOCKED (Postgres) or Redis Streams consumer groups (XREADGROUP / XCLAIM). - Deliver at-least-once + idempotent handlers, and articulate honestly why "exactly-once execution" is an aspiration the system approximates, not a guarantee it provides. - Retries with exponential backoff + jitter, a max_attempts cap, and routing to a dead-letter queue on exhaustion. - Visibility timeout / lease renewal so long jobs aren't stolen mid-run, and crashed-worker jobs are reclaimed. - Scheduled/delayed jobs (run_at) and recurring cron jobs. - Priority queues, graceful shutdown (drain in-flight, requeue nothing twice), and poison-job containment. - Observability: queue depth, oldest-job age, throughput, retry rate, DLQ rate.
Non-goals - A managed queue (SQS, Cloud Tasks, Temporal). Build it so you own the claim loop and see every knob. - Cross-region / multi-datacenter replication (that's a staff concern). - A general workflow engine with DAGs and signals — single jobs only, no child-job orchestration beyond enqueue.
3. Functional requirements¶
- An enqueue API (
Enqueue(ctx, job)): accepts a job type, JSON payload, optionalrun_at, priority, and a caller-supplied idempotency key; returns a job ID. Enqueue is durable before it returns (committed row /XADDack). - A worker pool (
cmd/worker) with N goroutines per process and M processes, each claiming jobs, running the registered handler, and acking success or scheduling a retry. - A handler registry:
Register("send_email", fn)wherefnisfunc(ctx, payload) error. Returningnilacks; returning an error retries; returning a sentinelErrDropdead-letters immediately. - Retries: on failure, reschedule with exponential backoff + jitter up to
max_attempts, then move to the DLQ with the last error recorded. - Scheduler: delayed jobs become visible at
run_at; a cron registrar enqueues recurring jobs without double-firing across multiple worker processes (leader/lock orINSERT ... ON CONFLICT). - Lease renewal: a long-running job heartbeats to extend its lease; a worker that dies lets the lease expire and another worker reclaims the job.
- Graceful shutdown: on SIGTERM, stop claiming, finish in-flight jobs within a drain deadline, and leave the rest cleanly claimable by others.
- A chaos hook (
cmd/chaos) cankill -9a worker mid-job, flap a downstream dependency, and inject poison payloads.
4. Load & data profile¶
- Throughput: sustain ≥ 10,000 jobs/s enqueue + process end-to-end for ≥ 20 minutes; report the ceiling and what bounds it.
- Backlog: pre-seed ≥ 5,000,000 queued jobs and measure drain time and drain throughput from a cold start.
- Job mix: deliberately heterogeneous — 70% fast (1–5 ms), 20% slow (200 ms–2 s, exercise lease renewal), 5% flapping (fail then succeed), 5% poison (always fail → DLQ).
- Duration skew: at least one job class runs > visibility timeout so you must implement lease renewal correctly or it gets double-run.
- Concurrency: ≥ 8 worker processes × ≥ 64 workers each (≥ 512 concurrent claimers) hitting one backend — this is where claim contention shows up.
- Generator:
cmd/genis deterministic given a seed; payloads carry an embeddedseqso the harness can detect loss and duplication exactly.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Sustained processing throughput (mixed job mix, durable enqueue) | Find & report the ceiling; name the bound (claim contention? DB IOPS? handler CPU? lock waits?) |
| Claim latency p99 (worker asks for next job → has a leased job) | < 10 ms at 80% of throughput ceiling |
| Enqueue p99 (durably committed) | < 25 ms |
| At-least-once guarantee | Zero lost jobs after kill -9 mid-process: every accepted seq is completed or in DLQ |
| Double-run rate under chaos | Effectively zero — with idempotent handlers, observed effect of each seq is applied once; raw re-deliveries are counted and bounded |
Scheduled-job lateness p99 (fire delay past run_at) | < 1 s at 1M-deep backlog |
| Backlog drain throughput (5M cold backlog) | Report jobs/s and total drain time |
| DLQ correctness | Poison jobs land in DLQ after exactly max_attempts, never silently dropped |
| Graceful-shutdown drain | 100% of in-flight jobs finish or cleanly requeue; none lost, none acked twice |
The point isn't a magic jobs/s number — it's to find your backend's ceiling, explain it, and prove the loss/double-run invariants hold around it.
6. Architecture constraints & guidance¶
- Backend (pick one, justify it):
- Postgres +
pgx: onejobstable, claim viaSELECT ... FOR UPDATE SKIP LOCKED LIMIT $batch.SKIP LOCKEDis the whole trick — concurrent claimers step over each other's locked rows instead of blocking. Watch index bloat and dead-tuple churn on a high-update table. - Redis Streams +
go-redis:XADDto enqueue, consumer group viaXREADGROUP, reclaim stuck entries withXAUTOCLAIM/XPENDING,XACKon success. Delayed jobs ride a ZSET keyed byrun_at, swept into the stream. - Worker is a separate binary from any enqueuer so you can scale and kill workers independently.
- Batch claims (e.g. 50–500 rows per claim) to amortize round-trips; this is the single biggest throughput lever and the source of the contention ceiling.
- Run the backend via
docker-compose, version pinned. For Postgres, tuneautovacuumaggressively — a hot job table dies without it. - Instrument with Prometheus: enqueue rate, claim rate, ack rate, in-flight count, queue depth, oldest-job age, retry rate, DLQ rate, claim p99, lease renewals, reclaims.
7. Data model¶
Postgres (SKIP LOCKED) variant:
jobs(
id BIGSERIAL PK,
queue TEXT NOT NULL, -- e.g. 'default','email','critical'
type TEXT NOT NULL, -- handler key
payload JSONB NOT NULL,
idempotency_key TEXT, -- caller-supplied; UNIQUE per (queue,key)
priority INT NOT NULL DEFAULT 0,-- higher runs first
status TEXT NOT NULL DEFAULT 'ready', -- ready|running|done|dead
run_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- delayed/scheduled visibility
attempts INT NOT NULL DEFAULT 0,
max_attempts INT NOT NULL DEFAULT 25,
leased_until TIMESTAMPTZ, -- visibility timeout / lease expiry
locked_by TEXT, -- worker id holding the lease
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
)
-- claim hot path:
CREATE INDEX ON jobs (queue, priority DESC, run_at)
WHERE status = 'ready'; -- partial index: only claimable rows
dead_jobs(... same shape ..., died_at TIMESTAMPTZ, final_error TEXT)
cron(name PK, spec TEXT, type TEXT, payload JSONB, next_run TIMESTAMPTZ)
Claim query (the core of the whole system):
WITH cte AS (
SELECT id FROM jobs
WHERE status = 'ready' AND run_at <= now() AND queue = $1
ORDER BY priority DESC, run_at
FOR UPDATE SKIP LOCKED
LIMIT $2
)
UPDATE jobs j SET status='running', attempts=attempts+1,
leased_until=now()+$3, locked_by=$4
FROM cte WHERE j.id = cte.id
RETURNING j.id, j.type, j.payload, j.attempts;
Redis Streams variant: stream jobs:{queue} + consumer group workers; XADD enqueue, XREADGROUP ... COUNT n BLOCK, XACK on done; XAUTOCLAIM for idle > visibility-timeout entries; a ZSET delayed:{queue} swept by a ticker into the stream when run_at <= now(); dead:{queue} stream for the DLQ.
8. Interface contract¶
Client / handler API (Go):
type Job struct {
Type string
Payload []byte
Queue string // default "default"
Priority int
RunAt time.Time // zero = now
MaxAttempts int // 0 = registry default
IdempotencyKey string // optional dedup on enqueue
}
func (c *Client) Enqueue(ctx context.Context, j Job) (id int64, err error)
type Handler func(ctx context.Context, payload []byte) error
func (w *Worker) Register(jobType string, h Handler) // ErrDrop → straight to DLQ
func (w *Worker) Run(ctx context.Context) error // blocks until ctx canceled, then drains
Operational endpoints: - GET /metrics → Prometheus exposition (all gauges/counters from §6). - GET /stats?queue=default → { ready, running, dead, oldest_age_s, attempts_p99 }. - POST /dlq/{id}/requeue → move a dead job back to ready with attempts reset. - Worker flags/env: -queues, -concurrency, -claim-batch, -visibility-timeout, -drain-timeout, -backend=pg|redis.
9. Key technical challenges¶
- One run at a time. The claim must be atomic: select-and-lease in a single transaction, or two workers grab the same job.
SKIP LOCKEDgives you this without a global lock; Redis gives it via consumer-group ownership. Get this wrong and your double-run test fails immediately. - At-least-once is the real contract. A worker can finish a job and die before acking — the job will be re-delivered. The only honest answer is idempotent handlers (dedup on idempotency key /
seq). Treat "exactly-once execution" as a property of the effect, achieved by the handler, not the queue. - Lease renewal vs. visibility timeout. Too short and long jobs get stolen and double-run; too long and a crashed worker's jobs sit dead for minutes. Heartbeating leases is the fix — and it must be correct under clock skew.
- The claim-contention ceiling. At 512 concurrent claimers on one Postgres table, throughput stops scaling with workers and starts being bound by row locks, index updates, and autovacuum. Bigger claim batches help — until they cause uneven work distribution and longer leases. Find the knee.
- Poison jobs. A job that always panics or always fails must not wedge a worker or spin the retry loop hot. Cap attempts, catch panics, route to DLQ, and rate-limit retry churn.
- Graceful shutdown without double-acking. Drain in-flight on SIGTERM, but a job whose lease you've already lost must not be acked by the dying worker.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each:
- Throughput vs. worker count: sweep total concurrency 16 → 64 → 256 → 512 →
- Plot processed jobs/s. Find where adding workers stops helping — that's the claim-contention ceiling. Name the bound (lock waits / IOPS / vacuum).
- Claim-batch sweep:
claim-batch= 1, 10, 50, 200, 1000 at fixed concurrency. Throughput and claim p99 vs batch size; find the knee and explain the latency/fairness cost of big batches. - Job-loss test (the headline): drive steady load,
kill -9a worker mid-process repeatedly. Afterward prove every acceptedseqisdoneor in DLQ — zero lost, and that idempotent handlers applied each effect once. Show the SQL/diff. - Backlog drain: pre-seed 5M jobs, start cold, measure drain throughput and total time. Compare to steady-state throughput — why is it different?
- Flapping dependency: point 5% of jobs at a dependency that fails for 60 s then recovers. Show backoff+jitter spreading retries (no thundering herd) and jobs completing after recovery, none hitting DLQ prematurely.
- DLQ growth under poison: inject 5% always-failing jobs. Confirm each lands in DLQ after exactly
max_attempts, retry CPU stays bounded, and healthy throughput is unaffected. - Scheduled-job latency at depth: schedule jobs for
now()+5swhile a 1M backlog churns. Measure fire-delay p99 pastrun_at— does the scheduler starve behind the backlog? - Graceful-shutdown correctness: SIGTERM a worker with 200 in-flight jobs. Prove every in-flight job either completed or was cleanly reclaimed by another worker — none lost, none acked twice.
- Lease-renewal proof: run jobs longer than the visibility timeout without renewal (expect double-runs), then with renewal (expect none). Show the delta.
11. Milestones¶
- Backend up via compose;
jobstable / stream; enqueue + single-worker claim loop withSKIP LOCKED; Prometheus + a Grafana board for depth/throughput/age. - Worker pool + handler registry + retries with backoff + jitter + DLQ.
- Visibility timeout + lease renewal + reclaim of crashed-worker jobs (experiment 9).
- Scheduler (
run_at) + cron without double-fire; priorities; graceful drain. - Load harness: 10k+ jobs/s, 5M backlog seed, chaos hooks. Run the headline job-loss and contention experiments (1, 3, 4); write the findings note.
- Remaining experiments (2, 5–8); ceiling curve with the bottleneck proven.
12. Acceptance criteria (definition of done)¶
- Sustained ≥ 20-min run at ≥ 10k jobs/s with bounded, flat queue depth; dashboard screenshot attached.
- Throughput ceiling reported with the bottleneck named and proven (pg lock-wait stats /
pg_stat_activity/ pprof / IOPS evidence). - Job-loss test passes: after repeated
kill -9mid-process under load, every acceptedseqisdoneor in DLQ — zero lost (show the SQL/diff). - No double-run effect: idempotency proven — each
seq's effect applied once despite re-deliveries; raw re-delivery count reported. - Lease renewal proven on jobs longer than the visibility timeout (experiment 9).
- Retries follow exponential backoff + jitter; poison jobs reach DLQ after exactly
max_attempts. - Scheduled-job lateness p99 < 1 s at 1M backlog; cron never double-fires.
- Graceful shutdown drains in-flight with zero loss and zero double-ack.
- 5M-job backlog drain throughput + time reported.
- Every number reproducible from a committed command + config + seed.
13. Stretch goals¶
- Both backends: implement Postgres and Redis Streams behind the same interface; compare throughput ceiling, claim p99, and operational pain.
- Fairness / weighted queues: prevent one tenant's flood from starving others (per-queue concurrency caps or weighted claiming); measure tail fairness.
- Adaptive backoff: circuit-break a job type whose dependency is down instead of retrying every instance; measure retry-CPU saved.
- Batched acks / claims pipelining to push the ceiling higher; quantify the gain and any correctness cost.
- Unique jobs: dedup enqueue so an identical pending job isn't queued twice (
ON CONFLICTon idempotency key); prove under concurrent enqueuers. - Exactly-once aspiration writeup: consume-process-produce with the ack and the side effect in one transaction; document precisely where it still can't be exactly-once and why at-least-once + idempotency is the right model.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Claim/lease correctness | One-run-at-a-time holds in the happy path | Holds through kill -9, lease expiry, and clock skew; explains why SKIP LOCKED / consumer-group ownership is correct |
| Delivery semantics | Knows it's at-least-once; handlers are idempotent | Articulates the exactly-once boundary precisely; proves no observable double-effect |
| Throughput analysis | Reports a jobs/s ceiling | Names and proves the claim-contention bottleneck; knows the next one and how to push past it |
| Retries & DLQ | Backoff + jitter + max-attempts + DLQ work | Tunes backoff for a flapping dependency without herds; contains poison without hurting healthy throughput |
| Long jobs | Implements visibility timeout | Lease renewal correct; demonstrates the double-run it prevents |
| Scheduling | Delayed + cron jobs fire | No double-fire across processes; bounded lateness under deep backlog |
| Shutdown | Drains in-flight on SIGTERM | Zero loss and zero double-ack proven under chaos |
| Communication | Clear findings note | Could defend every curve and the loss-test to a staff panel |
15. References¶
- Postgres:
FOR UPDATE SKIP LOCKEDsemantics; partial indexes; autovacuum tuning for high-churn tables. jackc/pgxdocs; prior art:riverqueue/river,contribsys/faktory,vgarvardt/gue(PostgresSKIP LOCKEDjob queues in Go).- Redis: Streams + consumer groups (
XADD/XREADGROUP/XACK/XAUTOCLAIM),go-redis(redis/go-redis); prior art:hibiken/asynq. - Designing Data-Intensive Applications — Ch. 11 (message brokers, delivery guarantees), Ch. 8 (process pauses, leases, clock skew).
- Retry backoff: AWS "Exponential Backoff and Jitter" (full-jitter).
- See also:
Interview Question/02-concurrency/,Interview Question/11-messaging-and-event-streaming/, andInterview Question/15-testing-and-quality/(loss/idempotency test design).