Event-Sourced Order System: CQRS + Saga + Outbox¶
Make the append-only event log the only source of truth, derive every read from it, and orchestrate a multi-service order/payment/inventory workflow with sagas — then live with the consequences at billions of events: replay cost, projection lag, schema evolution, and the eventual consistency your UI can no longer hide from.
| Tier | Staff (architecture / distributed) |
| Primary domain | Event sourcing & distributed transactions |
| Skills exercised | Event sourcing, CQRS, optimistic concurrency, snapshots, event versioning/upcasting, sagas & process managers, transactional outbox, idempotent projections, Go (pgx, Kafka) |
| Interview sections | 11 (messaging & event streaming), 12 (architecture), 13 (distributed systems) |
| Est. effort | 5–8 focused days |
1. Context¶
You own the order domain at a marketplace processing ~3M orders/day across order, payment, and inventory services. The current design is a CRUD orders table with a status column that three services race to update; nobody can answer "what did this order look like at 14:03?", a refund bug last quarter was unauditable, and a new finance read-model means another team wants its own denormalized view of the same data without taking a lock on your write path.
You're going to rebuild the order lifecycle as a fully event-sourced system. State is a fold over an append-only event log. Writes go through aggregates that emit events under optimistic concurrency; reads are served by independently scaled CQRS projections that the UI must tolerate as eventually consistent. Cross-service workflows (reserve stock → charge card → confirm, or compensate) run as sagas. Events leave the building through a transactional outbox.
This is a staff lab because the hard part isn't the happy path — it's replay at scale, versioning a schema while billions of old events sit immutable on disk, and knowing when event sourcing is the wrong answer. You will produce numbers, not opinions.
2. Goals / Non-goals¶
Goals - Build an event store (append-only, per-aggregate streams) holding ≥ 500M events and exercise it to billions via replay. - Enforce correctness on the write side with optimistic concurrency keyed on stream version; measure the conflict rate on a hot aggregate. - Serve reads from ≥ 2 independent projections kept in sync with bounded lag, and rebuild one from the full log on demand. - Bound replay cost with snapshots; quantify the saving. - Evolve the event schema with upcasting without rewriting history. - Run a saga / process manager across order/payment/inventory with compensation on a mid-workflow failure. - Publish via a transactional outbox; keep projections idempotent.
Non-goals - A full event-store product (EventStoreDB/Marten). Build the mechanics on Postgres (+ Kafka for transport) so you see the knobs. - UI work beyond a thin read API and a staleness probe. - Global multi-region (that's staff/04-multi-region-active-active).
3. Functional requirements¶
- A command side (
cmd/orderd) accepts commands (PlaceOrder,AddItem,Pay,Cancel), loads the aggregate by folding its stream, validates, and appends new events under an expected version. - An append API rejects a write whose
expected_versionno longer matches the stream head (optimistic-concurrency conflict → caller retries). - A projector (
cmd/projector) consumes the event log and maintains read models: at minimumorder_summary(UI) andfinance_daily(rollup). Each projection tracks its own checkpoint and is independently restartable. - A read API (
cmd/queryd) serves projections and exposes the projection'sas_ofposition so callers can reason about staleness. - A saga (
cmd/saga) drivesOrderPlaced → ReserveStock → ChargePayment → ConfirmOrder, with compensations (ReleaseStock,RefundPayment) when a downstream step fails or times out. - An outbox relay (
cmd/outbox) publishes appended events to Kafka exactly once per event from the relay's perspective (at-least-once delivery + dedup downstream). - A replay tool (
cmd/replay) rebuilds any projection from event 0 (or from the latest snapshot) and reports throughput and total time. - A chaos hook can kill the payment service mid-saga and inject a malformed payload to force a poison-message path.
4. Load & data profile¶
- Event volume: seed ≥ 500M events; demonstrate replay tooling against ≥ 2B (replayed, not all resident as distinct aggregates).
- Aggregates: ~50M order streams; average 8–12 events/stream, but a Zipfian hot tail — a few "house account" aggregates accrue 10k+ events each (this is what makes optimistic concurrency bite and snapshots matter).
- Command throughput: drive ≥ 5,000 commands/s sustained for ≥ 20 min, with a deliberately hot aggregate receiving a concentrated share to provoke version conflicts.
- Generator:
cmd/genis deterministic given a seed; emits a realistic mix of place/add/pay/cancel and the occasional already-stale command. - Traffic model: open-model (fixed command rate) so projection lag can build and be observed, not hidden by a closed-loop.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Command append p99 (cold aggregate, fold + append) | < 25 ms at 80% of command-throughput ceiling |
| Command append p99 (hot aggregate, 10k+ events, with snapshot) | < 40 ms; without snapshot, report the blowup |
| Optimistic-concurrency conflict rate (hot aggregate) | Measured & explained; retry policy keeps effective success ≥ 99% |
| Projection lag at steady state (rate below ceiling) | Bounded and flat; p99 event→projected < 2 s |
| Read-after-write staleness window | Measured distribution (p50/p99/max); UI strategy stated |
| Projection rebuild throughput (replay) | ≥ 50k events/s/projector; report total time for the full log |
| Snapshot saving | Replay time with vs without snapshots, same projection, quantified |
| Saga correctness under failure | After mid-saga payment kill: no stock leaked, no double charge; system reaches a consistent terminal state (confirmed or fully compensated) |
The point isn't a magic number — it's to find your system's numbers (conflict rate, lag, replay time, staleness window) and explain them.
6. Architecture constraints & guidance¶
- Event store on Postgres: a single
eventstable, append-only, with a per-stream monotonic version and a global sequence. Usepgx(v5) with the binary protocol; batch appends in one transaction per command. - Transport via Kafka for the outbox → projector/saga path. The DB is the source of truth; Kafka is derived and replayable from it.
- Keep command, query, projector, saga, and outbox as separate binaries so you can scale and kill them independently — that separation is the CQRS story.
- No reads off the write tables in the hot path. If the UI reads from
events, you've defeated the exercise. - Instrument with Prometheus: command rate, append p50/p99, conflict rate, per-projection lag (global-seq head − checkpoint), event→projected latency, outbox backlog, saga in-flight/compensating counts.
7. Data model¶
-- Append-only event store; per-aggregate streams + global ordering.
events (
global_seq BIGINT GENERATED ALWAYS AS IDENTITY, -- global order for projectors
stream_id UUID NOT NULL, -- aggregate id (one order)
version INT NOT NULL, -- per-stream, 1..N (gapless)
event_type TEXT NOT NULL, -- "OrderPlaced", ...
event_schema INT NOT NULL DEFAULT 1, -- payload schema version (upcasting)
payload JSONB NOT NULL,
metadata JSONB NOT NULL, -- correlation_id, causation_id, actor
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (stream_id, version) -- optimistic concurrency guard
);
CREATE UNIQUE INDEX ON events (global_seq);
-- Append is: INSERT ... where version = expected+1; a duplicate (stream_id,version)
-- raises unique violation => concurrency conflict => caller reloads & retries.
snapshots ( -- bound replay cost per aggregate
stream_id UUID PRIMARY KEY,
version INT NOT NULL, -- snapshot is valid up to this version
state JSONB NOT NULL, -- folded aggregate state
taken_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Read models (CQRS). Each owns its checkpoint; independently rebuildable.
order_summary ( -- UI projection
order_id UUID PRIMARY KEY, status TEXT, total_cents BIGINT,
item_count INT, updated_seq BIGINT
);
finance_daily (day DATE, gross_cents BIGINT, refunds_cents BIGINT, PRIMARY KEY(day));
projection_checkpoint (name TEXT PRIMARY KEY, last_seq BIGINT NOT NULL);
outbox ( -- written in the SAME tx as events
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
global_seq BIGINT NOT NULL, topic TEXT, key BYTEA, value BYTEA,
published_at TIMESTAMPTZ -- NULL until relay confirms
);
-- Saga / process manager state.
saga_instance (
saga_id UUID PRIMARY KEY, order_id UUID NOT NULL,
step TEXT NOT NULL, -- reserving|charging|confirming|compensating|done
state JSONB NOT NULL, -- reservation_id, payment_id, attempts
timeout_at TIMESTAMPTZ, updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
The idempotency rule for projectors: apply only events with global_seq > checkpoint.last_seq and advance the checkpoint in the same transaction as the projected write. Re-delivery is then a no-op.
8. Interface contract¶
Command API (cmd/orderd)
POST /orders { items:[...] } -> 201 { order_id, version }
POST /orders/{id}/pay { amount_cents, idempotency_key } -> 200 { version } | 409 (version conflict)
POST /orders/{id}/cancel -> 200 { version }
expected_version (or If-Match); a stale write returns 409 and the body states the current head version. Query API (cmd/queryd)
GET /orders/{id} -> { order_id, status, total_cents, item_count, as_of_seq }
GET /finance/daily/{d} -> { day, gross_cents, refunds_cents, as_of_seq }
GET /staleness -> { write_head_seq, projection_seq, lag_events, lag_ms }
Event schema (versioned, upcastable)
// event_schema = 1
{ "type":"OrderPlaced", "order_id":"…", "customer_id":"…", "items":[{"sku":"…","qty":2}] }
// event_schema = 2 (added currency; v1 upcasts to v2 by defaulting currency="USD")
{ "type":"OrderPlaced", "order_id":"…", "customer_id":"…", "currency":"USD", "items":[…] }
upcast(raw) -> latestEvent stage runs before any fold. Old bytes on disk are never rewritten. 9. Key technical challenges¶
- Optimistic concurrency on a hot aggregate. Concentrated writes to one stream collide on
(stream_id, version). You must measure the conflict rate, choose a retry policy (bounded retries + jitter), and decide whether the hot "house account" should be modeled differently (split aggregate, or a counter that isn't event-sourced). Knowing when the aggregate boundary is wrong is the staff signal. - Replay cost vs snapshots. Folding a 10k-event stream on every command is the latency killer. Snapshot cadence (every N events / on size) trades write amplification against replay cost — find the knee.
- Projection lag & idempotency. Re-delivery, restarts, and rebuilds must not double-apply. The checkpoint-in-same-transaction discipline is the only thing standing between you and a corrupted read model.
- Eventual consistency the UI must tolerate. Read-after-write can show stale state. You must measure the staleness window and pick a strategy (read-your-writes via command-returned version, optimistic UI, or sticky read).
- Schema evolution without rewriting history. Upcasting must be correct for every historical version and cheap enough to run on the full-log replay path.
- Saga compensation correctness. A failure between "stock reserved" and "payment charged" must leave no leaked reservation and no orphan charge. Compensations are themselves events and must be idempotent.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each:
- Command throughput & conflict rate. Drive 5k cmd/s with a hot aggregate taking 20% of traffic. Plot append p99 and the optimistic-concurrency conflict rate vs hot-key concentration. At what concentration does retry storm dominate?
- Snapshot saving. Take a 10k-event aggregate. Measure command p99 and replay time without snapshots, then with snapshots every {100, 1k, 5k} events. Plot replay time and snapshot write-amplification; pick the cadence.
- Projection lag under write load. Ramp command rate to the ceiling and measure per-projection lag (events and ms). Where does the projector fall behind, and is it CPU, DB write, or single-partition ordering?
- Replay to rebuild a projection. Drop
order_summaryand rebuild it from the full log; report events/s and total time with vs without snapshots as the read source. Does the system stay available (old projection serving) during rebuild? - Event-version upcasting. Ship
event_schema=2(addcurrency). Prove oldv1events fold correctly via upcasting, then replay the full log through the new upcaster and confirm read models are identical for unaffected fields. - Saga compensation under injected failure. Kill the payment service after stock is reserved but before charge confirms. Prove the saga compensates (
ReleaseStock) and reaches a consistent terminal state. Show no leaked reservation, no double charge via the event log. - Read-after-write staleness window. Issue a command, then immediately poll the read API; record time-to-visibility distribution (p50/p99/max). Compare a naive read vs read-your-writes using the command-returned version.
11. Milestones¶
- Event store on Postgres (append + optimistic version);
cmd/orderdfolds and appends;cmd/gendeterministic load; Prometheus board for append p99 + conflicts. - Projector + two read models with checkpoint-in-tx idempotency;
queryd; first lag-under-load run (experiment 3). - Snapshots + replay tool; snapshot-cadence and rebuild experiments (2, 4).
- Outbox relay to Kafka; saga across order/payment/inventory with compensation; chaos run (experiment 6).
- Event versioning/upcasting (experiment 5); staleness measurement (7); findings note.
12. Acceptance criteria (definition of done)¶
- Event store holds ≥ 500M events; append enforces optimistic concurrency (show a 409 on a stale write and the gapless
(stream_id, version)). - Two independent projections served by
queryd, each rebuildable from the log; rebuild of one while the other serves traffic, with throughput reported. - Conflict-rate curve for the hot aggregate, with the chosen retry policy and its measured effective success rate.
- Snapshot experiment: replay time and command p99 with vs without snapshots, knee identified.
- Upcasting demonstrated across a real
event_schemabump; full-log replay through the new upcaster verified. - Saga chaos: payment killed mid-workflow → no leaked stock, no double charge, consistent terminal state, shown from the event log.
- Staleness window distribution reported, with the UI strategy named.
- Every number reproducible from a committed command + config.
13. Stretch goals¶
- Competing-consumer projectors: partition the log by
stream_idand run N projector instances; measure ordering guarantees and lag at N×. - Snapshot-on-read vs background snapshotter: compare write amplification and tail latency.
- Choreography vs orchestration: re-implement the saga as choreography (services react to events directly) and compare failure-mode clarity.
- Bi-temporal queries: "what did order X look like as of
?" served by folding to a point in time — and the cost of doing it at scale. - Outbox vs CDC: replace the polling outbox relay with logical-decoding CDC (cross-reference
events/01-cdc-pipeline-debezium) and compare publish latency.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Event-sourcing fundamentals | Commands→events, fold to state | Defends append-only as source of truth; knows when ES is the wrong choice (CRUD-shaped domain, no audit need, high-cardinality hot aggregate) and says so |
| Optimistic concurrency | Detects version conflicts | Measures conflict rate; chooses retry policy and re-models the hot aggregate when retries don't pay |
| CQRS / projections | Separate read model exists | Multiple independently-scaled projections, rebuilt from log with bounded lag; explains the eventual-consistency contract to the UI |
| Snapshots | Knows snapshots speed replay | Quantifies the cadence knee; balances write-amp vs replay cost |
| Versioning | Adds a field | Upcasts every historical version correctly without rewriting history; proves it on full-log replay |
| Sagas | Happy-path orchestration | Compensation is correct and idempotent under mid-workflow failure; no leaked/double effects, proven from the log |
| Outbox / idempotency | Publishes events | Checkpoint-in-transaction idempotent projections; survives re-delivery and rebuild |
| Communication | Clear findings note | Could defend every curve — and the decision not to event-source a sub-domain — to a staff panel |
15. References¶
- Fowler: Event Sourcing, CQRS; Vernon: Implementing Domain-Driven Design (aggregates, sagas/process managers).
- Designing Data-Intensive Applications — Ch. 11 (stream processing, derived state), Ch. 12 (the future of data systems).
- Microsoft patterns: Transactional Outbox, Saga, Compensating Transaction.
pgxv5 (binary protocol, batch); Kafka transactions/idempotent producer.- See also:
events/03-event-replay-and-reprojection(replay & reprojection at scale),senior/07-event-driven-order-payment-service(the service-level version of this domain),events/07-idempotent-inbox-outbox. - Interview prep:
Interview Question/11-messaging-and-event-streaming/,Interview Question/12-architecture/,Interview Question/13-distributed-systems/.