Event Schema Registry & Contract Evolution¶
Events are an API — and most teams ship them with no compiler. Stand up a schema registry, encode events with Avro/Protobuf, then evolve the contract under a fleet of producers and consumers running mixed versions — with zero downtime and zero broken consumers. Prove which changes are safe, which are not, and why.
| Tier | Lab (event-engineering) |
| Primary domain | Event contracts / schema governance |
| Skills exercised | Schema registry, Avro/Protobuf, compatibility modes, wire format, serde caching, rolling upgrades, CI contract checks, Go (hamba/avro, linkedin/goavro, protobuf) |
| Interview sections | 10 (API design), 11 (messaging), 12 (architecture) |
| Est. effort | 3–5 focused days |
1. Context¶
You own the orders event stream at a company where 40+ services in 6 languages consume it. Last quarter a backend team added a non-nullable field to the OrderPlaced event "because it's just a struct change," and three downstream consumers crashed in a deserialization loop at 02:00 — the analytics pipeline, the fraud scorer, and the notification fan-out. The post-mortem conclusion: the event payload is a public API with no contract enforcement and no review gate.
Your job in this lab is to make event schemas a governed contract. You will run a schema registry (Confluent Schema Registry, or build a minimal compatible one), encode events with Avro and Protobuf, and then deliberately evolve schemas — add, remove, rename, retype fields — across a live fleet of producers and consumers on mixed versions. You will produce a table of which change is safe under which compatibility mode, backed by runs, not by the docs. And you will prove the registry never becomes the throughput bottleneck.
2. Goals / Non-goals¶
Goals - Run a schema registry and put compatibility enforcement in the produce path: an incompatible schema registration is rejected, not silently accepted. - Implement the Confluent wire format (magic byte 0x00 + 4-byte big-endian schema id + payload) and a Go serde with a schema-id cache so the registry is hit once per schema, not once per message. - Build a mixed-version fleet (N producers, M consumers on different schema versions) and survive a rolling upgrade with zero broken consumers and zero consumer lag spike attributable to schema handling. - Empirically map every change class (add optional, add required, remove, rename, retype) to each compatibility mode (BACKWARD, FORWARD, FULL, *_TRANSITIVE) and state the upgrade order (consumers-first vs producers-first) each mode forces. - Replay historical events written under old schemas with new code, and read new events with old code — both must succeed where the mode promises it.
Non-goals - Building a UI/admin console for the registry — CLI + HTTP is enough. - Cross-format migration (Avro → Protobuf on the same topic). Pick a format per topic; compare them on separate topics. - A full streaming app — the consumer just deserializes and asserts; correctness of business logic is out of scope, contract correctness is the whole point.
3. Functional requirements¶
- A registry (
cmd/registryif self-built, else Confluent in compose) exposing the standard REST surface:POST /subjects/{subject}/versions,GET /schemas/ids/{id},POST /compatibility/subjects/{subject}/versions/{v}, and per-subject compatibility-mode config (PUT /config/{subject}). - A producer (
cmd/producer) that, given a schema file, registers it (which must pass the subject's compatibility check), caches the returned id, and emits framed records (magic + id + payload) to Kafka. - A consumer (
cmd/consumer) that reads the 5-byte header, fetches the writer schema by id (cached), and deserializes against its own reader schema. On a schema it cannot resolve, it does not crash the group — it routes to a quarantine path and increments a metric. - Both Avro and Protobuf code paths, switchable by topic/flag, over identical logical events so the comparison is apples-to-apples.
- A CI gate (
cmd/compat-check+ a Makefile target) that, given a proposed schema and the registered latest version(s), exits non-zero on an incompatible change before merge. - A chaos/upgrade driver (
cmd/rollout) that runs a mixed-version fleet and performs a rolling upgrade in a chosen order while load is live.
4. Load & data profile¶
- Topology: N = 24 producer instances and M = 48 consumer instances (split across ≥2 consumer groups), so a rolling upgrade always leaves old and new schema versions live simultaneously for minutes, not seconds.
- Throughput: sustained ≥ 200k events/s aggregate on the hot produce path; single run ≥ 20 minutes. Event payload ~512 B (an
Orderwith ~12 fields, a nestedLineItem[], and an enumstatus). - Schema churn: at least 6 schema versions of the
Ordersubject created during the lab, exercising each change class at least once. - Historical corpus: ≥ 500M events persisted across ≥3 distinct schema versions on disk/topic, for the replay experiment (mix versions within one partition so a single consumer pass sees them interleaved).
- Generator:
cmd/genis deterministic given a seed; it can emit any registered schema version on demand so you can synthesize mixed-version streams.
5. Non-functional requirements / SLOs¶
| Metric | Target |
|---|---|
| Registry lookup p99 (warm schema-id cache) | < 1 ms (in-process cache hit; effectively a map lookup) |
| Registry lookup p99 (cold, cache miss → HTTP round-trip) | < 25 ms; must occur ≤ once per schema id per process |
| Serde + framing overhead on produce hot path | < 8 µs/event median, < 30 µs p99 over a raw-bytes baseline; report both |
| Sustained produce throughput with serde + cache on | ≥ 200k events/s aggregate; serde must not be the bottleneck (prove it) |
| Cache-miss amplification under N=24 producers cold-start | Registry QPS bounded by #processes × #distinct schemas, not by event rate |
| Broken consumers during a rolling upgrade | Zero — deserialize_errors_total == 0 for any mode-compliant change |
| Consumer lag during rolling upgrade | No sustained rise attributable to schema handling; bounded and recovers |
| Historical replay across schema change | 100% of ≥500M old-format events deserialize with new reader schema (mode-permitting) |
The point is not a magic latency — it's that the registry sits off the hot path after warm-up, and that every compatibility claim is backed by a run where
deserialize_errors_total == 0(or a clearly-explained nonzero you predicted).
6. Architecture constraints & guidance¶
- Registry: Confluent Schema Registry via
docker-compose(pin the version), or a self-built minimal registry. If you self-build, you must implement compatibility checking yourself — that is the interesting part. - Kafka: 3 brokers, KRaft mode, pinned version. The registry is the subject of the lab, not Kafka.
- Go clients:
- Avro:
hamba/avro/v2(fast, good schema-resolution support) orlinkedin/goavro(canonical, schema-evolution-aware). State which and why. - Protobuf:
google.golang.org/protobufwith generated types; for registry integration use the Confluent Protobuf serde framing (message-index prefix). - Confluent serde behavior in Go: model it on
confluentinc/confluent-kafka-goserde, but a hand-rolled framing codec is encouraged so you see the bytes. - Caching: two caches per process — schema → id (write path) and id → parsed schema (read path). Both unbounded-but-tiny (schemas are few); a cold miss is the only time you touch the registry. Instrument hit ratio.
- Instrument with Prometheus: registry RPS, cache hit ratio, serde latency histogram (separate from network),
deserialize_errors_total{subject,reason}, per-consumer reader-schema version, produce throughput, consumer lag.
7. Data model¶
The logical event (rendered in both Avro and Protobuf):
Order v1 (Avro):
{ order_id string, account_id long, status enum{PLACED,PAID,SHIPPED},
amount_cents long, currency string, ts long,
items array<{ sku string, qty int, price_cents long }> }
The wire frame on every Kafka record value (Confluent format):
0 1 5
+--------+-----------------+------------------------------------+
| 0x00 | schema id (BE) | Avro/Protobuf-encoded payload |
+--------+-----------------+------------------------------------+
magic 4-byte uint32 (Protobuf adds a varint message-index array here)
The registry's subject/version/id model:
subjects(subject TEXT, compatibility TEXT) -- e.g. orders-value, BACKWARD
versions(subject, version INT, schema_id INT) -- ordered history per subject
schemas(schema_id INT PK, schema_text TEXT, format TEXT)-- global, dedup by canonical form
The writer schema (id on the wire) and the reader schema (compiled into the consumer) are generally different versions — Avro resolves between them. That gap is the entire game.
8. Interface contract¶
- Registry (Confluent-compatible subset):
POST /subjects/{subject}/versions→{ "id": N }(registers; 409/422 on incompatible).GET /schemas/ids/{id}→{ "schema": "...", "schemaType": "AVRO|PROTOBUF" }.POST /compatibility/subjects/{subject}/versions/{v}→{ "is_compatible": bool }.PUT /config/{subject}{ "compatibility": "BACKWARD|FORWARD|FULL|...TRANSITIVE" }.- Producer/consumer flags:
-format=avro|protobuf,-schema=path,-subject,-compat,-reader-version,-rate,-instances. cmd/compat-check -subject orders-value -proposed new.avsc→ exit 0/1 + a human-readable diff of what broke.GET /metrics→ Prometheus exposition on every binary.
9. Key technical challenges¶
- Keeping the registry off the hot path. A naive serde fetches the schema per message and turns the registry into a single point of throughput collapse. The schema-id cache is load-bearing — and a cold fleet of 24 producers must not stampede the registry. Quantify the with/without-cache delta.
- Compatibility direction is not intuitive.
BACKWARDmeans new reader can read old data → upgrade consumers first.FORWARDmeans old reader can read new data → upgrade producers first. Getting this backwards is exactly how the 02:00 outage happened. You must derive the order, not memorize it. - Avro defaults vs Protobuf field presence. Avro tolerates a removed field on read only if the reader has a default; "add a required field" is unsafe precisely because old data has no value for it. Protobuf's "everything optional, field numbers are the contract" model permits different changes — and forbids others (reusing a field number is catastrophic). Compare them head-to-head.
- Mixed-version steady state. During a rolling upgrade, both directions are exercised at once: new consumers read old producers' data and old consumers read new producers' data, for the whole rollout window. Only
FULLsurvives arbitrary ordering — and that's the trade-off you must articulate. - Reading the past. Historical replay is forward/backward compatibility under a microscope: 500M records written by long-dead schema versions must still decode with today's reader, or your event log is a liability, not an asset.
10. Experiments to run (break it / tune it)¶
Record before/after numbers for each.
- Serde + registry overhead, cache on vs off. Disable the schema-id cache and measure produce throughput, registry RPS, and serde p99; re-enable it. Report the throughput cliff and the registry-RPS reduction. Measure: events/s, registry RPS, serde latency histogram, cache hit ratio.
- Cold-fleet stampede. Cold-start all 24 producers simultaneously against an empty cache. Measure: peak registry RPS, p99 of first-event latency, and confirm steady-state registry RPS ≈ 0 once warm.
- Change-class × compatibility-mode matrix. For each change — add optional field (with default), add required field (no default), remove field, rename field, change type (e.g.
int→long,int→string) — attempt to register underBACKWARD,FORWARD, andFULL. Measure: registration accepted/rejected, and then a live read test proving it actually decodes (or the exact deserialization error). Fill the matrix from runs, not docs. - Upgrade-order proof per mode. Under
BACKWARD, do a rolling upgrade producers-first (the wrong order) and show old consumers breaking; then consumers-first and show zero errors. Reverse the experiment forFORWARD. Measure:deserialize_errors_totalover the rollout window for each order. - Mixed-version live rollout. With N=24/M=48, run a 15-minute rolling upgrade under each mode while producing ≥200k/s. Measure: error count, lag curve, and the duration both versions coexist. Show
FULLsurvives arbitrary order. - Avro vs Protobuf head-to-head. Same logical change set on two topics. Measure: payload size, serde CPU/latency, and which change classes each format permits (e.g. Protobuf field-number reuse vs Avro rename-by-alias).
- Historical replay across a schema change. Replay ≥500M events spanning ≥3 writer versions through a single new-reader consumer. Measure: % decoded, error count by writer version, and throughput vs a single-version replay.
- CI gate enforcement. Submit a known-breaking schema through
cmd/compat-checkin a Makefile/CI target. Measure: non-zero exit, and the produced diff naming the offending field — then confirm the registry would also have rejected it (defense in depth).
11. Milestones¶
- Compose up: Kafka (KRaft) + registry; producer/consumer with the 5-byte frame; Prometheus + a Grafana board for registry RPS, cache hit ratio, serde latency.
- Schema-id caches both directions; experiment 1–2 (overhead + stampede) with numbers; prove registry is off the hot path warm.
- Compatibility matrix (experiment 3) for Avro;
cmd/compat-check+ CI gate (experiment 8). - Rolling-upgrade order proofs and mixed-version live rollout (experiments 4–5) under BACKWARD/FORWARD/FULL.
- Protobuf path + format bake-off (experiment 6); historical replay (experiment 7); findings note.
12. Acceptance criteria (definition of done)¶
- The 5-byte Confluent frame is implemented and decoded from raw bytes (show a hexdump of one record annotated: magic, id, payload).
- Warm registry-lookup p99 < 1 ms and serde overhead reported against a raw-bytes baseline; cache hit ratio > 99.99% at steady state.
- A completed change-class × mode matrix, each cell backed by a registration result and a live decode test (not just "the docs say so").
- A rolling upgrade in the correct order per mode with
deserialize_errors_total == 0, and the wrong order shown breaking on purpose, both with screenshots. - A 15-min mixed-version live rollout at ≥200k/s under FULL with zero broken consumers and a flat (recovering) lag curve.
- Historical replay of ≥500M events across ≥3 writer versions: 100% decode under the mode that permits it, with per-version counts.
- CI gate rejects a breaking change with a useful diff; every number is reproducible from a committed command + config.
13. Stretch goals¶
- Self-built registry with full compatibility checking (Avro schema resolution
- Protobuf descriptor diff) and
*_TRANSITIVEmodes; diff it against Confluent's verdicts on the same schema pairs. - Schema references (a shared
Addresstype referenced by multiple subjects) and the compatibility implications of evolving a referenced schema. - Soft-delete + alias-based rename in Avro: show a "rename" done safely via
aliases+ default, surviving both directions. - Contract testing across repos: a consumer publishes its reader schema; the producer's CI checks compatibility against all registered reader schemas (consumer-driven contracts), not just the latest.
- Wire-format fuzzing: feed truncated/garbage frames and prove the consumer quarantines rather than crashes (DLQ tie-in to lab
events/05).
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Compatibility modes | Knows backward/forward/full exist | Derives the upgrade order each forces and proves it with a broken-then-fixed rollout |
| Wire format | Uses a serde library | Implements + reads the magic-byte/id frame from raw bytes; explains schema-id resolution |
| Hot-path performance | Adds a schema-id cache | Proves the registry is off the hot path; quantifies cache-on/off cliff and bounds cold-fleet RPS |
| Change safety | Knows "add required = bad" | Produces the full change×mode matrix from runs; explains Avro defaults vs Protobuf field-number semantics |
| Format judgment | Can use Avro or Protobuf | Recommends one per use case with measured size/CPU/evolution trade-offs |
| Evolution under load | Survives a single rollout | Survives mixed-version live rollout + 500M-event historical replay; defends FULL's cost |
| Communication | Clear findings note | Could defend the matrix and the outage post-mortem to a staff panel |
15. References¶
- Confluent: "Schema Registry Concepts", "Schema Evolution and Compatibility", and the wire-format spec (magic byte + schema id framing).
- Avro spec — schema resolution rules (reader vs writer schema, defaults, aliases).
- Protocol Buffers — "Updating A Message Type" (field numbers, reserved, presence).
hamba/avro,linkedin/goavro,google.golang.org/protobuf,confluentinc/confluent-kafka-go(serde framing) docs.- Designing Data-Intensive Applications — Ch. 4 (encoding & evolution).
- See also:
Interview Question/10-api-design/(events-as-API, versioning, contract evolution) andInterview Question/11-messaging-and-event-streaming/(schema registry, compatibility, replay).