Pub/Sub — Professional¶

Focus: staff/principal-level decisions. Pub/Sub at scale is not a library; it is an architectural commitment about who owns truth, who owns latency, and who owns replay. The Go code is the easy part. The hard parts are idempotency under retries, schema evolution across teams, and the operational shape of a system whose most important property is what happens when a broker partition stops responding for 90 seconds at 03:00. Opinionated where the field agrees, explicit about trade-offs where it does not.

1. Pub/Sub as a system primitive¶

Pub/Sub is one member of a family of inter-service primitives. Conflating them is the most common architectural mistake; you cannot retrofit one into another after the call sites are written.

Primitive	Coupling	Delivery model	Consumer knowledge	Typical use
RPC (gRPC, HTTP)	Tight: caller knows callee	Sync request/response	1 known callee	"Charge this card and tell me the result"
Work queue (SQS, Rabbit work)	Loose on identity, tight on contract	Async, exactly-one-consumer	N anonymous workers	"Resize this image" — any free worker
Pub/Sub topic (Kafka, NATS)	Loose: publisher does not know subscribers	Async, fan-out, at-least-once	0..N independent subs	"An order was placed" — accounting, search, fraud, audit all react
Event bus (in-process)	Same binary	Async, fan-out, lossy	0..N handlers	Domain events, internal cache invalidations
Event log (Kafka with retention)	Loose, with replay	Append-only, position-addressable	Anyone, at any past offset	Event sourcing, late-joining consumers

Four distinctions matter operationally:

Who knows whom. RPC requires the caller to know endpoints; Pub/Sub requires the publisher to know nothing beyond the topic. The instant a publisher learns "subscriber X needs this", the abstraction has leaked.
Delivery cardinality. Work queues are 1-of-N; topics are 1-to-all-subscribers; consumer groups are 1-of-N within a group. Mixing them is how payment-processed gets consumed twice by email-sender.
Time semantics. RPC has one deadline. Pub/Sub has eventual delivery. A successful Publish means the broker accepted the message — not that any consumer saw it.
Truth ownership. The broker is not the system of record. The producer's database is. Treating a successful publish as proof the side effect occurred is the most common production failure.

A useful rule: RPC when the caller needs the result inline; work queue when one party needs work done; Pub/Sub when several independent parties need to know something happened. The third case is the broadest and the most abused — every "let's just emit an event" needs a written list of consumers.

2. Quantitative cost analysis¶

Numbers below are Go 1.22, amd64, NATS JetStream 2.10 / Kafka 3.7 on a 10 GbE LAN.

2.1 In-process baseline¶

BenchmarkChannelSendRecv      50 ns/op      0 B/op  0 allocs   (buffered chan)
BenchmarkBrokerFanout4       210 ns/op    128 B/op  1 alloc    (in-proc, 4 subs)
BenchmarkBrokerFanout64     2300 ns/op    128 B/op  1 alloc    (in-proc, 64 subs)

In-process Pub/Sub is ~50 ns per delivery per subscriber. Negligible against any I/O.

2.2 Brokered baseline¶

NATS Core, single subject, sync publish:                  ~80 µs round-trip
NATS JetStream, AckExplicit, durable consumer:           ~600 µs publish + ack
Kafka, acks=all, batch.size=16KB, linger.ms=5:           ~3 ms p50, ~12 ms p99
Kafka, acks=1, batch full of small msgs:                 ~250 µs amortized
Google Pub/Sub, gRPC streaming pull:                       ~5 ms p50, ~40 ms p99

The 10 000× gap between in-process and brokered is the load-bearing fact of this chapter. A brokered topic costs as much as a database write — because under durability it is a database write.

2.3 Serialization and the publish budget¶

Format	Encode (1 KB)	Decode	Wire size vs JSON	Schema
`encoding/json`	~1.8 µs	~3.5 µs	1.0×	Untyped
`vmihailenco/msgpack`	~0.6 µs	~0.9 µs	0.7×	Untyped
`protoc-gen-go`	~0.3 µs	~0.5 µs	0.5×	`.proto`
Avro (`hamba/avro`)	~0.4 µs	~0.6 µs	0.45×	`.avsc` + registry
FlatBuffers	~0.2 µs	0 µs (in-place)	0.6×	`.fbs`

For >100 k msg/s per topic, protobuf or Avro pay for themselves in bytes and CPU. For <1 k/s control events, JSON is fine. Mixing within one topic is a maintenance disaster — one format per topic, versioned. At 50 k events/s the producer budget is ~1 µs/event synchronous (struct + marshal + compression) plus an asynchronous ack pipeline. Beyond that means unbatched publishes or synchronous acks — both fixable.

3. Backpressure models — pull vs push, prefetch tuning¶

Backpressure answers "what does a slow consumer do to a fast producer?" Get it wrong and the broker is the unbounded queue.

Pull (Kafka, JetStream pull). Consumer requests N messages; backpressure is automatic.

cl, _ := kgo.NewClient(
    kgo.SeedBrokers("kafka-0:9092"),
    kgo.ConsumerGroup("orders-fraud"),
    kgo.ConsumeTopics("orders"),
    kgo.FetchMaxBytes(1<<20),                    // bytes-based prefetch
    kgo.FetchMaxWait(50*time.Millisecond),
)
for {
    fetches := cl.PollFetches(ctx)               // backpressure lives here
    fetches.EachRecord(func(r *kgo.Record) { handle(r) })
}

Push (NATS Core, RabbitMQ basic.consume). Broker delivers as fast as it can; backpressure is the in-flight cap plus TCP flow control.

sub, _ := js.PullSubscribe("orders.*", "fraud",
    nats.MaxAckPending(256),
    nats.AckWait(30*time.Second),
)

Prefetch tuning — the only formula you need. Apply Little's Law to the consumer: prefetch ≈ 2·λ·W. A 50 ms handler at 200 msg/s needs prefetch ≈ 20. Setting it to 1000 creates a 50-second head-of-line block when one message is slow.

Pathology	Cause	Fix
Head-of-line stall	Slow message holds N-1 others	Lower prefetch; separate slow topic
Idle consumer with backlog	Prefetch = 1 starves network	Raise to `2λW`
Memory blowup	Prefetch holds large messages in RAM	Bytes-based prefetch
Uneven partition load	Static assignment	Cooperative-sticky balancer

Per-key concurrency. Process N keys concurrently within a partition while preserving per-key order:

type KeyedWorker struct {
    queues map[string]chan *kgo.Record
    mu     sync.Mutex
}
func (k *KeyedWorker) Dispatch(r *kgo.Record) {
    k.mu.Lock(); defer k.mu.Unlock()
    ch, ok := k.queues[string(r.Key)]
    if !ok {
        ch = make(chan *kgo.Record, 32)
        k.queues[string(r.Key)] = ch
        go k.worker(ch)
    }
    ch <- r
}

This preserves the only ordering guarantee that matters (per-key) and exploits the easy parallelism (cross-key).

4. Idempotency at the subscriber — dedup store, idempotency key¶

At-least-once is the default for every production broker. Every consumer must be idempotent. The publisher cannot fix this for you.

Duplicate source	Mitigated by
Producer retry on transient error	Idempotent producer (Kafka `enable.idempotence=true`)
Broker rebalance	Cooperative rebalance; shorter session timeouts
Consumer crashed before committing offset	Idempotent handler
Network partition double-ack	Idempotent handler
Operator-driven replay	Idempotent handler

Idempotency lives in the consumer. Every event must carry a stable producer-assigned identifier — either a UUID v7 (monotonic, embeds timestamp) or a deterministic hash of business state (sha256(order_id || version)). The second is stronger: two duplicate publishes of the same logical event collapse to one key even when the producer retried after a timeout that had succeeded.

A dedup store records "I have processed key X" with expiry:

Store	Lookup latency	Best for
In-memory LRU + bloom	<1 µs	Stateless consumers, window ~minutes
Redis `SET key NX EX <ttl>`	~0.3 ms	Multi-instance consumers, window ~hours
DB table with unique constraint on `idempotency_key`	~1 ms	Already writing to a DB; same tx

The third is the most powerful — the side effect and the dedup are atomic:

func handleOrderCreated(ctx context.Context, db *pgxpool.Pool, ev Event) error {
    tx, err := db.Begin(ctx)
    if err != nil { return err }
    defer tx.Rollback(ctx)

    _, err = tx.Exec(ctx,
        `INSERT INTO processed_events (idempotency_key, event_type, processed_at)
         VALUES ($1, $2, now()) ON CONFLICT (idempotency_key) DO NOTHING`,
        ev.IdempotencyKey, ev.Type)
    if err != nil { return err }

    if _, err := tx.Exec(ctx,
        `INSERT INTO orders (id, customer_id, total_cents) VALUES ($1,$2,$3)
         ON CONFLICT (id) DO NOTHING`,
        ev.OrderID, ev.CustomerID, ev.TotalCents); err != nil { return err }

    return tx.Commit(ctx)
}

If the consumer crashes between ack and DB commit, the next delivery re-runs the transaction and the ON CONFLICT makes both steps no-ops. The handler is idempotent by construction, not by inspection. This is the only pattern that survives audit. Retention is bounded — typically 7 days; operators replaying older data use a separate consumer group.

5. Outbox pattern revisited — DB → broker reliable publish¶

The most common production bug in event-driven services: the database transaction commits and the publish fails (or vice versa). They are not in the same transaction. The outbox pattern makes the broker eventually-consistent with the database; the database is truth.

// WRONG — two systems, no atomicity
func placeOrder(ctx context.Context, db *pgxpool.Pool, broker Publisher, o Order) error {
    if err := db.QueryRow(ctx, "INSERT INTO orders ...").Scan(&o.ID); err != nil { return err }
    return broker.Publish(ctx, "order.created", o) // crash here = silent loss
}

The outbox table:

CREATE TABLE outbox (
    id              BIGSERIAL PRIMARY KEY,
    aggregate_type  TEXT NOT NULL,
    aggregate_id    TEXT NOT NULL,
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    headers         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at    TIMESTAMPTZ
);
CREATE INDEX outbox_unpublished_idx ON outbox(id) WHERE published_at IS NULL;

Producer code is one transaction; a separate publisher reads unpublished rows and marks them published only after broker ack:

func (p *OutboxPublisher) tick(ctx context.Context) error {
    rows, _ := p.db.Query(ctx,
        `SELECT id, event_type, payload, headers FROM outbox
         WHERE published_at IS NULL ORDER BY id LIMIT 200
         FOR UPDATE SKIP LOCKED`)
    defer rows.Close()
    for rows.Next() {
        var id int64; var typ string; var payload, hdr []byte
        rows.Scan(&id, &typ, &payload, &hdr)
        if err := p.broker.Publish(ctx, typ, payload, hdr); err != nil { return err }
        p.db.Exec(ctx, `UPDATE outbox SET published_at = now() WHERE id = $1`, id)
    }
    return nil
}

FOR UPDATE SKIP LOCKED lets multiple publishers coexist without coordinating. ORDER BY id preserves per-aggregate order because aggregate events are written in one transaction.

Approach	Latency	Operational cost
Polling outbox publisher	50–500 ms	Low — a goroutine
CDC (Debezium reading WAL)	10–50 ms	High — Kafka Connect, schema registry, WAL slot management

Polling is sufficient for most services; CDC is for shops already running Kafka Connect. Prune the table daily (published_at < now() - interval '7 days'); partition by day above tens of millions of rows.

6. CloudEvents spec — header conventions, traceparent¶

Every team that emits events invents its own envelope; the result is data, payload, body, event across services. CloudEvents (CNCF) is the only widely adopted standardization. Adopt it on day one.

{
  "specversion": "1.0",
  "id": "e1b8e7e2-3a4d-4f8a-9b2c-1234567890ab",
  "source": "/orders/api",
  "type": "com.example.order.created.v1",
  "datacontenttype": "application/json",
  "subject": "order/42",
  "time": "2026-05-28T14:23:09Z",
  "traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
  "data": { "order_id": 42, "customer_id": 1001, "total_cents": 4999 }
}

specversion, id, source, type are mandatory. type should be <reverse-dns>.<aggregate>.<verb>.<version>. The v1 suffix is not optional — the day you need v2, every consumer must be able to filter. Without a version, you are renaming the topic.

traceparent carries the W3C trace context. Set it on every publish:

func publishWithTrace(ctx context.Context, broker Publisher, ev *cloudevents.Event) error {
    if sc := trace.SpanFromContext(ctx).SpanContext(); sc.IsValid() {
        ev.SetExtension("traceparent",
            fmt.Sprintf("00-%s-%s-%02x", sc.TraceID(), sc.SpanID(), sc.TraceFlags()))
    }
    return broker.Publish(ctx, ev)
}

On the consumer side, restore the trace context before invoking the handler:

func consume(ctx context.Context, ev *cloudevents.Event, handler Handler) error {
    if tp, ok := ev.Extensions()["traceparent"].(string); ok {
        if sc, err := parseTraceparent(tp); err == nil {
            ctx = trace.ContextWithRemoteSpanContext(ctx, sc)
        }
    }
    ctx, span := tracer.Start(ctx, "consume."+ev.Type())
    defer span.End()
    return handler(ctx, ev)
}

The result is a single trace spanning publisher → broker → all consumers. Headers carry metadata (routing, dedup keys, trace IDs); payload carries the business event. Mixing them — payload.trace_id or headers.customer_email — couples consumers to fields they should not need.

7. Schema registry — Confluent, Apicurio, Buf¶

Without a registry, schemas live in tribal knowledge. The day a producer renames customer_id to customer_uuid, every consumer breaks silently.

Registry	Format support	Best for
Confluent Schema Registry	Avro, JSON Schema, Protobuf	Kafka shops, hosted or self-host
Apicurio	Avro, JSON Schema, Protobuf, GraphQL, OpenAPI, AsyncAPI	Self-hosted, broader formats
Buf Schema Registry	Protobuf only	Go services with Protobuf; strong CI

Compatibility modes (Confluent semantics):

Mode	Producers can	Consumers must
Backward	Add optional fields; remove required	Read new with old code
Forward	Remove optional; add required	Read old with new code
Full	Both	Both
None	Anything	Cope

Backward is the safe default; Full is required for public events outside your control. Confluent's wire format prefixes the payload with a magic byte and 4-byte schema ID:

[0x00][schema_id:int32 BE][avro_payload...]

The consumer fetches schema schema_id (cached locally), decodes, runs. The consumer can read every version ever written with zero coordination. The single most valuable feature is the CI breaking-change check:

- run: buf breaking --against 'https://github.com/example/protos.git#branch=main'

A PR that breaks consumers' decoders is rejected before merge. No registry without this CI gate is worth the operational burden of the registry itself.

8. Multi-tenancy — topic namespacing, ACLs, rate limits¶

A multi-tenant Pub/Sub system is one broker cluster shared by N tenants. Without discipline, one tenant's runaway producer drowns the others.

Strategy	Topic name	Pros	Cons
Per-tenant topic	`tenant-42.orders.created`	Hard isolation, per-tenant ACL	Topic explosion (>10 k tenants)
Per-tenant partition	`orders.created`, key=`tenant_id:order_id`	One topic, per-tenant order	Bad tenant slows partition
Tenant prefix on header	`orders.created`, header `tenant=42`	Simple, scalable	Easy to forget filter; weak isolation

The practical compromise: per-tenant topic for high-value tenants (>1% of traffic), shared with header filter for the long tail. Kafka degrades around 200 k partitions per cluster.

ACLs enforce that a producer can write only its own tenant's topics; a consumer can read only its own. The broker enforces this, not the application:

kafka-acls --add --allow-principal User:tenant-42 \
    --operation Write --topic 'tenant-42.*' --resource-pattern-type prefixed

Quotas throttle a noisy tenant at the broker, not the consumer. Kafka has per-client-ID byte rates; NATS has account-level limits; Google Pub/Sub has per-topic quotas. Set per-tenant quotas at contracted throughput plus 2× burst. The 100th-percentile tenant should not consume >5% of cluster bandwidth. The schema registry must also be tenant-aware — Confluent RBAC and Apicurio group scoping handle this.

9. Observability deep dive — consumer lag, redelivery rate, p99 end-to-end¶

Metric	Type	Why it matters
`publish_latency_seconds`	Histogram	Producer-side health; spikes precede consumer spikes
`consumer_lag_messages` / `consumer_lag_seconds`	Gauge	The single most important metric — backlog growth
`consumer_redelivery_rate`	Counter ratio	Idempotency / poison-pill health
`end_to_end_latency_seconds`	Histogram	`consumer_received_time - event_time` — what users feel

var (
    publishLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name: "pubsub_publish_seconds",
        Buckets: prometheus.ExponentialBuckets(0.0001, 4, 12),
    }, []string{"topic"})
    consumerLag = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Name: "pubsub_consumer_lag"}, []string{"topic", "group", "partition"})
    e2eLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name: "pubsub_end_to_end_seconds",
        Buckets: prometheus.ExponentialBuckets(0.001, 4, 14),
    }, []string{"topic", "group"})
)

Lag in messages vs seconds. Brokers report messages-behind natively. Users feel seconds. If the producer slows down, messages shrink while seconds grow. Alert on seconds, plot both.

# Backlog growing
deriv(kafka_consumergroup_lag{group="orders-fraud"}[10m]) > 0
  and kafka_consumergroup_lag{group="orders-fraud"} > 1000

# End-to-end p99 SLA
histogram_quantile(0.99,
  sum(rate(pubsub_end_to_end_seconds_bucket[5m])) by (le, topic, group)) > 5.0

# Redelivery storm — poison-pill or bad handler
rate(pubsub_redeliveries_total[5m]) > 10

traceparent headers (§6) give per-event traces across publisher and all consumers. Add span attributes for broker offset and consumer group; investigators find every consumer's view of one event in one query.

10. Failure modes — split-brain, broker partition, replay storms¶

Broker partition / split-brain. A Kafka controller failover or NATS JetStream quorum loss produces a window where producers and consumers see inconsistent metadata. Producers get NotLeaderForPartition and retry; consumers may receive duplicates across rebalance; committed offsets in the window may not survive. Defense is operational: idempotent producers, idempotent consumers (§4), and operators monitoring under_replicated_partitions and controller_health.

Replay storms. An operator runs kafka-consumer-groups --reset-offsets --to-earliest to debug a one-message bug; the group re-processes 30 days; the database is overwhelmed; downstream services slow. Has happened at every Kafka-using company. Defenses:

Rate-limited replay. Consumer-side throttle per partition.
Separate consumer group for replays. Never reset the live group.
Replay-aware handlers. Some side effects (sending email) must not be replayed:

const replayCutoff = "2026-05-25T00:00:00Z"
func (h *EmailHandler) Handle(ctx context.Context, ev *Event) error {
    cutoff, _ := time.Parse(time.RFC3339, replayCutoff)
    if ev.Time.Before(cutoff) { return nil } // replay; skip side effect
    return h.send(ctx, ev)
}

Poison pill. A malformed event makes the handler throw, redeliver, re-throw forever. The partition stops. Bounded redelivery → dead-letter topic:

func (c *Consumer) handle(ctx context.Context, ev *Event) error {
    if ev.DeliveryCount > 5 {
        return c.dlq.Publish(ctx, "orders.created.dlq", ev,
            map[string]string{"reason": "max_redeliveries"})
    }
    return c.handler(ctx, ev)
}

A DLQ nobody reads is a leak; assign owners and SLAs.

Zombie consumers. A consumer loses network past the session timeout; the broker rebalances; network recovers; the zombie tries to commit offsets it no longer owns. With cooperative rebalance and idempotent commits, this is a no-op; with eager rebalance and naive code, you get phantom commits and redeliveries. Use kgo.CooperativeStickyBalancer() or partition.assignment.strategy=cooperative-sticky.

Slow-fast skew. One instance in a group is on a slow VM; its partitions back up while others run fine. Monitor per-partition lag, not just per-group. Alert when one partition's lag is >5× the median.

11. Security — encryption at rest, TLS, mTLS, signed events¶

Every broker supports TLS. The defaults in Kafka, NATS, Rabbit, and Redis are plaintext; the burden is on you to flip the switch. mTLS authenticates the client to the broker with a cert — preferable to passwords because rotation is automated by cert-manager.

tlsCfg := &tls.Config{
    Certificates: []tls.Certificate{cert},
    RootCAs:      caPool,
    MinVersion:   tls.VersionTLS13,
}
cl, _ := kgo.NewClient(
    kgo.SeedBrokers("kafka-0:9093"),
    kgo.DialTLSConfig(tlsCfg),
    kgo.SASL(scram.Auth{User: "...", Pass: "..."}.AsSha512Mechanism()),
)

Encryption at rest. Filesystem-level (LUKS, dm-crypt) protects against disk theft, not a compromised broker. Application-level encryption is the strongest answer for PII — encrypt the payload before publish; only consumers with the key can read. Use envelope encryption: a per-event data key encrypts the payload; the data key is KMS-wrapped and carried in headers.

type EncryptedPayload struct {
    EncryptedKey []byte `json:"k"` // KMS-wrapped DEK
    Nonce        []byte `json:"n"`
    Ciphertext   []byte `json:"c"` // AES-GCM
}

Signed events. For events crossing trust boundaries (B2B, partner integrations), payloads should be signed by the producer's key; consumers verify before processing. Signature lives in headers (signature: <base64>, signature-key-id: <kid>); the public key is fetched from a JWKS endpoint. An unsigned event from a "trusted" partner is one DNS hijack away from compromise.

PII / right-to-be-forgotten. GDPR/CCPA require deletion on request; long-retention topics conflict. Two answers: crypto-shredding (encrypt PII with a per-subject key; deletion = throwing away the key; ciphertext remains in the log but is irrecoverable), or short retention + dedicated PII topics. Crypto-shredding is the only approach handling 7-year audit retention for non-PII fields with month-scale PII retention in the same stream.

12. Anti-patterns at scale + closing¶

Anti-pattern	Symptom	Fix
Publish-success = side-effect-success	Consumers never react; producer DB has the row	Outbox pattern (§5)
No consumer idempotency	Duplicate emails, double charges	Idempotency key + dedup store (§4)
Schemas in tribal knowledge	Renaming a field silently breaks consumers	Registry with CI breaking-change gate (§7)
One topic per "thing that happened"	Topic count in tens of thousands	Aggregate-typed topics; subject filtering
Prefetch "as high as possible"	Head-of-line blocking	Tune to `2λW` (§3)
No DLQ	Poison pill stalls a partition forever	Bounded redelivery → DLQ → operator review
Sync publish on every request	Producer p99 mirrors broker p99	Batched async; outbox for durability
Cross-service ordering assumptions	"But A's event arrived before B's" — no	Order is per-partition; design for eventual consistency
Commit before handler success	Lost messages on crash	Commit after side effect commits
No `traceparent` propagation	Debugging takes hours	Propagate W3C trace context (§6)
Plaintext broker traffic	Sniffable PII; broker pwn = company pwn	TLS mandatory; mTLS preferred (§11)
Long-retention PII topics	GDPR deletion needs log rewrite	Crypto-shredding or short retention
Replay against the live group	Replay flood overwhelms downstream	Separate replay group; rate-limit; time-cutoff handlers (§10)
No per-tenant quotas in shared cluster	One tenant's bug drowns every other tenant	Broker-level quotas (§8)
Broker as system of record	Lost a topic = lost the business	DB is truth; broker is transport
Headers bleeding into payload	Consumers coupled to producer routing concerns	Headers for routing/metadata; payload for the event
Synchronous chained Pub/Sub	`A → B → C → D`, each blocking	Use RPC for sync; Pub/Sub is for fan-out

The deepest anti-pattern: using Pub/Sub as glue between services that should not be coupled at all. A "let's just emit an event" without a written subscriber list is technical debt that accrues silently — three months later, deleting the event breaks something nobody remembers. Every event type needs a documented contract, set of consumers, and deprecation policy. The pattern is not the discipline.

Closing principles¶

A Pub/Sub system is a contract: producers promise events that mean what they say; consumers promise idempotent reaction; the broker promises at-least-once delivery and best-effort order. Honor all three.

The producer's database is truth; the broker is transport. If the broker loses messages, replay from the database. If the database loses rows, no broker can save you. Outbox is not optional — it is what makes the producer-side honest.
Idempotency is the consumer's only correctness invariant. At-least-once is the only guarantee that scales; every consumer must be designed for it. Idempotency key (§4.2) and dedup store (§4.3) are not nice-to-have observability — they are the only thing between you and duplicate charges.
Schemas are contracts; CI is enforcement. A schema registry without a breaking-change CI gate is a wiki page nobody reads. With it, an entire class of incident becomes a PR comment.
Observe lag in seconds, not just messages. Users feel latency; messages-behind lies during throughput dips. End-to-end histograms tell you what the customer sees.
Replay is a first-class operation. Design topics with replay in mind: idempotent handlers, time-cutoffs on side effects, separate consumer groups for replays. The day you need to debug a week-old bug, you won't have time to add these.
Security is the default, not the upgrade. Plaintext broker traffic, unsigned cross-trust events, and infinite-retention PII topics are open issues with a clock.
Multi-tenancy requires broker-enforced isolation. ACLs, quotas, per-tenant topics or partitions are the only defense against the noisy-neighbor incident that takes down every customer at once.

Get these right and Pub/Sub becomes invisible: producers publish, consumers react, traces light up across the system, the dashboard says lag is 12 seconds and stable. Get them wrong and the broker is the on-call incident — every consumer at lag 800 000, the producer still emitting because the database is fine, and the only fix is to drain the topic with a parallel consumer group while you figure out which deploy introduced the poison pill that has been replaying for two hours.