Fan-In — Professional Level¶

Table of Contents¶

Introduction
Production Case Study: Log Aggregator
Production Case Study: Multi-Region Health Check
Production Case Study: Search Aggregator
Designing for Operability
Stable Merge Variants
Hot-Reload Producers
Multi-Tenant Fan-In
Backpressure Strategies under Outage
Testing in CI and Staging
Cost Modelling
Migration Story
Cheat Sheet
Summary

Introduction¶

At the professional level, fan-in is part of a system. You operate it, scale it, debug it in production, and pay its costs. The merge function itself is unchanged from the senior level; what differs is the operational rigour around it. This file is a series of production case studies and the design decisions inside each.

Production Case Study: Log Aggregator¶

A central log shipper runs on each machine. It tails N log sources (stdout, stderr, journald, file tailers, an HTTP receiver) and ships every line to a remote collector.

Architecture:

[stdout tailer]   ─┐
[stderr tailer]   ─┤
[journald reader] ─┼──▶ Merge ──▶ batcher ──▶ HTTP shipper ──▶ collector
[file tailer 1]   ─┤
[file tailer 2]   ─┤
[HTTP listener]   ─┘

Each producer goroutine emits parsed log records on its own channel. A merge fans them into one. A batcher accumulates 100 records or 200ms and POSTs them.

Engineering decisions:

Buffer per producer: 1024 records. Lets short consumer hiccups not block log emission.
Drop-on-full: high-volume sources (file tailers) use select { case ch <- rec: default: counters.Drops.Inc() }. Operators can see drops in metrics.
Hot-add producers: a control plane can add a new file tailer without restart. The merge is the supervisor pattern from senior.md.
Order: cross-source order is irrelevant. Within a single source, order is preserved by the producer (one goroutine per source).
Telemetry: each source tagged with a label. Per-label counters expose imbalance.

Failure modes seen in production:

A misconfigured file tailer hits a 1 MB log line and stalls. The merge keeps running because other sources continue. The tailer is restarted by the supervisor.
The HTTP collector goes down. The batcher retries with backoff. Producer buffers fill. Drops start. Operator alerts. Recovery: collector returns; buffers drain; drop counter stops climbing.
A new producer is added but never closes its channel on shutdown. The merge stays alive past process exit signal. Fix: every producer registers a Close hook with the supervisor.

Production Case Study: Multi-Region Health Check¶

A health-checking service probes endpoints across regions. Each region has 50-200 probes, run by goroutines that emit results on a per-region channel. Merges produce a per-region stream and a global stream.

Architecture:

region eu-west: [200 probes] ─▶ Merge ─▶ regional alerter
region us-east: [180 probes] ─▶ Merge ─▶ regional alerter
region ap-south:[150 probes] ─▶ Merge ─▶ regional alerter

regional streams ──▶ Merge ──▶ global dashboard

Engineering decisions:

Layered merging: per-region first, then a global merge. This keeps the regional fan-out manageable (≤200) and lets each region drain independently.
Stable region boundaries: producers are static (one per probe). No hot-add at the region level.
Cancel-fast: when the dashboard disconnects, a ctx cascades back. Probes themselves run on independent ctx because they continue emitting to the alerter even when the dashboard is down.

This case shows how layered fan-in matches the system's natural hierarchy.

Production Case Study: Search Aggregator¶

A search query is dispatched to N backends in parallel. Each backend emits hits on its own channel. A merge feeds the UI's "live results" component.

Architecture:

query ──▶ dispatcher ──▶ ┌─ backend 1 ─▶ ch1 ─┐
                          ├─ backend 2 ─▶ ch2 ─┼──▶ Merge (with deadline) ──▶ UI
                          └─ backend N ─▶ chN ─┘

Engineering decisions:

Per-query merge: created on each request, destroyed when the request completes.
Deadline ctx: 200ms total. Slow backends are preempted; their results are silently dropped.
Order by relevance: each hit carries a score. The merge does not re-sort; the UI sorts a sliding window of hits.
Goroutine accounting: each merge spawns N+1 goroutines for the duration of the request. At 1000 QPS with N=8, that is 9000 goroutines steady-state — fine.

Failure modes:

A backend becomes slow (P99 = 5s instead of 50ms). Queries time out. The slow-backend's forwarder is cancelled when ctx fires, but the backend itself keeps computing the response, eventually emitting a result that is discarded. Fix: pass ctx into the backend RPC so it cancels server-side.
Per-query goroutine spawn cost: each merge does make(chan T) × (N+1) channels and N+1 goroutines. At 10K QPS this is 110K channel allocations/sec. Profiler confirms it is small (~0.3% of CPU); acceptable.

Designing for Operability¶

Production fan-in must expose:

pending per input: backlog size if the input is buffered.
emitted_total per input: counter, label = input id.
dropped_total per input: counter, label = input id (drop-on-full).
merge_latency histogram: time from input send to merged emit.
active_inputs gauge: dynamic merges only.

Without these, you cannot diagnose imbalance, slow producers, or drop storms.

Logging: log only structural events (input added, input closed). Do not log every value — at production rates that is a megabyte per second of log noise.

Stable Merge Variants¶

Production code occasionally needs partial ordering:

Per-key ordering: events for one user must be in order, but cross-user order is irrelevant. Partition by key into separate sub-pipelines, merge their outputs at the end.
Time-window ordering: emit values in 1-second buckets, sorted within each bucket. Buffer 1s, sort, emit.
Watermark-based: each input emits "watermark" tokens advertising its current time; the merge holds back values until all inputs have advanced past their timestamp. This is the Apache Beam / Flink approach.

Build these with care; they have non-trivial bug surfaces. For most teams, an unordered merge plus a downstream sort is good enough.

Hot-Reload Producers¶

Production fan-ins sometimes need to add/remove producers at runtime: a new shard is provisioned, a file is rotated, a connection is replaced.

Pattern: a manager goroutine owns the merge and listens on a control channel.

type Manager[T any] struct {
    out  chan T
    ctrl chan ctrlMsg[T]
    ctx  context.Context
}

type ctrlMsg[T any] struct {
    op  ctrlOp
    id  string
    src <-chan T
}

const (
    opAdd ctrlOp = iota
    opRemove
)

The manager spawns/teardowns forwarders in response to ctrl messages. Each forwarder has its own ctx that the manager cancels on remove.

This is essentially a runtime-flexible supervisor. It is non-trivial but unavoidable for long-lived merges.

Multi-Tenant Fan-In¶

A SaaS service may run one merge per tenant, scaled to thousands of merges. Issues:

Goroutine count: 1000 tenants × 10 inputs × 2 (forwarder + closer) = 20K goroutines. Manageable but watch heap.
Per-tenant resource limits: cap producers per tenant to prevent one bad tenant from starving others.
Separation: each tenant's merge has its own metrics labels. Aggregate with sum by (tenant_id).
Quotas: drop or rate-limit producers when tenant exceeds plan.

Design for the 99th-percentile tenant; the median is fine.

Backpressure Strategies under Outage¶

When the consumer dies or slows by 10x, three options:

Block (default): producers slow to consumer rate. Memory bounded. Latency rises.
Drop oldest: producer keeps current value, discards older buffered ones. Implement with a ring buffer per input.
Drop newest: producer skips the new value. Implement with select { default }.
Spill to disk: large buffers; persistent queue (e.g., go-disk-queue). Drains to memory when consumer recovers.

Choice depends on data semantics: - Logs: drop oldest (recent lines matter more). - Metrics: drop newest (we already saw most of the data). - Audit events: never drop; spill to disk.

Document the choice. Operators must know what your service does under load.

Testing in CI and Staging¶

Production-grade tests:

Functional CI tests: unit tests with -race.
Load tests in staging: 10x production rate to stress the merge.
Chaos tests: kill a producer mid-stream, expect graceful continuation.
Goroutine leak tests: goleak.VerifyNone after every test.
Memory tests: runtime.ReadMemStats before/after a 1M-message run, assert no leak.

Failed tests should produce a complete report: stack of every live goroutine, channel capacities, last 100 emitted values.

Cost Modelling¶

A merge has three costs:

Per-value CPU: ~150 ns/value on a single core.
Per-value latency: ~1-10 µs depending on scheduler.
Per-input goroutine memory: ~8 KB stack per forwarder, plus channel buffers.

For a service emitting 1M values/sec across 100 inputs:

CPU: 1M × 150 ns = 150 ms CPU/sec, i.e. 15% of one core. Fine.
Latency: P50 1µs, P99 10µs. Fine for most use cases.
Memory: 100 forwarders × 8 KB = 800 KB stacks; 100 channels × 1 KB buffer = 100 KB. Total ~1 MB. Fine.

If you scale to 10,000 inputs at 100M values/sec, costs scale linearly. The dominant cost becomes the single output channel, which caps at ~10-20M sends/sec on one CPU. Layer the merge into a tree at this scale.

Migration Story¶

A team migrating from a synchronous loop to a merge-based design follows this rough order:

Identify N producers — usually loops over a slice of resources.
Wrap each producer in a goroutine that emits on a channel.
Replace the synchronous for _, r := range resources loop with merge(ctx, channels...).
Add errgroup if any producer can fail.
Add metrics on every input.
Add goleak test.
Roll out to canary; watch goroutine count and latency.

A common pitfall: forgetting to close channels on producer error. The fix is defer close(out) in every producer goroutine, regardless of how it exits.

Cheat Sheet¶

Production decision	Default
Merge buffer	unbuffered output, per-input buffer of 256
Drop policy	drop newest for telemetry; block for everything else
Hot-add producers	supervisor pattern
Layering	tree at >1000 inputs
Per-input metrics	always
Goroutine leak test	always

Summary¶

Professional fan-in is operational. The merge function is small; the system around it — supervisors, metrics, drop policy, layered topology, multi-tenant safety — is large. Real production case studies show the same pattern applied with different operability rigour: log shippers, health checkers, search aggregators. Every decision (buffer size, drop policy, error semantics) is documented and observable.