Tee-Channel — Senior Level¶
Table of Contents¶
- Introduction
- Internal Mechanics of the Nil-Channel Trick
- Fairness, Scheduling, and
selectBias - Tee Under High Throughput
- Lock-Free Multi-Output Variants
- Tee in a Streaming System
- Exactly-Once vs At-Least-Once Semantics
- Observability and Drop Accounting
- Choosing Between Tee, Hub, and Ring Buffer Fanout
- Production Failure Modes
- Summary
Introduction¶
At senior level the question is not "how does tee work" or "which variant should I pick" but "what are the failure modes when I run this at production scale, and how do I reason about them?"
Tee is built from primitives — select, chan, close — whose semantics are well-defined but subtle under load. This file unpacks the parts the middle-level summary glossed over: runtime mechanics of select, fairness guarantees (and lack thereof), throughput ceilings, integration with non-channel back-ends (ring buffers, message brokers), and the operational behaviour you should expect when tee is in the hot path of a high-volume pipeline.
Internal Mechanics of the Nil-Channel Trick¶
A select statement compiles to a call into runtime.selectgo (see src/runtime/select.go). At call time the runtime:
- Walks the list of cases.
- For each case, evaluates the channel expression. A
nilchannel produces an entry with no ready slot and no pending slot; it is effectively pruned. - Performs a lock-acquisition pass over the non-nil channels in pseudo-random order.
- If any channel has a ready send/receive partner, that case is taken immediately.
- If none is ready, the goroutine is parked on the wait-queues of all non-nil channels.
The key consequence: setting a channel variable to nil between iterations is not a runtime hack; it is the documented and supported way to dynamically remove a case from a select without rewriting the statement.
The cost is a single pointer write. There is no allocation, no lock, no scheduling.
The alternative — restructuring the code to use two different select statements ("first send is one select, second send is another") — would compile to two distinct selectgo calls and prevent the runtime from atomically considering both branches as ready in the first iteration. The nil-channel form is strictly more efficient.
Compiler-level note¶
go build -gcflags="-m" rarely tells you much about select cases; they live in the runtime, not the compiler's inliner. To understand performance, profile with go test -bench=. -benchmem and go tool pprof against a tee benchmark. The hot path is runtime.selectgo and runtime.chansend.
Fairness, Scheduling, and select Bias¶
select picks uniformly at random among ready cases. The implementation uses a Knuth shuffle on the case order each call. This means:
- Over a long run, both outputs receive the value "first" roughly half the time.
- For any single value, which output gets it first is unpredictable.
- There is no static priority; you cannot rely on declaration order.
If you need priority — "always prefer the audit branch when both are ready" — you must encode it explicitly. The idiom:
// Priority: try outA first; if it would block, fall through to select.
select {
case <-done:
return
case outA <- v:
// delivered to A first
default:
// A would block; fair race for who gets it first
select {
case <-done:
return
case outA <- v:
case outB <- v:
}
}
This form gives outA a head start when it is ready, while still preventing the goroutine from blocking on a slow outA if outB is also ready. The asymmetry is documented and intentional.
A subtler fairness concern: if one consumer is always faster, the symmetric tee will tend to deliver to it first in each iteration. The second send then waits for the slower consumer. Over time, the system reaches steady state at the slower consumer's rate. There is no bias accumulating against either branch; the only "bias" is that both branches are always served, in some order, before the next value is read.
Tee Under High Throughput¶
Channel operations in Go cost on the order of 50–100 ns each on commodity x86_64 hardware. A symmetric tee performs:
- 1 receive from
in - 2 sends (one per output), each going through
selectgo
Add scheduling overhead and goroutine wake-up costs. A realistic budget is 150–250 ns per value for the tee itself, end-to-end, when both consumers are CPU-bound and ready. That is roughly 4–6 million values/sec through a single tee, with one CPU core dedicated to the tee goroutine and one to each consumer.
Bottlenecks at higher rates:
- Goroutine wake-up storms. Every send wakes the consumer goroutine. If consumers re-park immediately (because their next operation also blocks), you incur two scheduler transitions per value.
- Cache line bouncing. A channel's
hchanstruct lives in one cache line; sender and receiver write to the same lines. Under heavy contention this becomes a memory-system bottleneck. - GC pressure from payload values. Tee allocates nothing per value, but the channel runtime copies the value twice (once in, once out times two outputs = three copies). For 4 KB payloads at 1 M/sec you are moving 12 GB/sec of bytes — measurable.
Mitigations:
- Buffer aggressively for batch-oriented pipelines. A 1024-deep buffer turns each operation into an amortised lock acquisition rather than a per-value scheduler transition.
- Send pointers, not values, for payloads larger than a cache line. Trade aliasing risk for throughput.
- Place the tee on a dedicated
runtime.LockOSThread()goroutine if you need predictable latency under load. Rarely needed; verify withpproffirst. - Replace tee with a lock-free ring fanout when throughput exceeds 10 M/sec. See the next section.
A 4 M/sec tee is more than sufficient for almost every workload short of a market-data fanout or a network packet capture pipeline.
Lock-Free Multi-Output Variants¶
Past a few million events per second, the channel-based tee hits its ceiling. The alternative is a single producer, multiple consumer (SPMC) ring buffer where each consumer maintains its own read cursor.
+----------------+
| ring | producer writes monotonically
| [.][.][.][.] | producers' write cursor: W
+----------------+
^ ^
| |
A B consumers' read cursors RA, RB
each consumer reads independently
slowest consumer determines when W
can wrap and reclaim slots
Sketch:
type Ring[T any] struct {
buf []T
mask uint64
w atomic.Uint64
ra, rb atomic.Uint64
}
func (r *Ring[T]) Publish(v T) {
// Wait until slowest consumer has caught up to within capacity.
for {
wpos := r.w.Load()
slow := min(r.ra.Load(), r.rb.Load())
if wpos-slow < uint64(len(r.buf)) {
r.buf[wpos&r.mask] = v
r.w.Store(wpos + 1)
return
}
runtime.Gosched()
}
}
func (r *Ring[T]) ConsumeA() (T, bool) {
for {
pos := r.ra.Load()
if pos < r.w.Load() {
v := r.buf[pos&r.mask]
r.ra.Store(pos + 1)
return v, true
}
return *new(T), false
}
}
Properties:
- Throughput is bounded by atomic counter writes (~5 ns) and a single load, often above 100 M/sec on modern CPUs.
- Backpressure is naturally preserved: the producer waits for the slowest consumer.
- Each new consumer adds one atomic counter; no goroutine hop on the hot path.
- Memory:
O(capacity * sizeof(T)), fixed at startup.
Caveats:
- Spin-waiting wastes CPU; use exponential backoff or
runtime.Gosched()between spins. - Cancellation is awkward; you typically add a sentinel value or a separate done flag.
- Wrong size = problems. Power-of-two
len(buf)enables fast masking. - For mixed-pace consumers a small SPMC ring is worse than tee because the fast consumer spins on
w.Load()after every read.
This is library-level work. Crank's disruptor-go and smallnest/ringbuffer implement variants. Use only if you have measured channel-based tee as the bottleneck. Otherwise the operational simplicity of channels beats the throughput of a ring.
Tee in a Streaming System¶
Tee fits into Go-internal pipelines naturally. Tee across distributed systems requires a different vocabulary:
- Kafka with two consumer groups is the canonical distributed tee. Group A and group B each see every message. Throughput, durability, and per-consumer-group offset tracking come for free. The Go-side tee at most copies one local stream into both groups' producers.
- NATS subjects with multiple subscribers are similar but in-memory.
- AWS Kinesis fanout lets multiple stream consumers read the same shard.
When tee straddles the local/distributed boundary, the Go-side tee usually feeds a write to a broker:
toKafka, toIndex := Tee(ctx, events)
g.Go(func() error { return shipToKafka(ctx, toKafka) })
g.Go(func() error { return updateIndex(ctx, toIndex) })
The tee goroutine is local; the actual duplication of work across machines happens downstream via the broker's own fanout. This is the canonical hybrid pattern.
Failure modes specific to the hybrid:
- Kafka write blocks. Producer librdkafka may apply backpressure; the tee's
outAsend blocks; symmetric tee then blocksoutB. If the index branch must proceed during Kafka outages, you need the lossy asymmetric variant on the Kafka branch with monitoring. - Index falls behind Kafka. Same shape, reversed.
- Both branches partition independently. Allowed by tee; ensure the consumer side handles out-of-order delivery if your business contract requires synchronised state.
Exactly-Once vs At-Least-Once Semantics¶
Tee is at-least-once on in close; at-most-once-per-output on done cancellation. Let us unpack:
Clean shutdown (close in)¶
The for v := range in loop drains every value. For each value, the inner loop iterates exactly twice, delivering to both outputs. Both outputs receive exactly the same number of values as in sent. No duplication, no loss.
Cancellation (close done)¶
If the inner loop is between sends when done fires:
- 0 sends done so far: neither output receives
v. Outputs are equal in count. - 1 send done:
outAreceivedv;outBdid not. Outputs differ by 1. - 2 sends done: both received
v; the loop already exited normally.
So cancellation may produce a one-value discrepancy between outputs. This matters for some consumers:
- A consumer that just logs values does not care.
- A consumer that maintains an aggregate (count, sum) sees a 1-off skew if checked precisely at shutdown.
- A consumer that does idempotent upsert into a database does not care.
- A consumer that does non-idempotent writes (append-only log) is at risk if the producer is restarted and replays — but the issue then is replay, not tee.
To eliminate the skew, defer cancellation until in is drained:
func DrainThenCancel[T any](src <-chan T) (
in <-chan T, cancel func(),
) {
out := make(chan T)
done := make(chan struct{})
cancelOnce := sync.OnceFunc(func() { close(done) })
go func() {
defer close(out)
for {
select {
case <-done:
return
case v, ok := <-src:
if !ok {
return
}
select {
case <-done:
return
case out <- v:
}
}
}
}()
return out, cancelOnce
}
The point: tee's cancellation contract is correct for the common case (drop in-flight is acceptable). When it is not, you wrap the input with a stage that delays cancellation.
Observability and Drop Accounting¶
A tee in production should be visible. Suggested instrumentation:
- Counters:
tee.in.values,tee.out_a.values,tee.out_b.values. In a strict tee, all three should match within one value (the difference is in-flight). A persistent drift on one output is a sign of a stuck consumer. - Counter:
tee.droppedif using the lossy variant. Alert on rate, not absolute count. - Gauge:
tee.buffer_a.depth,tee.buffer_b.depthfor buffered variants. Rising depth means the consumer is falling behind. - Histogram:
tee.deliver_latency_secondsmeasuring time from receive oninto second successful send. Tail latency here equals the slow consumer's pacing.
Implementation note: do not measure inside the tee body if you care about performance. Sample (1-in-N) or measure end-to-end via consumer-side timestamps embedded in payloads.
For Prometheus, register the metrics at startup and update them inside the tee goroutine via local atomics that are batch-flushed every N values to reduce contention.
Choosing Between Tee, Hub, and Ring Buffer Fanout¶
Practical decision table:
| Property | Tee | Hub | SPMC ring |
|---|---|---|---|
| N (consumers) | 2 | any | any |
| Dynamic subscribe | no | yes | usually no |
| Per-consumer overflow policy | manual via variant | yes | manual |
| Throughput ceiling | ~5 M/sec | ~2 M/sec | 50+ M/sec |
| Lock-free | yes | no (RWMutex) | yes |
| Code size | ~15 lines | ~200 lines | ~100 lines + atomics |
| Operability | excellent | good | requires care |
Decision flow:
- N=2 and N is fixed → tee.
- N=3 and rare to grow → chained tee (2 tees, depth 1).
- N variable, subscribers come and go → hub.
- Throughput exceeds 5 M/sec → SPMC ring, only after measurement.
The choice is rarely tee vs hub on first principles; it is "we built tee, it now needs to grow, do we extend or rewrite?" Tee is so cheap that extending is usually fine; you only rewrite once the dynamism or throughput pressures justify the operational complexity of a hub.
Cross-reference: 05-concurrency-patterns/06-broadcast-pattern/senior.md covers the sharded hub implementation.
Production Failure Modes¶
A non-exhaustive list of how tee misbehaves in production, in rough order of frequency:
1. One consumer leaks goroutines, slows over time, eventually stalls everything¶
Symptom: latency on a different service goes up. Cause: a downstream service paged on tee slowness blocks both branches. Fix: lossy asymmetric tee on the suspect branch, with drop counter and alerts. Capacity-plan based on burst, not steady-state.
2. Producer cancels via context but consumers never see EOF¶
Cause: tee's goroutine exited on ctx.Done() but its defer close(out1) does not run because the producer goroutine panicked. Fix: install recover in the producer that closes its own output channel before crashing; tee will see range in exit and close cleanly.
3. Memory grows unboundedly¶
Cause: buffered tee with a buffer too large; consumer permanently slower than producer. Fix: shrink buffer; switch to lossy if the slow consumer is acceptable to drop from; otherwise re-architect the slow consumer.
4. Inconsistent value counts between two sinks¶
Cause: cancellation mid-flight, or the lossy variant in use without monitoring. Fix: drain-on-shutdown wrapper; ensure cancellation only fires after in is closed.
5. Tee goroutine pinned to one core under load¶
Cause: heavy tee on a single goroutine processing 5 M/sec; that goroutine is the bottleneck. Fix: shard the input across multiple tees by hash, or move to SPMC ring.
6. panic: send on closed channel inside tee¶
Cause: someone outside cast the directional output back to bidirectional and closed it. Fix: do not do that. Audit channel types.
7. Deadlock at shutdown¶
Cause: shutdown sequence closes done then waits for consumers, but consumers wait for EOF on outputs that only close once the tee goroutine exits, which is blocked on a send because there is no buffer and <-done is selected against an already-in-progress send. (Rare; modern tee implementations defer close before any send.) Fix: ensure defer close(...) is the first line in the goroutine body.
Composing Tee With Other Combinators at Scale¶
In a large pipeline, tee rarely stands alone. Senior-level skill is recognising the natural compositions and the load they put on the runtime.
Tee + orDone on consumers¶
orDone adds one goroutine per call. In a pipeline with 10 tees and 20 consumers, the goroutine count is 10 (tees) + 20 (orDones) + 20 (consumers) = 50, plus the producer chain. This is fine. The point is to know the order of magnitude so you can spot leaks.
Tee + errgroup¶
Each g.Go(...) is one goroutine. Tee's internal goroutine is separate (it does not return an error). Lifecycle is governed by the context. Cancellation flows from any error to all goroutines via the shared ctx. The combination is idiomatic and operationally clean.
Tee + bridge¶
bridge (the next combinator in this directory) flattens a chan chan T into a chan T. Composing tee after bridge is natural for pipelines that have sub-streams per partition. Memory and goroutine cost are additive.
Tee + rate-limiter¶
If you place a rate limiter on one branch, that branch becomes the slow one and paces the producer. Tee + rate-limiter is the right pattern for "log every event but only ship 100/sec to the slow sink." The rate-limited branch effectively becomes the lossy variant via the limiter's drop policy.
Memory Behaviour Under Sustained Load¶
A tee in steady state has constant memory: one goroutine stack (2-8 KB), two channels (a few hundred bytes), plus whatever the buffer holds (buf * sizeof(T)). No allocations per value.
Under burst load, what can grow:
- The output channel buffer, up to its capacity.
- Each consumer's downstream state.
- Goroutine stacks of consumers, if they recurse deeply.
The tee itself does not grow. If your process memory grows when tee is in the path, the cause is downstream. Use pprof heap to confirm; the tee's bytes should be negligible.
GC interaction:
- Values transiting tee are short-lived. They allocate (if at all) in the producer, transit two channels, and become garbage in the consumer.
- For pointer payloads, the lifetime is bounded by the slowest consumer. A slow consumer that pins many large values delays GC of those objects.
- A buffered tee holds references to buffered values until consumed; large buffers + large payloads = sustained memory.
Tune GOGC and GOMEMLIMIT based on the slowest-consumer-bound working set, not on tee itself.
Race Conditions to Watch For¶
Tee is mutex-free, so the obvious race conditions don't apply. The subtle ones:
Aliased payload mutation¶
Both consumers receive the same pointer or slice header. If one mutates, the other observes. go test -race catches concurrent writes but not the design-level race of "we both expected to own this object."
External close of an output¶
If a misbehaving caller closes an output channel from outside, the next send from inside tee panics. -race won't catch this; it is a data-flow bug, not a memory race.
Done channel closed twice¶
Closing a closed channel panics. If multiple goroutines call close(done), you need sync.Once:
-race will not catch double-close until it actually happens at runtime.
Subscriber/output mismatch¶
A consumer reading from a while another reads from b is fine. A consumer that occasionally drains from a and occasionally from b (e.g., a unified select) does not break tee, but does change the semantics: each value's "first" delivery is now ambiguous from the consumer's perspective.
Tee in Long-Running Daemons¶
Many tee uses are in services that run for months. Long-tail concerns:
- Goroutine leak. A tee goroutine leaked at startup is invisible until the process is restarted. Periodic goroutine-count check is a useful canary.
- Memory creep. A buffered tee whose consumer is gradually slower over time accumulates memory at the rate of the consumer's lag. If lag is bounded, fine; if not, OOM eventually.
- Cancellation forgotten. A tee whose
doneis never closed and whoseinnever closes runs forever. Wiredonetosignal.Notify-driven shutdown. runtime.GCand pacer. Heavy tee throughput drives short-lived allocations through the GC; tuningGOGCmay help.- Profiling under load.
pprofitself adds load. In hot paths, sample sparingly.
A daemon with a healthy tee shows: - Constant goroutine count over time. - Constant memory after warmup. - Throughput matching the slowest consumer. - No "send on closed channel" panics in logs.
If any of these drift, investigate.
Summary¶
Senior-level tee is the same fifteen lines you wrote at junior, plus everything you learned about how the runtime executes them and what happens when production traffic exposes their bounds.
The key insights:
- The nil-channel trick is not magic; it is a documented runtime feature with one-pointer-write cost.
selectfairness is uniform but unpredictable per-value; deterministic priority requires explicitdefault-fallthrough.- Tee tops out around 5 M/sec on commodity hardware; past that, SPMC rings or a sharded hub are better tools.
- Cancellation is at-most-once on each output; if exactness matters, wrap the input to drain before cancelling.
- Composition with
orDone,errgroup, andbridgeis additive in goroutines and natural in semantics. - Memory is constant in time for the tee itself; growth is always downstream.
- Observability is the difference between "tee works in dev" and "tee is operable in prod" — instrument it.
Tee is not a deep pattern. It is a tiny composition of primitives. Senior mastery is about knowing exactly which primitive limits exactly which production scenario, and reaching for the right alternative without overshooting.