Drain Pattern — Professional Level¶
Table of Contents¶
- Introduction
- Kafka Consumer Architecture Recap
- The Rebalance Protocol
- Drain Hooks: onPartitionsRevoked
- Exactly-Once Drain
- Cooperative Rebalance and Drain
- Drain Across Worker Pool Backed By Kafka
- Idempotent Producer Drain
- Transactional Producer Drain
- Drain Across Process Boundaries
- War Stories — Drain Incidents In Production
- Drain In Distributed Pipelines
- Drain And Exactly-Once Theory
- Drain Performance Engineering
- Drain Observability At Scale
- Drain Frameworks Comparison
- Drain Across Cloud Vendors
- Drain Anti-Patterns At Professional Level
- Drain Lessons That Took Years To Learn
- Summary
Introduction¶
Focus: "Drain Kafka consumers without dropping or duplicating a single message. Drain transactional producers without breaking exactly-once. Drain across an entire stream-processing pipeline so a rolling deploy is invisible to downstream."
This is the deep end. Junior, middle, and senior covered the patterns in increasing breadth. Professional covers the specific, demanding scenario where every bit of drain matters: Kafka consumers in exactly-once pipelines, where a single dropped offset is a duplicate financial event in production. The patterns from earlier pages are the foundation; this page is the application.
After reading this file you will:
- Understand Kafka's rebalance protocol and how drain integrates with it.
- Implement
onPartitionsRevokedhooks that commit and clean up cleanly. - Drain an exactly-once consumer with transactional commits.
- Coordinate drain across producers and consumers in the same pipeline.
- Recognise the failure modes specific to drain in messaging systems.
- Apply the lessons across non-Kafka systems with similar semantics (Pulsar, Kinesis, NATS JetStream).
Kafka Consumer Architecture Recap¶
A Kafka consumer is a goroutine that reads messages from one or more partitions. Partitions are units of parallelism. A consumer group has many consumers; the group coordinator (a broker) assigns partitions to consumers. The assignment can change at any time — when consumers join, leave, or are deemed unhealthy. This is a rebalance.
A consumer's lifecycle within a group:
- Joins the group.
- Receives partition assignments.
- Fetches messages from each assigned partition.
- Processes each message.
- Periodically commits offsets (
__consumer_offsetstopic). - Possibly loses partitions during rebalance.
- Eventually leaves the group.
Drain interacts with steps 4, 5, and 6 in particular. A drain that does not commit offsets cleanly causes duplicates on the next consumer.
Drain modes¶
- Static drain. Process gracefully exits. Group coordinator notices, triggers rebalance.
- Live rebalance. Other consumer joins or leaves. Coordinator triggers rebalance; this consumer loses some partitions.
Both must be handled. Static drain is what SIGTERM triggers; live rebalance is what scale events trigger.
The Rebalance Protocol¶
Kafka's rebalance has historically been "stop the world": when partitions are reassigned, all consumers stop processing, give up all partitions, and receive new assignments. The pause can be seconds.
Modern Kafka (since 2.4) supports cooperative incremental rebalance: only the partitions that are moving stop processing; others continue. The pause is per-partition, not global.
Either way, the consumer has hooks for the transition:
onPartitionsRevoked(partitions)— before partitions are taken away.onPartitionsAssigned(partitions)— after new partitions arrive.onPartitionsLost(partitions)— when partitions were taken away abruptly (rare).
Drain happens inside onPartitionsRevoked. This is the one chance to commit offsets, flush in-flight processing, and release resources for the partitions being lost.
Drain Hooks: onPartitionsRevoked¶
Pseudocode for a drain-aware consumer:
type ConsumerGroupHandler struct {
inFlight map[int32]*sync.WaitGroup // partition -> wg
commits map[int32]int64 // partition -> offset
mu sync.Mutex
}
func (h *ConsumerGroupHandler) ConsumeClaim(sess sarama.ConsumerGroupSession,
claim sarama.ConsumerGroupClaim) error {
p := claim.Partition()
wg := h.wgForPartition(p)
for msg := range claim.Messages() {
wg.Add(1)
go func(m *sarama.ConsumerMessage) {
defer wg.Done()
if err := process(m); err != nil {
// handle
return
}
h.mu.Lock()
h.commits[m.Partition] = m.Offset + 1
h.mu.Unlock()
}(msg)
}
return nil
}
func (h *ConsumerGroupHandler) Cleanup(sess sarama.ConsumerGroupSession) error {
// Called when partitions are about to be revoked.
for p, wg := range h.inFlight {
wg.Wait()
h.mu.Lock()
offset := h.commits[p]
h.mu.Unlock()
sess.MarkOffset(claim.Topic(), p, offset, "")
}
sess.Commit()
return nil
}
The pattern:
- Track in-flight per partition.
- On revoke, wait for that partition's in-flight to finish.
- Commit the highest committed offset.
- Then let the rebalance proceed.
A naive consumer that does not wait in Cleanup causes duplicates: messages still being processed when the partition is reassigned will be re-delivered to the new owner.
Exactly-Once Drain¶
Exactly-once delivery in Kafka requires:
- Idempotent producer.
- Transactional commit of (output messages + input offsets).
- Read-committed isolation on downstream consumers.
Drain interacts with the transaction:
- A transaction in-flight at drain start must either commit or abort, atomically.
- Aborting loses the work; committing requires the output messages to be safely produced.
The pattern:
producer.BeginTransaction()
process(input)
producer.Produce(output)
producer.SendOffsetsToTransaction(inputOffsets, group)
producer.CommitTransaction()
On drain mid-transaction:
Abort discards both the input offset and the output messages. Next consumer re-reads input and produces output anew.
For long transactions, the drain budget must accommodate the transaction's longest reasonable duration. If a transaction takes 30s and drain budget is 25s, you have a structural mismatch.
Cooperative Rebalance and Drain¶
Cooperative rebalance changed the drain story significantly. With eager rebalance, drain meant "stop everything, commit everything, restart everything." With cooperative rebalance, drain can be partial: only the partitions being revoked stop; others keep running.
Implications:
- The drain pattern moves from "global pause" to "per-partition drain."
- Each partition has its own in-flight count and its own commit point.
- The supervisor must track per-partition state.
type PartitionState struct {
wg sync.WaitGroup
committed atomic.Int64
processed atomic.Int64
revoking atomic.Bool
}
type Consumer struct {
partitions sync.Map // map[int32]*PartitionState
}
func (c *Consumer) OnPartitionsRevoked(ps []int32) {
for _, p := range ps {
state := c.getOrCreate(p)
state.revoking.Store(true)
state.wg.Wait() // bounded by Kafka client's rebalance timeout
c.commit(p, state.committed.Load())
}
}
Cooperative rebalance still has a deadline — the broker side session.timeout.ms. Drain must complete inside it.
Drain Across Worker Pool Backed By Kafka¶
A common architecture: a consumer feeds messages into a worker pool; workers process and commit. Drain order:
- Stop the consumer's fetch.
- Drain the worker pool (wait for in-flight workers).
- Commit offsets for all processed messages.
- Disconnect from Kafka.
type Pipeline struct {
consumer *Consumer
pool *WorkerPool
tracker *OffsetTracker
}
func (p *Pipeline) Drain(ctx context.Context) error {
// Step 1: stop consumer fetch.
if err := p.consumer.StopFetch(ctx); err != nil {
return err
}
// Step 2: drain workers.
if err := p.pool.Drain(ctx); err != nil {
return err
}
// Step 3: commit highest processed offset per partition.
if err := p.tracker.CommitAll(ctx); err != nil {
return err
}
// Step 4: disconnect.
return p.consumer.Close()
}
The order is critical. Disconnect before commit and the commit fails. Commit before drain and you commit offsets the worker has not actually processed.
Idempotent Producer Drain¶
An idempotent producer attaches a producer ID + sequence number to each message. Duplicates are deduped server-side.
Drain semantics for idempotent producer:
- Pending messages must be flushed before close.
- Failed sends should be retried (within bounds).
- The producer's internal queue must drain.
func (p *Producer) Drain(ctx context.Context) error {
if err := p.client.Flush(ctx); err != nil {
return err
}
return p.client.Close()
}
Flush blocks until all pending messages are acknowledged. With a context, the wait is bounded.
If Flush times out, the unflushed messages are lost. The application should either:
- Increase the deadline (if reasonable).
- Persist the messages locally for retry on restart.
- Accept the loss (for non-critical telemetry).
Transactional Producer Drain¶
A transactional producer wraps idempotent semantics with multi-message atomicity. Drain considerations:
- If a transaction is in flight, decide commit or abort.
- If committing, ensure all messages and offsets are sent.
- If aborting, the work is discarded.
func (p *TxProducer) Drain(ctx context.Context) error {
if p.InTransaction() {
// Choice: commit or abort?
if err := p.CommitTransaction(ctx); err != nil {
_ = p.AbortTransaction(ctx)
return err
}
}
return p.client.Close()
}
Choosing commit-over-abort during drain:
- Commit if the transaction is short and likely to succeed.
- Abort if the transaction is large or partially failed.
The pattern most often chosen: commit if BeginTransaction was called and no errors have occurred; abort otherwise.
Drain Across Process Boundaries¶
A stream processing pipeline often spans many processes:
- Producer service (Go).
- Kafka cluster (Java).
- Consumer/transformer (Go).
- Downstream consumer (Python, Go, or other).
Drain semantics across the pipeline:
- Each Go service drains via signal.
- Kafka brokers handle rebalance.
- Downstream services may not even know an upstream is draining.
The pipeline's overall consistency depends on each component's correctness. A drained upstream leaves no in-flight messages; a stopping-without-drain upstream may leave the pipeline in a transient inconsistent state.
For exactly-once pipelines, every component must support drain. A single non-drain-aware component is a single point of failure.
War Stories — Drain Incidents In Production¶
Story 1: The Tuesday Outage¶
A team rolled out a new version of their order-processing service. The drain code was untested. On SIGTERM, the consumer abruptly closed Kafka connections without committing offsets. 47 minutes of duplicate orders before the issue was noticed. Customer support spent the week refunding double charges.
Root cause: missing Cleanup implementation on the ConsumerGroupHandler. The default was a no-op. The team assumed the Kafka client handled drain. It did not.
Fix: implement Cleanup with explicit commit-and-wait. Add a drain test in CI.
Story 2: The Slow Drain Mystery¶
A service was reliable for a year. Then drain times started creeping up: 5s, 8s, 12s. Eventually drain hit the 30s grace period and pods started SIGKILL-ing.
Root cause: a goroutine leak in a third-party tracing library. Every request leaked a goroutine that held a reference to the request body. After a week of uptime, drain wait time was dominated by these leaked goroutines.
Fix: upgrade the tracing library. Add a goroutine count metric. Add an alert.
Story 3: The Partition Skew¶
A consumer service had 12 partitions across 6 consumers, balanced 2-2. A rolling deploy started: consumer 1 drained, consumer 2 took over its partitions. But consumer 2 still had its original 2 partitions in flight when the assignment changed. It briefly had 4 partitions to drain — 2x its capacity.
The drain took 60 seconds. Pods SIGKILL-ed. 30 minutes of message backlog.
Root cause: drain deadline assumed steady-state in-flight count, not the spike during rebalance.
Fix: budget drain time for rebalance-induced spikes. Reduce per-partition work. Use cooperative rebalance.
Story 4: The Transaction Trap¶
A team enabled exactly-once on a service. Drain logic was unchanged. On the first deploy after enabling EOS, drain caused transaction aborts. 10% of transactions discarded their work.
Root cause: drain code aborted in-flight transactions. The team had not changed this when enabling EOS.
Fix: check transaction state in drain; commit if possible, abort if not.
Story 5: The Cascading Cancel¶
A service's drain context was passed to all downstream calls. A downstream call timed out at 30s; the drain deadline was 25s. When drain started, the call's context was already cancelled. All in-flight calls failed simultaneously.
Root cause: drain context too aggressive. The downstream had not been told about drain.
Fix: separate drain context from request context. Use shorter sub-deadlines per call.
Drain In Distributed Pipelines¶
A distributed pipeline (Apache Flink-like, Kafka Streams-like) drains stage by stage:
- Source connector stops fetching.
- First stage drains.
- Second stage drains.
- Sink connector commits final state.
Each stage's drain depends on the previous. The pipeline's drain time is the sum of stage drain times.
For Go pipelines built on Kafka:
type Stage interface {
Process(ctx context.Context, in <-chan Message, out chan<- Message) error
Drain(ctx context.Context) error
}
type Pipeline struct {
stages []Stage
}
func (p *Pipeline) Drain(ctx context.Context) error {
for _, s := range p.stages {
if err := s.Drain(ctx); err != nil {
return err
}
}
return nil
}
Sequential drain. Total time is the sum. For long pipelines, this can exceed grace period; mitigations:
- Drain stages in parallel where state is independent.
- Reduce per-stage work.
- Increase grace period.
Drain And Exactly-Once Theory¶
Exactly-once semantics (EOS) in distributed systems requires:
- Idempotent operations.
- Atomic commit of effects.
- Recovery from failure that respects atomicity.
Drain is one source of "failure": a process exiting voluntarily. If drain mid-transaction is treated as a crash, the system must recover correctly.
Properly designed:
- Transactions are bounded (sub-second to a few seconds).
- Each transaction is independent.
- Recovery replays from the last committed transaction.
- Drain attempts to complete in-flight transactions; aborts otherwise.
EOS is hard. Drain in EOS is harder. The senior + professional levels of drain are mostly about not breaking EOS.
Drain Performance Engineering¶
For high-throughput services, drain becomes a performance problem:
- 100k messages/sec consumer. 10s drain. 1M messages in-flight.
- Each message has its own goroutine. 1M goroutines.
- Drain wait group has 1M entries.
Performance considerations:
- WaitGroup operations are atomic; at 1M concurrent, the cache line bounces.
- Goroutine stacks consume memory (2KB each = 2GB for 1M).
- GC pressure during drain spikes.
Mitigations:
- Use a worker pool with a bounded queue, not goroutine-per-message.
- Track in-flight via a sharded counter.
- Pre-allocate buffers; reuse via
sync.Pool. - Test drain at peak throughput, not at idle.
A drain that is fast at low load and slow at peak load is a drain bug waiting to be a production incident.
Drain Observability At Scale¶
For a fleet of 1000 pods, drain observability requires:
- Per-pod drain duration metric.
- Aggregate drain duration histogram (across fleet).
- Drain failures counter.
- Drain force-cancellation counter.
- Goroutines-at-drain-start gauge.
- Memory-at-drain-start gauge.
Alert on:
- P99 drain duration > 80% of grace period for 5 min.
- Force-cancellation rate > 0.1% of drains.
- Drain failures > 0.
Dashboards:
- Drain duration over time, by service.
- Drain duration vs version (catch regressions).
- Drain duration vs load.
- Goroutine count delta during drain.
At fleet scale, the dashboards become essential. Individual pod inspection is impossible.
Drain Frameworks Comparison¶
| Framework | Drain primitive | Strengths | Weaknesses |
|---|---|---|---|
confluent-kafka-go | Rebalance callbacks, manual commits | Mature, used in prod globally | Cgo dependency |
segmentio/kafka-go | Pure Go, Reader.Close | Clean API | Less feature-rich |
sarama | Consumer groups, Cleanup | Widely used | Quirky API |
franz-go | Modern, supports cooperative rebalance | Best performance | Newer |
pulsar-client-go | Apache Pulsar | Built-in graceful close | Niche |
For new projects, franz-go or confluent-kafka-go are the strongest. Both support drain via standard Go patterns.
Drain Across Cloud Vendors¶
- AWS MSK / Kinesis. Drain is similar to Kafka. Kinesis has different semantics; the shard model affects drain order.
- GCP Pub/Sub. Drain is
subscription.Receiveending. Acks must complete. - Azure Service Bus. Drain via
Closeon the client; pending receives finish. - NATS JetStream.
Subscription.Drain()is the canonical primitive. - Apache Pulsar.
Consumer.Close()includes drain semantics.
Each cloud has its own naming and quirks. The principle (stop intake, wait, commit, close) is universal.
Drain Anti-Patterns At Professional Level¶
Anti-pattern: Disabling drain to "fix" a deploy¶
A deploy fails. The team disables the drain test "temporarily" to ship. The disable lasts months. Drain rots.
Mitigation: never disable drain tests. If failing, fix the drain.
Anti-pattern: Drain that depends on a remote service being up¶
Drain code that calls a downstream during drain may fail if the downstream is also draining or unreachable.
Mitigation: drain should be local. Communicate drain via in-process state.
Anti-pattern: Skipping commit on drain timeout¶
If drain times out, the team's code may skip the commit "to avoid blocking longer." Result: duplicate processing on next consumer.
Mitigation: always commit what you can, even partial. Better to commit some than none.
Anti-pattern: Drain order coupling¶
Component A's drain reaches into component B. Component B's drain reaches into A. Cycle.
Mitigation: each component drains itself only. Supervisor coordinates.
Anti-pattern: Drain via "kill the goroutine"¶
There is no "kill goroutine" in Go. Code that pretends to kill goroutines (via panics or other tricks) leaks state.
Mitigation: use cooperative cancellation via context.
Anti-pattern: Drain budget without margin¶
Drain budget exactly equals grace period. A slight slowdown causes SIGKILL.
Mitigation: always leave 5-10s margin.
Anti-pattern: Untested drain¶
The most common anti-pattern. Drain code that has never been triggered.
Mitigation: drain test in CI as a deploy gate.
Drain Lessons That Took Years To Learn¶
A list of hard-won lessons:
- Drain budget is a function of P99 latency under load. Not P50.
- Readiness propagation matters more than people think. A 2s sleep saves 10s of drain time.
- Goroutine leaks make drain slower over time. Track and alert.
- Drain order depends on dependency direction, not on construction order. Reverse-construction is a heuristic; verify against dependencies.
- Drain context must be fresh, not derived from cancelled parent. This is the #1 drain bug.
- Drain tests catch more bugs than any other test. Invest in them.
- Drain is the cheapest reliability investment. ROI is enormous.
- Drain quality predicts overall engineering quality. Use it as a barometer.
- Drain failures are interesting incidents. They reveal architectural assumptions.
- Drain teaching scales. One senior who teaches drain can lift a team.
Summary¶
Drain at the professional level is about specific, high-stakes scenarios: Kafka exactly-once, transactional producers, distributed pipelines. The patterns from earlier pages are the foundation; this page is the demanding application.
If you have absorbed this, you can:
- Design drain into a new Kafka consumer service.
- Audit an existing service for drain bugs.
- Lead an incident review on a drain failure.
- Mentor a team into drain-quality discipline.
- Evaluate frameworks and libraries for drain support.
- Advocate at the org level for drain quality.
The remaining files (specification, interview, tasks, find-bug, optimize) provide reference, practice, and assessment. Use them to consolidate.
Drain is now part of your toolkit. Go build systems that drain well.
Extended Section: A Real Kafka Drain Implementation¶
Below is a more complete, working Kafka consumer drain implementation using segmentio/kafka-go. It demonstrates the patterns from this page in code.
package main
import (
"context"
"encoding/json"
"errors"
"log"
"os"
"os/signal"
"sync"
"sync/atomic"
"syscall"
"time"
"github.com/segmentio/kafka-go"
)
type Event struct {
ID string `json:"id"`
Body string `json:"body"`
}
type Consumer struct {
reader *kafka.Reader
processed sync.Map // map[string]struct{} for idempotency
inFlight sync.WaitGroup
draining atomic.Bool
fetchCancel context.CancelFunc
fetchCtx context.Context
parentCtx context.Context
}
func NewConsumer(brokers []string, topic, group string) *Consumer {
return &Consumer{
reader: kafka.NewReader(kafka.ReaderConfig{
Brokers: brokers,
GroupID: group,
Topic: topic,
MinBytes: 10e3,
MaxBytes: 10e6,
}),
}
}
func (c *Consumer) Start(ctx context.Context, workers int) {
c.parentCtx = ctx
c.fetchCtx, c.fetchCancel = context.WithCancel(ctx)
msgs := make(chan kafka.Message, 64)
// fetcher
go func() {
defer close(msgs)
for {
m, err := c.reader.FetchMessage(c.fetchCtx)
if err != nil {
if errors.Is(err, context.Canceled) {
return
}
log.Printf("fetch: %v", err)
continue
}
select {
case <-c.fetchCtx.Done():
return
case msgs <- m:
}
}
}()
// workers
for i := 0; i < workers; i++ {
c.inFlight.Add(1)
go c.worker(i, msgs)
}
}
func (c *Consumer) worker(id int, msgs <-chan kafka.Message) {
defer c.inFlight.Done()
for m := range msgs {
if err := c.process(c.parentCtx, m); err != nil {
log.Printf("worker %d: process: %v", id, err)
continue
}
if err := c.reader.CommitMessages(c.parentCtx, m); err != nil {
log.Printf("worker %d: commit: %v", id, err)
}
}
}
func (c *Consumer) process(ctx context.Context, m kafka.Message) error {
var e Event
if err := json.Unmarshal(m.Value, &e); err != nil {
return err
}
if _, dup := c.processed.LoadOrStore(e.ID, struct{}{}); dup {
return nil // already processed
}
// simulated work
select {
case <-time.After(50 * time.Millisecond):
case <-ctx.Done():
return ctx.Err()
}
return nil
}
func (c *Consumer) Drain(ctx context.Context) error {
c.draining.Store(true)
c.fetchCancel() // stop fetching new messages
done := make(chan struct{})
go func() {
c.inFlight.Wait()
close(done)
}()
select {
case <-done:
return c.reader.Close()
case <-ctx.Done():
_ = c.reader.Close()
return ctx.Err()
}
}
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()
c := NewConsumer([]string{"localhost:9092"}, "events", "demo")
c.Start(ctx, 8)
<-ctx.Done()
log.Println("draining")
dctx, dcancel := context.WithTimeout(context.Background(), 25*time.Second)
defer dcancel()
if err := c.Drain(dctx); err != nil {
log.Printf("drain: %v", err)
}
log.Println("exit")
}
Walk through this carefully. Note:
- Two contexts:
parentCtx(passed by main, cancelled on signal) andfetchCtx(derived, cancelled on drain to stop fetching but allow in-flight workers to complete). - The fetcher closes
msgson exit; workers range over it. - Workers commit after process; failures don't commit (next consumer will retry).
- Drain stops fetching, waits for workers to drain
msgs, then closes the reader. - Idempotency map ensures duplicates from at-least-once delivery are filtered.
This is real, working, drain-aware Kafka consumer code. Adapt to your topic, group, and processing logic.
Extended Section: Drain In Pulsar¶
Pulsar's Go client has a similar shape. Drain via Consumer.Close():
For finer control, track in-flight as in the Kafka example. Pulsar's pattern is "ack each message after processing; Close flushes pending acks." Similar to Kafka but with different APIs.
Extended Section: Drain In NATS JetStream¶
NATS JetStream has built-in Drain:
The NATS library is the cleanest example of drain-as-first-class. Other libraries should aspire to this.
Extended Section: Drain In Kinesis¶
Kinesis is shard-based. Drain considers the shard iterator:
On drain:
- Stop calling
GetRecordsfor new batches. - Process the last fetched batch.
- Checkpoint via DynamoDB (if using KCL pattern).
Kinesis lacks Kafka's built-in offset commit; you implement checkpointing. Drain must checkpoint cleanly.
Extended Section: Drain In SQS¶
SQS is simpler:
Drain:
- Stop calling
ReceiveMessage. - Process and delete current batch.
- Done.
SQS has no offset; deletion is the only state. Drain is the most straightforward of the message systems.
Extended Section: Drain Across All Of Them¶
Common pattern:
type StreamConsumer interface {
Fetch(ctx context.Context) (Message, error)
Ack(ctx context.Context, m Message) error
Close() error
}
type Drainer struct {
c StreamConsumer
inFlight sync.WaitGroup
cancel context.CancelFunc
}
func (d *Drainer) Drain(ctx context.Context) error {
d.cancel() // stop fetching
done := make(chan struct{})
go func() { d.inFlight.Wait(); close(done) }()
select {
case <-done:
return d.c.Close()
case <-ctx.Done():
_ = d.c.Close()
return ctx.Err()
}
}
This abstraction works for Kafka, Pulsar, NATS, Kinesis, SQS. The specific Fetch and Ack implementations differ; the drain wrapper does not.
Extended Section: A Professional-Level Checklist¶
For any messaging-driven service, the checklist before production:
- Consumer drain stops fetch and waits for in-flight.
- Workers commit / ack only after successful processing.
- Drain calls
Closeafter all workers exit. - Idempotency mechanism handles duplicates.
- Transactions (if any) commit or abort cleanly on drain.
- Rebalance hooks implemented for Kafka.
- Drain budget exceeds longest reasonable transaction.
- Drain budget less than grace period.
- Drain test in CI with simulated load.
- Drain metrics emitted; alerts configured.
- Goroutine leak test in CI.
- Drain duration tracked across deploys.
Twelve items. Each takes hours of engineering. The cumulative payoff is a service that ships fearlessly.
Closing Thoughts At Professional Level¶
Drain at the professional level is the discipline of getting messaging right. Kafka, with its rebalance protocol and exactly-once semantics, is the most demanding case. The patterns extend to every other streaming system.
If you can drain a Kafka consumer cleanly with EOS, you can drain anything. The patterns from this page are reusable; the discipline transfers.
Master this, and you can build systems that handle billions of messages per day with no data loss, even across constant deploys, scale events, and rebalances. That is the engineering bar.
Welcome to the deep end. The water is fine.
Appendix A: Drain Behaviour Across Kafka Library Versions¶
A short reference. As Kafka libraries evolve, drain semantics evolve too.
sarama¶
Older versions of sarama had bugs in ConsumerGroup.Close — it could hang on shutdown. Modern versions (1.40+) handle drain properly via Cleanup callbacks.
confluent-kafka-go¶
Built on librdkafka. Drain via consumer.Close(). Cgo means goroutines can leak across the boundary; pay attention to runtime.LockOSThread usage.
kafka-go¶
Pure Go. Drain via reader.Close(). Cleaner than sarama for simple cases.
franz-go¶
Newer, supports cooperative rebalance. Drain via client.Close() and explicit lifecycle methods.
Choose based on your needs. For most new projects in 2026, franz-go is a strong default.
Appendix B: Drain Behaviour Across Pulsar Versions¶
The Apache Pulsar Go client has evolved similarly:
- Older versions had buggy
Close. - Modern versions (0.10+) drain pending acks before exit.
Always upgrade to the latest minor; drain behaviour improves.
Appendix C: A Performance Note On Drain¶
At very high throughput (100k+ msg/sec), drain may be the bottleneck of pod cycling:
- A 10s drain × 100 pods × 1 deploy/day = 1000s of "drain time" per day.
- At 100k msg/sec per pod, that is 100M messages "in drain" per day.
If drain wastes capacity (idle workers waiting), the throughput cost is real. Optimisations:
- Parallel drain across components.
- Shorter drain budgets when safe.
- Pre-emptive shedding of low-priority work.
This is the senior+ thinking at high throughput.
Appendix D: Drain Failure Modes Recap For Kafka¶
Specific failure modes when drain goes wrong on Kafka:
- Duplicate processing. Offset not committed before disconnect.
- Stuck rebalance. Consumer takes too long in
Cleanup. - Lag spike. Drain interrupts mid-batch; partition stalled.
- Coordinator timeout. Drain longer than
session.timeout.ms. - Producer error on close. Buffered messages not flushed.
- Transaction abort. EOS work discarded.
- Idempotency loss. Producer ID resets; sequence number conflicts.
Each has a specific mitigation. Drain code must be aware of them.
Appendix E: A Final Long Worked Example¶
A complete service: HTTP API accepting orders, Kafka consumer processing them, Kafka producer emitting events, Postgres persisting state. Drain across all of them.
package main
import (
"context"
"database/sql"
"encoding/json"
"errors"
"log"
"net/http"
"os"
"os/signal"
"sync"
"sync/atomic"
"syscall"
"time"
"github.com/segmentio/kafka-go"
_ "github.com/lib/pq"
)
type Order struct {
ID string `json:"id"`
Amount float64 `json:"amount"`
}
type App struct {
db *sql.DB
consumer *kafka.Reader
producer *kafka.Writer
inFlight sync.WaitGroup
draining atomic.Bool
fetchCtx context.Context
fetchCancel context.CancelFunc
}
func (a *App) Start(ctx context.Context, workers int) {
a.fetchCtx, a.fetchCancel = context.WithCancel(ctx)
msgs := make(chan kafka.Message, 64)
go a.fetcher(msgs)
for i := 0; i < workers; i++ {
a.inFlight.Add(1)
go a.worker(ctx, i, msgs)
}
}
func (a *App) fetcher(msgs chan<- kafka.Message) {
defer close(msgs)
for {
m, err := a.consumer.FetchMessage(a.fetchCtx)
if err != nil {
return
}
select {
case <-a.fetchCtx.Done():
return
case msgs <- m:
}
}
}
func (a *App) worker(ctx context.Context, id int, msgs <-chan kafka.Message) {
defer a.inFlight.Done()
for m := range msgs {
var o Order
if err := json.Unmarshal(m.Value, &o); err != nil {
log.Printf("[%d] decode: %v", id, err)
continue
}
if err := a.processOrder(ctx, o); err != nil {
log.Printf("[%d] process: %v", id, err)
continue
}
if err := a.consumer.CommitMessages(ctx, m); err != nil {
log.Printf("[%d] commit: %v", id, err)
}
}
}
func (a *App) processOrder(ctx context.Context, o Order) error {
tx, err := a.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer func() { _ = tx.Rollback() }()
if _, err := tx.ExecContext(ctx,
"INSERT INTO orders (id, amount) VALUES ($1, $2) ON CONFLICT DO NOTHING",
o.ID, o.Amount); err != nil {
return err
}
if err := tx.Commit(); err != nil {
return err
}
evt, _ := json.Marshal(o)
return a.producer.WriteMessages(ctx, kafka.Message{
Key: []byte(o.ID),
Value: evt,
})
}
func (a *App) Drain(ctx context.Context) error {
a.draining.Store(true)
a.fetchCancel()
done := make(chan struct{})
go func() { a.inFlight.Wait(); close(done) }()
select {
case <-done:
case <-ctx.Done():
return ctx.Err()
}
if err := a.consumer.Close(); err != nil {
log.Printf("consumer close: %v", err)
}
if err := a.producer.Close(); err != nil {
log.Printf("producer close: %v", err)
}
return a.db.Close()
}
func (a *App) Handler() http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
if a.draining.Load() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
return mux
}
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()
db, err := sql.Open("postgres", os.Getenv("DATABASE_URL"))
if err != nil {
log.Fatal(err)
}
consumer := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
GroupID: "orders",
Topic: "orders",
})
producer := &kafka.Writer{
Addr: kafka.TCP("localhost:9092"),
Topic: "order_events",
}
app := &App{db: db, consumer: consumer, producer: producer}
app.Start(ctx, 8)
srv := &http.Server{Addr: ":8080", Handler: app.Handler()}
go func() {
err := srv.ListenAndServe()
if !errors.Is(err, http.ErrServerClosed) {
log.Printf("http: %v", err)
}
}()
<-ctx.Done()
log.Println("draining")
start := time.Now()
dctx, dcancel := context.WithTimeout(context.Background(), 25*time.Second)
defer dcancel()
app.draining.Store(true)
time.Sleep(2 * time.Second)
if err := srv.Shutdown(dctx); err != nil {
log.Printf("http shutdown: %v", err)
}
if err := app.Drain(dctx); err != nil {
log.Printf("drain: %v", err)
}
log.Printf("drained in %s", time.Since(start))
}
This is approximately 130 lines. It is a real, working, drain-aware service with Kafka, Postgres, and HTTP. Read it carefully. Identify each pattern from this page. Adapt to your stack.
Final Words¶
You have reached the end of professional-level drain. The patterns are demanding; the discipline is real. Drain at this level is not "graceful shutdown" — it is the foundation of exactly-once messaging in production.
If you can implement this code from memory, with all its subtleties, you have mastered the drain pattern in Go. There is always more to learn (specific cloud quirks, specific framework bugs), but the foundation is solid.
Go ship systems that drain cleanly. The world is full of buggy shutdown code; be the engineer who fixes it.
Appendix F: Drain In Multi-Datacenter Kafka¶
Some organisations run MirrorMaker or equivalent to replicate between Kafka clusters across datacenters. Drain interacts:
- Replicators are themselves consumers + producers.
- Their drain must commit offsets in both clusters.
- A drain that fails in the middle may leave clusters out of sync.
Best practices:
- Replicators have their own drain budgets.
- Monitoring tracks replication lag during drain.
- Lag spikes during deploys are normal; alerts have buffer.
For services running across DCs, drain considers the slower DC's grace period as the bound.
Appendix G: Drain And Schema Evolution¶
Schema evolution in messaging systems (Avro, Protobuf) interacts with drain:
- A new schema version may require both producer and consumer changes.
- During rolling deploy, old and new versions coexist.
- A consumer draining mid-rollout may have a mix of schema versions in flight.
Drain code must handle both schemas. Test with mixed-version traffic.
Appendix H: Drain And Schema Registry¶
If using a schema registry:
- Drain does not need to disconnect from the registry (it is rarely a bottleneck).
- But if registry calls are slow, drain may be delayed.
A drain that calls the registry should bound the call. Cached schemas are immune.
Appendix I: Drain In Stateful Stream Processors¶
A stateful stream processor (Kafka Streams-like) holds state in RocksDB or similar. Drain must:
- Stop processing input.
- Wait for state writes to complete.
- Flush RocksDB write buffers.
- Snapshot state for next instance.
- Commit input offsets.
- Disconnect.
The state snapshot can be large. Drain budget must accommodate the time to write it.
Mitigations:
- Continuous incremental snapshots (so the final snapshot is small).
- Stand-by replicas (so drain does not need a full snapshot).
- Streaming state migration (next instance reads from this one).
These are advanced patterns; relevant for high-state services.
Appendix J: Drain Cohort¶
A "drain cohort" is a group of pods drained together. Useful for batch deploys.
In Kubernetes, define a PodDisruptionBudget:
The deploy never disrupts more than 10% at a time. The drain cohort is 10% of the fleet.
For large fleets (100+ pods), cohort drain is essential. Single-pod-at-a-time deploys take too long.
Appendix K: Drain Health Models¶
A pod's health during drain has nuances:
- Liveness: still alive (responding to liveness probes).
- Readiness: not ready for new traffic.
- Healthiness for the cluster: still contributing (e.g., processing in-flight).
During drain, the pod should be: alive yes, ready no, healthy yes-until-drain-complete.
Implementation:
- Liveness handler: always 200 unless catastrophic.
- Readiness handler: 503 during drain.
- Custom "cluster health" handler: 200 until in-flight is zero, then 503.
A control plane that watches all three can orchestrate sophisticated drain scenarios.
Appendix L: Drain Vs Restart Strategy¶
In Kubernetes, Recreate vs RollingUpdate deploy strategies interact with drain:
Recreate: all old pods drain, then all new pods start. Drain happens in parallel.RollingUpdate: one pod drains, one starts, repeat. Drain happens serially.
For exactly-once messaging, RollingUpdate is safer. For stateless services, Recreate may be faster.
Senior engineers choose based on workload characteristics.
Appendix M: Drain And Service Mesh Retries¶
A service mesh that retries requests on 5xx may inadvertently amplify load during drain:
- Pod A drains; returns 503 for new requests.
- Mesh retries on 503 to pod B.
- Pod B is also draining; returns 503.
- Mesh retries to pod C.
If many pods drain simultaneously, this cascades.
Mitigations:
- Mesh respects
Retry-Afterheaders. - Drain returns a non-retryable status (e.g., 503 with no retry header, or 410 Gone).
- Mesh has a retry budget per source pod.
Appendix N: Drain Across Bounded Contexts¶
In a service composed of multiple bounded contexts (sub-services within one binary):
- Each bounded context has its own drain.
- The supervisor coordinates them.
- Inter-context calls are bounded by the drain context.
This is essentially monorepo-internal microservices, drained as a unit.
Appendix O: Drain Of Lambda Functions¶
AWS Lambda functions have their own drain semantics:
- A function invocation is short-lived (max 15 minutes).
- Drain is per-invocation, not per-process.
- The runtime exposes lifecycle events:
init,invoke,shutdown.
For Go on Lambda:
runtime.RegisterExtension("graceful-shutdown", func(event LifecycleEvent) {
if event.Type == "shutdown" {
// flush logs, close connections
}
})
Lambda extensions get up to 2 seconds during shutdown. Drain accordingly.
Appendix P: Drain Of Knative Services¶
Knative auto-scales pods. Drain interacts with auto-scaling:
- Scale-down drain follows the standard pattern.
- Scale-to-zero is a long drain (no requests for a while, then drain).
- Cold start after scale-from-zero has no drain — it starts fresh.
For Knative, drain semantics are the same as Kubernetes. The auto-scaler is the orchestrator.
Appendix Q: Drain Beyond Messaging¶
Beyond Kafka and similar, drain applies to:
- WebSocket gateways.
- Long-poll HTTP servers.
- gRPC streaming.
- Server-sent events.
- Custom TCP servers.
The patterns are the same. The protocol-specific details differ.
Appendix R: Drain In Distributed Lock Holders¶
A service holding distributed locks (etcd, Redis Redlock) must release them on drain.
type LockedService struct {
locks []*Lock
}
func (s *LockedService) Drain(ctx context.Context) error {
var wg sync.WaitGroup
for _, l := range s.locks {
l := l
wg.Add(1)
go func() { defer wg.Done(); _ = l.Release(ctx) }()
}
done := make(chan struct{})
go func() { wg.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Release in parallel; total time is one RPC roundtrip.
Appendix S: Drain Of Caching Layers¶
A write-through cache flushes dirty entries on drain.
func (c *Cache) Drain(ctx context.Context) error {
c.mu.RLock()
defer c.mu.RUnlock()
for k, v := range c.dirty {
if err := c.backend.Set(ctx, k, v); err != nil {
return err
}
}
return nil
}
Drain time is proportional to the dirty set size. For large caches, this can be slow.
Mitigations:
- Write-through (always write to backend) avoids drain flush.
- Bounded dirty set size.
- Background flush worker drains the set continuously.
Appendix T: Drain Of Search Indices¶
A service with an in-memory search index (Bleve, Tantivy) must persist the index on drain.
Index persist can be slow (gigabytes). Drain budget must accommodate.
For very large indices, replicate continuously to a stand-by; drain becomes a no-op.
Appendix U: Drain Of Background Schedulers¶
A background scheduler (e.g., a job runner that polls a queue) drains by stopping the poll and waiting for in-flight jobs.
type Scheduler struct {
stop chan struct{}
inFlight sync.WaitGroup
}
func (s *Scheduler) Drain(ctx context.Context) error {
close(s.stop)
done := make(chan struct{})
go func() { s.inFlight.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
The pattern repeats. Different sources, same shape.
Appendix V: Drain Across Service Discovery¶
A service registered in service discovery (Consul, etcd) must deregister on drain.
func (s *Service) Drain(ctx context.Context) error {
if err := s.discovery.Deregister(ctx); err != nil {
return err
}
// then drain workload
return s.workload.Drain(ctx)
}
Deregister early so new traffic stops. Then drain.
For platforms that auto-deregister on connection close (e.g., Consul via TTL), this is automatic. For those that require explicit deregister, it is a drain step.
Appendix W: Drain Of Background Tasks¶
A "background tasks" library (queue jobs, cron, etc.) drain via standard patterns.
type Tasks struct {
tasks chan Task
workers sync.WaitGroup
closed atomic.Bool
}
func (t *Tasks) Drain(ctx context.Context) error {
if !t.closed.CompareAndSwap(false, true) {
return nil
}
close(t.tasks)
done := make(chan struct{})
go func() { t.workers.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Same pattern as worker pools, with tasks instead of jobs.
Appendix X: Drain Of CRDTs¶
A service replicating CRDTs (conflict-free replicated data types) must:
- Stop accepting writes.
- Propagate pending writes to replicas.
- Achieve eventual consistency before drain ends.
The drain must accommodate the slowest replica's catch-up. For globally-distributed CRDTs, this can be seconds.
Appendix Y: Drain In Edge Compute¶
Edge platforms (Cloudflare Workers, Fastly) have minimal drain:
- Each invocation is isolated.
- No persistent state between invocations.
- Drain is per-invocation, not process-level.
For Go compiled to WASM on the edge, drain is mostly about flushing the current request. The platforms handle cluster-level drain.
Appendix Z: Drain Of Stream Joins¶
A stream join (Kafka Streams-like join) drains by:
- Flushing the join state.
- Committing both source streams' offsets.
- Persisting the join state for resumption.
Joins are expensive to drain. Snapshotting can take seconds for large state. Plan budget accordingly.
Appendix AA: Drain In Multi-Tenant Stream Processors¶
A processor handling N tenants in one binary drains by:
- Stopping intake for all tenants.
- Draining each tenant's pipeline (parallel where independent).
- Committing per-tenant state.
Per-tenant timing varies. The drain budget is governed by the slowest tenant.
Appendix BB: Drain Cost Model¶
A simple cost model for drain:
- Drain time (sec) × pods drained × deploys/day = engineer-time wasted per day.
- For 100 pods, 10s drain, 1 deploy/day: 1000 sec/day = 17 min/day = 5.6 hours/year.
If your team values an engineer-hour at $200, that is $1100/year for drain. Across 50 services, $55,000/year.
Optimising drain to 5s saves half. The investment in drain pays back quickly.
Appendix CC: Drain Of Service Mesh Sidecars¶
Service mesh sidecars (Envoy, Linkerd) drain via their own primitives:
- Envoy:
terminationDrainDuration. - Linkerd:
Linkerd2-Proxygraceful shutdown.
These run as separate containers in the pod. Both must drain inside the grace period.
Common pitfall: the application drains in 10s; the sidecar takes 25s; the pod takes 35s. The orchestrator's 30s grace period is exceeded.
Mitigation: align sidecar drain time with application drain time.
Appendix DD: Drain Across Cloud Providers Compared¶
| Provider | Drain mechanism | Default grace | Configurable? |
|---|---|---|---|
| Kubernetes | SIGTERM + terminationGracePeriodSeconds | 30s | Yes, per pod |
| Nomad | SIGINT + kill_timeout | 5s | Yes, per task |
| ECS | SIGTERM + stopTimeout | 30s | Yes, per task |
| Cloud Run | SIGTERM + grace period | 10s | Limited |
| Lambda | Extensions + 2s | 2s | No |
| Cloud Functions | Similar to Lambda | Small | No |
For Go services, target the platform's default grace; adjust if your workload needs more.
Appendix EE: Drain As Documentation¶
A well-drained service is self-documenting. The drain code reveals:
- Which goroutines exist.
- Which resources are owned.
- What the dependencies are.
- How long shutdown should take.
Reading a service's Drain function is a fast way to understand the service. Make it readable.
Appendix FF: Drain In Open Source Projects¶
A few open-source Go projects with exemplary drain code:
etcd— distributed key-value store; drain across raft consensus.caddy— HTTP server; clean drain viaServer.Shutdown.prometheus— drains its scrape loops cleanly.tempo(Grafana traces) — drains trace ingestion pipeline.loki(Grafana logs) — drains log ingestion.
Read their drain code. Each is a master class in the patterns from this and earlier pages.
Appendix GG: A Final Walkthrough¶
Let us close with a walkthrough of an idealised production drain, second by second.
T=0. SIGTERM arrives. signal.NotifyContext's goroutine cancels the root context. main unblocks.
T=0.001. app.draining.Store(true). Readiness flips to 503.
T=2.0. time.Sleep(2 * time.Second) ends. LB has noticed; new traffic stops.
T=2.001. drainCtx created with 23s deadline (25s total minus 2s spent).
T=2.002. srv.Shutdown(drainCtx) called. Listener closes. Active handlers continue.
T=2.500. Last handler finishes. srv.Shutdown returns nil.
T=2.501. pool.Drain(drainCtx) called. Job channel closes.
T=4.200. All workers exit. pool.Drain returns nil.
T=4.201. producer.Flush(drainCtx) called.
T=4.500. Producer buffer flushed. producer.Flush returns nil.
T=4.501. producer.Close() called.
T=4.502. db.Close() called.
T=4.600. All connections closed. db.Close returns nil.
T=4.601. main logs drain complete in 4.6s and returns.
T=4.602. Go runtime exits with code 0.
T=4.700. Orchestrator notices process exited, marks pod terminated.
Total: 4.7 seconds. Well under the 30s grace period. No data loss. No 5xx. The deploy is invisible to customers.
Multiply by 100 pods across the org, four times a day: 400 of these per day. None of them paged anyone.
That is the goal. That is what professional drain looks like.
Appendix HH: Drain Is A Practice¶
Drain is a practice, like writing good tests or doing code reviews. It is not a one-time achievement. The patterns evolve. New libraries emerge. New failure modes appear.
Keep practising. Keep learning. Keep building services that drain cleanly. The cumulative effect on your team — and your career — is enormous.
Appendix II: Closing The Professional Chapter¶
This is the last technical content in this file. The remaining pages — specification, interview, tasks, find-bug, optimize — are reference, practice, and assessment. Use them to consolidate.
If you have absorbed this professional page, you are equipped to:
- Lead drain quality at an organisation.
- Handle the most demanding drain scenarios (EOS Kafka, complex pipelines).
- Audit any Go service for drain bugs.
- Mentor others in drain discipline.
- Contribute drain patterns to open source.
That is the bar. The work is yours.
Appendix JJ: A Final Long Reflection¶
Drain, as I write this, is one of the most underappreciated patterns in production Go. Every team that has built a serious Go service has hit a drain bug. Every senior engineer who has been on-call for a Go service has been paged at 2 AM because drain went wrong.
The good news: drain is teachable. The patterns are finite. The tooling exists. A team that invests in drain for a quarter reaps the benefits for years.
The hard part is the discipline. Drain is not glamorous. It does not appear in feature demos. It does not appear in performance benchmarks. It does not appear in promotion documents — unless you make it appear.
Make it appear. Talk about drain in design reviews. Write drain into your team's bar for "production-ready." Mentor newer engineers into drain awareness. Be the senior engineer who treats drain as first-class.
A team that does this is a team that ships fast and sleeps well. Be that team.
Appendix KK: Drain And The Future¶
What will change in the next 5 years?
- More language-level support for graceful shutdown (Go itself may evolve here).
- More framework-level drain (every framework will have it).
- More platform-level drain (orchestrators will get smarter).
- More observability tooling for drain.
But the patterns will remain. Stop intake, wait for in-flight, close downstream, bounded by deadline. That is the eternal recipe. Future systems will make it easier; the recipe is universal.
Master the recipe now. The tooling catches up.
Appendix LL: Last Words¶
You have completed the four-level journey: junior, middle, senior, professional. Drain is now in your toolkit.
Go build. Go ship. Go drain cleanly.
The work continues. The systems we leave behind us are better for it.
Appendix MM: A Last Practical Note¶
The most useful single thing you can do tomorrow: audit one service for drain quality.
Pick a service. Spend 30 minutes. Apply the 10-point checklist from the senior page. Note what is missing. Write a follow-up task.
That single audit, repeated weekly, lifts a service's drain quality over a quarter. Repeated org-wide, lifts the whole org.
Start tomorrow. Pick a service. Spend 30 minutes.
The rest follows.
Appendix NN: Deep Dive — Kafka Rebalance Internals¶
To drain Kafka consumers correctly, you need to understand the rebalance protocol at the broker level.
The group coordinator¶
Each consumer group has a coordinator — one of the brokers, chosen via a hash of the group ID. The coordinator:
- Tracks group membership.
- Assigns partitions to members.
- Tracks heartbeats and session timeouts.
- Initiates rebalances when membership changes.
A consumer's "drain" interacts with the coordinator in two ways:
- When the consumer's session expires (no heartbeat), the coordinator marks it dead and rebalances.
- When the consumer cleanly leaves the group (
LeaveGrouprequest), the coordinator rebalances immediately.
A drain should do (2) — call LeaveGroup explicitly. Most Go libraries do this in Close().
Generation IDs¶
Each rebalance produces a new generation ID. Consumers that miss a heartbeat may have a stale generation and be rejected. Drain interacts:
- Long drain may cause heartbeats to lag.
- The consumer is then booted from the group.
- Any commits with the stale generation are rejected.
Mitigation: keep heartbeats running during drain. Most libraries do this on a background goroutine that runs until Close.
Static vs dynamic membership¶
Static membership (Kafka 2.3+) lets a consumer keep its partition assignment across restarts (up to session.timeout.ms). A draining consumer:
- Leaves cleanly via
LeaveGroup. - A replacement with the same
group.instance.idreclaims the assignment.
This is useful for stateful consumers (stream processors) where the new instance can pick up from the same state.
Range vs round-robin vs sticky assignment¶
The assignment strategy affects drain:
- Range: contiguous partitions per consumer. Drain triggers contiguous reassignment.
- Round-robin: spread across consumers. Drain triggers wider reassignment.
- Sticky: minimise movement on rebalance. Drain only moves the leaving consumer's partitions.
For drain, sticky is best — least disruption. Use it when available.
Cooperative rebalance protocol¶
The cooperative rebalance protocol (CooperativeStickyAssignor in Java) breaks rebalance into two phases:
- Revoke phase: each consumer revokes only the partitions it is about to lose. Others continue.
- Assign phase: new assignments are applied.
The Go ecosystem is catching up: franz-go supports cooperative; older libraries default to eager.
Cooperative changes drain semantics:
- Drain may revoke only some partitions, not all.
- The consumer must drain those partitions specifically.
- Other partitions continue uninterrupted.
This requires per-partition drain logic.
Appendix OO: Per-Partition Drain Implementation¶
For cooperative rebalance, drain is per-partition.
type PartitionRunner struct {
partition int32
in chan kafka.Message
wg sync.WaitGroup
committed atomic.Int64
}
type Consumer struct {
mu sync.Mutex
runners map[int32]*PartitionRunner
commit func(p int32, offset int64) error
}
func (c *Consumer) OnPartitionsRevoked(parts []int32) error {
for _, p := range parts {
c.mu.Lock()
r := c.runners[p]
delete(c.runners, p)
c.mu.Unlock()
if r == nil {
continue
}
close(r.in)
r.wg.Wait()
if err := c.commit(p, r.committed.Load()); err != nil {
log.Printf("commit %d: %v", p, err)
}
}
return nil
}
func (c *Consumer) OnPartitionsAssigned(parts []int32) error {
c.mu.Lock()
defer c.mu.Unlock()
for _, p := range parts {
r := &PartitionRunner{
partition: p,
in: make(chan kafka.Message, 16),
}
c.runners[p] = r
r.wg.Add(1)
go c.runPartition(r)
}
return nil
}
func (c *Consumer) runPartition(r *PartitionRunner) {
defer r.wg.Done()
for msg := range r.in {
_ = c.process(msg)
r.committed.Store(msg.Offset + 1)
}
}
The per-partition runner has its own goroutine, channel, and offset state. Revocation closes the channel; the runner drains its buffer and exits; the offset is committed.
This is more code than the simple consumer, but it handles cooperative rebalance correctly.
Appendix PP: Drain Order Within A Kafka Consumer¶
Inside a single Kafka consumer, drain order matters:
- Stop fetcher. No new messages enter the system.
- Drain per-partition queues. Each partition's in-flight work completes.
- Commit offsets per partition. The highest processed offset is recorded.
- Heartbeat one last time. The coordinator sees us as alive until the last moment.
- LeaveGroup. Tell the coordinator we are leaving cleanly.
- Close connection. TCP-level shutdown.
Skipping any step risks duplicates or lag.
Appendix QQ: Drain Error Recovery¶
Drain can fail. Strategies:
- Idempotent re-drain. Calling drain again after a failure tries again.
- Partial-state recovery. Note which components drained; skip them on retry.
- Force-cleanup. If clean drain fails, close hard.
func (s *Service) DrainWithRetry(ctx context.Context, retries int) error {
for i := 0; i < retries; i++ {
if err := s.Drain(ctx); err == nil {
return nil
}
}
s.ForceClose()
return errors.New("drain failed; force-closed")
}
For most services, one drain attempt is enough. Retries are useful for transient errors (network blip during commit).
Appendix RR: Drain And End-to-End Latency¶
Drain affects end-to-end latency:
- During drain, in-flight requests have an upper bound (the drain deadline).
- Slow requests may be force-cancelled.
- The P99 of "drain-affected" requests is higher.
Track P99 latency by deploy time. A spike during deploys indicates drain pressure.
Appendix SS: Drain And SLA Compliance¶
Service-Level Agreements often specify uptime (99.9%, 99.99%). Drain affects uptime:
- A drain that returns 5xx counts as downtime.
- A drain that drops requests counts as downtime.
- Frequent deploys with drain issues accumulate downtime.
Compute the impact: drain 5xx rate × deploy frequency × pod-time = uptime loss.
For a 99.95% SLA, this matters. Drain quality is part of compliance.
Appendix TT: Drain And Cost Optimisation¶
Drain consumes capacity:
- A drain pod is not serving full traffic.
- The replacement is starting up.
- For minutes, the cluster has reduced capacity.
For cost-optimised deployments (running near capacity), drain time directly affects ability to deploy. If drain is 30s and you cannot tolerate any capacity loss, you cannot deploy.
Mitigation: keep drain short; or over-provision modestly during deploys.
Appendix UU: Drain And Auto-Scaling¶
Auto-scaling interacts with drain:
- Scale-down drains existing pods.
- Scale-up replaces drained capacity quickly.
- Rapid scale changes can amplify drain pressure.
For latency-sensitive services, scale changes should be slow. For throughput services, fast scale changes are fine.
Drain budget must accommodate scale-down events. If you scale down 50% of pods at once, drain handles 2x normal load briefly.
Appendix VV: Drain Comparison Across Languages — Long Version¶
| Language | Drain primitive | Notes |
|---|---|---|
| Go | signal.NotifyContext + context.WithTimeout | Native; idiomatic |
| Rust | tokio::signal + select! macros | Async-first |
| Java | Runtime.addShutdownHook + ExecutorService.shutdown | JVM-heavy |
| Python (asyncio) | signal.add_signal_handler + Server.wait_closed | Coroutine-based |
| Node.js | process.on('SIGTERM', ...) + server.close() | Callback-based |
| C# | IHostedService.StopAsync | Framework-level |
| Erlang/Elixir | OTP supervisor shutdown spec | First-class |
| Ruby | Signal.trap + various library-specific | Less standardised |
| PHP | Less relevant; FPM handles per-request | n/a for long-running |
Go's drain story is solid. Erlang is even cleaner. Java/.NET are more verbose but capable.
When evaluating a polyglot stack, ensure each language is drainable. A single weak link is a single point of failure.
Appendix WW: Drain And Performance Testing¶
A performance test that excludes drain misses key data. Include drain in the test:
# Load for 60 seconds.
vegeta attack -duration 60s -rate 1000 < targets > results
sleep 30
kill -TERM $SERVER_PID
wait $SERVER_PID
# Verify exit code 0
# Verify zero 5xx in last 10 seconds of result file
The "last 10 seconds" is the drain window. Anomalies there are drain bugs.
Appendix XX: Drain And Chaos Engineering¶
Inject random SIGTERM during sustained load:
for i in {1..100}; do
./service &
PID=$!
SLEEP_TIME=$(( (RANDOM % 30) + 5 ))
sleep $SLEEP_TIME
kill -TERM $PID
wait $PID
if [ $? -ne 0 ]; then
echo "iter $i failed"
exit 1
fi
done
100 iterations of random drain. Any failure is a bug.
This is a poor man's chaos test. Production teams use tools like ChaosMesh or Litmus for more sophisticated chaos.
Appendix YY: Drain Of Background Schedulers — Cron-like¶
A cron-like scheduler in Go drains by:
- Stopping the tick loop.
- Letting any running job complete (or force-cancel on deadline).
- Persisting any state (last-run-time).
type Cron struct {
jobs []*Job
stop chan struct{}
running sync.WaitGroup
}
func (c *Cron) Run(ctx context.Context) {
t := time.NewTicker(time.Minute)
defer t.Stop()
for {
select {
case <-c.stop:
return
case <-ctx.Done():
return
case now := <-t.C:
for _, j := range c.jobs {
if j.Due(now) {
c.running.Add(1)
go func(j *Job) { defer c.running.Done(); j.Run(ctx) }(j)
}
}
}
}
}
func (c *Cron) Drain(ctx context.Context) error {
close(c.stop)
done := make(chan struct{})
go func() { c.running.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Cron drain is straightforward. The complexity is in long-running jobs that may exceed drain budget.
Appendix ZZ: Drain Of Cron Jobs With Long Runners¶
If a cron job takes 5 minutes and drain budget is 25 seconds, you have a problem. Options:
- Persist the long job's state; resume on next pod.
- Run long jobs in a separate pod with longer grace.
- Skip the long job during drain.
Choose based on the job's criticality. A daily report can be skipped; a billing run cannot.
Appendix AAA: Drain Library Survey¶
Open-source Go libraries for drain helpers:
golang.org/x/sync/errgroup— error group.go.uber.org/fx— dependency injection with lifecycle hooks.github.com/oklog/run— concurrent actors.github.com/heptiolabs/healthcheck— health endpoints.
None of them are full drain frameworks. They provide pieces. Compose them.
For org-wide drain, build your own framework on top of these.
Appendix BBB: Drain Framework Anti-Patterns¶
Some framework anti-patterns:
- Framework that requires
import _ "github.com/.../init". Init runs at import time; no control. - Framework that catches
SIGTERMautomatically. Then yourmaincannot control drain. - Framework that prints to stdout. Pollutes logs.
- Framework that doesn't expose drain duration metric. Cannot tune.
- Framework that doesn't support custom drain logic. Cannot adapt.
A good drain framework is opt-in, configurable, observable. Evaluate before adopting.
Appendix CCC: Building A Production Drain Library¶
The pseudocode for a production drain library:
package drain
import (
"context"
"sync"
"time"
)
type Drainable interface {
Name() string
Drain(ctx context.Context) error
}
type Manager struct {
mu sync.Mutex
items []Drainable
timeout time.Duration
onStart []func()
onEnd []func(time.Duration, error)
}
func New(timeout time.Duration) *Manager {
return &Manager{timeout: timeout}
}
func (m *Manager) Register(d Drainable) {
m.mu.Lock()
defer m.mu.Unlock()
m.items = append(m.items, d)
}
func (m *Manager) OnStart(fn func()) {
m.mu.Lock()
defer m.mu.Unlock()
m.onStart = append(m.onStart, fn)
}
func (m *Manager) OnEnd(fn func(time.Duration, error)) {
m.mu.Lock()
defer m.mu.Unlock()
m.onEnd = append(m.onEnd, fn)
}
func (m *Manager) Drain() error {
for _, fn := range m.onStart {
fn()
}
start := time.Now()
ctx, cancel := context.WithTimeout(context.Background(), m.timeout)
defer cancel()
var firstErr error
for i := len(m.items) - 1; i >= 0; i-- {
d := m.items[i]
if err := d.Drain(ctx); err != nil && firstErr == nil {
firstErr = err
}
}
elapsed := time.Since(start)
for _, fn := range m.onEnd {
fn(elapsed, firstErr)
}
return firstErr
}
50 lines. Add metrics integration, tracing, structured logging — you have a production library.
Appendix DDD: Drain And Operating System Limits¶
OS limits affect drain:
- File descriptor limits (
ulimit -n). - Process count limits.
- Memory limits.
A pod hitting limits may fail to drain (cannot open file to write snapshot). Pre-flight check these.
Appendix EEE: Drain And Container Runtimes¶
Container runtime (containerd, Docker, CRI-O) controls signal delivery:
- Most runtimes forward
SIGTERMto PID 1 in the container. - PID 1 must handle signals correctly.
A Go binary at PID 1 handles SIGTERM via signal.NotifyContext. But shell scripts wrapping the binary may not forward signals. Always run Go binaries as PID 1 in containers, not via wrapper scripts.
If you need a wrapper script, use exec to replace the shell with the binary:
exec replaces the shell with the Go binary; signals go directly to Go.
Appendix FFF: Drain And Init Systems¶
In multi-process pods, an init system manages process supervision:
tiniis the most common in containers.dumb-initis similar.- Some use
s6-overlay.
These forward signals correctly and reap zombie processes. Use one of them if you have multiple processes in a pod.
For single-process pods (the typical case), no init system is needed.
Appendix GGG: Drain And Cgroup v2¶
Cgroups manage resource limits. On cgroup v2, the OOM killer may target the entire group, killing all processes at once. This interacts with drain:
- An OOM in one process kills the whole pod.
- Drain has no time to run.
Mitigation: monitor memory; drain pre-emptively.
Appendix HHH: Drain And Linux Capabilities¶
Drain code that closes file descriptors does not need elevated privileges. Drain code that sends signals to other processes (cgroups) may.
Audit your drain code for privilege requirements. Run with the minimum.
Appendix III: Drain And Networking¶
Drain interacts with networking at several layers:
- TCP: open connections close; SYN packets get RST after listener closes.
- HTTP: in-flight requests complete; new requests get connection refused.
- HTTP/2: GOAWAY frames signal "no new streams."
- gRPC: similar to HTTP/2 plus framing.
- WebSocket: close frames signal "goodbye."
Each layer has its own "I'm draining" mechanism. Use the highest one available; it is the most informative.
Appendix JJJ: Drain Of Long-Lived TCP Connections¶
A TCP server with long-lived connections (e.g., a chat server) drains by:
- Stopping
Accept. - Sending application-level "shutdown" message to each connection.
- Waiting for clients to disconnect.
- Force-closing remaining connections.
type ChatServer struct {
ln net.Listener
conns sync.Map
wg sync.WaitGroup
}
func (s *ChatServer) Drain(ctx context.Context) error {
_ = s.ln.Close()
s.conns.Range(func(_, c any) bool {
_, _ = c.(net.Conn).Write([]byte("SHUTDOWN\n"))
return true
})
done := make(chan struct{})
go func() { s.wg.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
s.conns.Range(func(_, c any) bool {
_ = c.(net.Conn).Close()
return true
})
<-done
return ctx.Err()
}
}
Application-level shutdown messages let clients reconnect to another pod. Force-close is the deadline fallback.
Appendix KKK: Drain Of WebSocket Hubs¶
A WebSocket hub broadcasts to subscribers. Drain:
- Stop accepting new subscriptions.
- Send "going-away" close frame to each.
- Wait for clients to disconnect (with a deadline).
- Force-close.
type Hub struct {
mu sync.RWMutex
conns map[*websocket.Conn]struct{}
draining atomic.Bool
}
func (h *Hub) Drain(ctx context.Context) error {
h.draining.Store(true)
h.mu.RLock()
for c := range h.conns {
_ = c.WriteControl(
websocket.CloseMessage,
websocket.FormatCloseMessage(websocket.CloseGoingAway, "server shutdown"),
time.Now().Add(time.Second),
)
}
h.mu.RUnlock()
// Wait or force-close
deadline := time.After(5 * time.Second)
for {
h.mu.RLock()
count := len(h.conns)
h.mu.RUnlock()
if count == 0 {
return nil
}
select {
case <-deadline:
case <-ctx.Done():
h.mu.Lock()
for c := range h.conns {
_ = c.Close()
}
h.mu.Unlock()
return ctx.Err()
case <-time.After(100 * time.Millisecond):
}
}
}
5-second client-disconnect window. After that, force-close. The client is responsible for reconnecting.
Appendix LLL: Drain Of Server-Sent Events¶
SSE is one-way streaming over HTTP. Drain:
- Send a "shutting down" event.
- Close the response.
- The client reconnects per SSE spec.
func (s *SSEHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
if s.draining.Load() {
fmt.Fprintln(w, "event: shutdown\ndata: bye\n")
return
}
// normal SSE handling
}
SSE makes drain easy because of built-in reconnection.
Appendix MMM: Drain Of UDP Servers¶
UDP is connectionless. There is nothing to "wait for in-flight."
Drain for UDP servers is mostly: stop reading; let outstanding processing finish.
type UDPServer struct {
conn *net.UDPConn
wg sync.WaitGroup
}
func (s *UDPServer) Drain(ctx context.Context) error {
_ = s.conn.Close() // breaks ReadFromUDP loop
done := make(chan struct{})
go func() { s.wg.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Simpler than TCP because no per-connection state.
Appendix NNN: Drain Of QUIC Servers¶
QUIC (HTTP/3) drains via:
Listener.Closestops new connections.- Active connections get a CONNECTION_CLOSE frame with code 0 (no error).
- Streams can finish or be cancelled.
Use the library's CloseGracefully if available.
Appendix OOO: Drain Of gRPC With Bidirectional Streaming¶
Bidirectional gRPC streams are the trickiest:
- Server sends and receives concurrently.
- Drain must complete or cancel both directions.
func (s *Server) Chat(stream pb.Chat_ChatServer) error {
ctx := stream.Context()
send := make(chan *pb.Msg, 16)
go func() {
for {
msg, err := stream.Recv()
if err != nil {
return
}
send <- msg
}
}()
for {
select {
case <-ctx.Done():
return status.FromContextError(ctx.Err()).Err()
case <-s.draining:
return status.Error(codes.Unavailable, "draining")
case msg := <-send:
if err := stream.Send(transform(msg)); err != nil {
return err
}
}
}
}
A draining channel on the server signals "close streams now." The handler returns; gRPC closes the stream.
Appendix PPP: Drain Of Service With Many Concurrent Streams¶
For a service with thousands of concurrent streams, draining all at once is heavy:
- Each stream sends a close frame (CPU).
- The kernel handles thousands of TCP closes.
- Memory pressure spikes.
Mitigation: throttle close rate.
sem := make(chan struct{}, 10)
for _, s := range streams {
sem <- struct{}{}
go func(s *Stream) {
defer func() { <-sem }()
s.Close()
}(s)
}
10 concurrent closes at a time. Smoother resource usage.
Appendix QQQ: Drain Telemetry For Streaming Services¶
Specific metrics for streaming:
streams_active(gauge).streams_drain_duration_seconds(histogram).streams_force_closed_total(counter).streams_goaway_sent_total(counter).
A dashboard with these tells you streaming drain health.
Appendix RRR: Drain Of Database Drivers¶
Database drivers in Go:
database/sql.DB.Closeblocks until in-use connections return.- Some drivers have cancellation; others don't.
- Connection pools may have their own drain.
Pattern:
If db.Close hangs, you have a connection leak — a worker holding a connection without returning it.
Appendix SSS: Drain And Long Queries¶
A long SQL query during drain:
- Blocks the connection.
- May exceed drain budget.
- Cancellation depends on driver support.
Set statement_timeout (Postgres) or equivalent to bound query duration.
A query longer than 10s is killed by the database. Drain budget is safe.
Appendix TTT: Drain And Database Transactions¶
A transaction in flight at drain:
- Holds row locks.
- Blocks other transactions.
- Rolls back on context cancel (if driver supports).
Mitigation: short transactions. Drain attempts to commit; aborts cleanly on cancel.
Appendix UUU: Drain And Connection Pools¶
Connection pools (database/sql, custom HTTP pools) drain by:
- Stopping new acquires.
- Waiting for returns.
- Closing idle connections.
For database/sql:
These accelerate connection close. Useful for fast drain.
Appendix VVV: Drain Pattern Visualisation¶
Time →
intake ████████████░░░░░░░░░░░░░░░░░░ ← drops at SIGTERM
http ████████████████░░░░░░░░░░░░░░ ← drains until handlers done
workers ░░██████████████████░░░░░░░░░░ ← drains queue
producer ░░░░░░░░░░██████████░░░░░░░░░░ ← flushes buffer
db ░░░░░░░░░░░░░░██████░░░░░░░░░░ ← closes connections
metrics ░░░░░░░░░░░░░░░░░░██░░░░░░░░░░ ← flushes finally
↑
process exit
SIGTERM ▲ ▲ grace period
Each component drains at its own pace. The sum fits in the grace period.
Appendix WWW: A Lengthy Final Reflection¶
We have covered a lot. Let me close with a reflection on why drain matters at this professional level.
When I was a junior engineer, I shipped my first Go service. It had no drain. Every deploy dropped a handful of requests. I did not even notice — they were buried in normal noise.
A senior engineer at the company saw the deploy logs and pulled me aside. He showed me the 5xx spikes on the dashboard. "Each one of these," he said, "is a customer who saw an error. You wouldn't accept that in your own code if you knew."
That conversation changed me. I learned drain. I implemented it across the service. I wrote tests. The next deploy was clean. The deploy after that, too. The 5xx noise during deploys disappeared.
Eight years later, I mentor junior engineers the same way. The drain pattern is a teaching opportunity. It is also a litmus test for engineering maturity.
If you are reading this far, you have invested significant time in learning drain. That investment pays off for years. Every clean deploy. Every uninterrupted customer session. Every quiet Sunday evening when your phone does not vibrate.
Drain is the discipline of leaving cleanly. In a world that often celebrates "move fast and break things," drain is the opposite ethos: move fast, but leave cleanly. Build systems that gracefully retire.
That is the professional view. That is what this page has been about.
Appendix XXX: One More Final Thought¶
The patterns in this file will not all apply to your service today. Most will. Some will not.
The discipline is to read with judgement. Pick the patterns you need. Skip the ones you don't. Return when your service has grown into them.
Drain is a journey, not a destination. Your service evolves; its drain evolves. The patterns are tools; the wisdom is in using the right tool at the right time.
You have the tools. Use them well.
Appendix YYY: Truly Final Words¶
Welcome to professional-level drain. The patterns are yours. The discipline is yours. The systems you build will be better for it.
The remaining files in this section — specification, interview, tasks, find-bug, optimize — are reference material. Use them as you need.
Build well. Drain well. Ship well.
Go.
Appendix ZZZ: A Long Technical Deep Dive — Sarama Consumer Group Drain¶
Let us deep-dive into draining a sarama.ConsumerGroup consumer, which remains one of the most-used Kafka libraries in Go.
package main
import (
"context"
"errors"
"log"
"os"
"os/signal"
"sync"
"syscall"
"time"
"github.com/IBM/sarama"
)
type Handler struct {
mu sync.Mutex
inFlight sync.WaitGroup
process func(context.Context, *sarama.ConsumerMessage) error
}
func (h *Handler) Setup(s sarama.ConsumerGroupSession) error {
log.Printf("setup: claims=%v", s.Claims())
return nil
}
func (h *Handler) Cleanup(s sarama.ConsumerGroupSession) error {
log.Printf("cleanup: waiting for in-flight")
done := make(chan struct{})
go func() {
h.inFlight.Wait()
close(done)
}()
select {
case <-done:
log.Printf("cleanup: in-flight complete")
case <-time.After(15 * time.Second):
log.Printf("cleanup: timeout, returning anyway")
}
return nil
}
func (h *Handler) ConsumeClaim(s sarama.ConsumerGroupSession,
c sarama.ConsumerGroupClaim) error {
for msg := range c.Messages() {
h.inFlight.Add(1)
go func(m *sarama.ConsumerMessage) {
defer h.inFlight.Done()
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := h.process(ctx, m); err != nil {
log.Printf("process: %v", err)
return
}
s.MarkMessage(m, "")
}(msg)
}
return nil
}
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()
cfg := sarama.NewConfig()
cfg.Version = sarama.V3_5_0_0
cfg.Consumer.Group.Rebalance.Strategy = sarama.NewBalanceStrategyRoundRobin()
cfg.Consumer.Offsets.Initial = sarama.OffsetOldest
client, err := sarama.NewConsumerGroup([]string{"localhost:9092"}, "demo", cfg)
if err != nil {
log.Fatalf("new: %v", err)
}
h := &Handler{
process: func(ctx context.Context, m *sarama.ConsumerMessage) error {
select {
case <-time.After(100 * time.Millisecond):
case <-ctx.Done():
return ctx.Err()
}
log.Printf("processed: p=%d o=%d", m.Partition, m.Offset)
return nil
},
}
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
for {
if err := client.Consume(ctx, []string{"events"}, h); err != nil {
if errors.Is(err, sarama.ErrClosedConsumerGroup) || ctx.Err() != nil {
return
}
log.Printf("consume: %v", err)
}
}
}()
<-ctx.Done()
log.Printf("draining")
dctx, dcancel := context.WithTimeout(context.Background(), 25*time.Second)
defer dcancel()
closeDone := make(chan struct{})
go func() {
_ = client.Close()
close(closeDone)
}()
select {
case <-closeDone:
case <-dctx.Done():
log.Printf("close timeout")
}
wg.Wait()
log.Printf("clean exit")
}
Walk through this:
Handler.Setupruns once per claim assignment.Handler.Cleanupruns once per claim revocation; this is where drain logic lives.Handler.ConsumeClaimprocesses each message; spawns a goroutine per message.inFlightwaitgroup tracks message processing.Cleanupwaits up to 15 seconds for in-flight to complete; longer than that and the rebalance moves on.MarkMessagerecords the offset; sarama commits periodically.- Main loop catches the signal, calls
client.Close, waits for the consume loop to exit.
Note the 15-second timeout in Cleanup: it should be less than cfg.Consumer.Group.Session.Timeout.Ms (default 10s in newer versions). Tune both together.
Appendix AAAA: Drain With Sarama's SyncProducer And AsyncProducer¶
sarama.SyncProducer: each Send blocks until ack. Drain is simple — just call Close().
sarama.AsyncProducer: messages queued; ack arrives later. Drain:
producer.AsyncClose() // closes input channel, drains success/error channels
for range producer.Successes() {
}
for range producer.Errors() {
// log
}
AsyncClose is non-blocking; you must drain Successes and Errors channels yourself.
Appendix BBBB: Drain With franz-go¶
franz-go is a modern Go Kafka library. Drain:
cl, _ := kgo.NewClient(/* opts */)
defer cl.Close() // synchronous, drains correctly
cl.PollFetches(ctx) // returns on ctx cancel
Close is well-behaved: drains, commits offsets, leaves group. Pass a context-aware PollFetches for the consume loop.
For producers, cl.ProduceSync(ctx, ...) blocks; cl.Produce(ctx, ..., cb) is async with callback. Drain via cl.Flush(ctx).
Appendix CCCC: Drain Of Transactional Producer With franz-go¶
cl, _ := kgo.NewClient(
kgo.TransactionalID("my-app"),
kgo.RequiredAcks(kgo.AllISRAcks()),
)
if err := cl.BeginTransaction(); err != nil { ... }
cl.Produce(ctx, &kgo.Record{...}, nil)
if err := cl.EndTransaction(ctx, kgo.TryCommit); err != nil { ... }
Drain: if a transaction is open, attempt commit. On failure, abort.
func drainTx(ctx context.Context, cl *kgo.Client) error {
switch {
case cl.InTransaction():
if err := cl.EndTransaction(ctx, kgo.TryCommit); err != nil {
_ = cl.EndTransaction(ctx, kgo.TryAbort)
return err
}
}
return cl.Flush(ctx)
}
Appendix DDDD: Drain Of NATS JetStream¶
nc, _ := nats.Connect("nats://localhost:4222")
js, _ := nc.JetStream()
sub, _ := js.Subscribe("events", func(m *nats.Msg) {
// handle
_ = m.Ack()
})
// drain
_ = sub.Drain() // stops new deliveries, waits for handler to finish
_ = nc.Drain() // closes connection cleanly
Drain on subscription and connection. NATS got the API right.
Appendix EEEE: Drain Of Apache Pulsar¶
client, _ := pulsar.NewClient(pulsar.ClientOptions{URL: "pulsar://localhost:6650"})
defer client.Close()
consumer, _ := client.Subscribe(pulsar.ConsumerOptions{
Topic: "events",
SubscriptionName: "demo",
Type: pulsar.Shared,
})
// drain
consumer.Close() // blocks until in-flight acks complete
Close includes drain. Simple.
Appendix FFFF: Drain Of Google Pub/Sub¶
client, _ := pubsub.NewClient(ctx, "my-project")
defer client.Close()
sub := client.Subscription("my-sub")
cctx, cancel := context.WithCancel(context.Background())
go sub.Receive(cctx, func(ctx context.Context, m *pubsub.Message) {
// handle
m.Ack()
})
// drain
cancel() // tells Receive to stop
// in-flight handlers complete naturally as their contexts complete
Cancel the receive context; in-flight handlers complete; client.Close cleans up.
Appendix GGGG: Drain Of AWS SQS¶
client := sqs.NewFromConfig(cfg)
for {
if ctx.Err() != nil {
break
}
out, _ := client.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
QueueUrl: aws.String(queueURL),
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
})
for _, m := range out.Messages {
if err := process(ctx, m); err == nil {
_, _ = client.DeleteMessage(ctx, &sqs.DeleteMessageInput{
QueueUrl: aws.String(queueURL),
ReceiptHandle: m.ReceiptHandle,
})
}
}
}
Drain: cancel ctx. The ReceiveMessage long-poll returns. The for-loop exits. Process the current batch via the outer drain logic.
Appendix HHHH: Drain Of Amazon Kinesis¶
Kinesis uses shards; each shard has its own iterator. Drain:
- Stop calling
GetRecordsfor new batches. - Checkpoint the last processed sequence via DynamoDB.
- Done.
For KCL (Kinesis Client Library) in Go via aws/aws-sdk-go-v2:
processor := NewProcessor()
go processor.Start(ctx)
<-ctx.Done()
_ = processor.Stop(ctx)
_ = processor.Checkpoint(ctx)
KCL has explicit Stop/Checkpoint methods. Use them.
Appendix IIII: Drain Of Azure Service Bus¶
client, _ := azservicebus.NewClient(...)
defer client.Close(ctx) // drains in-flight
receiver, _ := client.NewReceiverForQueue("q", nil)
defer receiver.Close(ctx)
for {
msgs, _ := receiver.ReceiveMessages(ctx, 10, nil)
for _, m := range msgs {
_ = process(m)
_ = receiver.CompleteMessage(ctx, m, nil)
}
}
Close on receiver and client handles drain. Pass context for bounded wait.
Appendix JJJJ: A Comparison Of Streaming Drain Across Libraries¶
| Library | Drain API | Returns context error on timeout |
|---|---|---|
| sarama | client.Close() | No (no context param) |
| franz-go | client.Close() | No (synchronous) |
| segmentio/kafka-go | reader.Close() | No (no context) |
| confluent-kafka-go | consumer.Close() | No |
| NATS | sub.Drain(), nc.Drain() | No (returns nil/error) |
| Pulsar | consumer.Close() | No |
| Google Pub/Sub | Cancel receive ctx | Yes |
| SQS | Cancel context | Yes |
| Service Bus | client.Close(ctx) | Yes |
| Kinesis (KCL Go) | processor.Stop(ctx) | Yes |
Notice the trend: newer libraries take context for drain. Older ones do not. If you need context-aware drain on an older library, wrap with a goroutine + select.
Appendix KKKK: A Long Technical Tour Of A Production Pipeline Drain¶
Imagine a real-world pipeline: a payment processing service. Architecture:
- HTTP API accepting payment intents.
- Kafka consumer reading from
payment_requeststopic. - Worker pool processing each request.
- Postgres persisting state.
- Kafka producer emitting
payment_events. - Redis caching.
- OpenTelemetry exporter.
Drain order:
- Mark service as draining (
/ready→ 503). - Sleep 2s for LB propagation.
- Stop HTTP server.
- Stop Kafka consumer fetch.
- Drain worker pool.
- Commit final Kafka offsets.
- Flush Kafka producer.
- Close Kafka producer.
- Close Kafka consumer.
- Flush OTel exporter.
- Close Redis client.
- Close Postgres.
Eleven steps. Each takes a slice of the budget. Total ~10s in the happy path; up to 25s in worst case.
Implementation:
type Pipeline struct {
api *http.Server
consumer *KafkaConsumer
pool *WorkerPool
producer *KafkaProducer
cache *RedisClient
db *sql.DB
otel *OtelExporter
draining atomic.Bool
}
func (p *Pipeline) Drain(rootCtx context.Context) error {
p.draining.Store(true)
time.Sleep(2 * time.Second)
dctx, dcancel := context.WithTimeout(context.Background(), 25*time.Second)
defer dcancel()
if err := p.api.Shutdown(dctx); err != nil {
log.Printf("api: %v", err)
}
if err := p.consumer.StopFetch(dctx); err != nil {
log.Printf("consumer fetch: %v", err)
}
if err := p.pool.Drain(dctx); err != nil {
log.Printf("pool: %v", err)
}
if err := p.consumer.CommitFinal(dctx); err != nil {
log.Printf("commit: %v", err)
}
if err := p.producer.Flush(dctx); err != nil {
log.Printf("producer flush: %v", err)
}
if err := p.producer.Close(); err != nil {
log.Printf("producer close: %v", err)
}
if err := p.consumer.Close(); err != nil {
log.Printf("consumer close: %v", err)
}
if err := p.otel.Flush(dctx); err != nil {
log.Printf("otel: %v", err)
}
if err := p.cache.Close(); err != nil {
log.Printf("cache: %v", err)
}
if err := p.db.Close(); err != nil {
log.Printf("db: %v", err)
}
return nil
}
Each step has its own error log. Errors are not fatal; subsequent steps continue.
In production, this exact shape (with variations) has shipped billions of payments.
Appendix LLLL: A Long Wrap-Up Of Drain Patterns¶
We have covered, across this and earlier pages:
- The three steps of drain.
- The deadline bound.
- Signal handling.
- HTTP, gRPC, worker pool, queue consumer drain.
- Two-phase shutdown.
- Supervisor patterns.
- DAG and topological order.
- Cross-service coordination.
- Cluster awareness.
- Leader election interaction.
- Distributed transactions.
- Stateful systems.
- Telemetry.
- Anti-patterns.
- Designing for drain.
- Testing strategies.
- Performance budgets.
- Capacity planning.
- Hot path costs.
- Polyglot stacks.
- Real incidents.
- Sidecars.
- Long jobs.
- Mentoring.
- Kafka rebalance.
- Exactly-once.
- Per-partition drain.
- Pulsar, NATS, Kinesis, SQS, Service Bus.
- Many other appendices.
That is roughly a textbook. Drain in Go, with depth.
If you have absorbed even half of this, you are above the 95th percentile of Go engineers in drain knowledge. Use that knowledge to ship better systems.
Appendix MMMM: An Encouragement For The Reader¶
I have written tens of thousands of words about drain across these pages. The expectation is not that you memorise all of it. The expectation is that you read with judgement, apply what fits, and return to specific sections when you need them.
Drain is a deep topic, but the surface area you need on any given day is small. The deep parts are there when you need them: the war stories, the per-partition drain, the framework comparisons. Most days you will use the basic recipe.
Keep this file bookmarked. Return when you need a specific pattern. The text is dense; it rewards re-reading.
Appendix NNNN: A Brief Word On Tooling¶
Tooling for drain in Go:
go.uber.org/goleakfor goroutine leak detection.golang.org/x/sync/errgroupfor goroutine coordination.signal.NotifyContextfor signal handling.pproffor profiling drain.- OpenTelemetry for tracing drain phases.
- Prometheus client for metrics.
vegeta/wrkfor load testing.
These are the tools. Master them. Apply them. The combination is more than the sum.
Appendix OOOO: Drain In The Cloud Native Landscape¶
The CNCF (Cloud Native Computing Foundation) landscape has many projects related to drain:
- Kubernetes (orchestration with grace period).
- Istio (sidecar drain).
- Knative (autoscaler-aware drain).
- Linkerd (alternative mesh).
- Helm (deploy orchestration).
Familiarity with these helps when working in cloud native environments.
Appendix PPPP: Drain As Part Of Resilience Engineering¶
Drain is one of several resilience engineering patterns:
- Circuit breakers.
- Retries with backoff.
- Bulkheads.
- Rate limits.
- Timeouts.
- Graceful degradation.
- And drain.
Each handles a different failure mode. Drain handles the "we are going away" case. Resilience engineering combines them.
A drain-only service is fragile. A circuit-breaker-only service drops shutdown traffic. The combination is robust.
Appendix QQQQ: Drain And SRE Practice¶
Site Reliability Engineering treats drain as an SLO contributor:
- Drain failures count against the error budget.
- Drain duration contributes to deploy duration.
- Drain quality affects MTTR for deployments.
SRE-led organisations measure drain rigorously. Engineers respond to the data.
Appendix RRRR: Drain As A Code Smell Detector¶
A service with bad drain often has other bad patterns:
- Goroutine leaks (drain hangs).
- Hidden state (drain misses it).
- Tight coupling (drain order is wrong).
- Lack of testing (drain is untested).
- Lack of observability (drain failures invisible).
Auditing drain often reveals these. Drain is a probe into the codebase.
Appendix SSSS: One More Worked Example — Drain A Microservice Cluster¶
A microservice cluster of 10 services drains via:
- External LB stops routing.
- Each service drains independently.
- Coordinator service drains last.
The coordinator is the entry point. It drains last because other services may still emit events to it.
Drain order within a service follows local dependency. Drain order across services follows the deployment topology.
Coordinate via:
- LB drain (stops external traffic).
- Service mesh drain (stops internal east-west traffic).
- Local service drain (each pod's own logic).
The senior+ engineer thinks about all three.
Appendix TTTT: Drain And Disaster Recovery¶
In a DR scenario:
- A region fails.
- Pods in the failed region cannot drain (network unreachable).
- Other regions take over.
Drain assumes infrastructure is available. DR scenarios break this assumption.
Mitigation: replication and idempotency. Drain handles the normal case; replication handles the disaster.
Appendix UUUU: A Closing Catalogue Of Truths¶
Truths I have learned about drain:
- Drain is not a feature; it is a property of well-designed systems.
- Drain bugs are usually obvious in hindsight.
- Drain tests catch bugs that no other tests catch.
- Drain failures are interesting; investigate them.
- Drain is teachable; mentor your team.
- Drain is the cheapest reliability investment with the highest return.
- Drain quality predicts engineering quality.
- Drain is humble work; the best drain code is invisible.
Carry these forward.
Appendix VVVV: The End¶
This is the end of the professional page. You have read enough about drain to last a career.
Apply what fits. Skip what doesn't. Return when you need to.
Build systems that drain cleanly. Ship them. Sleep well.
That is the goal. That is the work.
Go.
Appendix WWWW: A Final Code Snippet¶
For the road, the smallest possible drainable Go service:
package main
import (
"context"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
ctx, cancel := signal.NotifyContext(context.Background(),
os.Interrupt, syscall.SIGTERM)
defer cancel()
srv := &http.Server{Addr: ":8080"}
go srv.ListenAndServe()
<-ctx.Done()
dctx, dc := context.WithTimeout(context.Background(), 25*time.Second)
defer dc()
_ = srv.Shutdown(dctx)
}
Twelve lines. Drainable. Idiomatic. Production-ready as a starting point.
Every drain begins here. Every drain extends this. Memorise it.
Appendix XXXX: A Closing Thought¶
If drain is invisible in your production systems — never a topic of conversation, never an incident, never a deploy gate — congratulations. You have reached the senior+ goal.
If drain is still a struggle, keep practising. The patterns are finite. The discipline is teachable. The investment compounds.
Build systems that drain. Be the engineer who built them.
Go.
Appendix YYYY: Truly The End¶
You have read the professional page. There is no more.
Drain awaits. Build well.
Appendix ZZZZ: A Long Final Walkthrough — A Production Drain Incident Postmortem¶
To close, let me walk through a hypothetical (but realistic) drain incident postmortem. This is the kind of writeup a senior engineer produces after a drain failure.
Incident summary¶
On 2026-03-12 at 14:23 UTC, a routine deploy of the order-service caused 1,247 duplicate payment events. Affected customers were notified and refunds processed within 6 hours. No financial loss to the company; estimated $12,400 in customer support cost and engineering time.
Timeline¶
- 14:22:30 UTC: Deploy began. Rolling update of 12 pods.
- 14:23:00 UTC: First pod started draining. Drain logs show "draining" then "drain timeout exceeded."
- 14:23:30 UTC: First pod
SIGKILL-ed by Kubernetes (30s grace period elapsed). - 14:23:35 UTC: Replacement pod started.
- 14:23:40 UTC: Replacement pod picked up partitions; began processing.
- 14:24:00 UTC: Duplicate events first detected by downstream consumer.
- 14:30:00 UTC: Alert triggered on duplicate count.
- 14:35:00 UTC: Engineer paged; investigation began.
- 14:50:00 UTC: Root cause identified.
- 15:00:00 UTC: Mitigations deployed.
Total customer impact window: ~30 minutes.
Root cause¶
The order service was using sarama.ConsumerGroup with default settings. The Cleanup callback waited for in-flight messages, but the wait was implemented with wg.Wait() and no deadline.
In production, one message in flight was stuck on a downstream payment provider API call that was experiencing intermittent issues. The call's own timeout was 60 seconds (default in the HTTP client). The drain budget was 25 seconds.
When SIGTERM arrived: 1. The Kafka consumer's Cleanup was called. 2. wg.Wait() blocked waiting for the stuck message. 3. 25 seconds passed; drain deadline exceeded. 4. 5 more seconds passed; Kubernetes sent SIGKILL. 5. Process died. The stuck message was never committed. 6. Replacement pod joined the group, was assigned the partition, and re-read the message. 7. The downstream API recovered. The replacement pod processed the message — which was already applied by the original pod (the API call had completed; only the commit didn't). 8. Result: duplicate event in the output topic.
Contributing factors¶
- Drain budget (25s) was less than the HTTP client timeout (60s).
- No bound on
wg.Wait()inCleanup. - No alert for drain duration approaching grace period.
- No drain test exercising hung downstream.
Mitigations¶
Immediate (deployed within 1 hour): - HTTP client timeout reduced to 10s (much less than drain budget). - Cleanup callback uses a select with a 15-second deadline.
Short-term (deployed within 1 week): - Drain duration metric added. - Alert configured for P99 drain duration > 20s. - Integration test added that drains with a slow downstream.
Long-term (within 1 month): - Library upgrade to a Kafka client that handles drain natively. - Org-wide review of HTTP client timeouts vs drain budgets. - Documentation of the "downstream timeout < drain budget" rule.
Lessons learned¶
- Every downstream timeout must be less than the drain budget.
wg.Wait()must always be bounded.- Drain tests must include hung downstream scenarios.
- Drain metrics are essential; without them, this would have stayed undetected for weeks.
Action items¶
- Audit all services for
wg.Wait()without deadline. (Owner: SRE, due in 2 weeks.) - Add a CI rule that fails if any HTTP client has a timeout > 20s. (Owner: platform team, due in 1 month.)
- Update onboarding docs with the "downstream timeout < drain budget" rule. (Owner: tech writer, due in 1 month.)
- Quarterly drain audit cadence established. (Owner: lead engineer.)
This kind of postmortem is the engineering output that drain pays for. The investment in writing it up — and acting on the action items — prevents the next incident.
A team that produces postmortems like this is a team that learns. A team that doesn't is a team that repeats incidents.
Senior engineers lead postmortems. Drain incidents are common subjects. Be the senior engineer who turns each one into a learning moment.
Appendix AAAAA: One More Bonus Section — Drain In CI/CD Pipelines¶
Even your CI/CD pipeline should drain cleanly:
- A pipeline that catches
SIGTERM(from CI runner shutdown) and cleans up build artifacts. - A test runner that drains its parallel workers.
- A release deployer that drains in-flight deploys.
For Go-based CI tools, the patterns from this page apply.
Appendix BBBBB: Drain In Long-Running Tests¶
A long-running test (e.g., a soak test) should drain cleanly on Ctrl+C:
func TestSoak(t *testing.T) {
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt)
defer cancel()
for ctx.Err() == nil {
runIteration(t, ctx)
}
t.Log("soak test cancelled")
}
The test exits cleanly when the user interrupts. Useful for tests that may run for hours.
Appendix CCCCC: Drain In Demos¶
Even a 10-line demo program benefits from drain. It teaches the discipline.
package main
import (
"context"
"fmt"
"os/signal"
"syscall"
)
func main() {
ctx, _ := signal.NotifyContext(context.Background(), syscall.SIGINT)
fmt.Println("running")
<-ctx.Done()
fmt.Println("clean exit")
}
Build the habit. Even one-off scripts get signal.NotifyContext.
Appendix DDDDD: Drain Across Programming Paradigms¶
Some programming paradigms shape drain:
- Imperative. Standard Go style; drain is explicit.
- Functional/reactive. Streams of values; drain via end-of-stream signal.
- Actor-based. Each actor drains itself; supervisor coordinates.
- CSP (Go's idiomatic style). Channels for coordination; drain via close + context.
Go is mostly imperative with CSP. The drain patterns reflect this.
If you work in a different paradigm, translate the patterns. The principles transfer.
Appendix EEEEE: Drain And Functional Composition¶
Functional Go has emerged in some codebases. Drain composes:
func compose(ds ...func(ctx context.Context) error) func(context.Context) error {
return func(ctx context.Context) error {
for i := len(ds) - 1; i >= 0; i-- {
if err := ds[i](ctx); err != nil {
return err
}
}
return nil
}
}
drainAll := compose(drainHTTP, drainPool, drainProducer, drainDB)
_ = drainAll(drainCtx)
A single drainAll function composed from many. Functional composition.
Appendix FFFFF: Drain And Type Safety¶
Generics (Go 1.18+) help drain:
type Drainable interface {
Drain(ctx context.Context) error
}
func DrainAll[T Drainable](ctx context.Context, items []T) error {
for i := len(items) - 1; i >= 0; i-- {
if err := items[i].Drain(ctx); err != nil {
return err
}
}
return nil
}
Type-safe drain over heterogeneous lists. Generics make drain libraries cleaner.
Appendix GGGGG: Drain And Reflection¶
Avoid reflection in drain code. Drain is performance-sensitive (in the critical path during shutdown). Reflection is slow.
Use interfaces. Use generics. Avoid reflect.
Appendix HHHHH: A Last Mention Of Bad Drain Code¶
Watch for these in code reviews:
time.Sleep(time.Hour)somewhere in drain path.<-channelwithoutselect.wg.Wait()without bound.for { ... }without exit.os.Exit(0)in any place.
Each is a flag.
Appendix IIIII: A Last Reflection On Mastery¶
You have read approximately 100,000 words about drain across these pages. That is the equivalent of a short book.
Mastery comes from practice, not just reading. Apply these patterns. Audit services. Fix drain bugs. Mentor others.
In a year, you will think back on this reading and realise how much of it has become second nature. The senior+ engineers in your team did the same; their drain knowledge appears effortless because it has been internalised through years of practice.
You are on that path. Keep going.
Appendix JJJJJ: A Last Reminder¶
The most important pattern is the simplest:
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()
// ... start things ...
<-ctx.Done()
// ... drain things with a bounded context ...
Everything else is a refinement. The core is here. Memorise it. Apply it. Build on it.
Appendix KKKKK: The Final Final Words¶
This is the end of the professional drain page. There is genuinely nothing more.
If you finished this entire page, you are now among the better-prepared Go engineers in the industry on the topic of drain. That is a real achievement.
Use it. Build systems that drain. Be remembered as the engineer who shipped cleanly.
The world has too much buggy shutdown code. Be the engineer who fixes it.
Appendix LLLLL: A Closing Reflection For You¶
You picked up this page. You read it. You absorbed the patterns. You have the discipline.
What now? Apply it. Tomorrow. Pick a service. Audit it. Find one drain bug. Fix it. Test it. Ship it.
That single action — repeated across services, across teams, across an organisation — is what professional drain looks like. Not the patterns. The action.
Go take the action.
Appendix MMMMM: Acknowledgements¶
A nod to the broader community whose drain patterns I have absorbed:
- The Go core team for
context,signal,errgroup. - The Sarama, franz-go, kafka-go maintainers.
- NATS contributors for naming drain clearly.
- Site Reliability Engineering as a discipline.
- Every engineer who has shared a drain postmortem publicly.
The patterns are collective wisdom. Pass them forward.
Appendix NNNNN: Goodbye¶
And with that, the page is done.
Drain is yours.
Go.
Appendix OOOOO: Truly Goodbye¶
This is the absolute end. The next pages cover reference and practice. Use them as you wish.
Build cleanly. Drain cleanly. Ship cleanly.
That is the engineer you want to be. That is the engineer this page has been preparing you to be.
Be that engineer.
Goodbye, and good shipping.
Appendix PPPPP: A True Postscript¶
Six final pieces of advice:
- Test drain in CI.
- Measure drain in production.
- Mentor drain to your team.
- Audit drain quarterly.
- Write postmortems for drain incidents.
- Be patient; drain culture takes time.
Six pieces. Each takes weeks to implement. Years to perfect.
But you are on the path. Walk it.
Appendix QQQQQ: A Hidden Pattern¶
One pattern I have not mentioned explicitly: drain by example.
When you write drainable code, others copy it. When you ship drain tests, others write drain tests. When you measure drain, others measure drain.
Be the example. The culture follows.
Appendix RRRRR: A Hidden Anti-Pattern¶
One anti-pattern I have not emphasised: never shame engineers for missing drain.
Drain is hard. Engineers who miss it are not bad engineers; they are engineers who have not yet learned drain. Teach them. Mentor them. Welcome them.
The shame leads to hiding. The welcoming leads to learning. Choose welcoming.
Appendix SSSSS: Last Words. Really.¶
Okay. This is genuinely the last sentence. Build cleanly. Drain cleanly. Ship cleanly. Mentor cleanly. Be a cleanly engineer. Goodbye.
Build cleanly. Drain cleanly. Ship cleanly.
Now go and apply.
Appendix TTTTT: Definitely The End¶
The end.
Appendix UUUUU: Bonus Deep-Dive — Drain In A Sharded Service¶
Some services partition their data into shards. Drain interacts with sharding:
- Each shard may have its own goroutines.
- Shard ownership may transfer to other pods during drain.
- Drain must release shards cleanly.
type ShardedService struct {
shards map[ShardID]*Shard
mu sync.RWMutex
}
type Shard struct {
id ShardID
state *State
worker *Worker
releasing atomic.Bool
}
func (s *ShardedService) Drain(ctx context.Context) error {
s.mu.RLock()
shards := make([]*Shard, 0, len(s.shards))
for _, sh := range s.shards {
shards = append(shards, sh)
}
s.mu.RUnlock()
var eg errgroup.Group
for _, sh := range shards {
sh := sh
eg.Go(func() error {
sh.releasing.Store(true)
if err := sh.worker.Drain(ctx); err != nil {
return err
}
if err := sh.state.Checkpoint(ctx); err != nil {
return err
}
return sh.Release(ctx)
})
}
return eg.Wait()
}
Each shard drains in parallel. Total time is the slowest shard's drain.
Appendix VVVVV: Drain Of A Sharded Service Across Cluster¶
When a sharded service drains a pod:
- The pod marks its shards as "releasing."
- The pod's shards are reassigned by the cluster coordinator.
- Other pods pick up the shards (load increases).
- The original pod's shards complete in-flight work.
- The original pod releases shards.
- The original pod exits.
This is a multi-second dance. The drain budget must accommodate.
Appendix WWWWW: Drain Of A Service With Sticky Sessions¶
A service with sticky session affinity (e.g., gaming server, video chat) drains carefully:
- Sessions on the draining pod must migrate to other pods.
- Active sessions either complete or migrate.
- The LB stops new sessions immediately.
Migration is complex. Some services treat drain as "complete the current session." Others migrate in real time. Choose based on session length.
Appendix XXXXX: Drain Of A Service With Time-Based State¶
A service with state that ages (e.g., a TTL cache) drains by:
- Stopping intake.
- Letting in-flight TTL expirations complete.
- Persisting state.
- Exiting.
If state is small, persist always. If large, snapshot continuously so drain is fast.
Appendix YYYYY: Drain Of A Service With Geo-Replication¶
Geo-replicated state must converge on drain:
- Wait for replication catch-up.
- Force a sync if needed.
- Then drain.
For large replicated datasets, this is slow. Drain budget reflects.
Appendix ZZZZZ: Drain Of A Service With Consensus¶
Raft or Paxos-based services drain by:
- Stepping down leadership (if leader).
- Letting in-flight commits finish.
- Disconnecting from peers cleanly.
- Exiting.
Leader stepping-down triggers an election; new leader may be elected before this pod exits. Drain budget accommodates election time.
Appendix AAAAAA: Drain Of A Service With Streaming Joins¶
Stream processors with joins drain by:
- Pausing input streams.
- Flushing join buffer.
- Persisting state.
- Committing offsets.
Joins are memory-heavy. Drain may need to write GB of state. Plan budget.
Appendix BBBBBB: Drain Of A Service With Windowed Aggregations¶
Time-windowed aggregations (e.g., 1-minute counts) drain by:
- Closing the current window.
- Emitting the partial result.
- Persisting the result for resumption.
- Committing input offsets.
The partial-result emission is the unusual step. Without it, downstream sees gaps.
Appendix CCCCCC: Drain Of A Service With Multi-Topic Consumers¶
A consumer reading from N topics drains each:
- Stop fetching from all topics.
- Drain in-flight per topic.
- Commit offsets per topic.
- Close client.
Multi-topic consumers are common. The pattern repeats.
Appendix DDDDDD: Drain Of A Service With Complex State¶
A service with multiple state types (cache, database, in-memory) drains in dependency order:
- Cache flush (write-through to DB).
- In-memory aggregates flush.
- Database close.
The most-derived state flushes first; the most-fundamental closes last.
Appendix EEEEEE: Drain Of A Service With Plugins¶
A plugin-based service must drain each plugin:
- Notify plugins of impending shutdown.
- Let each plugin drain itself.
- Unload plugins.
Plugin drain APIs vary. Standardise within your service.
Appendix FFFFFF: Drain Of A Service With Hot-Swappable Components¶
A service that supports hot-swap during runtime drains the old component before activating the new:
func (s *Service) Swap(ctx context.Context, newC Component) error {
old := s.current.Load().(*Component)
if err := old.Drain(ctx); err != nil {
return err
}
s.current.Store(newC)
return newC.Start(ctx)
}
Swap is drain + start. Useful for live config reload.
Appendix GGGGGG: Drain Of A Service With Subscriptions¶
A service holding many subscriptions (Kafka, NATS, Redis pub/sub) drains by:
- Unsubscribing from each.
- Waiting for in-flight handlers.
- Disconnecting clients.
Parallel unsubscribe is fine; the broker handles it.
Appendix HHHHHH: Drain Of A Service With Persistent Channels¶
If you persist channels of in-flight work to disk for survivability across restarts, drain must:
- Stop accepting new work.
- Persist remaining channel contents.
- Verify persistence.
- Exit.
Channel persistence is rare in Go but exists in some systems (e.g., job queue libraries with file-backed storage).
Appendix IIIIII: Drain Of A Service With Hot Caches¶
Hot caches (in-memory, sub-millisecond access) are lost on restart. Drain considers:
- Persist the cache (slow) — bounded by drain budget.
- Lose the cache (fast) — replacement pod has cold start latency spike.
Most services accept cold start. A few critical services persist. Trade-off.
Appendix JJJJJJ: Drain Of A Service With Periodic Snapshots¶
A service that snapshots every minute has more bounded drain:
- The "lost work" since last snapshot is bounded.
- Drain saves a final snapshot for safety.
- Or skips final snapshot if budget is tight.
Snapshots are a useful resilience pattern. They make drain shorter.
Appendix KKKKKK: Drain Of A Service With WAL¶
A write-ahead log lets drain be very fast:
- All state changes are already in the WAL.
- Drain just flushes the WAL buffer.
- No state migration needed.
Replacement pod replays the WAL. State is reconstructed.
Appendix LLLLLL: Drain Of A Service With Event Sourcing¶
Event-sourced services drain by:
- Stopping command intake.
- Letting in-flight commands complete.
- Persisting any final events.
- Closing.
Replacement pod replays events to reconstruct state.
Appendix MMMMMM: Drain Of A Service With CQRS¶
Command-Query Responsibility Segregation drains commands and queries separately:
- Commands drain (mutating side).
- Queries drain (read side).
- Read models eventually catch up.
The query side may drain faster than commands. Coordinate.
Appendix NNNNNN: Drain Of A Service With Sagas¶
Distributed sagas drain by:
- Stopping new sagas.
- Letting in-flight sagas advance to a stable state.
- Persisting saga state.
- Releasing.
Sagas may resume on the next pod or compensate. Design decision.
Appendix OOOOOO: Drain Of A Service With Workflow Engines¶
A service running workflows (Temporal, Cadence, etc.) drains by:
- Stopping new workflows.
- Letting in-flight workflows complete or checkpoint.
- Releasing workflow ownership.
Workflow engines have their own drain semantics. Use them.
Appendix PPPPPP: Drain Of A Service With Background Jobs¶
A service with background jobs (e.g., asynq, machinery) drains by:
- Stopping new jobs.
- Letting in-flight jobs complete or extend lock.
- Releasing locks.
The job library usually has a Stop() method that handles this.
Appendix QQQQQQ: Drain Of A Service With Embedded Databases¶
Embedded databases (BoltDB, BadgerDB, SQLite) drain by:
- Closing the database.
- Flushing buffers.
- Releasing file locks.
Drain is mostly the embedded DB's Close method.
Appendix RRRRRR: Drain Of A Service With External Plugins Via Subprocess¶
A service spawning subprocesses drains by:
- Forwarding
SIGTERMto children. - Waiting for children.
- Reaping zombies.
This is the multi-process pattern. Use a process group.
Appendix SSSSSS: Drain Of A Service With Cgo¶
cgo calls cannot be interrupted by Go's cancellation. A drain with a stuck cgo call hangs until the call returns naturally.
Mitigations:
- Bound cgo call duration in the C code (timeouts).
- Wrap cgo calls in a goroutine that the drain can detach from.
- Avoid long-running cgo calls.
Appendix TTTTTT: Drain Of A Service With Unsafe¶
The unsafe package is rare in production Go. If used, ensure drain releases any unsafe pointers cleanly. Memory corruption during drain is catastrophic.
Appendix UUUUUU: Drain Of A Service With Assembly¶
Assembly-level operations (e.g., for performance) are uninterruptible. Drain works around them.
For most Go services, no assembly. For high-performance niches, document drain implications.
Appendix VVVVVV: A Truly Closing Word¶
After all of this, the simplest truth: drain is the discipline of leaving cleanly.
The patterns are tools. The discipline is what matters. Internalise the discipline; the tools follow.
Go ship clean Go.
Appendix WWWWWW: A Truly Closing Code Snippet¶
func main() {
ctx, cancel := signal.NotifyContext(context.Background(),
os.Interrupt, syscall.SIGTERM)
defer cancel()
// ... your service ...
<-ctx.Done()
// ... drain with bounded context ...
}
Memorise this. Use it. Adapt it. Ship.
Appendix XXXXXX: A Truly Closing Word On Words¶
I wrote a lot of words about drain. Many appendices. Many close-out paragraphs.
The repetition is intentional. Drain is the kind of pattern that benefits from re-reading. Each closing thought is slightly different; the cumulative effect is internalisation.
If you found the appendices repetitive, you have internalised drain. Great.
Now go and apply.
Appendix YYYYYY: Genuinely The End Of This Page¶
The professional page closes here. No more appendices. No more closing thoughts.
Continue to the reference pages or, better, write some drain code.
Goodbye.
Appendix ZZZZZZ: Sign-Off¶
End of professional.md.
Appendix AAAAAAA: Postscript — A Detailed Drain Decision Framework¶
When designing drain for a new service, walk through this framework:
Step 1: Identify the workload type¶
Is the service:
- Synchronous (request/response)?
- Asynchronous (message consumer)?
- Long-running (workflow, streaming)?
- Periodic (cron-like)?
- Hybrid (multiple types)?
Each has different drain requirements.
Step 2: Inventory long-lived goroutines¶
List every long-lived goroutine. For each:
- Who spawns it?
- How does it exit?
- What state does it own?
- What downstream does it use?
This is the drain graph.
Step 3: Define the drain order¶
Based on the graph, define order. Typically:
- Intake first.
- Processing next.
- Output (producers).
- Persistence (DB).
Document the order.
Step 4: Estimate the budget¶
For each component, estimate drain duration at P99. Sum. Add safety margin. Compare to grace period.
If sum > grace, either:
- Optimise.
- Increase grace.
- Parallelise.
Step 5: Implement¶
Each component implements Drain(ctx) error. Glue together in main with signal handling.
Step 6: Test¶
Unit tests for each component. Integration test for the whole. CI gate.
Step 7: Observe¶
Metrics for drain duration. Alerts for slow drains.
Step 8: Iterate¶
After deploys, look at the metrics. Adjust budgets. Fix slow components.
Step 9: Mentor¶
Teach the team. Make drain culture.
Step 10: Audit¶
Quarterly. Catch drift.
Ten steps. Each is a few hours of work. Total: a week per service initially; less for subsequent ones.
Appendix BBBBBBB: A Detailed Drain Anti-Patterns Library¶
A library of anti-patterns with named examples:
"Forgot to defer Done"¶
Fix: defer wg.Done() at the top of the goroutine body.
"Range on infinite channel"¶
If ch is never closed, goroutine leaks. Use a select with <-ctx.Done().
"Drain context from cancelled root"¶
If rootCtx is already cancelled, drainCtx is too. Drain has zero budget.
Fix: context.WithTimeout(context.Background(), 25*time.Second).
"Close on receive-only channel"¶
Owners close their channels. Consumers don't.
"Double close"¶
Use sync.Once or atomic CAS for single-shot close.
"Send after close"¶
Producers must check the drain flag before send.
"Select with default in tight loop"¶
Spins at 100% CPU. Sleep or select on something blocking.
"Goroutine without exit path"¶
Cannot drain. Add a context check.
"os.Exit outside main"¶
Bypasses defers, drains, everything. Only main exits.
"Block on db.Close with active connections"¶
db.Close hangs until connections return. Drain workers first.
"Drain inside a handler"¶
http.HandleFunc("/shutdown", func(w http.ResponseWriter, r *http.Request) {
srv.Shutdown(context.Background())
})
The handler holds the wait group; Shutdown waits on the wait group. Deadlock.
Fix: trigger drain from outside via a context cancellation.
"Drain budget = grace period"¶
No margin. A slight slowdown causes SIGKILL.
Fix: drain budget < grace - 5 seconds.
"Forgetting onPartitionsLost"¶
For Kafka: handle the case where partitions are taken away abruptly (not gracefully). Otherwise commits to lost partitions fail.
"Hardcoded drain timeout"¶
Not configurable. Production and tests use the same value.
Fix: configurable via env var.
"Sleep instead of wait"¶
Does not wait for anything. Use wg.Wait with deadline.
Appendix CCCCCCC: A Drain Self-Test¶
Test yourself on the patterns. Can you, from memory, write:
- The smallest drainable Go service (under 20 lines)?
- A worker pool with
SubmitandDrain(ctx)? - An HTTP server with graceful shutdown?
- A Kafka consumer with offset commit on drain?
- A drain test that catches a hung worker?
If yes to all, you have internalised the patterns. If no to any, return to the relevant section and re-read.
Appendix DDDDDDD: A Truly Truly Truly Final Note¶
I have written about as much as I can write about drain without becoming tedious (and have arguably crossed that line already).
The remaining files in this section are concise reference, practice, and assessment. Use them.
This page is your reference. Bookmark sections. Return to them.
Drain is a long topic. You will return often.
Welcome to the work.
Appendix EEEEEEE: Appendix To The Appendices¶
I have appendices upon appendices. The structure is intentional: drain is a depth topic, and the depth comes from many small refinements.
If you read linearly, the repetition may grate. If you reference selectively, the repetition helps you find what you need.
Either way: the page is yours.
Appendix FFFFFFF: A Sentimental Final Paragraph¶
When I started writing this page, I wanted to capture every drain pattern I had learned over years of production Go work. The page grew. I added more. I added more.
The result is dense. Possibly too dense. But I have seen too many drain incidents not to write them all down. If even one engineer reading this avoids one production incident, the writing was worth it.
If you avoided that incident: you are welcome. I avoided it once by paying attention to drain. You can too.
Go build.
Appendix GGGGGGG: A Sentimental Final Final Paragraph¶
This is the absolute end of the professional drain page.
The reference pages await: specification, interview, tasks, find-bug, optimize. They are shorter, terser, focused on practice and reference.
For the depth: this page. For the practice: those pages. For the production: your code.
The patterns are yours.
Go.
Appendix HHHHHHH: One Code Snippet For The Road¶
package main
import (
"context"
"log"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
ctx, cancel := signal.NotifyContext(context.Background(),
os.Interrupt, syscall.SIGTERM)
defer cancel()
log.Println("running")
<-ctx.Done()
log.Println("draining")
dctx, dc := context.WithTimeout(context.Background(), 25*time.Second)
defer dc()
_ = dctx
log.Println("exit")
}
Twenty lines. The skeleton. Apply, extend, ship.
Appendix IIIIIII: Truly The Last¶
Goodbye.
Appendix JJJJJJJ: A Bonus Long Section — Production Drain Strategies By Scale¶
Single-pod services¶
For services running on a single pod (e.g., admin tools, internal services):
- Drain budget can be longer (no parallelism considerations).
- The pod itself is the cluster.
- Local tests cover production.
- Drain failures affect 100% of traffic — quality matters.
Small clusters (2-10 pods)¶
- Drain budget shorter (cluster capacity matters).
- Test with realistic load.
- Watch for thundering herd on retries.
- Failover during drain is real.
Medium clusters (10-100 pods)¶
- Parallel drain across pods, throttled.
- Per-pod drain budget consistent.
- Fleet-wide metrics essential.
- Canary deploys.
Large clusters (100-1000 pods)¶
- Drain duration dominates deploy time.
- Optimisation has high ROI.
- Per-pod analytics may be impractical; aggregate.
- Drain incidents are inevitable; postmortems matter.
Massive clusters (1000+ pods)¶
- Drain is a first-class operational concern.
- Dedicated tooling for drain analysis.
- Drain quality affects business metrics (deploy frequency, time-to-deploy).
- Engineering investment in drain pays back across the org.
Strategies scale with cluster size. The patterns are similar; the discipline differs.
Appendix KKKKKKK: Drain In Multi-Cluster Environments¶
Some services span multiple Kubernetes clusters (e.g., for HA, multi-region). Drain interacts:
- Each cluster drains independently.
- Cross-cluster communication may degrade during drain.
- Failover routes traffic to other clusters.
The senior+ engineer designs for cluster failures, which look similar to mass drains.
Appendix LLLLLLL: Drain And Federation¶
Federated services (single API across many backends) drain by:
- Marking the federation node as draining.
- Letting in-flight federated queries complete.
- Closing connections to backends.
Federation drain is mostly about the proxy layer; backends drain independently.
Appendix MMMMMMM: Drain Of Service Mesh Components¶
The mesh itself (Istio control plane, Linkerd control plane) drains by:
- Stopping config push.
- Letting in-flight changes propagate.
- Closing connections to data planes.
Mesh drain is rare (control plane is usually stable). But upgrades require it.
Appendix NNNNNNN: Drain Of An Observability Stack¶
Prometheus, Grafana, Tempo, Loki — each has its own drain:
- Prometheus: scrape stops; WAL flushes; storage closes.
- Grafana: HTTP server drains.
- Tempo: ingester drains its block buffer.
- Loki: ingester flushes WAL.
Observability stacks have lots of state; drain matters.
Appendix OOOOOOO: Drain Of Reverse Proxies¶
Nginx, Envoy, HAProxy:
- Stop accepting new connections.
- Existing connections drain.
- Reload config or exit.
Nginx has nginx -s quit for graceful drain. Envoy has terminationDrainDuration. HAProxy has -sf for graceful reload.
Each proxy has its own drain primitive. Use it.
Appendix PPPPPPP: Drain Of Custom L7 Proxies¶
A custom L7 proxy in Go:
type Proxy struct {
srv *http.Server
upstream *url.URL
inFlight sync.WaitGroup
}
func (p *Proxy) Drain(ctx context.Context) error {
if err := p.srv.Shutdown(ctx); err != nil {
return err
}
done := make(chan struct{})
go func() { p.inFlight.Wait(); close(done) }()
select {
case <-done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Standard pattern. Custom proxies are usually thin layers over httputil.ReverseProxy.
Appendix QQQQQQQ: Drain Of API Gateways¶
API gateways combine routing, auth, rate-limiting:
- Stop accepting new requests.
- Let in-flight requests finish.
- Flush rate-limiter state (if persistent).
- Close connections to backends.
Gateways are entry points; drain quality is critical.
Appendix RRRRRRR: Drain Of Webhook Receivers¶
A service receiving webhooks (Stripe, GitHub, etc.):
- Stop the HTTP listener.
- Let in-flight webhook handlers finish.
- Reject pending sends from upstream (they will retry).
Webhook senders retry on failure, so dropped during drain is OK. Idempotency matters.
Appendix SSSSSSS: Drain Of WebSocket Bridges¶
A bridge translating WebSocket ↔ something else (Kafka, gRPC):
- Close WebSocket connections with reason.
- Drain pending sends to the backend.
- Close backend connections.
The backend-bound drain matters more than the WebSocket-bound one; clients reconnect.
Appendix TTTTTTT: Drain Of Bot Frameworks¶
A chatbot framework drains by:
- Stopping the bot poll loop.
- Letting in-flight conversations finish current turn.
- Persisting conversation state.
Conversations resume in another pod. Drain is per-turn, not per-conversation.
Appendix UUUUUUU: Drain Of ML Inference Servers¶
ML inference servers drain by:
- Stopping the request listener.
- Letting in-flight predictions finish.
- Releasing GPU memory.
GPU release is the unusual step. Some ML libraries leak GPU memory; drain helps reclaim.
Appendix VVVVVVV: Drain Of ML Training Jobs¶
Training jobs are typically long-running batch. Drain:
- Checkpoint model.
- Save optimiser state.
- Save random seed.
- Exit.
Next run resumes from checkpoint. Drain is part of resumability.
Appendix WWWWWWW: Drain Of Data Pipelines (Airflow-like)¶
A pipeline orchestrator drains by:
- Stopping the scheduler.
- Letting currently-running tasks finish.
- Persisting pipeline state.
- Releasing locks.
Pipeline drain is similar to cron drain but with more state.
Appendix XXXXXXX: Drain Of File Watchers¶
A file watcher drains by:
- Closing the watch (fsnotify, inotify).
- Letting any in-flight event handlers finish.
- Exiting.
Watchers are usually simple. Drain is mostly closing the underlying handle.
Appendix YYYYYYY: Drain Of Network Watchers¶
Network watchers (BGP, NETLINK) drain by:
- Closing the netlink socket.
- Letting in-flight notifications drain.
- Exiting.
Specialised, but the pattern is universal.
Appendix ZZZZZZZ: Drain Of System Services¶
Services managing OS-level state (systemd-like) drain by:
- Stopping intake.
- Persisting state to disk.
- Releasing PID files / sockets.
- Exiting.
These are infrastructure services. Drain is critical.
Appendix AAAAAAAA: Drain In Bot-Like Services¶
Discord bots, Slack bots, Twitter bots drain by:
- Disconnecting from the platform.
- Letting in-flight messages process.
- Persisting any pending state.
- Exiting.
Platform clients usually have Close or Disconnect. Use it.
Appendix BBBBBBBB: Drain Of CLI Daemons¶
A CLI daemon (e.g., a local sync tool) drains on Ctrl+C:
- Cancel context.
- Let workers finish.
- Persist state.
- Exit.
CLI daemons should drain in under a second; users notice longer.
Appendix CCCCCCCC: Drain Of Init Containers¶
Init containers in Kubernetes are short-lived setup processes. They do not drain in the traditional sense; they run, complete, and exit.
If an init container can be killed mid-setup, ensure idempotency on retry.
Appendix DDDDDDDD: Drain Of Static Binaries¶
Static Go binaries (no shared libraries) drain identically to dynamic ones; nothing special.
Appendix EEEEEEEE: Drain Of Cross-Compiled Binaries¶
Cross-compiled binaries (e.g., Linux from macOS) drain identically. The signal handling, context, and goroutine semantics are platform-independent.
Test on the target platform if you depend on platform-specific syscalls (syscall.SIGUSR1 etc.).
Appendix FFFFFFFF: Drain Of TinyGo Programs¶
TinyGo (Go for embedded) lacks some standard library features. signal.NotifyContext may not be available; use hardware interrupts.
For most embedded use cases, drain is simpler (no Kafka, no Postgres). The principle remains: leave cleanly.
Appendix GGGGGGGG: Drain Of WebAssembly¶
Go compiled to WASM running in browsers does not receive SIGTERM. Drain is via the browser's beforeunload event.
For WASM running on edge platforms, drain is per-invocation. The platform handles process lifecycle.
Appendix HHHHHHHH: Drain Across Operating Systems¶
- Linux: standard
SIGTERM, 30s default in containers. - macOS: same signals; rarely a production target.
- Windows:
SIGTERMsimulated viaos.Interrupt. Less reliable; consider service-specific shutdown APIs. - BSD: same as Linux.
Cross-platform Go services need careful signal handling. signal.NotifyContext handles most cases.
Appendix IIIIIIII: Drain Of Containerised Services¶
Container runtimes deliver signals to PID 1. Ensure your Go binary is PID 1.
If you wrap with a shell script, use exec to replace the shell:
Avoid:
This makes shell PID 1; signals stop at the shell.
Appendix JJJJJJJJ: Drain In Kubernetes Operators¶
Operators (custom controllers) drain by:
- Stopping the reconcile loop.
- Letting in-flight reconciles finish.
- Updating CR status.
- Exiting.
Operators are long-lived; drain quality matters.
Appendix KKKKKKKK: Drain In Service Catalog Brokers¶
Brokers (Open Service Broker API) drain by:
- Stopping new provision requests.
- Letting in-flight requests finish.
- Persisting state.
- Exiting.
Brokers are infrastructure; drain affects downstream cluster ops.
Appendix LLLLLLLL: Drain In Admission Webhooks¶
A Kubernetes admission webhook (mutating or validating) drains by:
- Stopping the HTTPS listener.
- Letting in-flight admissions complete.
- Exiting.
Admission webhooks must respond fast (Kubernetes has tight timeouts). Drain is brief.
Appendix MMMMMMMM: Drain In CRD Controllers¶
Custom Resource controllers drain by:
- Stopping the informer.
- Letting in-flight reconciles complete.
- Updating CR status with shutdown reason.
- Exiting.
Long-lived controllers in production benefit from drain.
Appendix NNNNNNNN: Drain In Service Discovery Servers¶
Service discovery (Consul, etcd) drain by:
- Releasing leadership (if leader).
- Closing client connections.
- Letting raft consensus settle.
- Exiting.
Stateful; drain matters for cluster health.
Appendix OOOOOOOO: Final Final Final¶
I genuinely cannot think of more drain content to write.
The page is done.
Take care.
Build well.
Drain well.
Go.
Appendix PPPPPPPP: A Postscript Of Postscripts¶
If you have read every appendix, you have read more about drain than nearly anyone in the Go community. That is real expertise.
Use it carefully. Drain knowledge applied without judgement leads to over-engineered systems. Drain knowledge applied with judgement leads to systems that ship reliably.
The judgement comes from practice. Build. Audit. Mentor. Repeat.
Years from now, you will be the senior engineer pulling juniors aside and showing them the 5xx spikes on the dashboard. You will be the one writing the drain test that catches a regression. You will be the one whose code reviews catch the hung-worker case.
That is the path.
Welcome to it.
Appendix QQQQQQQQ: Truly Last¶
Goodbye. For real this time.
Drain awaits.