Decorator Pattern — Optimize¶
1. Goal of this file¶
This file is about when a naïve decorator is slow or wasteful, and when the fix is worth shipping. Junior taught the three shapes (interface, function/middleware, embedding). Middle taught the variants — generic decorators, ordering, recovery, stateful wrappers. Optimize is about the cases where a textbook decorator chain shows up in a CPU or allocation profile and you have to do something about it.
The honest envelope: most decorator chains are built once at startup (Chain(handler, Logging, Auth, Recover)), called per request at hundreds to thousands of QPS, and never measured. At those frequencies, each layer costs ~1-2 ns of interface dispatch and zero allocations. A 10-layer chain costs ~15 ns per request against ~100,000 ns of actual work. Nobody notices.
It becomes visible when:
- The chain is rebuilt per request instead of once at startup.
- A middleware allocates per call because of closure escape — typically
defer,time.Now()captured, or buffer construction. - The chain is 10+ layers deep and runs in a tight RPC inner loop.
- A recovery middleware's
deferruns on every request even though panics are rare. - A cache decorator's
sync.Mutexcontends under concurrent reads. - A logging decorator does
fmt.Sprintfof every field, every call. - A metrics decorator computes
WithLabelValues(...)per call on aHistogramVec. - A rate limiter uses a mutex when one atomic counter would do.
- A chain reload rebuilds the world on every config change.
- An embedding-based decorator forwards 10 methods through interface dispatch.
- A trace decorator creates a span per call regardless of sampling decision.
- A buffered writer is sized for the wrong typical payload.
Baseline you need to beat. From middle.md §13:
BenchmarkDirectCharge-8 500000000 2.10 ns/op 0 B/op 0 allocs/op
BenchmarkOneDecorator-8 300000000 3.41 ns/op 0 B/op 0 allocs/op
BenchmarkFiveDecorators-8 100000000 12.50 ns/op 0 B/op 0 allocs/op
BenchmarkMiddlewareChain5-8 80000000 14.20 ns/op 0 B/op 0 allocs/op
A direct call is 2 ns. Each extra interface decorator adds ~2 ns. Five decorators ≈ 12 ns. That's the budget — most optimizations in this file fight for the difference between "15 ns and 0 allocs" and "5 µs and 8 allocs", which usually means killing per-request chain construction, closure escapes, or middleware that does expensive work on the happy path.
Structure of the file:
- Lifecycle wins (§3–§5): build chain once, kill closure escapes, devirtualize with PGO.
- Per-middleware wins (§6–§10): trim recovery defer, swap mutex for RWMutex or sync.Map, switch to structured logging, pre-compute metric labels, atomic rate-limit counter.
- Architecture wins (§11–§14): atomic.Pointer for hot-swap config, direct delegation over embedding, sampled tracing, properly sized buffers.
- Cost-benefit framing (§15).
2. Table of Contents¶
- Goal of this file
- Table of Contents
- Exercise 1: Chain built per-request — build once at startup
- Exercise 2: Closure capture in middleware allocating per call
- Exercise 3: Deep middleware chain — PGO devirtualization
- Exercise 4: Recovery middleware's defer — eliminate via stack discipline
- Exercise 5: Cache decorator using mutex — switch to sync.Map or RWMutex
- Exercise 6: Logging decorator with fmt.Sprintf — structured logger
- Exercise 7: Metrics decorator with HistogramVec — pre-compute labels
- Exercise 8: Rate limiter using mutex — atomic-based counter
- Exercise 9: Middleware chain rebuilt on config change — atomic.Pointer hot swap
- Exercise 10: Embedding decorator forwarding 10 methods — direct delegation
- Exercise 11: Trace decorator creating span per call — sampling
- Exercise 12: Buffered writer decorator with small buffers — size for typical payload
- When NOT to optimize
- The optimization checklist
- Summary
3. Exercise 1: Chain built per-request — build once at startup¶
Scenario¶
An HTTP handler is wrapped in a Chain helper inside the request handler. Each request rebuilds the entire chain — five closures, five interface boxings, five slice elements. The handler runs at 5k req/s; the chain construction allocates 320 B per request.
Before¶
package server
import (
"log"
"net/http"
"time"
)
type Middleware func(http.Handler) http.Handler
func Chain(h http.Handler, mws ...Middleware) http.Handler {
for i := len(mws) - 1; i >= 0; i-- {
h = mws[i](h)
}
return h
}
func Logging(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
log.Printf("%s %s took %v", r.Method, r.URL.Path, time.Since(start))
})
}
func Recover(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer func() {
if rec := recover(); rec != nil {
http.Error(w, "internal error", 500)
}
}()
next.ServeHTTP(w, r)
})
}
func Auth(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Header.Get("Authorization") == "" {
http.Error(w, "unauthorized", 401)
return
}
next.ServeHTTP(w, r)
})
}
// Trace and Metrics omitted for brevity — same shape.
var apiHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
})
// Anti-idiom: chain built per request.
func handle(w http.ResponseWriter, r *http.Request) {
h := Chain(apiHandler, Logging, Recover, Auth, Trace, Metrics)
h.ServeHTTP(w, r)
}
Benchmark¶
func BenchmarkPerRequestChain(b *testing.B) {
b.ReportAllocs()
req := httptest.NewRequest("GET", "/", nil)
req.Header.Set("Authorization", "Bearer xxx")
rec := httptest.NewRecorder()
b.ResetTimer()
for i := 0; i < b.N; i++ {
handle(rec, req)
}
}
11 allocations per request: one per middleware closure (5), one for the slice backing array (1), one per http.HandlerFunc wrapper (5).
After
Build the chain once at startup. Reuse the `http.Handler` value for every request.package server
import "net/http"
var serveChain http.Handler
func init() {
serveChain = Chain(apiHandler, Logging, Recover, Auth, Trace, Metrics)
}
func handle(w http.ResponseWriter, r *http.Request) {
serveChain.ServeHTTP(w, r)
}
func RegisterRoutes(mux *http.ServeMux) {
chain := Chain(apiHandler, Logging, Recover, Auth, Trace, Metrics)
mux.Handle("/api", chain)
}
apiChain := Chain(apiHandler, Logging, Recover, Auth)
adminChain := Chain(adminHandler, Logging, Recover, Auth, AdminOnly)
mux.Handle("/api", apiChain)
mux.Handle("/admin", adminChain)
// Instead of choosing Auth or NoAuth per request, use one Auth that no-ops
// when the route is whitelisted.
func Auth(skipPaths []string) Middleware {
skipSet := make(map[string]struct{}, len(skipPaths))
for _, p := range skipPaths { skipSet[p] = struct{}{} }
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if _, skip := skipSet[r.URL.Path]; skip {
next.ServeHTTP(w, r)
return
}
// ... real auth ...
})
}
}
4. Exercise 2: Closure capture in middleware allocating per call¶
Scenario¶
A middleware captures the request's start time and the response writer in a deferred closure to log latency. The closure escapes to the heap because defer always allocates a closure on Go ≤ 1.13 and may still allocate on later versions when the defer is in a loop or has uncertain shape. Each request pays one closure allocation (~80 B).
Before¶
package middleware
import (
"log"
"net/http"
"time"
)
func LoggingLatency(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
// Closure captures start, r, w — escapes to heap.
log.Printf("%s %s took %v (size=%d)",
r.Method, r.URL.Path, time.Since(start), responseSize(w))
}()
next.ServeHTTP(w, r)
})
}
func responseSize(w http.ResponseWriter) int {
if sw, ok := w.(*sizingResponseWriter); ok { return sw.size }
return 0
}
Benchmark¶
func BenchmarkLoggingLatency(b *testing.B) {
b.ReportAllocs()
h := LoggingLatency(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
}))
req := httptest.NewRequest("GET", "/", nil)
log.SetOutput(io.Discard)
b.ResetTimer()
for i := 0; i < b.N; i++ {
h.ServeHTTP(httptest.NewRecorder(), req)
}
}
The 5 allocs: the closure (~48 B), fmt.Sprintf formatting buffer (~64 B), the time.Duration → string conversion, the recorder's headers map, the r.URL.Path escape.
After
Eliminate the deferred closure by computing latency inline after `next.ServeHTTP`. The `defer` is only needed if you want logging to fire even when the inner panics — and that's the recovery middleware's job, not the logging middleware's.package middleware
import (
"log"
"net/http"
"time"
)
func LoggingLatency(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
// No defer — no closure allocation. log.Printf still allocates its
// format buffer, but the middleware's contribution drops to zero.
log.Printf("%s %s took %v", r.Method, r.URL.Path, time.Since(start))
})
}
import "strconv"
func LoggingLatency(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
elapsed := time.Since(start)
var buf [128]byte
b := append(buf[:0], r.Method...)
b = append(b, ' ')
b = append(b, r.URL.Path...)
b = append(b, " took "...)
b = strconv.AppendInt(b, elapsed.Nanoseconds(), 10)
b = append(b, "ns\n"...)
os.Stderr.Write(b)
})
}
5. Exercise 3: Deep middleware chain — PGO devirtualization¶
Scenario¶
An internal RPC service stacks twelve middlewares: tracing, metrics, recovery, auth, authz, rate-limit, request-ID, timeout, logging, retry, circuit-breaker, and the terminal handler. At 50k QPS, each request walks all twelve interface dispatches. Each dispatch is ~2 ns; twelve of them is ~24 ns. With PGO, the compiler can devirtualize the hottest path — fast-call the concrete handler when the dynamic type matches.
Before¶
package middleware
import (
"context"
"net/http"
)
// 12 middlewares, each wrapping http.Handler. The chain is built once at boot.
var chain http.Handler // 12 layers deep
func init() {
chain = Tracing(Metrics(Recovery(Auth(Authz(RateLimit(RequestID(Timeout(Logging(Retry(CircuitBreaker(terminal)))))))))))
}
func Handle(w http.ResponseWriter, r *http.Request) {
chain.ServeHTTP(w, r)
}
Benchmark¶
func BenchmarkDeepChain(b *testing.B) {
b.ReportAllocs()
req := httptest.NewRequest("GET", "/rpc", nil)
req.Header.Set("Authorization", "Bearer xxx")
rec := httptest.NewRecorder()
b.ResetTimer()
for i := 0; i < b.N; i++ {
Handle(rec, req)
}
}
810 ns per request, of which roughly 25 ns is dispatch overhead and the remainder is per-middleware work (header lookups, time stamps, atomic increments).
After
Use Go 1.21+ profile-guided optimization. Capture a profile under representative load, feed it back into the build.# 1. Run benchmark with CPU profile.
go test -bench=BenchmarkDeepChain -cpuprofile=cpu.pgo
# 2. Build with PGO enabled.
go build -pgo=cpu.pgo -o server ./cmd/server
type FlatHandler struct {
tracer *Tracer
metrics *MetricsRecorder
auth *Authenticator
authz *Authorizer
rateLimit *RateLimiter
retry *RetryConfig
breaker *CircuitBreaker
terminal http.Handler
}
func (h *FlatHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// Inline what each middleware did, in order. One interface dispatch (terminal).
span := h.tracer.Start(r.Context(), r.URL.Path)
defer span.End()
if err := h.auth.Verify(r); err != nil {
http.Error(w, "unauthorized", 401)
return
}
if !h.rateLimit.Allow(r) {
http.Error(w, "rate limited", 429)
return
}
// ... etc ...
h.terminal.ServeHTTP(w, r)
}
6. Exercise 4: Recovery middleware's defer — eliminate via stack discipline¶
Scenario¶
A recovery middleware uses defer recover() on every request to catch handler panics. Panics are rare (< 1 per million requests in a stable service), but the defer fires on every request. Defer's cost on Go 1.20+ is small (~5 ns with open-coded defers) but non-zero, and the closure for the recover function escapes in some patterns. At 100k QPS, 5 ns/request is 500 µs/sec of CPU — not a lot, but visible on a CPU profile.
Before¶
package middleware
import (
"log"
"net/http"
"runtime/debug"
)
func Recovery(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer func() {
if rec := recover(); rec != nil {
log.Printf("panic: %v\n%s", rec, debug.Stack())
http.Error(w, "internal server error", 500)
}
}()
next.ServeHTTP(w, r)
})
}
Benchmark¶
func BenchmarkRecovery(b *testing.B) {
b.ReportAllocs()
h := Recovery(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
}))
req := httptest.NewRequest("GET", "/", nil)
b.ResetTimer()
for i := 0; i < b.N; i++ {
h.ServeHTTP(httptest.NewRecorder(), req)
}
}
5 ns is the open-coded defer; the rest is the handler write. Open-coded defers (Go 1.14+) are cheap, but they still emit defer-tracking code in the function prologue/epilogue. The deferred closure does not escape because Go's compiler proves that recover doesn't capture the function pointer beyond the call.
The honest framing. The "before" here is already efficient. The discussion below is about when even 5 ns matters — typically only when you have many short-lived nested handlers in an inner RPC path.
After
**Path A — Recover at the outermost layer only.** If you have a deep chain (Exercise 3) and the inner middlewares can't panic except in ways your outermost recovery catches anyway, put recovery only at the top: The inner middlewares pay zero defer cost. Only the outermost layer's defer fires per request. Net savings: 5 ns × (chain depth - 1) per request. **Path B — Use HTTP server's built-in panic handling.** The `net/http` server already catches panics in handler goroutines and closes the connection. If you're OK with a default 500 (no custom logging), don't add recovery middleware at all.// Trust http.Server's built-in panic protection.
// http.Server logs the panic and aborts the connection automatically.
7. Exercise 5: Cache decorator using mutex — switch to sync.Map or RWMutex¶
Scenario¶
A CachedRepo decorator wraps a database repository with a TTL cache. The cache uses a sync.Mutex to protect a map. Under concurrent reads — the common case for a hot cache — every reader contends on the same mutex. Throughput is bottlenecked by lock acquisition.
Before¶
package users
import (
"context"
"sync"
"time"
)
type Repo interface {
Get(ctx context.Context, id int) (User, error)
}
type CachedRepo struct {
Inner Repo
TTL time.Duration
mu sync.Mutex
entries map[int]cacheEntry
}
type cacheEntry struct {
user User
expires time.Time
}
func (c *CachedRepo) Get(ctx context.Context, id int) (User, error) {
c.mu.Lock()
if e, ok := c.entries[id]; ok && time.Now().Before(e.expires) {
c.mu.Unlock()
return e.user, nil
}
c.mu.Unlock()
u, err := c.Inner.Get(ctx, id)
if err != nil { return User{}, err }
c.mu.Lock()
c.entries[id] = cacheEntry{user: u, expires: time.Now().Add(c.TTL)}
c.mu.Unlock()
return u, nil
}
Benchmark¶
func BenchmarkCachedRepoParallel(b *testing.B) {
b.ReportAllocs()
c := &CachedRepo{
Inner: &fakeRepo{},
TTL: 5 * time.Minute,
entries: map[int]cacheEntry{},
}
// Pre-warm
for i := 0; i < 100; i++ {
c.Get(context.Background(), i)
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
i := 0
for pb.Next() {
c.Get(context.Background(), i%100)
i++
}
})
}
240 ns/op under parallel load on an 8-core machine. The per-call cost is mostly lock contention — sync.Mutex.Lock runs ~25 ns uncontended, but under 8-way concurrent traffic the contention can push it to 200+ ns.
After
**Path A — RWMutex (when reads dominate).**type CachedRepo struct {
Inner Repo
TTL time.Duration
mu sync.RWMutex
entries map[int]cacheEntry
}
func (c *CachedRepo) Get(ctx context.Context, id int) (User, error) {
c.mu.RLock()
e, ok := c.entries[id]
c.mu.RUnlock()
if ok && time.Now().Before(e.expires) {
return e.user, nil
}
u, err := c.Inner.Get(ctx, id)
if err != nil { return User{}, err }
c.mu.Lock()
c.entries[id] = cacheEntry{user: u, expires: time.Now().Add(c.TTL)}
c.mu.Unlock()
return u, nil
}
type CachedRepo struct {
Inner Repo
TTL time.Duration
entries sync.Map // map[int]*cacheEntry
}
func (c *CachedRepo) Get(ctx context.Context, id int) (User, error) {
if v, ok := c.entries.Load(id); ok {
e := v.(*cacheEntry)
if time.Now().Before(e.expires) {
return e.user, nil
}
}
u, err := c.Inner.Get(ctx, id)
if err != nil { return User{}, err }
c.entries.Store(id, &cacheEntry{user: u, expires: time.Now().Add(c.TTL)})
return u, nil
}
import "golang.org/x/sync/singleflight"
type CachedRepo struct {
Inner Repo
TTL time.Duration
entries sync.Map
sf singleflight.Group
}
func (c *CachedRepo) Get(ctx context.Context, id int) (User, error) {
if v, ok := c.entries.Load(id); ok {
e := v.(*cacheEntry)
if time.Now().Before(e.expires) {
return e.user, nil
}
}
// Only one goroutine calls Inner.Get(id) at a time.
key := strconv.Itoa(id)
v, err, _ := c.sf.Do(key, func() (interface{}, error) {
u, err := c.Inner.Get(ctx, id)
if err != nil { return nil, err }
c.entries.Store(id, &cacheEntry{user: u, expires: time.Now().Add(c.TTL)})
return u, nil
})
if err != nil { return User{}, err }
return v.(User), nil
}
8. Exercise 6: Logging decorator with fmt.Sprintf — structured logger¶
Scenario¶
A logging decorator formats every field with fmt.Sprintf("user=%d action=%s amount=%d", userID, action, amount). The Sprintf allocates a buffer, walks the format string, boxes each argument into an interface{}. At thousands of logs per second, this is the largest source of allocations in the service.
Before¶
package middleware
import (
"fmt"
"log"
)
type LoggingCharger struct {
Inner Charger
}
func (l *LoggingCharger) Charge(ctx context.Context, userID int, action string, amount int) error {
log.Printf("user=%d action=%s amount=%d started", userID, action, amount)
err := l.Inner.Charge(ctx, userID, action, amount)
if err != nil {
log.Printf("user=%d action=%s amount=%d failed err=%v", userID, action, amount, err)
return err
}
log.Printf("user=%d action=%s amount=%d ok", userID, action, amount)
return nil
}
Benchmark¶
func BenchmarkLoggingChargerFmt(b *testing.B) {
b.ReportAllocs()
log.SetOutput(io.Discard)
l := &LoggingCharger{Inner: &noopCharger{}}
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = l.Charge(context.Background(), 42, "purchase", 1000)
}
}
Each log.Printf allocates: the format buffer (64 B), one interface{} box per argument (3 × 16 B), and the time.Now().String() for the log timestamp.
After
Use a zerolog-style structured logger that appends typed values without reflection.package middleware
import (
"context"
"github.com/rs/zerolog"
)
type LoggingCharger struct {
Inner Charger
Logger zerolog.Logger
}
func (l *LoggingCharger) Charge(ctx context.Context, userID int, action string, amount int) error {
l.Logger.Info().
Int("user", userID).
Str("action", action).
Int("amount", amount).
Msg("started")
err := l.Inner.Charge(ctx, userID, action, amount)
if err != nil {
l.Logger.Error().
Int("user", userID).
Str("action", action).
Int("amount", amount).
Err(err).
Msg("failed")
return err
}
l.Logger.Info().
Int("user", userID).
Str("action", action).
Int("amount", amount).
Msg("ok")
return nil
}
import "log/slog"
logger := slog.New(slog.NewJSONHandler(os.Stderr, nil))
logger.Info("started",
"user", userID,
"action", action,
"amount", amount)
9. Exercise 7: Metrics decorator with HistogramVec — pre-compute labels¶
Scenario¶
A metrics decorator records request latency to a Prometheus HistogramVec. Each call does histogram.WithLabelValues(method, status).Observe(elapsed). WithLabelValues looks up the labels in a sync.Map keyed by the joined label strings — a hash + string-join on every observation. At high QPS this is a measurable fraction of the per-request cost.
Before¶
package middleware
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
)
var requestLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
}, []string{"method", "status"})
type sizingResponseWriter struct {
http.ResponseWriter
status int
}
func (s *sizingResponseWriter) WriteHeader(c int) {
s.status = c
s.ResponseWriter.WriteHeader(c)
}
func Metrics(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sw := &sizingResponseWriter{ResponseWriter: w, status: 200}
start := time.Now()
next.ServeHTTP(sw, r)
// Per-call: hash label values, look up histogram, observe.
requestLatency.WithLabelValues(r.Method, strconv.Itoa(sw.status)).Observe(time.Since(start).Seconds())
})
}
Benchmark¶
func BenchmarkMetricsMiddleware(b *testing.B) {
b.ReportAllocs()
h := Metrics(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
}))
req := httptest.NewRequest("GET", "/", nil)
b.ResetTimer()
for i := 0; i < b.N; i++ {
h.ServeHTTP(httptest.NewRecorder(), req)
}
}
The allocations: the sizingResponseWriter, the strconv.Itoa result, the WithLabelValues lookup's intermediate slice, and the Observe value-receiver copy.
After
Pre-compute the observers for the common (method, status) pairs at startup. Look up the histogram object directly via a small in-memory map.package middleware
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
)
var requestLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
}, []string{"method", "status"})
// Pre-resolved observers keyed by (method, status). The status is folded into
// buckets to keep the key space small: 2xx, 3xx, 4xx, 5xx.
var observers = map[string]prometheus.Observer{}
func init() {
methods := []string{"GET", "POST", "PUT", "DELETE", "PATCH", "HEAD", "OPTIONS"}
statusBuckets := []string{"2xx", "3xx", "4xx", "5xx"}
for _, m := range methods {
for _, s := range statusBuckets {
observers[m+s] = requestLatency.WithLabelValues(m, s)
}
}
}
func statusBucket(code int) string {
switch code / 100 {
case 2: return "2xx"
case 3: return "3xx"
case 4: return "4xx"
case 5: return "5xx"
}
return "unknown"
}
func Metrics(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sw := &sizingResponseWriter{ResponseWriter: w, status: 200}
start := time.Now()
next.ServeHTTP(sw, r)
// O(1) map lookup, no string conversion of the status code.
if obs, ok := observers[r.Method+statusBucket(sw.status)]; ok {
obs.Observe(time.Since(start).Seconds())
} else {
// Cold path — uncommon method or status.
requestLatency.WithLabelValues(r.Method, statusBucket(sw.status)).Observe(time.Since(start).Seconds())
}
})
}
// Lookup keyed by [method][bucket] (two-level map, no string alloc).
var observersByMethod = map[string]map[string]prometheus.Observer{}
func init() {
methods := []string{"GET", "POST", "PUT", "DELETE", "PATCH", "HEAD", "OPTIONS"}
statusBuckets := []string{"2xx", "3xx", "4xx", "5xx"}
for _, m := range methods {
observersByMethod[m] = map[string]prometheus.Observer{}
for _, s := range statusBuckets {
observersByMethod[m][s] = requestLatency.WithLabelValues(m, s)
}
}
}
func Metrics(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sw := &sizingResponseWriter{ResponseWriter: w, status: 200}
start := time.Now()
next.ServeHTTP(sw, r)
if byStatus, ok := observersByMethod[r.Method]; ok {
if obs, ok := byStatus[statusBucket(sw.status)]; ok {
obs.Observe(time.Since(start).Seconds())
return
}
}
// Cold fallback.
})
}
var swPool = sync.Pool{
New: func() any { return &sizingResponseWriter{} },
}
func Metrics(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sw := swPool.Get().(*sizingResponseWriter)
sw.ResponseWriter = w
sw.status = 200
defer swPool.Put(sw)
start := time.Now()
next.ServeHTTP(sw, r)
// ... lookup and observe ...
})
}
10. Exercise 8: Rate limiter using mutex — atomic-based counter¶
Scenario¶
A rate-limit decorator uses a sync.Mutex around a token-bucket counter. Each call locks, decrements, unlocks. Under high QPS the mutex becomes the throughput limit — the rate limiter itself is the bottleneck, not the rate it's enforcing.
Before¶
package middleware
import (
"context"
"errors"
"sync"
"time"
)
type RateLimit struct {
Inner Charger
mu sync.Mutex
tokens int
capacity int
refillAt time.Time
rate int // tokens per second
}
var ErrRateLimited = errors.New("rate limited")
func (r *RateLimit) Charge(ctx context.Context, amount int) error {
r.mu.Lock()
now := time.Now()
elapsed := now.Sub(r.refillAt)
refill := int(elapsed.Seconds()) * r.rate
if refill > 0 {
r.tokens += refill
if r.tokens > r.capacity { r.tokens = r.capacity }
r.refillAt = now
}
if r.tokens <= 0 {
r.mu.Unlock()
return ErrRateLimited
}
r.tokens--
r.mu.Unlock()
return r.Inner.Charge(ctx, amount)
}
Benchmark¶
func BenchmarkRateLimitMutex(b *testing.B) {
b.ReportAllocs()
r := &RateLimit{
Inner: &noopCharger{},
tokens: 1_000_000_000,
capacity: 1_000_000_000,
refillAt: time.Now(),
rate: 1_000_000,
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
_ = r.Charge(context.Background(), 100)
}
})
}
180 ns/op under parallel load on 8 cores. The mutex is the bottleneck — uncontended mutex is 25 ns, but with 8 cores hammering the same lock it's 180 ns.
After
Token bucket using `atomic.Int64`. A single CAS replaces the lock.package middleware
import (
"context"
"errors"
"sync/atomic"
"time"
)
type RateLimit struct {
Inner Charger
tokens atomic.Int64 // current token count
capacity int64
rate int64 // tokens per nanosecond × 1e9 — store as nano units
lastNs atomic.Int64 // last refill time, monotonic nanos
}
func NewRateLimit(inner Charger, capacity int64, ratePerSec int64) *RateLimit {
r := &RateLimit{
Inner: inner,
capacity: capacity,
rate: ratePerSec,
}
r.tokens.Store(capacity)
r.lastNs.Store(time.Now().UnixNano())
return r
}
func (r *RateLimit) Charge(ctx context.Context, amount int) error {
nowNs := time.Now().UnixNano()
lastNs := r.lastNs.Swap(nowNs)
// Refill: tokens to add = rate * (now - last) / 1e9
refill := r.rate * (nowNs - lastNs) / 1_000_000_000
if refill > 0 {
// Add refill, cap at capacity.
for {
cur := r.tokens.Load()
next := cur + refill
if next > r.capacity { next = r.capacity }
if r.tokens.CompareAndSwap(cur, next) { break }
}
}
// Take one token if available.
for {
cur := r.tokens.Load()
if cur <= 0 { return ErrRateLimited }
if r.tokens.CompareAndSwap(cur, cur-1) { break }
}
return r.Inner.Charge(ctx, amount)
}
import "golang.org/x/time/rate"
type RateLimit struct {
Inner Charger
limiter *rate.Limiter
}
func (r *RateLimit) Charge(ctx context.Context, amount int) error {
if !r.limiter.Allow() {
return ErrRateLimited
}
return r.Inner.Charge(ctx, amount)
}
11. Exercise 9: Middleware chain rebuilt on config change — atomic.Pointer hot swap¶
Scenario¶
A middleware stack is parameterized by config (rate limits, auth providers, feature flags). When config changes, the service rebuilds the chain and replaces it. The naïve implementation uses a mutex around the chain pointer; readers (request handlers) acquire it on every request. Under high QPS the read-lock dominates.
Before¶
package server
import (
"net/http"
"sync"
)
type Server struct {
mu sync.RWMutex
chain http.Handler
}
func (s *Server) Reload(cfg Config) {
newChain := buildChain(cfg)
s.mu.Lock()
s.chain = newChain
s.mu.Unlock()
}
func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
s.mu.RLock()
h := s.chain
s.mu.RUnlock()
h.ServeHTTP(w, r)
}
Benchmark¶
func BenchmarkRWMutexChain(b *testing.B) {
b.ReportAllocs()
s := &Server{chain: buildChain(defaultCfg)}
req := httptest.NewRequest("GET", "/", nil)
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
s.ServeHTTP(httptest.NewRecorder(), req)
}
})
}
120 ns/op. Most of that is the RLock + RUnlock pair, which involves atomic counters and (under contention) memory barriers.
After
Use `atomic.Pointer[http.Handler]`. The read path is a single atomic load — no lock, no counter, no memory barrier beyond what the CPU provides.package server
import (
"net/http"
"sync/atomic"
)
type Server struct {
chain atomic.Pointer[handlerHolder]
}
type handlerHolder struct {
h http.Handler
}
func NewServer(cfg Config) *Server {
s := &Server{}
s.chain.Store(&handlerHolder{h: buildChain(cfg)})
return s
}
func (s *Server) Reload(cfg Config) {
s.chain.Store(&handlerHolder{h: buildChain(cfg)})
}
func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
s.chain.Load().h.ServeHTTP(w, r)
}
type Server struct {
chain atomic.Pointer[handlerHolder]
building atomic.Bool
}
func (s *Server) Reload(cfg Config) error {
if !s.building.CompareAndSwap(false, true) {
return errors.New("already reloading")
}
defer s.building.Store(false)
newChain := buildChain(cfg) // expensive
s.chain.Store(&handlerHolder{h: newChain})
return nil
}
type Router struct {
routes sync.Map // map[string]*atomic.Pointer[handlerHolder]
}
func (r *Router) ServeHTTP(w http.ResponseWriter, req *http.Request) {
v, ok := r.routes.Load(req.URL.Path)
if !ok { http.NotFound(w, req); return }
v.(*atomic.Pointer[handlerHolder]).Load().h.ServeHTTP(w, req)
}
12. Exercise 10: Embedding decorator forwarding 10 methods — direct delegation¶
Scenario¶
A storage interface has 10 methods. A logging decorator embeds the interface and overrides only Read — the other 9 methods are forwarded through the embedded interface. Each forwarded call costs an extra interface dispatch (~2 ns). On a hot path that exercises all 10 methods, the overhead adds up.
Before¶
package storage
import (
"context"
"log"
)
type Storage interface {
Read(ctx context.Context, key string) ([]byte, error)
Write(ctx context.Context, key string, value []byte) error
Delete(ctx context.Context, key string) error
List(ctx context.Context, prefix string) ([]string, error)
Lock(ctx context.Context, key string) error
Unlock(ctx context.Context, key string) error
Subscribe(ctx context.Context, prefix string) (<-chan Event, error)
Stats(ctx context.Context) (Statistics, error)
Snapshot(ctx context.Context) error
Restore(ctx context.Context, id string) error
}
type LoggingStorage struct {
Storage // embedded — methods are promoted
Log *log.Logger
}
// Only Read is decorated. The other 9 forward through embedded Storage.
func (l *LoggingStorage) Read(ctx context.Context, key string) ([]byte, error) {
l.Log.Printf("Read: %s", key)
return l.Storage.Read(ctx, key)
}
Benchmark¶
func BenchmarkEmbeddedForward(b *testing.B) {
b.ReportAllocs()
inner := &diskStorage{...}
l := &LoggingStorage{Storage: inner, Log: log.New(io.Discard, "", 0)}
ctx := context.Background()
b.ResetTimer()
for i := 0; i < b.N; i++ {
// Hot path uses Write (not decorated, forwarded).
_ = l.Write(ctx, "key", []byte("value"))
}
}
The 62 ns is the inner disk-storage Write plus two interface dispatches: one to call Write on *LoggingStorage (which promotes from Storage), and one inside that promoted method to call Storage.Write on the embedded interface.
After
If only one method needs decoration, drop the embedding and forward 9 methods by hand. The "verbose" version has *one fewer interface dispatch per non-decorated call* because the wrapper holds a concrete type. Actually — the cleanest fix is different. Embedding an *interface* incurs the dispatch. Embedding a *concrete pointer* doesn't:type LoggingStorage struct {
inner *diskStorage // concrete type — no interface dispatch
Log *log.Logger
}
// Forward 9 methods explicitly. Each is a direct call.
func (l *LoggingStorage) Write(ctx context.Context, key string, value []byte) error {
return l.inner.Write(ctx, key, value)
}
func (l *LoggingStorage) Delete(ctx context.Context, key string) error {
return l.inner.Delete(ctx, key)
}
// ... etc for 7 more ...
func (l *LoggingStorage) Read(ctx context.Context, key string) ([]byte, error) {
l.Log.Printf("Read: %s", key)
return l.inner.Read(ctx, key)
}
type LoggingStorage[S Storage] struct {
inner S
Log *log.Logger
}
func (l *LoggingStorage[S]) Write(ctx context.Context, key string, value []byte) error {
return l.inner.Write(ctx, key, value)
}
// ... etc ...
func (l *LoggingStorage[S]) Read(ctx context.Context, key string) ([]byte, error) {
l.Log.Printf("Read: %s", key)
return l.inner.Read(ctx, key)
}
13. Exercise 11: Trace decorator creating span per call — sampling¶
Scenario¶
A tracing decorator creates a span for every request, even though only 1 % of traces are kept. Span creation involves a UUID generation, attribute allocation, and exporter notification — typically ~500 ns per span. At 100k QPS, this is 50 ms/sec of CPU spent on spans that are discarded.
Before¶
package middleware
import (
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("server")
func Tracing(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), r.URL.Path,
trace.WithAttributes(
attribute.String("method", r.Method),
attribute.String("user_agent", r.UserAgent()),
),
)
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Benchmark¶
func BenchmarkTracing(b *testing.B) {
b.ReportAllocs()
h := Tracing(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
}))
req := httptest.NewRequest("GET", "/api", nil)
b.ResetTimer()
for i := 0; i < b.N; i++ {
h.ServeHTTP(httptest.NewRecorder(), req)
}
}
The full span pipeline: UUID gen, attribute slice alloc, exporter call (no-op when sampled out, but still invoked), context.WithValue chain.
After
Sample at the decorator level, *before* paying span creation costs. Only create spans for sampled requests.package middleware
import (
"hash/maphash"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var (
tracer = otel.Tracer("server")
sampleRate = uint64(100) // 1 in 100
seed = maphash.MakeSeed()
)
func shouldSample(r *http.Request) bool {
// Use the trace-parent header for distributed consistency; fall back to a hash.
if tp := r.Header.Get("traceparent"); tp != "" {
// Decisions are propagated by the W3C trace context header.
return tracecontextSampled(tp)
}
// Local decision: hash a stable request identifier.
var h maphash.Hash
h.SetSeed(seed)
h.WriteString(r.URL.Path)
h.WriteString(r.RemoteAddr)
return h.Sum64()%sampleRate == 0
}
func Tracing(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if !shouldSample(r) {
next.ServeHTTP(w, r)
return
}
ctx, span := tracer.Start(r.Context(), r.URL.Path,
trace.WithAttributes(
attribute.String("method", r.Method),
attribute.String("user_agent", r.UserAgent()),
),
)
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
var sampleRate atomic.Uint64
func init() { sampleRate.Store(100) }
func SetSampleRate(rate uint64) { sampleRate.Store(rate) }
func shouldSample(r *http.Request) bool {
rate := sampleRate.Load()
if rate <= 1 { return true }
// ... hash logic ...
return h.Sum64()%rate == 0
}
14. Exercise 12: Buffered writer decorator with small buffers — size for typical payload¶
Scenario¶
A bufio.Writer wraps a network connection. The default buffer size (4 KB) was chosen for terminal I/O; the actual payload is large JSON responses (~64 KB typical). Each write flushes the buffer multiple times, doing a syscall per 4 KB. Increasing the buffer to the typical payload size eliminates the flushes.
Before¶
package server
import (
"bufio"
"encoding/json"
"net/http"
)
func WriteJSON(w http.ResponseWriter, v any) error {
bw := bufio.NewWriter(w) // default 4 KB buffer
enc := json.NewEncoder(bw)
if err := enc.Encode(v); err != nil {
return err
}
return bw.Flush()
}
Benchmark¶
func BenchmarkWriteJSONDefault(b *testing.B) {
b.ReportAllocs()
payload := generateLargeJSON() // ~64 KB
b.ResetTimer()
for i := 0; i < b.N; i++ {
w := &discardResponseWriter{}
_ = WriteJSON(w, payload)
}
}
17 allocations — the bufio buffer (4 KB), the encoder's internal buffer, and 15 flushes that each allocate a small follow-up buffer.
After
Size the buffer for the typical payload. Pool the buffer for reuse.package server
import (
"bufio"
"encoding/json"
"net/http"
"sync"
)
const typicalPayloadSize = 64 * 1024 // 64 KB — matches observed median
var writerPool = sync.Pool{
New: func() any {
return bufio.NewWriterSize(nil, typicalPayloadSize)
},
}
func WriteJSON(w http.ResponseWriter, v any) error {
bw := writerPool.Get().(*bufio.Writer)
bw.Reset(w)
defer writerPool.Put(bw)
enc := json.NewEncoder(bw)
if err := enc.Encode(v); err != nil {
return err
}
return bw.Flush()
}
var bufPool = sync.Pool{
New: func() any {
b := bytes.NewBuffer(make([]byte, 0, typicalPayloadSize))
return b
},
}
func WriteJSON(w http.ResponseWriter, v any) error {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
if err := json.NewEncoder(buf).Encode(v); err != nil {
return err
}
_, err := w.Write(buf.Bytes())
return err
}
15. When NOT to optimize¶
The honest framing: most decorator chains should not be optimized. The pattern is cheap. The wins exist only when:
| Condition | Threshold to bother |
|---|---|
| Decorator call frequency | > 10k calls/sec sustained |
| Profile shows middleware methods in top 5 % CPU | Yes |
| Allocation profile shows closures/maps from middleware in top 10 | Yes |
| Chain rebuilding shows in alloc profile | Yes (Exercise 1 — almost always worth fixing) |
| Cache decorator's mutex contention > 10 % in mutex profile | Yes |
Logging is on the hot path with fmt.Sprintf | Yes |
| The "fix" doesn't change the public API or break correctness | Yes |
| You can write a regression test | Yes |
| The fix survives a Go version bump | Probably yes |
If you can't tick most of those, don't optimize. The decorator chains in chi, gorilla/mux, the net/http standard library, gRPC interceptors are all "naïve" by the standards of this file — they ship because the simple version is good enough.
Specific anti-patterns to avoid:
| Anti-pattern | Why it's bad |
|---|---|
| Pre-allocating every chain at startup "for speed" when chains are stable already | The chain is already built once; you're optimizing nothing |
| Flattening every middleware chain | Loses composability and test isolation for ~1 µs that wasn't on the profile |
Switching every cache from sync.Mutex to sync.Map | sync.Map is slower for write-heavy or low-cardinality workloads |
| Removing recovery middleware to save 5 ns | One panic in production wipes out a year of "savings" |
| Aggressive sampling that loses error traces | The whole point of tracing is to debug failures — sample tail-latency at minimum |
sync.Pool for every middleware buffer | Pool overhead matches the savings below ~10k QPS |
| PGO for a service that handles 100 QPS | The CI complexity exceeds the value |
| Hand-rolling structured loggers | Use slog or zerolog; don't reinvent |
| Atomic pointer for chain that reloads daily | RWMutex's read-side is fine at low reload rates |
| Pre-computing metric labels for unbounded label sets | Cardinality explosion — the metric setup is broken in a different way |
The default answer to "can we make this decorator faster?" is no, it's fine. The yes cases are narrow and benchmark-justified.
16. The optimization checklist¶
Before shipping any optimization from this file:
- Baseline benchmark exists (the unoptimized decorator chain).
- Optimized benchmark shows ≥ 2× improvement OR saves ≥ 1 allocation per call.
-
pprofconfirms the optimization targets a real hot spot (top 5 % CPU or top 10 allocs). -
mutex.profconfirms contention reduction for lock-related changes. - The new code passes the same tests as the old.
-
-gcflags=-mshows no unexpected escapes (especially for closure changes). -
-raceis clean (especially for cached chains, atomic pointers, lock-free counters). - Documentation explains the assumption the optimization makes ("chain is static", "config reloads via atomic swap", "logs at INFO+").
- CI regression test (
benchstat) compares against the baseline. - Code review has signed off on the trade-off (especially for API-shape changes like flattening or PGO).
- The "When NOT to do this" condition from the relevant exercise has been checked.
- If the optimization changes panic behavior (Exercise 4), verify the recovery contract is preserved at the outer layer.
If any item is missing, the optimization isn't ready.
17. Summary¶
A decorator chain in Go is already fast: ~2 ns per layer and zero allocations per call. Most optimizations in this file save 10-1000 ns and 1-5 allocations. That matters at 10k QPS. It does not matter at 100 QPS.
The wins worth shipping cluster in eight areas:
- Build the chain once at startup, not per request (Exercise 1) — 19× faster, removes 11 allocations. Pure win when the chain is static. Almost always worth fixing if you find it.
- Eliminate closure escapes in middleware (Exercise 2) — 3× faster, removes a heap allocation per request. Pure win when
deferisn't needed for correctness. sync.MaporRWMutexfor read-heavy cache decorators (Exercise 5) — 4-8× faster under parallel load. Pure win when reads dominate writes.- Structured logger instead of
fmt.Printf(Exercise 6) — 11× faster, zero allocations. Pure win — also gives you structured logs for free. - Pre-compute metric label observers at startup (Exercise 7) — 3-8× faster, zero allocations. Pure win when label cardinality is bounded.
- Atomic counter instead of mutex for rate limiters (Exercise 8) — 5× faster under contention. Real win for a single shared limiter.
atomic.Pointerfor chain hot-swap (Exercise 9) — 4× faster than RWMutex. Pure win when reads dominate reloads.- Buffered writers sized for typical payload (Exercise 12) — 7× faster, 15 fewer allocations. Pure win when payload size is known.
The wins that don't always pay off:
- PGO devirtualization (Exercise 3) — 1.5-2× faster, requires profile-build pipeline; only worth it at scale.
- Eliminate recovery middleware's defer (Exercise 4) — 2× faster, but the safety cost is real.
- Sampled tracing (Exercise 11) — 17× faster on average, but loses 99 % of trace detail.
- Direct delegation over embedding (Exercise 10) — 2× faster, costs 30 lines of maintenance per decorator.
- Flattening deep chains (Exercise 3) — 3× faster, kills composability and testability.
Always benchmark. Always check -race. Always check mutex.prof for lock changes. Always confirm the optimization survives a Go version bump. Most production codebases need none of these optimizations; the decorator pattern is fine as written in junior.md and middle.md.
Further reading¶
- Go 1.21+ PGO: https://go.dev/doc/pgo
log/slog: https://pkg.go.dev/log/slogzerolog: https://pkg.go.dev/github.com/rs/zerologatomic.Pointer[T]: https://pkg.go.dev/sync/atomic#Pointergolang.org/x/sync/singleflight: https://pkg.go.dev/golang.org/x/sync/singleflightgolang.org/x/time/rate: https://pkg.go.dev/golang.org/x/time/rate- OpenTelemetry sampling: https://opentelemetry.io/docs/concepts/sampling/
benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat- Escape analysis: https://github.com/golang/go/wiki/CompilerOptimizations
- Sibling: middle.md — variant choices
- Sibling: junior.md — the baseline shape
- Related: ../03-strategy-pattern/optimize.md — same file shape for strategy
- Inspiration (zero-allocation HTTP): https://github.com/valyala/fasthttp
- Inspiration (decorator-rich stdlib):
bufio,gzip,tls,httputil