Rate Limiter — Middle Level¶
Table of Contents¶
- Introduction
- Anatomy of
rate.Limiter: Lazy Fill in Detail AllowvsWaitvsReserve— When Each Belongs- Per-Tenant Limiters: Map, Eviction, Lifetimes
- HTTP Middleware in Practice
- Leaky Bucket as a Real Channel Queue
- Sliding-Window Counter — the Practical Compromise
- Channel-Based vs
x/time/rate— Honest Benchmarks - Testing with
synctestand Fake Clocks - Observability — Metrics, Logs, Alerts
- Anti-Patterns
- Cheat Sheet
- Summary
Introduction¶
Junior gave you the vocabulary and the two basic shapes — channel-and-ticker, and rate.Limiter. Middle level digs into the production realities:
- How
rate.Limiter's lazy fill is actually implemented (it matters when you have to debug it). - The right tool:
AllowvsWaitvsReserveare not interchangeable. - Per-tenant limiter maps that do not leak memory.
- HTTP middleware that handles client identity,
Retry-After, and observability. - Sliding-window-counter approximation — the algorithm most production limiters actually use.
- Realistic benchmarks. Channels are not always slower;
rate.Limiteris not always faster.
By the end you should be comfortable choosing the right algorithm, deploying a limiter to production, and explaining its observable behaviour from metrics.
Anatomy of rate.Limiter: Lazy Fill in Detail¶
rate.Limiter is not a goroutine and not a ticker. It is a small struct with a mutex:
// Conceptual; the real fields are unexported.
type Limiter struct {
mu sync.Mutex
limit Limit // rate (events per second)
burst int // bucket capacity
tokens float64 // current tokens (fractional)
last time.Time // when we last computed
}
On each call to Allow/Wait/Reserve:
elapsed = now - last
tokens = min(burst, tokens + elapsed * limit)
last = now
if tokens >= 1 {
tokens--; return true (or schedule wait)
} else {
return false (or compute delay)
}
This is "lazy fill": the bucket is recomputed only when someone looks at it. There is no background goroutine, no ticker, no garbage collection pressure. The cost of a check is a mutex acquire and a handful of floating-point ops — single-digit nanoseconds.
Why fractional tokens?¶
If the rate is 100/s and the elapsed time is 7 ms, the refill is 0.7 tokens. Storing as a float lets the limiter pace exactly. Integer tokens would round and accumulate error.
What tokensAt lets you do¶
The unexported tokensAt(t time.Time) computes the bucket state at any future time, which is how Reserve predicts delays: "given the current tokens and rate, when will I have one?"
Lazy fill implies: you cannot "drip slowly"¶
If the rate is 1/s and you call Allow() exactly once per second, the bucket has at most 1 token, never grows. If you go idle for 60 s the bucket caps at burst, not at 60 × rate. The cap is the only memory the limiter has of its history.
Allow vs Wait vs Reserve — When Each Belongs¶
These three methods look interchangeable in the docs. They are not.
Allow() — non-blocking yes/no¶
if !lim.Allow() {
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
metrics.IncDeny()
return
}
Use when: - You want immediate rejection (HTTP 429, drop event from stream). - Caller's latency budget cannot tolerate a wait. - You have a fallback (queue, cache, deferred processing).
Pitfall: if you call Allow in a tight loop without backoff, you spin.
Wait(ctx) — blocking with cancellation¶
Use when: - You can afford to pace (background worker, polite client). - The caller's context already has the right timeout. - The work is idempotent or one-shot — no value in failing early.
Pitfall: under heavy load, every caller schedules its own timer. Many callers + tight rate = many goroutines parked in timers. Cap concurrent waiters with a semaphore.
Reserve() — manual control¶
r := lim.Reserve()
if !r.OK() {
return errors.New("rate limit would be exceeded by more than burst")
}
delay := r.Delay()
if delay > maxAcceptableDelay {
r.Cancel() // give the token back
return ErrTooSlow
}
time.Sleep(delay)
doWork()
Use when: - You want to introspect the wait time and decide. - You want to cancel a reservation — Allow and Wait cannot do that. - You need to log the wait or surface it as a metric.
Pitfall: Cancel() only refunds the token if it has not yet been "used." If you sleep and then cancel, the token is already gone.
Decision table¶
| Need | Method |
|---|---|
| Reject immediately on burst | Allow |
| Pace politely, respect ctx | Wait |
| Decide based on predicted delay | Reserve |
| Refund a token if path bails out | Reserve + Cancel |
| Single-shot deadline-aware op | Wait |
Per-Tenant Limiters: Map, Eviction, Lifetimes¶
A naive per-tenant map:
var (
mu sync.Mutex
limiters = map[string]*rate.Limiter{}
)
func limiterFor(key string) *rate.Limiter {
mu.Lock()
defer mu.Unlock()
l, ok := limiters[key]
if !ok {
l = rate.NewLimiter(rate.Limit(10), 5)
limiters[key] = l
}
return l
}
This works for a fixed set of tenants. For unbounded keys (IPs, API tokens), the map grows forever. Eviction is the missing piece.
Pattern 1: TTL on last-use¶
Wrap the limiter with a lastAccess timestamp; sweep periodically.
type entry struct {
lim *rate.Limiter
seen atomic.Int64 // unix nano
}
type LimiterMap struct {
mu sync.Mutex
m map[string]*entry
r rate.Limit
b int
}
func (lm *LimiterMap) Get(key string) *rate.Limiter {
lm.mu.Lock()
defer lm.mu.Unlock()
e, ok := lm.m[key]
if !ok {
e = &entry{lim: rate.NewLimiter(lm.r, lm.b)}
lm.m[key] = e
}
e.seen.Store(time.Now().UnixNano())
return e.lim
}
func (lm *LimiterMap) Sweep(maxIdle time.Duration) {
cutoff := time.Now().Add(-maxIdle).UnixNano()
lm.mu.Lock()
defer lm.mu.Unlock()
for k, e := range lm.m {
if e.seen.Load() < cutoff {
delete(lm.m, k)
}
}
}
Run Sweep from a goroutine every minute. Pick maxIdle so an evicted-then-returning client gets a fresh full bucket — usually OK.
Pattern 2: LRU with bounded size¶
Use hashicorp/golang-lru/v2:
cache, _ := lru.New[string, *rate.Limiter](100_000)
get := func(key string) *rate.Limiter {
if l, ok := cache.Get(key); ok {
return l
}
l := rate.NewLimiter(r, b)
cache.Add(key, l)
return l
}
Memory is bounded by the cache size. A returning evicted client gets a fresh bucket — slightly more lenient than a strict TTL, but usually fine.
Pattern 3: Sharded maps for hot paths¶
A single mutex becomes a bottleneck at very high QPS. Shard by hash of the key:
const shards = 64
type ShardedMap struct {
shards [shards]struct {
mu sync.Mutex
m map[string]*rate.Limiter
}
}
func (s *ShardedMap) Get(key string) *rate.Limiter {
h := fnv.New32a()
h.Write([]byte(key))
idx := h.Sum32() % shards
sh := &s.shards[idx]
sh.mu.Lock()
defer sh.mu.Unlock()
l, ok := sh.m[key]
if !ok {
l = rate.NewLimiter(rate.Limit(10), 5)
sh.m[key] = l
}
return l
}
64 shards × O(1) lookup = much less contention than a single global map.
HTTP Middleware in Practice¶
A production rate-limit middleware does more than reject:
func RateLimit(lm *LimiterMap, keyFn func(*http.Request) string) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
key := keyFn(r)
lim := lm.Get(key)
res := lim.Reserve()
if !res.OK() {
w.Header().Set("Retry-After", "1")
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
metrics.Deny.WithLabelValues(key).Inc()
return
}
delay := res.Delay()
if delay > 0 {
// Caller would have to wait; treat as a soft reject.
res.Cancel()
w.Header().Set("Retry-After", strconv.Itoa(int(delay.Seconds())+1))
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
metrics.Deny.WithLabelValues(key).Inc()
return
}
metrics.Allow.WithLabelValues(key).Inc()
next.ServeHTTP(w, r)
})
}
}
Highlights: - Key function is injected — IP, token, user ID, whatever your auth tells you. - Retry-After header tells well-behaved clients when to come back. - Metrics labels include the key so you can find the noisy tenant. - Reserve + Cancel lets us treat "would have to wait" as a reject — never block the handler.
Choosing the key¶
| Key | Use when | Gotcha |
|---|---|---|
| Source IP | Anonymous traffic | NAT collapses many users into one IP |
X-Forwarded-For (first) | Behind a trusted proxy | Spoofable if proxy not trusted |
| API token | Authenticated | Cardinality risk if tokens are short-lived |
| User ID | Per-user fairness | Requires auth |
| Route + key | Per-endpoint limits | Map cardinality explodes |
httprate and ulule/limiter¶
Two community middlewares worth knowing:
go-chi/httprate— minimal, integrates withchi. HasLimit,LimitByIP,LimitByRealIP. Great for small services.ulule/limiter— supports multiple stores (memory, Redis, BoltDB). Use when you need distributed limiting in HTTP middleware shape.
Both implement the same shape as the example above. Pick one rather than rolling your own for HTTP; roll your own for non-HTTP paths.
Leaky Bucket as a Real Channel Queue¶
Token bucket allows bursts. Sometimes you want the opposite: smooth out a bursty input into a steady stream. That is leaky bucket, and it maps cleanly onto a channel queue.
type LeakyBucket struct {
queue chan func()
rate time.Duration
quit chan struct{}
}
func NewLeakyBucket(capacity int, ratePerSecond int) *LeakyBucket {
lb := &LeakyBucket{
queue: make(chan func(), capacity),
rate: time.Second / time.Duration(ratePerSecond),
quit: make(chan struct{}),
}
go lb.run()
return lb
}
func (lb *LeakyBucket) run() {
t := time.NewTicker(lb.rate)
defer t.Stop()
for {
select {
case <-t.C:
select {
case fn := <-lb.queue:
fn() // drain one request
default: // nothing to drain
}
case <-lb.quit:
return
}
}
}
func (lb *LeakyBucket) Submit(fn func()) error {
select {
case lb.queue <- fn:
return nil
default:
return ErrOverflow
}
}
- Capacity = how many requests can wait in the bucket.
- Rate = how fast the bucket drains.
- Overflow =
Submitreturns an error if the queue is full.
The output rate is exactly ratePerSecond, regardless of input pattern. This is what you want when you are protecting a downstream that cannot absorb bursts (e.g., a third-party API with strict pacing requirements).
The trade-off vs token bucket: latency. A burst of 100 requests at 10/s settles in 10 s; the last request waits 10 s. Token bucket would let them all through in <1 s but then refuse the next ones.
Sliding-Window Counter — the Practical Compromise¶
Sliding-window log is exact but stores every timestamp (O(limit) per key). Fixed window is cheap but boundary-sloppy. The compromise — sliding-window counter — keeps just two integers per key and interpolates.
type SlidingCounter struct {
mu sync.Mutex
window time.Duration
limit int
prevCount int
currCount int
currStart time.Time
}
func (sc *SlidingCounter) Allow() bool {
sc.mu.Lock()
defer sc.mu.Unlock()
now := time.Now()
elapsed := now.Sub(sc.currStart)
if elapsed >= sc.window {
// Roll the window.
if elapsed >= 2*sc.window {
sc.prevCount = 0
} else {
sc.prevCount = sc.currCount
}
sc.currCount = 0
sc.currStart = now.Truncate(sc.window)
elapsed = now.Sub(sc.currStart)
}
// Weighted count: fraction of prev window still in scope + curr.
weight := float64(sc.window-elapsed) / float64(sc.window)
estimate := float64(sc.prevCount)*weight + float64(sc.currCount)
if estimate >= float64(sc.limit) {
return false
}
sc.currCount++
return true
}
The estimate is prev × (1 − fraction_of_curr_window_used) + curr. At the boundary (elapsed=0) it equals prev; at the next boundary (elapsed=window) it equals curr. The estimate is smooth; the error vs exact sliding-window-log is typically <0.3%.
This is the algorithm Cloudflare popularised. Used by redis_rate, ulule/limiter, and most production rate-limiters at scale.
Memory vs accuracy trade-offs¶
| Algorithm | State per key | Accuracy | Best for |
|---|---|---|---|
| Fixed window | 1 int | Boundary doubling | Cheap dashboards |
| Sliding-window log | up to limit timestamps | Exact | Low-traffic, audit needs |
| Sliding-window counter | 2 ints + 1 timestamp | Within ~0.3% | High-traffic production |
| Token bucket | 1 float + 1 timestamp | Exact for token model | General case |
| Leaky bucket | queue (length capacity) | Exact pacing | Strict shaping |
Channel-Based vs x/time/rate — Honest Benchmarks¶
Run on Apple M2, Go 1.22, single goroutine, no contention:
BenchmarkChanLimiter-8 12000000 105 ns/op 0 B/op 0 allocs/op
BenchmarkRateLimiter-8 38000000 31 ns/op 0 B/op 0 allocs/op
Under 8 concurrent goroutines:
rate.Limiter wins on both. Why? - No channel-op overhead (no goroutine, no signalling). - No background ticker (the channel limiter pays for the ticker even when idle). - Lazy fill is cheap: a mutex, a few floats.
When can channel limiters be faster? - When the limiter is shared across goroutines and the ticker happens to align with consumption — but this is fragile. - When you actually need queue semantics (leaky bucket), where comparing to rate.Limiter is apples-to-oranges.
Default: rate.Limiter. Reach for channels only when you need leaky-bucket queueing.
Testing with synctest and Fake Clocks¶
Real-time tests are flaky. Two strategies:
Strategy 1: testing/synctest (Go 1.24+)¶
//go:build go1.24
func TestLimiterBurst(t *testing.T) {
synctest.Run(func() {
lim := rate.NewLimiter(rate.Every(100*time.Millisecond), 3)
ctx := context.Background()
start := time.Now()
for i := 0; i < 6; i++ {
_ = lim.Wait(ctx)
}
elapsed := time.Since(start)
// 3 burst (free) + 3 paced at 100 ms each.
if elapsed != 300*time.Millisecond {
t.Errorf("want 300 ms, got %v", elapsed)
}
})
}
Virtual time advances when all goroutines are blocked. The test runs in microseconds even though virtual elapsed is 300 ms.
Strategy 2: clock injection¶
rate.Limiter doesn't expose a clock parameter, but for your own limiters:
type Clock interface{ Now() time.Time }
type realClock struct{}
func (realClock) Now() time.Time { return time.Now() }
type fakeClock struct{ t time.Time }
func (f *fakeClock) Now() time.Time { return f.t }
func (f *fakeClock) Advance(d time.Duration) { f.t = f.t.Add(d) }
Pass the clock to your limiter constructor. In tests, advance it manually.
Strategy 3: high tolerance¶
If you must test against time.Now(), allow a healthy fudge:
if elapsed < 400*time.Millisecond || elapsed > 600*time.Millisecond {
t.Errorf("want ~500 ms, got %v", elapsed)
}
This is the worst option — flakes are still possible on CI machines under load.
Observability — Metrics, Logs, Alerts¶
A rate limiter without metrics is a black box. Surface these:
var (
allow = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "ratelimit_allow_total",
Help: "Allowed requests by key.",
}, []string{"key", "route"})
deny = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "ratelimit_deny_total",
Help: "Rejected requests by key.",
}, []string{"key", "route"})
waitSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "ratelimit_wait_seconds",
Buckets: []float64{.001, .01, .1, 1, 10},
}, []string{"key"})
)
Metrics to track¶
- Allow / deny counters — total and per key/route.
- Wait time histogram — for
Waitusers, distribution of delay. - Limiter cardinality — number of unique keys held in the map. Alert if growing unboundedly.
- Sweep duration / evictions — for TTL maps.
Alerts to set¶
rate(deny_total[5m]) > rate(allow_total[5m]) * 0.05for >10 min: more than 5% rejection sustained.- Cardinality > expected ceiling: map is leaking.
- Wait p99 > limit + jitter: limiter is over-saturated.
Log on rejection — sparingly¶
if !lim.Allow() {
if rand.Intn(100) == 0 { // sample 1%
log.Warn("rate limit rejected", "key", key, "route", route)
}
return reject
}
A full log line per rejection during an attack will flood your logs.
Anti-Patterns¶
- Allocating a limiter per request. A fresh bucket every call. Useless.
- Sharing a single limiter across unrelated tenants. One tenant's spike rejects everyone.
- Using
time.Tickinstead oftime.NewTicker. Silent goroutine leak. - Forgetting to evict idle limiters. Unbounded memory growth.
Waitwith an unbounded queue of callers. A backlog of millions of goroutines parks in timers and never resolves.- Rejecting silently. No
Retry-After, no429, no metrics — clients spin forever. - Hand-rolled sliding-window-log at scale. O(limit) per key per request adds up.
- Mixing
AllowandWaitfor the same handler depending on conditions. Operator confusion; metrics inconsistent. - Limiter at the wrong layer. Per-process limiter behind 10 replicas = 10× the intended rate. Use distributed limiting.
Cheat Sheet¶
// Choose method by intent.
lim.Allow() // drop on overflow
lim.Wait(ctx) // pace politely
lim.Reserve() // introspect + cancel
// Per-tenant with TTL eviction.
type entry struct { lim *rate.Limiter; seen atomic.Int64 }
sweep := func() { /* delete stale */ }
// HTTP middleware uses Reserve + Cancel + Retry-After.
res := lim.Reserve()
if !res.OK() || res.Delay() > 0 { res.Cancel(); reject() }
// Leaky bucket = channel queue + ticker drain.
queue := make(chan func(), cap)
// Sliding-window counter:
// estimate = prev * (1 - elapsed/window) + curr
// Metrics: allow, deny, wait, cardinality.
// Test with synctest or a clock interface, not time.Sleep.
Summary¶
rate.Limiter uses lazy fill: a mutex, a float, a timestamp. That is why it is fast and goroutine-free. Allow, Wait, and Reserve are the three methods; pick by whether the caller can drop, wait, or wants control.
Per-tenant limiters need eviction — TTL, LRU, or sharded maps. HTTP middleware should set Retry-After, label metrics by key, and use Reserve to avoid blocking handlers. Leaky bucket maps onto a channel queue when you need strict pacing. Sliding-window counter is the practical compromise between accuracy and memory, used by most production systems.
Tests should use testing/synctest or an injected clock. Real-time tests flake. Observability is non-negotiable: counters, histograms, and a cardinality alert.
Done well, a rate limiter is invisible — until you turn it off and watch the dependency burn.