Rate Limiter — Professional Level¶
Table of Contents¶
- Introduction
- The Theoretical Frame: Stochastic Arrivals and Queueing
- GCRA — Generic Cell Rate Algorithm
- Adaptive Limiters and Load-Aware Throttling
- Limiter Integration with the Scheduler
- Anti-Patterns at Scale
- Choosing Between Implementations
- Summary
Introduction¶
This level is for engineers designing rate-limiting systems — not just integrating one. You will think about:
- The queueing-theoretic basis. Rate limiting is a special case of admission control.
- GCRA — the algorithm behind
redis_rateand most production limiters at scale. - Adaptive limiters that respond to downstream health, not just a fixed rate.
- How a limiter interacts with the Go scheduler, the OS, and your service's tail latency.
- When a limiter is the wrong answer and a circuit breaker, a queue, or pre-emption fits better.
If you are choosing what to deploy at a billion-request-per-day service, this is where the calculus is.
The Theoretical Frame: Stochastic Arrivals and Queueing¶
Rate limiting is the dual of admission control in queueing theory. Pretend your service is an M/M/1 queue:
- Arrivals at rate
λ(Poisson). - Service rate
μ. - Utilisation
ρ = λ/μ.
When ρ → 1, expected wait time E[W] → ∞. Rate limiting bounds λ to keep ρ < 1, preventing queue blow-up.
Implications¶
- The limit must be below downstream capacity. Set rate = 70% × μ. The 30% headroom absorbs jitter, retries, and partial failures.
- Burst budget = jitter tolerance. A burst of size
Bmeans you can absorbB / λseconds of jitter without rejection. Size it from observed arrival variance. - Token bucket is the dual of leaky bucket. Both enforce the same average rate; they differ in how they handle the queue. Token bucket "lends" capacity (burst absorbs at arrival); leaky bucket "schedules" capacity (burst absorbs at departure).
- Sliding-window-counter is an approximation of a fluid model. The "fluid" leaks at rate λ continuously; the counter discretises into a two-window estimate. The approximation error is bounded by the variance of arrivals within a single window — typically < 1% for well-mixed traffic.
Little's Law in practice¶
L = λ × W. If you cap arrival rate λ at 100/s and average wait is 50 ms, the average number of concurrent waiters is L = 100 × 0.05 = 5. This is the cardinality of goroutines parked in Wait(). Use it to size your concurrency budget and to predict the memory cost of Wait.
If W grows unboundedly, L grows unboundedly — your limiter is a memory leak. The fix: cap W with a context timeout, or use Allow instead of Wait.
GCRA — Generic Cell Rate Algorithm¶
GCRA was designed for ATM (the network protocol) cell shaping. It is mathematically equivalent to leaky-bucket, but uses one timestamp per key instead of a level counter.
The state¶
The check¶
Three lines. One field. That is the entire algorithm.
Why it scales¶
- One field in Redis — minimal storage.
- One Lua call — atomic update.
- No counter loops — does not iterate keys or timestamps.
- No boundary problem —
TATis continuous.
redis_rate uses GCRA. Stripe's rate limiter, Cloudflare's, and many others do too.
GCRA in Go (in-memory)¶
type GCRA struct {
mu sync.Mutex
tat time.Time
rate time.Duration // T
burst time.Duration // tau
}
func NewGCRA(eventsPerSec int, burst int) *GCRA {
T := time.Second / time.Duration(eventsPerSec)
return &GCRA{
rate: T,
burst: T * time.Duration(burst),
}
}
func (g *GCRA) Allow() bool {
g.mu.Lock()
defer g.mu.Unlock()
now := time.Now()
if g.tat.Before(now) {
g.tat = now
}
if g.tat.Sub(now) > g.burst {
return false
}
g.tat = g.tat.Add(g.rate)
return true
}
rate.Limiter and this GCRA are observably equivalent for Allow/Wait patterns. rate.Limiter also supports Reserve and weighted reservations — GCRA doesn't natively.
Adaptive Limiters and Load-Aware Throttling¶
A fixed-rate limiter sets the rate at design time. But downstream capacity is rarely constant — it depends on time of day, on what other services are doing, on garbage collection pauses, on hardware variability.
AIMD — Additive Increase, Multiplicative Decrease¶
Borrowed from TCP congestion control:
type AIMD struct {
mu sync.Mutex
rate float64
minRate float64
maxRate float64
}
// Called on success.
func (a *AIMD) OnSuccess() {
a.mu.Lock()
defer a.mu.Unlock()
a.rate = math.Min(a.maxRate, a.rate+1.0)
}
// Called on a downstream failure (timeout, 5xx, etc).
func (a *AIMD) OnFailure() {
a.mu.Lock()
defer a.mu.Unlock()
a.rate = math.Max(a.minRate, a.rate*0.5)
}
Each success bumps the rate by +1; each failure halves it. The system finds a stable rate where ~ε% of requests fail. TCP uses this; so do many service meshes.
Gradient-based limiters (Netflix Concurrency Limits)¶
Sample latency. Compute the gradient of latency over time. If latency is rising, decrease the limit; if falling, increase. Library: netflix/concurrency-limits (Java; Go ports exist).
This is a concurrency limiter, not strictly a rate limiter, but the same math applies. The key insight: tail latency is the best signal of saturation. CPU and memory are lagging indicators.
Why adaptive?¶
A fixed limiter is brittle: too low in good times (lost throughput); too high in bad times (cascade failures). Adaptive limiters self-tune. The cost is complexity and a risk of oscillation.
When to adapt: - High-volume services where capacity headroom matters. - Polyglot dependencies where capacity is hard to know. - Services with significant traffic-pattern variance.
When not to adapt: - Outbound clients to a third party (their published rate is the contract). - Cost-protection (you want a fixed ceiling). - Compliance/audit paths where deterministic behaviour matters.
Limiter Integration with the Scheduler¶
rate.Limiter.Wait parks a goroutine on a timer. At scale this matters.
Timer fan-out¶
Each Wait call that blocks creates a Go time.Timer. The runtime maintains timers in a per-P heap. At 50,000 concurrent waiters across 8 P's, that is 6,250 timers per heap — insertions and pop-mins cost O(log n). Not free.
Profile this¶
Heavy time spent in timerproc = your Waiters are clogging the scheduler. Solutions: - Use Allow instead of Wait on the hot path. - Cap the number of concurrent waiters with a semaphore in front of the limiter. - Move the limiter to an edge (e.g., LB) so the application never blocks.
Tail latency¶
Wait adds bounded delay. Bounded != predictable: under sudden load, you may wait near the configured rate's reciprocal. If your service's p99 budget is 100 ms and your limiter is configured at 100 ev/s with burst 1, expect 10 ms of wait per request in steady-state, plus 100 ms tail at saturation.
Always profile p99 with the limiter on a synthetic load. A perfectly-tuned median is meaningless if p99 doubles.
Goroutine count¶
Each blocked Waiter is a goroutine. At 10,000 sustained Waiters, that is 10,000 goroutines × ~8 KB = 80 MB. Manageable, but the GC pressure of scanning their stacks is real on a hot path.
Anti-Patterns at Scale¶
- Local-only limiter at fleet scale. N replicas × R rate each = N×R total. You meant R. Use Redis or coordinate.
- Redis hop on every request without batching. Pipeline reads. Batch checks for low-priority traffic.
- Single global Redis key for everything. All replicas hit one shard. Throughput ceiling = single-shard throughput. Shard by tenant.
- Adaptive limiter without an upper bound. Bug in the success-signal? Rate climbs to infinity. Always cap
maxRate. - Rate limit on the LB and in the app and in the client. Three limiters compounding silently. Pick one layer for each dimension.
Waitin a fan-out path. 1 user request × 100 fan-out = 100 limiter waits. Sequential = 100× tail latency. Parallel = thundering herd at the limiter. Solution: rate-limit at the fan-in, not the fan-out.- Burst sized to "feel generous". Burst should be measured: capture arrival variance and set burst = 99th percentile of inter-arrival spike.
- Limiting the wrong dimension. Per-IP behind a corporate NAT punishes 10,000 users in one office. Per-token doesn't help anonymous traffic. Choose the dimension that maps to the abuse you fear.
- No
Retry-After. Clients retry instantly, multiplying your reject rate by 10×. - Rejection slower than success. Attacker prefers rejected requests because they tie up your CPU. Always make reject paths cheap.
Choosing Between Implementations¶
| Need | Tool |
|---|---|
| In-process, simple, single instance | rate.Limiter |
| In-process leaky-bucket (FIFO queueing) | Channel + ticker |
| HTTP middleware, single instance | chi/httprate or roll-your-own around rate.Limiter |
| Multi-tenant HTTP | ulule/limiter or hierarchical custom |
| Distributed counters | redis_rate (GCRA) |
| Strict pacing across fleet | Redis Lua leaky bucket |
| Adaptive to downstream | AIMD or Netflix concurrency-limits port |
| Per-method gRPC limiting | Interceptor with per-method rate.Limiter map |
| Outbound API client | rate.Limiter embedded in client; Wait per call |
| Cost guard (no overage allowed) | Distributed quota counter + circuit breaker |
When not to use a rate limiter¶
- Bounded queue suffices. A buffered channel of size N is admission control by another name. Use it if the consumer naturally paces itself.
- Concurrency cap suffices.
semaphore.Weightedfor "at most 50 in flight" — cheaper and simpler. - Circuit breaker is the right answer. If your dependency fails, stopping calls entirely (breaker open) is better than slowing them.
- The downstream has its own limiter. Respect
Retry-After, don't double-limit. - The request pattern is naturally bounded. A worker pool of 10 reading from a queue cannot exceed the queue's drain rate.
Summary¶
Rate limiting is admission control. Choose rate.Limiter (lazy-fill token bucket) for in-process, redis_rate (GCRA) for distributed, channel-and-ticker leaky bucket when you need FIFO pacing, and adaptive (AIMD or gradient-based) when downstream capacity varies.
GCRA — one timestamp, three lines, no boundary problem — is the algorithm production systems converge on. Understand it; you will see it in redis_rate, in service meshes, and in API gateways.
Always profile under load. Wait interacts with the Go scheduler's timer heap; tail latency suffers under saturation. Layer limiters cheap-first. Plan for limiter failure: open, closed, local-fallback, or circuit-break. Pick the failure mode before the outage.
The hardest part is not the code — it is choosing where to enforce, how to choose the rate, and how to fail safely. Code is twenty lines; design is the entire problem.