Rate Limiter — Optimize¶

A walkthrough of common performance issues in rate limiter code, with profiling tools, benchmark numbers, and rewrites. Every section pairs a slow version with a fast one and explains the gap.

1. Channel-based limiter vs `rate.Limiter`¶

Slow¶

type ChanLim struct {
    tokens chan struct{}
}

func NewChanLim(r, b int) *ChanLim {
    cl := &ChanLim{tokens: make(chan struct{}, b)}
    for i := 0; i < b; i++ {
        cl.tokens <- struct{}{}
    }
    go func() {
        t := time.NewTicker(time.Second / time.Duration(r))
        defer t.Stop()
        for range t.C {
            select {
            case cl.tokens <- struct{}{}:
            default:
            }
        }
    }()
    return cl
}

func (c *ChanLim) Allow() bool {
    select {
    case <-c.tokens:
        return true
    default:
        return false
    }
}

Fast¶

lim := rate.NewLimiter(rate.Limit(r), b)
lim.Allow()

Benchmark (Apple M2, Go 1.22, single goroutine)¶

BenchmarkChanLim-8     12000000   105 ns/op   0 B/op   0 allocs/op
BenchmarkRateLimiter-8 38000000    31 ns/op   0 B/op   0 allocs/op

Why¶

ChanLim allocates a goroutine and a ticker (memory + scheduler overhead).
Each Allow does a select-send on a channel — channel ops cost ~50-100 ns.
The background ticker fires whether or not anyone is consuming.

rate.Limiter uses lazy fill: a mutex acquire and a few float ops. The constant factor is 3× to 4× better, plus no background goroutine.

When the channel version is right¶

Only when you actually need the queueing/shaping semantics of a leaky bucket. For pure token-bucket, rate.Limiter wins.

2. Per-key map: mutex contention¶

Slow¶

var (
    mu sync.Mutex
    m  = map[string]*rate.Limiter{}
)

func get(key string) *rate.Limiter {
    mu.Lock()
    defer mu.Unlock()
    l, ok := m[key]
    if !ok {
        l = rate.NewLimiter(100, 50)
        m[key] = l
    }
    return l
}

At 50,000 req/s, the mutex is held ~50,000 times/s. Under contention, p99 latency on get spikes to milliseconds.

Fast: sharded map¶

const shards = 64

type ShardedLimMap struct {
    shards [shards]struct {
        mu sync.Mutex
        m  map[string]*rate.Limiter
    }
}

func New() *ShardedLimMap {
    s := &ShardedLimMap{}
    for i := range s.shards {
        s.shards[i].m = make(map[string]*rate.Limiter)
    }
    return s
}

func (s *ShardedLimMap) Get(key string) *rate.Limiter {
    h := fnv.New32a()
    h.Write([]byte(key))
    sh := &s.shards[h.Sum32()%shards]
    sh.mu.Lock()
    defer sh.mu.Unlock()
    l, ok := sh.m[key]
    if !ok {
        l = rate.NewLimiter(100, 50)
        sh.m[key] = l
    }
    return l
}

64 shards = 64× less contention. Even better with sync.Map for read-mostly workloads — but sync.Map doesn't iterate cleanly for eviction.

Benchmark (8 goroutines, 1 M keys, lookup-heavy)¶

BenchmarkSingleMap-8    3000000   415 ns/op
BenchmarkShardedMap-8  18000000    62 ns/op

3. Eviction sweep: iteration cost¶

Slow¶

func sweep(m map[string]*entry, cutoff int64) {
    for k, e := range m {
        if e.seen.Load() < cutoff {
            delete(m, k)
        }
    }
}

For 10 M entries: ~500 ms holding the mutex. All requests block for half a second.

Fast: chunked sweep¶

func sweep(m map[string]*entry, cutoff int64, batchSize int) {
    var batch []string
    for k, e := range m {
        if e.seen.Load() < cutoff {
            batch = append(batch, k)
            if len(batch) >= batchSize {
                break
            }
        }
    }
    for _, k := range batch {
        delete(m, k)
    }
}

Run repeatedly with batchSize = 10,000. Each pass holds the mutex for ~5 ms. Total work is the same; latency tail is much better.

Better: LRU with O(1) eviction¶

hashicorp/golang-lru/v2 evicts the LRU item on insertion. No scan. O(1) per access.

cache, _ := lru.New[string, *rate.Limiter](100_000)
get := func(k string) *rate.Limiter {
    if l, ok := cache.Get(k); ok {
        return l
    }
    l := rate.NewLimiter(100, 50)
    cache.Add(k, l)
    return l
}

Memory is bounded. Eviction cost is amortised over inserts.

4. Redis hot key contention¶

Slow¶

res, _ := limiter.Allow(ctx, "global", redis_rate.PerSecond(1000))

Every request hits the same Redis key. The Redis shard handling that key tops out at ~100 K ops/s (one Redis instance). Beyond that, Redis becomes the bottleneck.

Fast: sharded keys¶

shardIdx := rand.Intn(10)
key := fmt.Sprintf("global:shard%d", shardIdx)
res, _ := limiter.Allow(ctx, key, redis_rate.PerSecond(100)) // 100/s × 10 shards = 1000/s

Each shard handles 1/N of the budget. Throughput scales with shard count. Loose enforcement: any single shard can be over-budget temporarily, but the average is correct.

Caveat¶

Sharding the limiter loosens correctness. A bursty client can pick a "fresh" shard repeatedly. For strict global limits, use a single key on a dedicated Redis cluster; for soft limits, sharding is fine.

5. Lua script overhead¶

Slow¶

script := `
    local current = redis.call("INCR", KEYS[1])
    if current == 1 then
        redis.call("EXPIRE", KEYS[1], ARGV[2])
    end
    if current > tonumber(ARGV[1]) then
        return 0
    end
    return 1
`
rdb.Eval(ctx, script, []string{key}, limit, window)

Eval ships the script every time. Network bytes + Redis parsing on every call.

Fast: `EvalSha` with cached script¶

var script = redis.NewScript(`...same...`)

// First call: SCRIPT LOAD, then EVALSHA.
// Subsequent: EVALSHA only.
script.Run(ctx, rdb, []string{key}, limit, window)

go-redis's NewScript handles EVALSHA with fallback to EVAL if NOSCRIPT. Saves ~100 bytes per call.

For high QPS, this matters; for moderate QPS, it doesn't.

6. Pipelining Redis calls¶

Slow¶

for _, key := range keys {
    res, _ := limiter.Allow(ctx, key, lim)
    handle(key, res)
}

N requests = N round-trips. At 1 ms RTT and 100 keys, 100 ms total.

Fast: pipeline¶

pipe := rdb.Pipeline()
results := make([]*redis.Cmd, len(keys))
for i, key := range keys {
    results[i] = pipe.Eval(ctx, script, []string{key}, lim, window)
}
_, err := pipe.Exec(ctx)

One round-trip for all 100 calls. 1 ms total.

Limitation: not all rate-limit libraries support pipelining out of the box. redis_rate v10 has AllowAtMost for batch-style calls; for custom Lua, you must pipeline manually.

7. `Wait` vs `Allow` on hot paths¶

Slow¶

func process(ctx context.Context, items []Item) {
    var wg sync.WaitGroup
    for _, item := range items {
        item := item
        wg.Add(1)
        go func() {
            defer wg.Done()
            lim.Wait(ctx)
            do(item)
        }()
    }
    wg.Wait()
}

10,000 items × Wait each = 10,000 goroutines parked in timers. At rate 100/s, the tail goroutine waits 100 s.

Fast: sequential with `Wait`¶

func process(ctx context.Context, items []Item) {
    for _, item := range items {
        if err := lim.Wait(ctx); err != nil {
            return
        }
        do(item)
    }
}

One goroutine. Memory linear in nothing. Same total elapsed time.

Faster: bounded concurrency¶

sem := make(chan struct{}, 10) // 10 in flight
for _, item := range items {
    sem <- struct{}{}
    go func(item Item) {
        defer func() { <-sem }()
        lim.Wait(ctx)
        do(item)
    }(item)
}

Combines rate limit (overall pacing) with concurrency cap (max 10 in flight). Best of both.

8. `Reserve` instead of `Allow` for "would have to wait" rejection¶

Slow (with extra delay)¶

if !lim.Allow() {
    reject()
    return
}
do() // proceed

If the bucket has 0.9 tokens and the request needs 1, Allow says no. But if the caller could wait 50 ms cheaply, that would be better.

Fast: introspect with `Reserve`¶

r := lim.Reserve()
if !r.OK() {
    reject()
    return
}
d := r.Delay()
if d > 100*time.Millisecond {
    r.Cancel()
    reject()
    return
}
time.Sleep(d)
do()

Lets you accept short waits, reject long ones. More throughput, predictable tail.

Cost: more code, more thinking. Use only when the latency target justifies it.

9. Removing the per-call mutex¶

rate.Limiter takes a mutex on every call. At 10 M ops/s aggregate, that mutex shows up in profiles.

Faster: lock-free GCRA with atomic CAS¶

type FastGCRA struct {
    tat    atomic.Int64 // nanoseconds since epoch
    rate   int64        // ns per token (T)
    burst  int64        // tau in ns
}

func (g *FastGCRA) Allow() bool {
    for {
        now := time.Now().UnixNano()
        tat := g.tat.Load()
        if tat < now {
            tat = now
        }
        if tat-now > g.burst {
            return false
        }
        newTat := tat + g.rate
        if g.tat.CompareAndSwap(g.tat.Load(), newTat) {
            return true
        }
        // CAS failed: retry.
    }
}

No mutex. Under low contention, faster than rate.Limiter. Under high contention, CAS retries can spin — sometimes worse than a mutex.

Profile both before choosing. The constant-factor savings is small (10-20 ns) compared to typical request processing time. Only worth it on extreme hot paths.

10. Reduce allocations in error paths¶

Slow¶

func handler(w http.ResponseWriter, r *http.Request) {
    if !lim.Allow() {
        log.Printf("rate limited: ip=%s path=%s", r.RemoteAddr, r.URL.Path)
        http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
        return
    }
    do()
}

Under attack, log.Printf allocates a buffer per call. Logs flood. CPU is wasted on logging instead of legitimate traffic.

Fast: sample logging¶

var counter atomic.Int64
func handler(w http.ResponseWriter, r *http.Request) {
    if !lim.Allow() {
        if counter.Add(1)%1000 == 0 {
            log.Printf("rate limited: ip=%s path=%s (sampled 1/1000)", r.RemoteAddr, r.URL.Path)
        }
        http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
        return
    }
    do()
}

Reduces logging cost to 1/1000. Adjust the sampling rate by observed attack volume.

Even faster: precomputed response¶

var tooManyRequestsBody = []byte("Too Many Requests\n")

w.Header().Set("Content-Type", "text/plain")
w.Header().Set("Retry-After", "1")
w.WriteHeader(http.StatusTooManyRequests)
w.Write(tooManyRequestsBody)

http.Error formats a string and writes headers — small per-call cost, but it adds up at attack scale. A precomputed response is the cheapest possible reject path.

11. Profile-guided choices¶

Run a load test. Capture profiles:

go test -bench BenchmarkHandler -cpuprofile cpu.prof
go tool pprof cpu.prof

Look for: - Time in runtime.timerproc → too many Wait waiters → switch to Allow or cap concurrency. - Time in runtime.lock → mutex contention → shard the limiter map. - Time in redis.(*Client).Do → Redis is the bottleneck → pipeline or shard keys. - Time in log.Printf or json.Marshal on the reject path → cache and pre-format.

Allocations:

go test -bench BenchmarkHandler -memprofile mem.prof -benchmem
go tool pprof -alloc_space mem.prof

A clean rate-limit check should be 0 allocs/op (after a warmup). If you see allocations, find the cause: a fmt.Sprintf for a key, a time.Duration formatting, a header write.

12. Benchmark suite to keep¶

Maintain a benchmark file for the limiter you deploy. Run on every change:

func BenchmarkAllow(b *testing.B) {
    lim := rate.NewLimiter(rate.Limit(1_000_000), 1000)
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            lim.Allow()
        }
    })
}

func BenchmarkPerKey(b *testing.B) {
    m := NewShardedLimMap()
    keys := []string{"a", "b", "c", "d"} // hot keys
    b.RunParallel(func(pb *testing.PB) {
        i := 0
        for pb.Next() {
            m.Get(keys[i%len(keys)]).Allow()
            i++
        }
    })
}

Compare across commits. Regressions become visible.

Cheat Sheet¶

Pick rate.Limiter over channel-based unless you need leaky-bucket queueing.
Shard per-key maps once you exceed ~10K keys.
Use LRU eviction (hashicorp/golang-lru/v2) over manual TTL sweep.
Pipeline Redis calls for batch operations.
Reserve+Cancel for "short wait OK, long wait reject" semantics.
Sample logs on the reject path.
Precompute the reject response body.
Profile before optimising. Allocate budget by data, not intuition.

Summary¶

The default rate.Limiter is fast — 31 ns/op, 0 allocations. Most rate-limit code can stay at this level. Optimization shows up at high QPS with cardinality (per-key maps), at scale across the fleet (Redis), or in pathological reject conditions (attacks).

The right answer is almost always: profile first. A measured 0.5 ms in runtime.timerproc is a real signal; a hand-tuned lock-free GCRA on a non-hot path is premature. Trust rate.Limiter and redis_rate until your traces say otherwise.

Rate Limiter — Optimize¶

1. Channel-based limiter vs rate.Limiter¶

Slow¶

Fast¶

Benchmark (Apple M2, Go 1.22, single goroutine)¶

Why¶

When the channel version is right¶

2. Per-key map: mutex contention¶

Slow¶

Fast: sharded map¶

Benchmark (8 goroutines, 1 M keys, lookup-heavy)¶

3. Eviction sweep: iteration cost¶

Slow¶

Fast: chunked sweep¶

Better: LRU with O(1) eviction¶

4. Redis hot key contention¶

Slow¶

Fast: sharded keys¶

Caveat¶

5. Lua script overhead¶

Slow¶

Fast: EvalSha with cached script¶

6. Pipelining Redis calls¶

Slow¶

Fast: pipeline¶

7. Wait vs Allow on hot paths¶

Slow¶

Fast: sequential with Wait¶

Faster: bounded concurrency¶

8. Reserve instead of Allow for "would have to wait" rejection¶

Slow (with extra delay)¶

Fast: introspect with Reserve¶

9. Removing the per-call mutex¶

Faster: lock-free GCRA with atomic CAS¶

10. Reduce allocations in error paths¶

Slow¶

Fast: sample logging¶

Even faster: precomputed response¶

11. Profile-guided choices¶

12. Benchmark suite to keep¶

Cheat Sheet¶

Summary¶

1. Channel-based limiter vs `rate.Limiter`¶

Fast: `EvalSha` with cached script¶

7. `Wait` vs `Allow` on hot paths¶

Fast: sequential with `Wait`¶

8. `Reserve` instead of `Allow` for "would have to wait" rejection¶

Fast: introspect with `Reserve`¶