Exponential Backoff — Specification¶
Table of Contents¶
- Overview
- Notation
- Schedule Formulas
- Jitter Variants
- Cap and Limit Semantics
- Context Integration
- Library API Reference
- Standard-Library Primitives
- Random Source Selection
- Timer Semantics
- Error Classification
- Retry-After Header
- gRPC Service Config Schema
- HTTP Status Code Retryability Table
- Defaults Reference
- Glossary
Overview¶
This file is a reference for the precise formulas, APIs, and protocol details related to exponential backoff in Go. It is not a tutorial; it is a lookup table. Use it when you need a specific formula or API signature.
Notation¶
Throughout this document:
n— attempt number (0-indexed unless noted).B— base delay.C— max-delay cap.M— max-attempts cap.T— total-elapsed-time cap (deadline).U[a, b]— uniform random distribution on[a, b].delay(n)— the wait before the (n+1)-th attempt.prev— the delay used in the previous retry (for decorrelated jitter).
Schedule Formulas¶
Exponential (no jitter)¶
For factor F:
Total elapsed (sum of delays from attempt 0 through k-1)¶
Truncated exponential (capped)¶
After attempt log2(C/B), the delay plateaus at C.
Jitter Variants¶
Full Jitter¶
Expected value: min(C, B * 2^n) / 2.
Implementation in Go:
cap := B * time.Duration(1<<n)
if cap > C || cap < 0 { cap = C }
delay := time.Duration(rand.Int63n(int64(cap)))
Equal Jitter¶
Equivalent: delay = U[temp/2, temp].
Expected value: 3 * temp / 4.
cap := B * time.Duration(1<<n)
if cap > C || cap < 0 { cap = C }
half := cap / 2
delay := half + time.Duration(rand.Int63n(int64(half)))
Decorrelated Jitter¶
Implementation requires stateful tracking of prev.
upper := prev * 3
if upper > C || upper < 0 { upper = C }
span := upper - B
delay := B + time.Duration(rand.Int63n(int64(span)))
prev = delay
Symmetric Randomisation (cenkalti style)¶
Where r is the randomisation factor (default 0.5) and interval(n) = B * F^n.
For r = 0.5: delay is in [0.5 * interval, 1.5 * interval].
Cap and Limit Semantics¶
Max-delay cap (C)¶
The maximum value any single delay can take. Applied after the exponential calculation, before jitter (in our formulas):
Max-attempts cap (M)¶
The maximum number of total attempts (including the first). After M-1 failed retries, the operation gives up.
Max-elapsed-time cap (T)¶
The maximum cumulative wall-clock time. After T, even with retries remaining, the operation gives up.
In cenkalti/backoff, this is MaxElapsedTime. Default 15 minutes.
Context deadline¶
The absolute deadline carried in context.Context. Overrides all other caps; if context is cancelled or deadline exceeded, the operation gives up immediately.
Context Integration¶
Cancellation check¶
Before each attempt and during sleep, check:
Deadline check¶
Before sleeping, optionally check:
deadline, ok := ctx.Deadline()
if ok && time.Until(deadline) < requestedSleep {
requestedSleep = time.Until(deadline)
}
Cancellable sleep¶
func sleepCtx(ctx context.Context, d time.Duration) error {
if d <= 0 {
return nil
}
t := time.NewTimer(d)
defer t.Stop()
select {
case <-t.C:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Library API Reference¶
cenkalti/backoff/v4¶
type Operation func() error
type Notify func(error, time.Duration)
func Retry(o Operation, b BackOff) error
func RetryNotify(o Operation, b BackOff, n Notify) error
func RetryNotifyWithTimer(o Operation, b BackOff, n Notify, t Timer) error
type BackOff interface {
NextBackOff() time.Duration
Reset()
}
const Stop time.Duration = -1
type ExponentialBackOff struct {
InitialInterval time.Duration // default 500ms
RandomizationFactor float64 // default 0.5
Multiplier float64 // default 1.5
MaxInterval time.Duration // default 60s
MaxElapsedTime time.Duration // default 15min
Stop time.Duration // default Stop
Clock Clock
}
func NewExponentialBackOff(opts ...ExponentialBackOffOpts) *ExponentialBackOff
func WithMaxRetries(b BackOff, max uint64) BackOff
func WithContext(b BackOff, ctx context.Context) BackOffContext
func Permanent(err error) error
hashicorp/go-retryablehttp¶
type Client struct {
HTTPClient *http.Client
Logger Logger
RetryWaitMin time.Duration // default 1s
RetryWaitMax time.Duration // default 30s
RetryMax int // default 4
RequestLogHook RequestLogHook
ResponseLogHook ResponseLogHook
CheckRetry CheckRetry
Backoff Backoff
ErrorHandler ErrorHandler
}
func NewClient() *Client
func (c *Client) Get(url string) (*http.Response, error)
func (c *Client) Post(url string, body io.Reader) (*http.Response, error)
func (c *Client) Do(req *Request) (*http.Response, error)
type CheckRetry func(ctx context.Context, resp *http.Response, err error) (bool, error)
type Backoff func(min, max time.Duration, attemptNum int, resp *http.Response) time.Duration
sony/gobreaker¶
type State int
const (
StateClosed State = iota
StateHalfOpen
StateOpen
)
type Settings struct {
Name string
MaxRequests uint32
Interval time.Duration
Timeout time.Duration
ReadyToTrip func(Counts) bool
OnStateChange func(name string, from State, to State)
IsSuccessful func(err error) bool
}
type Counts struct {
Requests uint32
TotalSuccesses uint32
TotalFailures uint32
ConsecutiveSuccesses uint32
ConsecutiveFailures uint32
}
type CircuitBreaker struct { /* opaque */ }
func NewCircuitBreaker(st Settings) *CircuitBreaker
func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error)
func (cb *CircuitBreaker) State() State
func (cb *CircuitBreaker) Counts() Counts
func (cb *CircuitBreaker) Name() string
var ErrTooManyRequests = errors.New("too many requests")
var ErrOpenState = errors.New("circuit breaker is open")
golang.org/x/time/rate¶
type Limit float64
const Inf = Limit(math.MaxFloat64)
type Limiter struct { /* opaque */ }
func NewLimiter(r Limit, b int) *Limiter
func (lim *Limiter) Allow() bool
func (lim *Limiter) AllowN(now time.Time, n int) bool
func (lim *Limiter) Wait(ctx context.Context) error
func (lim *Limiter) WaitN(ctx context.Context, n int) error
func (lim *Limiter) Reserve() *Reservation
func (lim *Limiter) ReserveN(now time.Time, n int) *Reservation
func (lim *Limiter) Tokens() float64
func (lim *Limiter) Burst() int
func (lim *Limiter) Limit() Limit
func (lim *Limiter) SetBurst(b int)
func (lim *Limiter) SetLimit(newLimit Limit)
Standard-Library Primitives¶
time.Duration¶
Common durations:
const (
Nanosecond Duration = 1
Microsecond = 1000 * Nanosecond
Millisecond = 1000 * Microsecond
Second = 1000 * Millisecond
Minute = 60 * Second
Hour = 60 * Minute
)
Multiplication: 5 * time.Second is 5 * Second = 5_000_000_000 nanoseconds.
time.Sleep¶
Blocks the calling goroutine for at least d. Not cancellable.
time.Timer¶
type Timer struct {
C <-chan Time
}
func NewTimer(d Duration) *Timer
func (t *Timer) Stop() bool
func (t *Timer) Reset(d Duration) bool
Properties: - t.C fires once after d. - t.Stop() returns false if already fired. - t.Reset(d) is for reuse; must drain channel first if Stop returned false.
time.After¶
Returns a channel that fires after d. Convenient but leaks timers in loops.
time.NewTicker¶
Fires periodically every d. For periodic backoff (rare).
Random Source Selection¶
Choosing between math/rand and crypto/rand¶
| Property | math/rand | crypto/rand |
|---|---|---|
| Speed | ~10ns per call | ~1µs per call |
| Determinism | reproducible with seed | non-reproducible |
| Cryptographic strength | no | yes |
| Concurrent safety (top-level functions, Go 1.20+) | yes | yes |
| Concurrent safety (per-instance) | no | yes (rand.Reader) |
For jitter: use math/rand.
Seeding (math/rand, pre-Go 1.20)¶
In Go 1.20+, this is auto-seeded and rand.Seed is deprecated.
Per-goroutine source¶
Not concurrent-safe. Wrap with sync.Mutex or use sync.Pool.
Math/rand/v2 (Go 1.22+)¶
Concurrent-safe top-level functions. Recommended for new code.
Generating a random duration¶
// Uniform on [0, cap)
delay := time.Duration(rand.Int63n(int64(cap)))
// Uniform on [a, b)
delay := a + time.Duration(rand.Int63n(int64(b-a)))
Int63n panics on n <= 0. Always guard.
Timer Semantics¶
time.NewTimer vs time.After¶
| Aspect | time.NewTimer | time.After |
|---|---|---|
| Returns | *time.Timer | <-chan time.Time |
| Stoppable | yes | no |
| Leaks in loops | no (with Stop) | yes |
| Use in retry | preferred | avoid |
Idiomatic cancellable sleep¶
func sleepCtx(ctx context.Context, d time.Duration) error {
if d <= 0 {
return nil
}
t := time.NewTimer(d)
defer t.Stop()
select {
case <-t.C:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
Reusing a timer¶
t := time.NewTimer(0)
defer t.Stop()
if !t.Stop() {
<-t.C // drain
}
for /* loop */ {
t.Reset(nextDelay)
select {
case <-t.C:
case <-ctx.Done():
return ctx.Err()
}
}
Saves allocation per iteration.
Error Classification¶
Retryable conditions¶
| Source | Examples |
|---|---|
| Network | dial tcp: i/o timeout, connection refused, connection reset |
| HTTP | 408, 425, 429, 500, 502, 503, 504 |
| gRPC | UNAVAILABLE, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED |
| Database | connection lost, deadlock detected |
Non-retryable conditions¶
| Source | Examples |
|---|---|
| HTTP | 400, 401, 403, 404, 405, 409, 410, 422 |
| gRPC | INVALID_ARGUMENT, NOT_FOUND, ALREADY_EXISTS, PERMISSION_DENIED, FAILED_PRECONDITION |
| Parsing | malformed JSON, validation errors |
| Auth | invalid credentials |
Idiomatic predicate¶
func isRetryable(err error) bool {
if err == nil { return false }
var netErr net.Error
if errors.As(err, &netErr) { return true }
return false
}
func isRetryableHTTPStatus(code int) bool {
return code == 408 || code == 425 || code == 429 ||
(code >= 500 && code <= 599)
}
Retry-After Header¶
Format¶
Two valid formats:
Parsing¶
func parseRetryAfter(h string) (time.Duration, bool) {
if h == "" { return 0, false }
if s, err := strconv.Atoi(h); err == nil {
return time.Duration(s) * time.Second, true
}
if t, err := http.ParseTime(h); err == nil {
return time.Until(t), true
}
return 0, false
}
Usage¶
When honouring Retry-After:
if d, ok := parseRetryAfter(resp.Header.Get("Retry-After")); ok {
// optionally clip to deadline
if deadline, hasDeadline := ctx.Deadline(); hasDeadline {
if d > time.Until(deadline) {
return ctx.Err()
}
}
// optionally add jitter
d += time.Duration(rand.Int63n(int64(d) / 10))
return sleepCtx(ctx, d)
}
gRPC Service Config Schema¶
Retry policy¶
{
"methodConfig": [
{
"name": [{"service": "<svc>", "method": "<method>"}],
"timeout": "<duration>",
"retryPolicy": {
"maxAttempts": <int>,
"initialBackoff": "<duration>",
"maxBackoff": "<duration>",
"backoffMultiplier": <float>,
"retryableStatusCodes": ["<code>", ...]
}
}
]
}
Throttling¶
maxTokens is the bucket size. tokenRatio is how many tokens each successful call adds (failed retries cost 1 token).
Hedging policy¶
{
"methodConfig": [
{
"name": [{"service": "<svc>"}],
"hedgingPolicy": {
"maxAttempts": <int>,
"hedgingDelay": "<duration>",
"nonFatalStatusCodes": ["<code>", ...]
}
}
]
}
Mutually exclusive with retryPolicy.
Loading¶
const cfg = `{...}`
conn, err := grpc.NewClient(target,
grpc.WithDefaultServiceConfig(cfg),
grpc.WithTransportCredentials(creds),
)
HTTP Status Code Retryability Table¶
| Code | Name | Retryable? | Notes |
|---|---|---|---|
| 200-299 | 2xx Success | n/a | success |
| 300-399 | 3xx Redirect | n/a | follow redirect; not retry |
| 400 | Bad Request | no | malformed input |
| 401 | Unauthorized | no (unless re-auth) | |
| 402 | Payment Required | no | |
| 403 | Forbidden | no | |
| 404 | Not Found | no | |
| 405 | Method Not Allowed | no | |
| 406 | Not Acceptable | no | |
| 408 | Request Timeout | yes | |
| 409 | Conflict | sometimes | retry after re-read |
| 410 | Gone | no | |
| 413 | Payload Too Large | no | |
| 414 | URI Too Long | no | |
| 415 | Unsupported Media Type | no | |
| 422 | Unprocessable Entity | no | |
| 425 | Too Early | yes | |
| 429 | Too Many Requests | yes | honour Retry-After |
| 500 | Internal Server Error | yes | |
| 501 | Not Implemented | no | |
| 502 | Bad Gateway | yes | |
| 503 | Service Unavailable | yes | honour Retry-After |
| 504 | Gateway Timeout | yes | |
| 505 | HTTP Version Not Supported | no |
Defaults Reference¶
Recommended defaults¶
MaxAttempts: 3-5
Base: 100-200ms
MaxDelay: 5s
MaxElapsedTime: 30s
Strategy: FullJitter
Budget rate: 0.1 * normal RPS
Budget burst: 2 * rate
Breaker: 50% failures over 20 requests
Idempotency TTL: 24 hours
Cenkalti/backoff defaults¶
DefaultInitialInterval = 500 * time.Millisecond
DefaultRandomizationFactor = 0.5
DefaultMultiplier = 1.5
DefaultMaxInterval = 60 * time.Second
DefaultMaxElapsedTime = 15 * time.Minute
Hashicorp/go-retryablehttp defaults¶
gRPC defaults¶
Stripe API recommended¶
(From Stripe's published guidance.)
AWS SDK v2 defaults¶
Glossary¶
| Term | Definition |
|---|---|
| Attempt | One execution of the operation. |
| Retry | A repeated attempt after a failure. |
| Backoff | The delay-policy between retries. |
| Base delay | Delay before the first retry. |
| Cap | Maximum delay. |
| Jitter | Random variation added to delay. |
| Full jitter | U[0, cap]. |
| Equal jitter | cap/2 + U[0, cap/2]. |
| Decorrelated jitter | U[base, prev*3]. |
| Permanent error | Should not be retried. |
| Transient error | Should be retried. |
| Thundering herd | Synchronised retries overwhelming a service. |
| Retry budget | System-wide retry rate cap. |
| Idempotency key | Client-generated unique ID for deduplication. |
| Circuit breaker | Fail-fast pattern for known-bad dependencies. |
| Bulkhead | Per-dependency concurrency limit. |
| Hedging | Speculative duplicate requests. |
| Deadline propagation | Forwarding the deadline to downstream calls. |
| Token bucket | Rate-limiting algorithm: tokens replenish at a rate. |
Summary Table¶
| Parameter | Range | Recommended |
|---|---|---|
MaxAttempts | 2-10 | 3-5 |
Base | 10ms-1s | 100ms |
MaxDelay | 1s-1min | 5s |
Total deadline | 1s-1h | 5-30s |
Factor | 1.5-3 | 2 |
Strategy | none/full/equal/decorr | full |
Budget rate | 1-50% RPS | 10% |
Budget burst | 1-5x rate | 2x |
Breaker threshold | 30-70% | 50% |
Breaker window | 30s-5min | 60s |
Breaker timeout | 10s-2min | 30s |
Idempotency TTL | 1h-7d | 24h |
These are starting points. Tune from data.