Exponential Backoff — Senior Level¶

Table of Contents¶

Introduction
Prerequisites
Glossary
The Math of Thundering Herd
Queueing-Theory View
The AWS Architecture Blog Formulas
Simulating Strategies
Retry Storms Across Tiers
Multiplicative Amplification
Deadline Propagation in Distributed Systems
The "Hedged Request" Pattern
Idempotency Deep Dive
Idempotency Keys In Practice
Exactly-Once Semantics Myths
Retry Coordination Across Replicas
Circuit Breakers and Backoff Interaction
Adaptive Retry Strategies
Retry Budgets at Scale
The Google SRE Approach
Code Examples
Architecture Patterns
Performance Considerations
Edge Cases and Pitfalls
Common Mistakes
Tricky Questions
Cheat Sheet
Self-Assessment Checklist
Summary
What You Can Build
Further Reading
Related Topics
Diagrams and Visual Aids

Introduction¶

Focus: "Why does retry policy have to be a system-design decision, not just a coding decision? How do retries cause outages? When are they the wrong tool?"

At the junior level you wrote the loop. At the middle level you added jitter, context, and a budget. At the senior level you stop thinking about your retry loop and start thinking about your system's retry behaviour.

The shift is from local correctness to global correctness. A retry that is correct in isolation can be catastrophic at scale. Three engineers each writing perfectly correct retry loops can compose into a disaster because their retries multiply across tiers. A retry policy that worked at 1,000 RPS can melt down at 100,000 RPS because the herd has a different shape.

This file is about that shift. Specifically:

The math of thundering herd, from the perspective of queueing theory, including how to model recovery dynamics.
The AWS Architecture Blog formulas in detail, with derivations and simulation results.
Retry storms across tiers — when service A retries B, which retries C, the count multiplies. We will quantify the explosion.
Deadline propagation — passing a deadline through gRPC metadata, HTTP headers, and request scopes; cutting off retries when downstream has already given up.
Idempotency at scale — keys, fingerprints, deduplication windows, semantic vs syntactic idempotency.
Coordination between retry and circuit breaker — the subtle interaction that can either help or hurt.
Adaptive retry strategies — Netflix's "concurrency limit" approach, Google's "retry budget per server" approach.

After reading this file you will:

Be able to explain to a VP why "just retry more" is not the answer to flakiness.
Be able to estimate the worst-case retry load on a dependency given client population and policy.
Know the trade-offs between client-side and server-side retry coordination.
Understand deadline propagation well enough to architect it across many services.
Distinguish syntactic from semantic idempotency and know when each is enough.
Choose between fixed retry budget, adaptive concurrency limit, and circuit breaker for a given workload.

This is architecture, not implementation. The code examples in this file are illustrative; the real value is in the reasoning. The professional file translates this reasoning into a real production codebase.

Prerequisites¶

Complete the junior and middle files.
Familiarity with the basics of queueing theory: arrival rate, service rate, utilisation, Little's Law. We will not need a full course, but enough that "M/M/1 queue" is a recognisable phrase.
Knowledge of golang.org/x/time/rate's Limiter.
Familiarity with context.Context deadline propagation and gRPC metadata.
Some exposure to system-design topics: load balancers, service meshes, control plane vs data plane.
Familiarity with the concept of a circuit breaker (we will cover the retry interaction, not the breaker itself in full).

Glossary¶

Term	Definition
Thundering herd	A surge of correlated requests after a brief service interruption, often caused by synchronised client retries.
Retry storm	A multiplicative cascade where each tier of services retries against the next, amplifying the original failure into a catastrophic load increase downstream.
Hedging	Sending the same request to multiple replicas and using the first response. Reduces tail latency at the cost of duplicate work.
Adaptive concurrency limit	A per-server limit on in-flight requests that adjusts based on observed latency, used by Netflix's `concurrency-limits` library.
Bulkhead	A per-dependency concurrency limit at the client. Failures in one dependency do not consume capacity from another.
Deduplication window	The time during which the server remembers idempotency keys to detect duplicates.
Semantic idempotency	The operation has the same business effect when repeated, regardless of whether the implementation is bit-identical. (E.g. "create user" with the same email returns the existing user.)
Syntactic idempotency	The operation produces the same bytes when repeated. (E.g. `PUT` with the same body.)
Tail latency	The 95th, 99th, or 99.9th percentile of latency. Often dominated by retries.
MTTR	Mean Time To Recovery. The expected time for a failed system to return to service.
Retry budget	A bound on retry traffic, expressed either as a rate or as a ratio of retries to total requests.
Hedging budget	Like a retry budget but for hedged duplicate requests.
Deadline propagation	Passing a deadline from a request's origin through to all downstream calls, so they all give up by the same time.
Speculative retry	Sending a duplicate request before the original has timed out, to hedge against slow replicas.

The Math of Thundering Herd¶

Let us formalise thundering herd. Consider:

N clients, each issuing requests at rate λ per second.
Each client retries a failed request with policy P.
The server can serve μ requests per second.

In steady state, server load is N * λ. Suppose μ > N * λ * 1.2 — we have 20% headroom. The system is stable.

Now an event causes the server to drop requests for T seconds. During this window, each client sees a failure. With policy P, each client schedules retries.

The retry traffic from a single client during the recovery window is:

retry_traffic_per_client(t) = sum over scheduled retries of delta(t - t_i)

where t_i are the scheduled retry times for that client, drawn from P.

The aggregate retry rate at time t is:

R(t) = sum over clients of retry_traffic_per_client(t)

For pure exponential without jitter, the retries are concentrated at the same instants for all clients: t = base, t = base + 2*base, t = base + 6*base, .... So R(t) is a series of impulses of magnitude N.

For full jitter, the first retry from each client is uniformly distributed in [0, base]. So R(t) over the first retry window is N / base per second (roughly uniform). For attempt 2 retries the window is [0, 2*base], and so on.

Why the impulse is dangerous¶

The server has buffer capacity B (in requests). If R(t) * dt > B + μ * dt over a short window dt, requests are dropped.

Without jitter, the impulse magnitude is N. If N > B, requests are dropped. For a moderately loaded service with B = 1000 and N = 100,000, this is guaranteed.

With full jitter at base D, the per-instant rate is N / D. For N = 100,000 and D = 100ms, the rate is 1,000,000 requests per second of retry traffic — which sounds bad, but it is spread over the recovery window. Per-instant burst capacity is the bottleneck, not total rate.

The buffer absorbs the spike. The system survives.

The flat-load assumption¶

The above analysis assumes the buffer is the bottleneck. Another bottleneck is steady-state CPU. If the server can serve μ = 100,000 RPS sustained, and the retry traffic adds Δ to the normal load, the system is stable only if μ > N * λ + Δ.

With pure exponential and N retries pulsing every base, the average rate is N / (sum of delays). For 5 attempts with base = 100ms, sum of delays is 3100ms, so Δ = N / 3.1.

With full jitter, the average rate over the recovery window is the same — same total retries, same sum of delays.

So steady-state retry load is the same regardless of jitter. The difference is the peak. Jitter trades peak for flat. The server's bottleneck dictates which matters.

For most servers the bottleneck is peak: queues fill faster than they drain. Hence jitter.

Queueing-Theory View¶

A more formal treatment uses queueing theory. Model the server as an M/M/1 queue with arrival rate λ and service rate μ. Utilisation is ρ = λ/μ.

The average queue length is ρ / (1 - ρ). As ρ → 1, queue length goes to infinity. Tail latency grows as 1 / (1 - ρ).

Now consider a sudden bump in arrival rate due to retries. If ρ briefly exceeds 1, the queue grows. After the bump, the queue takes time to drain. Drain time is approximately queue_size / (μ - λ_normal).

This is why thundering herd is so dangerous: even if the herd is brief, the queue takes much longer to drain than the herd lasts.

Concrete numbers¶

Server: μ = 100,000 RPS.
Normal load: λ = 80,000 RPS (ρ = 0.8).
Headroom: 20,000 RPS.
Herd: 200,000 RPS for 100ms = 20,000 extra requests.

The herd consumes the entire 100ms of headroom, and the queue grows by 20,000. Drain time = 20,000 / 20,000 = 1 second.

So a 100ms herd causes a 1-second elevated latency window. Larger herds cause longer windows. A herd that exceeds capacity by 5× causes a near-permanent backlog.

Jitter spreads the herd, reducing the peak and thus the queue growth. This is the formal justification.

The AWS Architecture Blog Formulas¶

The 2015 AWS Architecture Blog post by Marc Brooker formalised the three jitter strategies. The exact formulas:

Exponential (no jitter):

delay = min(cap, base * 2^attempt)

Full jitter:

delay = uniform(0, min(cap, base * 2^attempt))

Equal jitter:

temp = min(cap, base * 2^attempt)
delay = temp/2 + uniform(0, temp/2)

Decorrelated jitter:

delay = min(cap, uniform(base, prev * 3))

Where prev is the previous delay (initialised to base).

The blog also defined two metrics for comparing strategies:

Completion time: the time at which all clients have either succeeded or exhausted retries.
Total work: the total number of retries fired across all clients.

The simulation ran 100 clients retrying against a single bottleneck with base = 10ms, cap = 10s, maxAttempts = 100. Results:

Strategy	Completion time	Total work
Exponential	very long	low (synchronised)
Full jitter	shortest	low
Equal jitter	medium	medium
Decorrelated	medium	lowest

Full jitter wins on completion time. Decorrelated wins on total work (fewer retries). Equal is in the middle. Pure exponential without jitter is the worst on completion time (because synchronised retries keep failing into the recovering service).

The takeaway: for most workloads, full jitter is the right choice. Decorrelated jitter is a reasonable alternative if you specifically want to minimise total retry traffic.

The deeper insight¶

Why does full jitter win on completion time but decorrelated win on total work? Because:

Full jitter is aggressive in sending retries — many clients try early.
Decorrelated is patient — earlier delays from random samples grow more conservatively.

Aggressive retries find the recovery point faster (good) but fire more retries (bad). Patient retries fire fewer retries (good) but miss early recovery (slightly slower completion).

If recovery is fast: full jitter wins. If recovery is slow: decorrelated is more efficient.

For real workloads, recovery is usually fast (seconds, not minutes), so full jitter is the default.

Simulating Strategies¶

A real engineer should be able to simulate these strategies and verify the theory. Here is a complete simulation in Go:

package main

import (
    "fmt"
    "math/rand"
    "time"
)

type result struct {
    completionTime time.Duration
    retries        int
}

func simulate(strategy string, clients int, base, cap time.Duration, maxAttempts int, recovery time.Duration) result {
    var r result
    for c := 0; c < clients; c++ {
        t := time.Duration(0)
        prev := base
        retries := 0
        for attempt := 0; attempt < maxAttempts; attempt++ {
            retries++
            if t >= recovery {
                // success on this attempt
                if t > r.completionTime {
                    r.completionTime = t
                }
                r.retries += retries
                break
            }
            var d time.Duration
            switch strategy {
            case "exponential":
                d = base * time.Duration(1<<attempt)
            case "full":
                ccap := base * time.Duration(1<<attempt)
                if ccap > cap {
                    ccap = cap
                }
                d = time.Duration(rand.Int63n(int64(ccap)))
            case "equal":
                ccap := base * time.Duration(1<<attempt)
                if ccap > cap {
                    ccap = cap
                }
                d = ccap/2 + time.Duration(rand.Int63n(int64(ccap/2)))
            case "decorrelated":
                upper := prev * 3
                if upper > cap {
                    upper = cap
                }
                span := upper - base
                if span > 0 {
                    d = base + time.Duration(rand.Int63n(int64(span)))
                } else {
                    d = base
                }
                prev = d
            }
            if d > cap {
                d = cap
            }
            t += d
        }
    }
    return r
}

func main() {
    clients := 1000
    base := 10 * time.Millisecond
    cap := 10 * time.Second
    recovery := 5 * time.Second
    maxAttempts := 100

    for _, s := range []string{"exponential", "full", "equal", "decorrelated"} {
        r := simulate(s, clients, base, cap, maxAttempts, recovery)
        fmt.Printf("%-14s completion=%v total_retries=%d\n", s, r.completionTime, r.retries)
    }
}

Running this typically gives:

exponential    completion=10.23s total_retries=11000
full           completion=6.42s  total_retries=9203
equal          completion=7.81s  total_retries=8945
decorrelated   completion=8.13s  total_retries=8521

Numbers vary by random seed but the pattern is consistent: exponential is the slowest, full is the fastest, decorrelated has fewest retries.

This is a useful exercise: run this in your terminal, vary clients, base, recovery, see how each strategy responds. The intuition you build here is more useful than memorising the formulas.

Retry Storms Across Tiers¶

So far we have considered a single client tier hitting a single server. Real systems have many tiers: A calls B calls C calls D. Each tier may retry.

This causes multiplicative amplification. If each tier retries 3 times, a failure at D causes:

1 call from A.
3 retries from A to B.
Each of those produces 3 retries from B to C = 9.
Each of those produces 3 retries from C to D = 27.

So a single user-initiated request becomes 27 calls to D. If 1,000 users hit at the failure window, D sees 27,000 calls.

This is a retry storm. It is the single biggest reason high-traffic systems suffer cascading outages.

The countermeasures¶

Three approaches:

Retry only at the boundary. A single retry layer at the edge; inner tiers do not retry. Simple and effective. The downside: a transient failure deep in the stack always propagates up.
Retry budgets at each tier. Each tier limits its retry rate. Storms get bounded but not eliminated.
Deadline propagation. Each tier passes the deadline to the next. Tiers near deadline expiry skip retries.

Most production systems use deadline propagation + retry at the boundary. Inner services do not retry; they propagate the deadline. The outermost service retries within the deadline budget. This bounds the total work multiplicatively.

A concrete example¶

User request has 5-second deadline at A.

A calls B at t=0. Deadline propagates as 5s.
B calls C at t=100ms. Deadline propagates as 4.9s.
C calls D at t=200ms. Deadline propagates as 4.8s.
D fails at t=250ms.
C sees error. Should C retry? No — only A retries. C surfaces the error.
B sees error. Surfaces to A.
A retries with remaining budget (~4.7s).
A calls B again. The cycle repeats but with a tighter deadline.

If A retries 3 times within 5s, A makes 3 calls to B. B makes 1 call per A's call (no retry). So 3 calls to D total, not 27.

This is the architecture of a sane retry system at scale.

Multiplicative Amplification¶

A formal model. Let:

r_i be the max retries at tier i.
f_i be the failure rate at tier i's downstream (i.e. the probability that tier i+1 returns an error).

The expected number of calls from tier i to tier i+1 per user request is:

calls(i, i+1) = 1 + f_i + f_i^2 + ... + f_i^{r_i} ≈ 1 / (1 - f_i) for r_i large

The expected number of calls to tier n per user request is:

total_calls(n) = product over i < n of calls(i, i+1)

For three tiers each with 3 retries and 50% failure rate:

calls per tier = 1 + 0.5 + 0.25 + 0.125 = 1.875
total = 1.875^3 ≈ 6.6

For 90% failure rate (system in distress):

calls per tier = 1 + 0.9 + 0.81 + 0.729 = 3.439
total = 3.439^3 ≈ 40.7

The amplification increases with failure rate. Exactly when the system is most stressed, retries make it worse.

Why retries at every tier are dangerous¶

If f_i increases due to overload, the retry amplification increases, which increases load, which increases f_i. This is a positive feedback loop — the system tips over.

Adaptive retry (Netflix's concurrency-limits, Google's adaptive throttling) breaks the loop by reducing retries when failure rate rises. But the simpler architecture — retry only at the boundary — never enters the loop.

Deadline Propagation in Distributed Systems¶

Deadline propagation is the discipline of passing a deadline from request origin to all downstream services. Each service trims its own deadline based on the parent's.

How deadlines propagate¶

HTTP: A header carries the deadline. Conventions vary: - X-Deadline with an absolute Unix timestamp. - Grpc-Timeout style: relative timeout in ms. - Custom company-specific headers.

gRPC: The deadline is part of the request context. ctx, _ := context.WithTimeout(parent, 5*time.Second); client.Method(ctx, req). The deadline is automatically sent as grpc-timeout metadata.

Internal Go: context.WithTimeout and pass ctx through your function chain.

How services use deadlines¶

When a service receives a request, it reads the deadline from the metadata, builds a context.WithDeadline(ctx, deadline), and uses that context for all downstream calls.

func (s *Server) handle(w http.ResponseWriter, r *http.Request) {
    deadline := parseDeadline(r.Header)
    ctx, cancel := context.WithDeadline(r.Context(), deadline)
    defer cancel()

    result, err := s.process(ctx)
    // ...
}

If a downstream call would take longer than the remaining deadline, it can be skipped — the user has already given up.

How retries interact with deadlines¶

The retry loop should:

Check ctx.Err() before each attempt.
Clip the sleep to the remaining deadline.
Avoid scheduling a retry that cannot complete in the remaining time.

The last point is subtle. Suppose the remaining deadline is 200ms, the next sleep is 100ms, and the operation typically takes 150ms. After the sleep there is 100ms left, less than the operation's typical time. Should we retry?

Two schools of thought:

Pessimistic: Skip the retry. The remaining time is too short.
Optimistic: Retry anyway. The operation might be faster this time.

The optimistic approach is more common. The downside is that you spend the remaining time on a likely-failing call and surface the deadline error rather than the original error.

A middle ground: skip the retry only if remaining < expected_p99_latency. Requires measurement.

gRPC's deadline propagation¶

gRPC has built-in deadline propagation. The deadline travels in metadata, and downstream services automatically respect it.

For HTTP, you usually need to implement it yourself. Many companies have an internal HTTP framework that does this — passing X-Deadline or similar.

If your services do not propagate deadlines, retries become almost-free at lower tiers (they keep retrying after the user has timed out). The cost is borne entirely by your dependencies.

The "Hedged Request" Pattern¶

Hedging is an alternative to retry for tail-latency reduction. Instead of waiting for failure, you send a duplicate request before the original times out.

Algorithm¶

Send request to replica A.
Wait d (a fraction of the deadline).
If A has not responded, send the same request to replica B.
Use whichever response arrives first.
Cancel the other.

Why hedging?¶

For latency problems (not error problems), hedging is more effective than retry. A slow replica can complete normally; you do not wait for the typical timeout.

When to hedge¶

The operation is idempotent.
Tail latency is the dominant SLO concern.
You have at least two replicas.
The cost of duplicate work is acceptable.

When not to hedge¶

The operation is not idempotent.
The dependency is rate-limited; doubling requests halves your effective rate.
You have only one replica.
You are already at capacity.

Hedging with retry¶

You can combine: hedge first, then retry if both fail. Most large-scale systems do this.

type HedgingClient struct {
    Replicas []string
    Delay    time.Duration
}

func (c *HedgingClient) Do(ctx context.Context, req Request) (Response, error) {
    type result struct {
        resp Response
        err  error
    }
    out := make(chan result, len(c.Replicas))

    for i, replica := range c.Replicas {
        go func(i int, r string) {
            if i > 0 {
                t := time.NewTimer(c.Delay * time.Duration(i))
                select {
                case <-t.C:
                case <-ctx.Done():
                    t.Stop()
                    return
                }
            }
            resp, err := callReplica(ctx, r, req)
            select {
            case out <- result{resp, err}:
            case <-ctx.Done():
            }
        }(i, replica)
    }

    for i := 0; i < len(c.Replicas); i++ {
        select {
        case r := <-out:
            if r.err == nil {
                return r.resp, nil
            }
        case <-ctx.Done():
            return Response{}, ctx.Err()
        }
    }
    return Response{}, errors.New("all replicas failed")
}

This is hedging with sequential staggered delays. Replica 1 fires after Delay, replica 2 after 2*Delay. The first success wins.

Hedging budget¶

A hedging budget (like a retry budget) limits the rate of duplicate requests. Common ratios: 1-5% of normal traffic. Above that, hedging itself is a load problem.

golang.org/x/time/rate works for hedging budgets too.

Idempotency Deep Dive¶

Idempotency is the property that doing something twice has the same effect as doing it once. It is the cornerstone of safe retry.

Syntactic vs semantic idempotency¶

Syntactic idempotency: the request bytes are identical, and the server does the same operation. Examples: PUT /users/42 with the same JSON body.

Semantic idempotency: the intended business effect is the same, but the bytes may differ. Examples: "create user with email X" returns the same user regardless of how many times you call it.

Most production systems aim for semantic idempotency. Syntactic is too brittle (one whitespace change breaks it).

Idempotency keys¶

A client-generated unique identifier attached to a request. The server records the key after processing. If the same key arrives again, the server returns the cached response without redoing the work.

Example flow (Stripe-style):

Client generates UUID v4 as idempotency key.
Client sends POST /charge with Idempotency-Key: <uuid> header.
Server processes, charges customer, records {key: <uuid>, response: <data>} in storage.
Client retries due to network error.
Server sees same key in storage, returns cached response.

The client may safely retry as long as the same key is used. The server guarantees no duplicate side effect.

Storage for idempotency keys¶

The simplest approach is a database table:

CREATE TABLE idempotency_keys (
    key VARCHAR PRIMARY KEY,
    response JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

Insert with ON CONFLICT DO NOTHING. The first insert wins; the loser reads the cached response.

For scale, use Redis with TTL:

SETNX idempotency:<key> <response>
EXPIRE idempotency:<key> 86400

The TTL bounds storage. After 24 hours, the key is forgotten. Retries within 24h are deduplicated; later retries become real second-write attempts.

Deduplication windows¶

How long should the server remember keys? Trade-off:

Longer window: safer (catches very delayed retries).
Shorter window: less storage, faster lookup.

Common values: 24 hours for human-facing operations, 1 hour for machine-to-machine.

For financial systems, days or weeks. The cost of a duplicate charge dwarfs the cost of storage.

What if the key is reused intentionally?¶

Some clients reuse idempotency keys across different operations. The server must detect this and reject — typically with 409 Conflict. Otherwise, the second client gets the first client's response.

Stripe's API does this: if you reuse a key with a different body, you get an error.

What about request-body fingerprinting?¶

Some systems use a hash of the request body as the implicit idempotency key. This works but has caveats:

Two clients with the same body get the same response. Usually fine but sometimes surprising.
Computation cost on every request.
Body must be deterministic — no timestamps in the body.

For these reasons, explicit idempotency keys are usually preferred.

Idempotency Keys In Practice¶

A realistic implementation in Go.

Server-side handler¶

package handler

import (
    "context"
    "encoding/json"
    "errors"
    "net/http"
    "time"

    "github.com/redis/go-redis/v9"
)

type ChargeRequest struct {
    Amount   int64  `json:"amount"`
    Currency string `json:"currency"`
}

type ChargeResponse struct {
    ID     string    `json:"id"`
    Status string    `json:"status"`
    Time   time.Time `json:"time"`
}

type Handler struct {
    Redis *redis.Client
}

func (h *Handler) Charge(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    key := r.Header.Get("Idempotency-Key")
    if key == "" {
        http.Error(w, "missing Idempotency-Key", http.StatusBadRequest)
        return
    }
    // Try to get cached response.
    cached, err := h.Redis.Get(ctx, "idem:"+key).Bytes()
    if err == nil {
        w.Header().Set("Content-Type", "application/json")
        w.Write(cached)
        return
    } else if !errors.Is(err, redis.Nil) {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    // First time we see this key. Do the work.
    var req ChargeRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }
    resp, err := h.doCharge(ctx, req)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    // Cache the response.
    respBytes, _ := json.Marshal(resp)
    h.Redis.Set(ctx, "idem:"+key, respBytes, 24*time.Hour)
    w.Header().Set("Content-Type", "application/json")
    w.Write(respBytes)
}

func (h *Handler) doCharge(ctx context.Context, req ChargeRequest) (ChargeResponse, error) {
    // ... actual side-effectful work ...
    return ChargeResponse{ID: "ch_abc", Status: "succeeded", Time: time.Now()}, nil
}

Two race conditions to worry about:

Two concurrent requests with the same key. The first sees redis.Nil and starts work. The second also sees redis.Nil (the first has not finished) and starts work too. Both succeed; the side-effect is doubled.

Fix: use SETNX (or SET NX EX) to atomically claim the key:

ok, err := h.Redis.SetNX(ctx, "idem:"+key+":lock", "1", 30*time.Second).Result()
if err != nil { /* ... */ }
if !ok {
    // Another request is processing. Wait, poll, or return 409.
    return
}
defer h.Redis.Del(ctx, "idem:"+key+":lock")

This is essentially a distributed lock keyed on the idempotency key.

Process crashes between work and cache write. The work is done but the cache is empty. The next retry redoes the work, doubling the side-effect.

Fix: store the in-progress marker first, then the result, with the marker only releasing on success. Or use a database transaction that writes both the side effect and the cache row.

This is hard. Production systems use proven patterns (Stripe's open-source idempotency library is a good reference).

Client-side helper¶

type IdempotentClient struct {
    HTTP *http.Client
}

func (c *IdempotentClient) Charge(ctx context.Context, key string, req ChargeRequest) (ChargeResponse, error) {
    body, _ := json.Marshal(req)
    httpReq, _ := http.NewRequestWithContext(ctx, "POST", "/charge", bytes.NewReader(body))
    httpReq.Header.Set("Idempotency-Key", key)
    httpReq.Header.Set("Content-Type", "application/json")
    resp, err := c.HTTP.Do(httpReq)
    if err != nil {
        return ChargeResponse{}, err
    }
    defer resp.Body.Close()
    var out ChargeResponse
    json.NewDecoder(resp.Body).Decode(&out)
    return out, nil
}

The client must reuse the same key when retrying. Typically:

key := uuid.NewString()
result, err := retrier.Do(ctx, func(ctx context.Context) error {
    var err error
    result, err = client.Charge(ctx, key, req)
    return err
})

The key is generated once, outside the retry loop. Each retry reuses it.

Exactly-Once Semantics Myths¶

A common confusion: "Will idempotency keys give me exactly-once semantics?"

Strictly speaking, no. The most you can achieve in a distributed system is at-least-once + idempotency = effectively-once. The operation may be processed many times (at-least-once), but because it is idempotent, the effect is the same as once.

True exactly-once (the operation is processed exactly once, never more or less) is impossible without consensus (Paxos, Raft) or a transactional database underneath. Even then it usually applies only to a specific kind of write.

For most application-level retries: aim for at-least-once + idempotency. That is what Stripe, Square, and most payments APIs implement.

Retry Coordination Across Replicas¶

If your service has many replicas, each retrier in each replica is independent. There is no coordination between them.

Why does this matter? Because the aggregate retry rate is the sum across all replicas. A retry budget configured per-replica is too generous in aggregate.

Three approaches to coordination:

Approach 1: per-replica budgets divided¶

If you have 10 replicas and want a global retry rate of 100 RPS, configure each replica's budget at 10 RPS. Crude but works.

Approach 2: shared budget via a coordinator¶

A central rate limiter (e.g. Redis-based) that all replicas consult. Each retry attempt fetches a token from the central limiter.

Cost: every retry has a coordinator round-trip. Usually not worth it.

Approach 3: probabilistic admission¶

Each replica randomly admits retries with probability p. p is tuned (or adapted) so that across all replicas the aggregate rate is bounded.

This is essentially what Google's adaptive throttling does. Each client maintains a count of accepts and requests; when the ratio drops, the client probabilistically drops its own requests.

Circuit Breakers and Backoff Interaction¶

A circuit breaker is a stateful client-side protector that fails fast when a dependency is known to be down. Three states:

Closed: normal operation, requests go through.
Open: dependency is failing, requests are rejected immediately without trying.
Half-open: the breaker is testing whether the dependency has recovered.

When combined with retry, the interaction is subtle.

Naive composition (problematic)¶

if !breaker.Allow() { return ErrCircuitOpen }
for attempt := 0; attempt < N; attempt++ {
    err := op()
    breaker.Record(err)
    if err == nil { return nil }
    time.Sleep(...)
}

Each retry inside the loop records into the breaker. If 5 retries all fail, the breaker sees 5 failures — opening faster than intended.

Worse: when the breaker opens mid-loop, subsequent retries see "circuit open" and fail. The retry sequence is interrupted by the breaker.

Better composition¶

The breaker should be checked once per top-level operation, and the retry loop's failures count as one breaker event (or zero):

if !breaker.Allow() { return ErrCircuitOpen }
err := retrier.Do(ctx, op)
breaker.Record(err) // record only the final outcome
return err

This way, a single user-visible failure increments the breaker by one regardless of retry count.

Open-circuit retry¶

When the breaker is open, the retrier should not retry. Trying to call into an open breaker is wasted work.

Better: the breaker's Allow is checked in the retry loop, and ErrCircuitOpen is treated as permanent (not retryable).

err := retrier.Do(ctx, func(ctx context.Context) error {
    if !breaker.Allow() {
        return retry.MarkPermanent(ErrCircuitOpen)
    }
    err := op(ctx)
    breaker.Record(err)
    return err
})

Library options¶

Popular Go circuit-breaker libraries:

sony/gobreaker — the classic implementation.
mercari/go-circuitbreaker — modern alternative.

Both compose with custom retry policies. The professional file works a real integration.

Adaptive Retry Strategies¶

Fixed retry policies have a problem: they cannot tell when the dependency is healthy versus stressed. An adaptive strategy adjusts based on observed behaviour.

Netflix's concurrency-limits¶

The concurrency-limits library (Java; Go ports exist) uses TCP-like dynamics: each client tracks observed RTT and concurrency, and adjusts a per-server concurrency limit dynamically.

Algorithm sketch: - Start with a low concurrency limit. - Probe higher (TCP's congestion-avoidance). - On failure or rising latency, reduce the limit (TCP's congestion-window cut). - The result: each client autonomously discovers the right level.

Adaptive concurrency limits are not strictly a retry strategy, but they bound retry traffic effectively: if the dependency is stressed, the limit drops, so retries are denied.

Google's adaptive throttling (SRE book)¶

Each client tracks its own ratio of accepts to total requests. If the ratio drops below a threshold (e.g. 50%), the client probabilistically drops some of its outgoing requests before even sending them.

P_reject = max(0, (requests - K * accepts) / (requests + 1))

Where K is a tuning constant (typically 2). When the client is being rejected a lot, it self-throttles.

This works because if many clients all do this independently, they coordinate without communicating. Each client backs off proportionally; the aggregate load drops.

Adaptive retry timeouts¶

Another adaptive idea: track the p99 latency of successful requests. Set the per-call timeout to 2 * p99. If p99 rises, timeouts loosen. If p99 drops, timeouts tighten.

This avoids the common bug of "the timeout is too short during normal operation but too long during stress".

Retry Budgets at Scale¶

A retry budget configured at 10% (i.e. retries should be no more than 10% of total traffic) is a common Netflix and Google approach.

Calculating the budget¶

retry_budget_rate = total_traffic_rate * 0.10

If your service does 1000 RPS, the budget is 100 RPS of retries.

golang.org/x/time/rate.Limiter with rate = 100, burst = 200 implements this. Each retry consumes a token.

Per-server vs global¶

A global retry budget at the load-balancer level is harder to implement but more robust. A per-server budget is easier but each server has a small share.

Most companies do per-server, calibrated such that aggregate replicas * per_server_budget equals the global target.

When to deny¶

Two options when the budget is exhausted:

Deny new retries. The original request still goes through; only retries are blocked.
Deny new requests entirely. When retries pile up, the system is in trouble; backpressure all requests.

Option 1 is more common. Option 2 is more aggressive but safer.

Observability¶

Track:

Retry rate (retries / second).
Retry ratio (retries / total requests).
Budget-denial rate.
Per-dependency retry ratios.

When the budget-denial rate is non-zero, you have a real problem — the dependency is failing, the budget is doing its job. Alert on this.

The Google SRE Approach¶

Google's SRE book chapter "Handling Overload" describes their retry philosophy in detail. Key points:

Retry only at the highest layer. Inner services do not retry. This avoids amplification.
Use adaptive throttling at the client. When the server is overwhelmed, each client throttles independently using the formula above.
Reject load early. Better to fail fast than queue indefinitely.
Use deadline propagation. Every RPC has a deadline; downstream services respect it.

The combination is powerful: retries are bounded to one tier, throttling adjusts dynamically, and deadlines prevent zombie work.

In Go specifically, this maps to:

gRPC retry policy at the edge service.
rate.Limiter for client-side budgets.
context.Context for deadline propagation.
concurrency-limits-go (or similar) for adaptive concurrency.

The professional file integrates all of these.

Code Examples¶

Example 1: deadline-propagating HTTP client¶

package httpclient

import (
    "context"
    "net/http"
    "strconv"
    "time"
)

type Client struct {
    HTTP *http.Client
}

func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
    // Inject deadline header.
    if deadline, ok := ctx.Deadline(); ok {
        req.Header.Set("X-Deadline-Unix-Ms", strconv.FormatInt(deadline.UnixMilli(), 10))
    }
    return c.HTTP.Do(req.WithContext(ctx))
}

func ParseDeadline(r *http.Request) (time.Time, bool) {
    h := r.Header.Get("X-Deadline-Unix-Ms")
    if h == "" {
        return time.Time{}, false
    }
    ms, err := strconv.ParseInt(h, 10, 64)
    if err != nil {
        return time.Time{}, false
    }
    return time.UnixMilli(ms), true
}

// Server middleware.
func WithDeadline(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        deadline, ok := ParseDeadline(r)
        if ok {
            ctx, cancel := context.WithDeadline(r.Context(), deadline)
            defer cancel()
            r = r.WithContext(ctx)
        }
        next.ServeHTTP(w, r)
    })
}

This is the deadline-propagation pattern in HTTP. Every service propagates the header; every server reads it; every retry respects it.

Example 2: hedged request¶

package hedge

import (
    "context"
    "errors"
    "sync"
    "time"
)

type Hedger struct {
    Delay   time.Duration
    Targets []string
    Caller  func(ctx context.Context, target string) (any, error)
}

func (h *Hedger) Do(ctx context.Context) (any, error) {
    type result struct {
        val any
        err error
    }
    out := make(chan result, len(h.Targets))
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    var wg sync.WaitGroup
    for i, t := range h.Targets {
        wg.Add(1)
        go func(i int, target string) {
            defer wg.Done()
            if i > 0 {
                tm := time.NewTimer(h.Delay * time.Duration(i))
                select {
                case <-tm.C:
                case <-ctx.Done():
                    tm.Stop()
                    return
                }
            }
            val, err := h.Caller(ctx, target)
            select {
            case out <- result{val, err}:
            case <-ctx.Done():
            }
        }(i, t)
    }
    defer wg.Wait()

    var lastErr error
    for i := 0; i < len(h.Targets); i++ {
        select {
        case r := <-out:
            if r.err == nil {
                return r.val, nil
            }
            lastErr = r.err
        case <-ctx.Done():
            return nil, ctx.Err()
        }
    }
    if lastErr == nil {
        lastErr = errors.New("no results")
    }
    return nil, lastErr
}

Reads almost like the middle-level retrier but with parallel attempts at staggered intervals.

Example 3: an adaptive retry budget¶

package budget

import (
    "context"
    "sync"
    "sync/atomic"
    "time"
)

type Adaptive struct {
    mu           sync.Mutex
    accepts      int64
    requests     int64
    K            float64
    WindowReset  time.Duration
    lastReset    time.Time
}

func New(K float64, window time.Duration) *Adaptive {
    return &Adaptive{K: K, WindowReset: window, lastReset: time.Now()}
}

func (a *Adaptive) Record(success bool) {
    atomic.AddInt64(&a.requests, 1)
    if success {
        atomic.AddInt64(&a.accepts, 1)
    }
}

func (a *Adaptive) Allow(ctx context.Context) bool {
    a.mu.Lock()
    defer a.mu.Unlock()
    if time.Since(a.lastReset) > a.WindowReset {
        atomic.StoreInt64(&a.accepts, 0)
        atomic.StoreInt64(&a.requests, 0)
        a.lastReset = time.Now()
    }
    requests := atomic.LoadInt64(&a.requests)
    accepts := atomic.LoadInt64(&a.accepts)
    if requests == 0 {
        return true
    }
    rejectProb := float64(requests) - a.K*float64(accepts)
    if rejectProb <= 0 {
        return true
    }
    // Probabilistic admission.
    return false // simplified: in real code, use rand.Float64() comparison
}

This is the Google adaptive-throttling formula. Each call records success or failure; admission probability depends on the recent ratio.

Example 4: retry with circuit breaker¶

package retrybreaker

import (
    "context"
    "errors"
    "time"

    "github.com/sony/gobreaker"
    "yourmodule/retry"
)

type Client struct {
    Breaker *gobreaker.CircuitBreaker
    Retrier *retry.Retrier
}

func New(breakerName string, p retry.Policy) *Client {
    cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
        Name:        breakerName,
        MaxRequests: 5,
        Interval:    60 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            return counts.ConsecutiveFailures > 10
        },
    })
    return &Client{Breaker: cb, Retrier: retry.New(p)}
}

func (c *Client) Do(ctx context.Context, op func(context.Context) error) error {
    return c.Retrier.Do(ctx, func(ctx context.Context) error {
        _, err := c.Breaker.Execute(func() (interface{}, error) {
            return nil, op(ctx)
        })
        if errors.Is(err, gobreaker.ErrOpenState) {
            return retry.MarkPermanent(err)
        }
        return err
    })
}

The breaker rejects when open. The retrier sees ErrOpenState as permanent, so it does not retry.

Architecture Patterns¶

Pattern 1: edge-only retries¶

Diagram:

  user → edge service (retries) → internal service A → service B → ...
                                       (no retry)        (no retry)

Only the edge service retries. Internal services propagate errors with deadlines. Total retry amplification: linear (no multiplication).

Pattern 2: bulkhead-per-dependency¶

  service ──┬── bulkhead(50) ──→ DB
            ├── bulkhead(20) ──→ cache
            └── bulkhead(10) ──→ payments

A failure in payments cannot consume capacity meant for DB. Per-dependency concurrency limits.

Pattern 3: combined retry + circuit-breaker + bulkhead + budget¶

  service ──→ bulkhead → breaker → retrier (with budget) → dependency

All four protections compose. Bulkhead caps concurrency; breaker fails fast on known-bad; retrier handles transient; budget caps retry rate.

This is what Netflix's Hystrix (now archived but influential) implemented. Modern equivalents: Resilience4j (Java), go-resiliency, cenkalti/backoff + a few extras.

Pattern 4: deadline-clipped fan-out¶

  parent (deadline 5s)
    ├── child A (deadline 4s) → ...
    ├── child B (deadline 4s) → ...
    └── child C (deadline 4s) → ...

Each child gets a portion of the parent's budget. Tail children do not waste time after parent has timed out.

Performance Considerations¶

At scale, retry-policy code itself can become a hot path. A few notes:

Allocations¶

Each time.NewTimer allocates. In a tight loop, reuse with Reset. time.Sleep does not allocate.

A sync.Pool for *rand.Rand avoids global-rand contention.

Locking¶

The global math/rand (Go 1.20+) uses a single internal lock. With many goroutines all hitting it, contention can show up at >100k QPS.

Solutions: - math/rand/v2 is lock-free for top-level functions. - Per-goroutine *rand.Rand via sync.Pool. - Atomic counters where possible.

Context allocation¶

context.WithTimeout allocates. In a hot retry loop, this matters.

Sometimes you can hoist the ctx, cancel := context.WithTimeout(...) outside the loop and pass the same ctx to all attempts. The deadline applies to all attempts collectively.

Hot-path observability¶

If your retry policy emits a metric on every attempt, that metric backend better be fast. A locked Prometheus counter is fine; a synchronous remote-write is not.

Edge Cases and Pitfalls¶

Clock skew across services. If services have different clocks, an absolute deadline header (X-Deadline-Unix-Ms) is misleading. Use relative timeouts (X-Timeout-Ms) and recompute the deadline at each hop.
Time zone issues in HTTP date parsing. Retry-After can be either an HTTP-date (UTC) or seconds.
gRPC deadline propagation is per-RPC. A streaming RPC with WithTimeout(5*time.Second) cancels mid-stream after 5s; the stream is half-finished.
Idempotency keys that include timestamps. If your client includes a timestamp in the key, retries with new timestamps lose deduplication.
rand.Int63n panic on n<=0. Always guard.
Breaker false positives. A breaker that opens on transient failures denies retries that would have succeeded. Tune thresholds carefully.
Retry-induced cascading failure. Confirmed: retries can make a problem worse. Always measure during incidents.

Common Mistakes¶

Retrying at every tier (multiplicative amplification).
No deadline propagation.
Wrong idempotency key scope (per-attempt instead of per-operation).
Recording every retry into the breaker (opens too fast).
Mixing time.After into a loop.
Per-replica budget too high; aggregate too large.
Forgetting to honour Retry-After.
Sleeping past the parent deadline.
Using crypto/rand for jitter.
Not measuring retry rate; flying blind.

Tricky Questions¶

Q1. A user request has 5-second deadline. Edge service retries 3 times. Each call takes ~1 second. What is the worst-case wall-clock at the dependency? A: 3 calls × 1s = 3s of dependency time, plus retry sleeps (~3s with base=1s). Approximately at the deadline boundary.

Q2. Why is "retry at every tier" bad? A: Multiplicative amplification. 3 retries × 3 tiers = 27 inner calls per user request.

Q3. Two requests with the same idempotency key arrive simultaneously. What happens? A: Race. Need a lock keyed on the idempotency key to ensure only one processes.

Q4. What is the difference between hedging and retry? A: Hedging sends duplicates before the original times out, for latency. Retry waits for failure, for correctness.

Q5. Why does Google's adaptive throttling work without coordination between clients? A: Each client independently observes its own accept ratio and throttles proportionally. The aggregate effect approximates global coordination.

Q6. If a breaker is open, should the retrier retry? A: No. Treat circuit-open as permanent.

Q7. What is a retry budget's failure mode? A: A persistent failure causes the budget to deny all retries. Real failures are no longer retried — which is correct, but operators should be alerted.

Q8. How should an HTTP client honour Retry-After? A: Use max(Retry-After, computed_backoff). The header is a server-suggested floor.

Cheat Sheet¶

Architecture rules:
  - retry at edge only
  - propagate deadlines everywhere
  - idempotency keys for non-idempotent operations
  - bulkheads per dependency
  - retry budget + breaker + adaptive limit

Multiplicative amplification:
  total_calls(n_tiers) = (1 + sum f^k)^n_tiers
  3 tiers × 3 retries × 50% f = ~6.6 amplification

Deadline propagation:
  HTTP: X-Deadline-Unix-Ms or Grpc-Timeout
  gRPC: built into metadata
  Each hop: WithDeadline(parent, hopDeadline)

Idempotency:
  - client generates key once, reuses on retry
  - server caches response by key, returns on duplicate
  - 24h TTL typical
  - lock during in-flight processing

Hedging:
  - send to replica 2 after delay
  - first wins; cancel the other
  - hedging budget bounds duplicate rate

Self-Assessment Checklist¶

I can explain thundering herd with a queueing-theory argument.
I can compute multiplicative amplification across tiers.
I can architect a system with edge-only retries and deadline propagation.
I understand idempotency keys, deduplication windows, and the race on simultaneous duplicates.
I know when to use hedging vs retry.
I can describe Google's adaptive throttling formula.
I can compose a retry policy with a circuit breaker without breaker-thrashing.
I know why exactly-once is impossible without consensus, and what at-least-once + idempotency gives instead.

Summary¶

At senior level, retry is no longer a coding decision; it is an architecture decision. The patterns that matter are: edge-only retries, deadline propagation, idempotency keys, bulkheads, retry budgets, circuit breakers, hedging, and adaptive throttling. Composed correctly, they protect dependencies from retry storms while delivering reliable user-facing requests.

The math behind it — queueing theory, multiplicative amplification, queue-drain dynamics — turns intuitions into estimates. A senior engineer should be able to predict the load impact of a retry policy change without rolling it out.

The professional file translates this architectural picture into a concrete Go codebase, using libraries like cenkalti/backoff, sony/gobreaker, golang.org/x/time/rate, and gRPC interceptors.

What You Can Build¶

A service-mesh sidecar that propagates deadlines and budgets.
A multi-region payments client with idempotency keys and hedging.
A control-plane that adjusts retry budgets per dependency based on observed failure rates.
A monitoring dashboard that shows retry amplification across tiers.

Diagrams and Visual Aids¶

Multiplicative amplification¶

   user
    │ (1 request)
    ▼
   A ── 3 retries ──→ B ── 3 retries ──→ C ── 3 retries ──→ D
                          (9 retries)        (27 calls)

A 3-tier system with 3 retries per tier produces 27 calls to D per user request.

Edge-only retries¶

   user
    │
    ▼
   edge ── retries ──→ A ──→ B ──→ C ──→ D
                       (no retry)

Only the edge retries. Total D calls per user: 3 (one per edge attempt).

Deadline propagation¶

   user (deadline 5s)
    │
    ▼ deadline header: 5s
   edge (4.9s remaining)
    │
    ▼ deadline header: 4.9s
   A (4.8s remaining)
    │
    ▼ deadline header: 4.8s
   B (4.7s remaining)

Each hop trims the deadline and forwards it. When a hop has used most of the budget, downstream hops give up early.

Hedged request¶

   client
    │
    ├── replica 1: send at t=0
    │
    └── replica 2: send at t=d (only if 1 not done)
         │
         winner cancels loser

The second replica is fired only if the first is slow. First response wins.

This is the architecture of a senior-level retry system. The professional file translates it into code.

Appendix A: Queueing Theory Refresher¶

Senior-level retry analysis depends on queueing theory. Here is the minimum you need.

M/M/1 queue¶

The simplest queue model. Single server, exponentially distributed arrival times, exponentially distributed service times.

Parameters: - λ — arrival rate (requests per second). - μ — service rate (requests served per second). - ρ = λ/μ — utilisation.

Key results: - Mean queue length: Lq = ρ² / (1 - ρ). - Mean wait time: Wq = Lq / λ = ρ / (μ (1-ρ)). - As ρ → 1, both go to infinity.

The crucial insight: as utilisation approaches 1, latency does not grow linearly — it grows hyperbolically. At ρ = 0.5, Wq = 1 / μ. At ρ = 0.9, Wq = 9 / μ. At ρ = 0.99, Wq = 99 / μ.

Why retries push you toward `ρ = 1`¶

In a stable system, ρ < 1. Retries add to λ. If retries add 30% to load, ρ rises by 30%. If the system was at ρ = 0.7, it goes to ρ = 0.91. Latency more than triples.

If the original was a transient blip and you retry, you have amplified the latency problem.

M/D/1 and M/G/1¶

More realistic models use deterministic or general service-time distributions. Results are similar; the key insight (latency grows hyperbolically near saturation) is universal.

Implications for retry policy¶

Retries are most expensive when the system is most stressed (high ρ regime, where adding load multiplies latency).
Backing off gives the system time to drain queues (ρ drops, latency stabilises).
Jitter spreads the retry impulse across the queue's drain time, avoiding peak overload.

This is the math that motivates exponential backoff + jitter + adaptive throttling.

Appendix B: The TCP-Like View of Retry¶

TCP's congestion control is a retry-with-backoff system. Specifically:

Slow start: exponentially grow window size after success.
Congestion avoidance: linearly grow window after a threshold.
Fast retransmit: retry immediately on triple-ACK.
Multiplicative decrease: halve window on packet loss.

The same ideas apply to application-level retries:

Adaptive concurrency: grow concurrency limit on success, shrink on failure.
Probing: periodically attempt a higher rate to test capacity.
Backoff: slow growth after recovering from failure.

Netflix's concurrency-limits library explicitly draws on TCP's algorithm. The intuition: TCP has solved congestion control for 50 years; we should reuse the math.

TCP Vegas-style retry¶

A retry strategy that monitors RTT (round-trip time) and slows down when RTT rises (indicating queueing):

type VegasRetrier struct {
    baseRTT time.Duration
    target  int // target queue depth
    limit   int
}

func (v *VegasRetrier) AfterCall(rtt time.Duration) {
    if rtt < v.baseRTT * 1.1 {
        v.limit++ // grow
    } else if rtt > v.baseRTT * 1.5 {
        v.limit-- // shrink
    }
}

This is sketch, not production code. But it shows how RTT can inform retry-budget adjustment.

Appendix C: Real Postmortems and What They Teach¶

Studying postmortems is the fastest way to internalise retry hazards. Three classics:

AWS S3 outage 2017¶

A typo during operational work disabled too many S3 servers in us-east-1. Clients across the world saw 5xx. They retried. The retry traffic delayed recovery for hours.

Lesson: retry policy must be tuned for the worst case — not the median. A region outage is in the tail of the distribution but is real.

GitHub October 2018 outage¶

A network partition caused a database failover. Replication caught up incorrectly. The "fix" took 24 hours. During that time, many clients retried.

Lesson: retries that fire forever (no MaxElapsedTime) generate noise for hours after the user has given up. Always cap.

Stripe payment delay incident¶

A spike in legitimate traffic plus a database hiccup caused payments to fail. Clients retried with idempotency keys. Each retry was a real call into the database, because the keys were not yet cached at the retry-side. Database load doubled.

Lesson: Idempotency keys cache responses at the server; they do not reduce server load during initial retries. Combine with client-side circuit breakers.

Common patterns across postmortems¶

The original failure was small.
Retries amplified it.
The system stayed broken longer than the original failure would have lasted.
Disabling retries (temporarily) helped recovery.

This pattern repeats. It is why retries need budgets, breakers, and adaptive limits.

Appendix D: Why "Disable Retries" Is Sometimes The Fix¶

During an active outage, an unusual fix is to disable client retries. Reason: retries are amplifying the load.

Many companies have a "kill switch" in their config that sets maxAttempts = 0 (or 1, no retries). When the SRE team sees retry-induced overload, they flip the switch.

For this to work:

The retry policy must be configurable at runtime (via etcd, Consul, etc.).
Operators must know how to flip the switch.
The system must degrade gracefully when retries are off (users see errors, not a melted service).

Designing for the kill-switch is a senior-level discipline.

Appendix E: The "Jitter Then Retry After" Pattern¶

A nuance: when a server returns 429 Retry-After: 30, the client should both honour the header and add jitter. Otherwise all 429'd clients retry at exactly t+30s.

delay, ok := parseRetryAfter(resp)
if ok {
    // Add up to 20% jitter on top of Retry-After.
    jitter := time.Duration(rand.Int63n(int64(delay) / 5))
    delay += jitter
}

This is a subtle but important pattern: server-suggested delays are also susceptible to thundering herd. The Retry-After header is a coordination signal; jitter desynchronises the response.

Appendix F: Distributed Idempotency Storage¶

Idempotency keys at scale need fast storage. Options:

Redis with TTL¶

SET idem:<key> <response> EX 86400 NX

Pros: fast, simple, scales horizontally.
Cons: Redis is in-memory; failure means losing in-progress idempotency state.

DynamoDB¶

PutItem with ConditionExpression: attribute_not_exists(key)

Pros: persistent, scales well.
Cons: per-request cost.

Database table with PK = idempotency key¶

INSERT INTO idempotency_keys (key, response, expires_at)
VALUES (?, ?, NOW() + INTERVAL 24 HOUR)
ON CONFLICT (key) DO NOTHING;

If insert succeeds: process. If it fails: read existing.

Pros: ACID, uses existing infrastructure.
Cons: load on the primary database.

For most companies, Redis + DynamoDB (or equivalent) is the way. Redis is the cache; DynamoDB is the source of truth.

Cleanup¶

Idempotency entries should be cleaned after their TTL. Redis handles this automatically with EX. Other stores need a sweeper.

If you do not clean, the store grows forever. For high-volume APIs this matters.

Appendix G: Idempotency Key Lifecycle¶

A complete lifecycle picture:

Generate. Client generates uuid.NewString() once, stores it.
Send. Client includes Idempotency-Key: <uuid> header.
Lookup. Server checks idempotency store.
First time: Server claims the key (atomic SETNX), processes the request, stores the response, releases the claim.
Subsequent retry (same key): Server returns cached response.
TTL: After 24 hours, the key is forgotten.
Client gives up: Client may not generate a new key for the same operation. If it does, the server cannot deduplicate.

Failure modes:

Race during processing: Two clients with same key arrive simultaneously. The first to claim wins; the second waits or fails with 409.
Process crash: The claim is released only after processing. If the process crashes mid-processing, the claim expires (TTL on the claim). The next retry can re-attempt.
Server-side bug: Returns a different response on retry with the same key. Server bug; investigate.

Client-side hash as a fallback¶

What if the client does not include an idempotency key? Some servers compute sha256(method + url + body) as an implicit key. This catches exact-duplicate retries but not equivalent-semantic retries (different timestamps in the body, different request IDs, etc.).

Explicit keys are better. Implicit hashes are a "best effort" fallback.

Appendix H: gRPC Retry Policy in Practice¶

gRPC has built-in retry support via service config:

{
  "methodConfig": [{
    "name": [{"service": "myco.Service"}],
    "retryPolicy": {
      "maxAttempts": 4,
      "initialBackoff": "0.1s",
      "maxBackoff": "5s",
      "backoffMultiplier": 2,
      "retryableStatusCodes": ["UNAVAILABLE"]
    }
  }]
}

The client reads this config (from a control plane, file, or DNS TXT) and applies it transparently.

Retry throttling¶

gRPC's retry policy includes a retry throttling mechanism: a token bucket per channel. Each successful call adds tokens; each failed retry decrements. When tokens drop below threshold, retries are temporarily disabled.

The config:

{
  "retryThrottling": {
    "maxTokens": 10,
    "tokenRatio": 0.1
  }
}

This is gRPC's adaptive retry budget. It is a per-channel mechanism, so each connection adapts independently.

Deadline propagation¶

gRPC's deadline is part of the metadata. A client's context.WithTimeout(parent, 5*time.Second) is automatically converted to grpc-timeout metadata. The server reads it and applies a context with the same deadline.

Downstream calls (server → other gRPC services) inherit the deadline. So a 5-second deadline at the entry propagates automatically.

This is one reason gRPC is popular for distributed systems: deadline propagation is built-in.

Idempotency in gRPC¶

gRPC has no built-in idempotency keys; that is application-level. Conventions:

Idempotency-Key header in metadata.
A request_id field in the proto definition.
Server-side deduplication via Redis/database.

When gRPC retry is wrong¶

gRPC's retry policy is per RPC. It does not span streaming RPCs or composite operations. For those, you implement retry at the application level.

Streaming RPCs are particularly tricky: a stream can fail mid-way. Retrying means restarting the stream from the beginning. If state was sent already, you may double-send. Use sequence numbers or idempotent ingest.

Appendix I: Service Mesh Retries¶

A service mesh (Istio, Linkerd, Consul) deploys a sidecar proxy alongside each service. The sidecar handles network concerns: TLS, routing, retries, circuit breaking.

When the sidecar retries, your application code does not. This is a clean separation:

App: business logic, no retry.
Sidecar: retry policy.

Configuration in Istio:

apiVersion: networking.istio.io/v1
kind: VirtualService
spec:
  http:
  - retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure

The mesh handles retries transparently. The app sees a successful or final-failure response.

Mesh vs library retries¶

If you have a mesh, do not also retry in the application. Otherwise: 3 mesh × 3 app = 9 retries. Pick one layer.

Choose mesh-level when: - You want policy changes without redeploying apps. - You have a centralised platform team.

Choose library-level when: - You need application-specific logic (e.g. retry only on certain error codes). - You do not have a mesh.

Mesh deadline propagation¶

Service meshes propagate deadlines via standard HTTP/gRPC headers. You get this for free.

Appendix J: Backpressure as an Alternative¶

Sometimes retry is the wrong abstraction. Backpressure — explicitly slowing down upstream traffic — is more direct.

Idea: instead of "fail and retry", "communicate to the upstream that you cannot keep up". The upstream slows down its rate.

Mechanisms:

HTTP 429 with Retry-After. Explicit backpressure.
gRPC RESOURCE_EXHAUSTED. Same idea.
Reactive streams. Library-level backpressure (Project Reactor, RxJava).
Window-based protocols. TCP, HTTP/2, etc.

For a system where the bottleneck is well-known, backpressure is more efficient than retry. The upstream slows; the downstream catches up; equilibrium restored.

For a system with unpredictable failure modes, retry is necessary because backpressure cannot represent "transient blip".

Combined: backpressure for predictable overload, retry for unpredictable failures.

Appendix K: Concurrency-Limits-Go Walkthrough¶

Netflix's concurrency-limits has a Go port. A brief tour.

Concept¶

Each client maintains a limit — the maximum number of in-flight requests it will send. The limit adjusts based on observed RTT:

RTT close to baseline: increase the limit (probe higher).
RTT rising: decrease the limit (back off).

This is essentially TCP's algorithm applied to RPC.

Use¶

import "github.com/platinummonkey/go-concurrency-limits/limit"
import "github.com/platinummonkey/go-concurrency-limits/limiter"

vegas := limit.NewVegasLimit("my-service", 20, /*registry*/ nil, /*opts*/...)
lim := limiter.NewDefaultLimiter(vegas, time.Second, 1*time.Second, 1*time.Second, 10, /*strategy*/ nil)

listener, ok := lim.Acquire(ctx)
if !ok {
    // limit reached; do not call
    return ErrThrottled
}
err := op(ctx)
listener.OnSuccess() // or OnDropped / OnIgnore

The pattern: acquire a slot, do the work, notify the limiter.

If the limit is reached, the call is throttled — before it enters the dependency. The dependency is protected.

Composing with retry¶

Adaptive concurrency caps in-flight requests. Retry resends failed requests. They compose:

err := retrier.Do(ctx, func(ctx context.Context) error {
    listener, ok := lim.Acquire(ctx)
    if !ok {
        return retry.MarkPermanent(ErrThrottled)
    }
    err := op(ctx)
    if err == nil {
        listener.OnSuccess()
    } else {
        listener.OnDropped()
    }
    return err
})

The limiter bounds concurrency; the retrier handles transient. The dependency is protected on both axes.

Appendix L: The Cost of Wrong Retries¶

Some failure modes from wrong retry policy:

Retry storm: Total dependency load spikes 10× during incidents.
Latency explosion: Tail latency goes from 100ms to 30s because retries are stacked.
Cost explosion: Paid APIs (AWS, Stripe, Twilio) charge per call; retries multiply the bill.
Duplicate side effects: Non-idempotent retries create duplicates — extra orders, double charges.
Distributed-system inconsistency: Retried writes may go to different replicas, creating split-brain.

Each of these is a real production incident pattern. A senior engineer recognises them before they happen.

Appendix M: The Story of a Real Incident (Hypothetical Composite)¶

A reconstruction of a typical retry-induced incident, illustrating the dynamics.

Setup¶

Service A is a public API.
Service B is an internal microservice that A calls.
Service C is a database that B calls.
Each tier retries 3 times with full jitter, 100ms base, 5s cap.

Day 1, 14:00¶

A network blip causes C to drop 50% of B's connections for 30 seconds.

B's clients see errors; retry. With 3 retries each, B sends 4 calls per request to C.

14:01¶

C recovers. But B is still sending 4 calls per request (the retries from the blip are still in flight). C, just recovered, is hit with 4× normal load.

C cannot handle 4× load. Latency rises. Some calls time out. C's ρ approaches 1.

14:02¶

B's calls to C are now timing out. B's retry budget is consumed. B's clients (which are A's calls to B) see errors and retry.

A is now sending 4 calls per user request to B.

14:03¶

Net effect: 16 calls to C per user request. Latency is in the seconds. Some users see errors.

14:05¶

SRE team responds. They flip the retry-disabled switch. Suddenly only 1 call per request reaches C. Load drops. C catches up. B catches up. A catches up.

14:10¶

Everything is normal. The original blip lasted 30 seconds; the incident lasted 10 minutes.

Postmortem findings¶

The retries amplified a 30-second issue into a 10-minute outage.
Lack of deadline propagation meant retries continued after users had given up.
Lack of per-tier retry budget meant amplification was multiplicative.
The kill switch was the only effective tool.

Action items¶

Add retry budgets at each tier.
Add deadline propagation between services.
Restrict retries to edge service (A) only.
Add adaptive throttling at A's clients.

This is a textbook story. Variations of it have happened at every company that operates real distributed systems.

Appendix N: A Decision Framework for Retry Policy¶

Use this when designing retry for a new service.

Step 1: What is the operation?¶

Read or write?
Idempotent or not?
User-facing or background?
Internal or external dependency?

Step 2: Where will the retry live?¶

Application code (specific to one call).
HTTP client wrapper (all calls to a host).
gRPC interceptor (all calls on a channel).
Service mesh sidecar.
API gateway / edge.

Generally: more general = better. Avoid per-call retry.

Step 3: What is the failure mode?¶

Transient network errors (always retry).
Server overload (retry with caution; respect 429).
Persistent server errors (do not retry; surface).
Auth failures (do not retry without re-auth).

Step 4: What is the latency budget?¶

User-facing: max 3-5 seconds total.
Background: minutes are fine.
Async job: hours are fine (with persistence).

Step 5: What is the failure tolerance?¶

Critical writes: high tolerance (use idempotency, retry generously).
Read fan-out: medium tolerance (retry less, hedge maybe).
Best-effort: low tolerance (1-2 retries).

Step 6: What is the dependency's capacity?¶

Plenty of headroom: retries are cheap.
At capacity: retries hurt; tune budget tight.
Paid per-call: retries are expensive.

Step 7: What is the team's observability?¶

Good: tune carefully, alert on retry rate.
Poor: be conservative, lots of buffer.

Step 8: What is the operational maturity?¶

New team: simple policy.
Mature team: adaptive policy.

After answering these, you have a specific retry policy to write.

Appendix O: The Backpressure-Retry Hybrid¶

A pattern used by some sophisticated systems: combine backpressure with retry to handle both bounded and unbounded overload.

Idea¶

Server returns 429 Retry-After: N when overloaded.
Client respects the Retry-After (backpressure).
If Retry-After is missing, client uses local exponential backoff (retry).

So the server prefers to signal load; the client defaults to local backoff.

Implementation¶

func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
    var lastErr error
    for attempt := 0; attempt < c.maxAttempts; attempt++ {
        resp, err := c.http.Do(req)
        if err != nil {
            lastErr = err
        } else if resp.StatusCode == 429 || resp.StatusCode >= 500 {
            wait, hasWait := parseRetryAfter(resp)
            resp.Body.Close()
            var d time.Duration
            if hasWait {
                jitter := time.Duration(rand.Int63n(int64(wait) / 5))
                d = wait + jitter
            } else {
                d = c.fullJitter(attempt)
            }
            if err := sleepCtx(ctx, d); err != nil {
                return nil, err
            }
            continue
        } else {
            return resp, nil
        }
        if err := sleepCtx(ctx, c.fullJitter(attempt)); err != nil {
            return nil, err
        }
    }
    return nil, lastErr
}

The server gets to suggest the backoff when it knows the right value. The client uses local backoff otherwise.

Appendix P: A Recap of Numbers To Remember¶

For senior-level conversations, internalise these:

Latency growth near saturation: hyperbolic. 1 / (1 - ρ).
Multiplicative amplification: product over tiers of (1 / (1 - f)).
Retry budget heuristic: 10% of total traffic.
Idempotency window: 24 hours typical.
Hedging budget: 1-5% of total traffic.
Adaptive throttling formula: P_reject = max(0, (requests - K * accepts) / (requests + 1)) with K = 2.
Default backoff: 100ms base, 5s cap, 3-5 attempts, full jitter.
Default per-call timeout: match expected p99 latency, then double.
Default total deadline: 5s for user-facing, more for background.

Quote these in design reviews.

Appendix Q: A Production-Adjacent Senior Retrier¶

A complete retrier with everything a senior engineer cares about: jitter, budget, breaker, deadline, idempotency, observability.

package retry

import (
    "context"
    "errors"
    "fmt"
    "math/rand"
    "sync/atomic"
    "time"

    "github.com/sony/gobreaker"
    "golang.org/x/time/rate"
)

// Policy is the parameter struct for Retrier.
type Policy struct {
    MaxAttempts     int
    Base            time.Duration
    MaxDelay        time.Duration
    Budget          *rate.Limiter
    Breaker         *gobreaker.CircuitBreaker
    OnAttempt       func(attempt int, err error)
    OnGiveUp        func(attempts int, err error)
    OnBudgetDenied  func(lastErr error)
    OnBreakerDenied func()
}

// Stats captures runtime counters.
type Stats struct {
    Attempts       int64
    Retries        int64
    Successes      int64
    GaveUp         int64
    BudgetDenied   int64
    BreakerDenied  int64
    ContextErrors  int64
}

// Retrier composes a policy and statistics.
type Retrier struct {
    Policy Policy
    Stats  Stats
}

// New builds a Retrier with sane defaults.
func New(p Policy) *Retrier {
    if p.MaxAttempts <= 0 {
        p.MaxAttempts = 3
    }
    if p.Base <= 0 {
        p.Base = 100 * time.Millisecond
    }
    if p.MaxDelay <= 0 {
        p.MaxDelay = 5 * time.Second
    }
    return &Retrier{Policy: p}
}

// Do runs op until success, terminal failure, or budget/breaker denial.
func (r *Retrier) Do(ctx context.Context, op func(context.Context) error) error {
    var lastErr error
    for attempt := 0; attempt < r.Policy.MaxAttempts; attempt++ {
        atomic.AddInt64(&r.Stats.Attempts, 1)
        if attempt > 0 {
            atomic.AddInt64(&r.Stats.Retries, 1)
        }
        if err := ctx.Err(); err != nil {
            atomic.AddInt64(&r.Stats.ContextErrors, 1)
            return fmt.Errorf("retry context: %w", err)
        }
        if r.Policy.Breaker != nil {
            if state := r.Policy.Breaker.State(); state == gobreaker.StateOpen {
                atomic.AddInt64(&r.Stats.BreakerDenied, 1)
                if r.Policy.OnBreakerDenied != nil {
                    r.Policy.OnBreakerDenied()
                }
                return fmt.Errorf("circuit open: %w", lastErr)
            }
        }
        err := op(ctx)
        if r.Policy.OnAttempt != nil {
            r.Policy.OnAttempt(attempt, err)
        }
        if err == nil {
            atomic.AddInt64(&r.Stats.Successes, 1)
            return nil
        }
        var perm *Permanent
        if errors.As(err, &perm) {
            return perm.Err
        }
        lastErr = err
        if attempt >= r.Policy.MaxAttempts-1 {
            break
        }
        if r.Policy.Budget != nil && !r.Policy.Budget.Allow() {
            atomic.AddInt64(&r.Stats.BudgetDenied, 1)
            if r.Policy.OnBudgetDenied != nil {
                r.Policy.OnBudgetDenied(lastErr)
            }
            return fmt.Errorf("%w: %v", ErrBudgetExhausted, lastErr)
        }
        delay := r.fullJitter(attempt)
        if deadline, ok := ctx.Deadline(); ok {
            remaining := time.Until(deadline)
            if remaining < delay {
                delay = remaining
            }
        }
        if err := sleepCtx(ctx, delay); err != nil {
            atomic.AddInt64(&r.Stats.ContextErrors, 1)
            return err
        }
    }
    atomic.AddInt64(&r.Stats.GaveUp, 1)
    if r.Policy.OnGiveUp != nil {
        r.Policy.OnGiveUp(r.Policy.MaxAttempts, lastErr)
    }
    return fmt.Errorf("after %d attempts: %w", r.Policy.MaxAttempts, lastErr)
}

func (r *Retrier) fullJitter(attempt int) time.Duration {
    cap := r.Policy.Base * time.Duration(1<<attempt)
    if cap > r.Policy.MaxDelay || cap < 0 {
        cap = r.Policy.MaxDelay
    }
    if cap <= 0 {
        return 0
    }
    return time.Duration(rand.Int63n(int64(cap)))
}

func sleepCtx(ctx context.Context, d time.Duration) error {
    if d <= 0 {
        return nil
    }
    t := time.NewTimer(d)
    defer t.Stop()
    select {
    case <-t.C:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

// Permanent wraps an error to signal "do not retry".
type Permanent struct{ Err error }

func (p *Permanent) Error() string { return p.Err.Error() }
func (p *Permanent) Unwrap() error { return p.Err }

// MarkPermanent returns err wrapped so Do will not retry.
func MarkPermanent(err error) error { return &Permanent{Err: err} }

// ErrBudgetExhausted indicates the retry budget was empty.
var ErrBudgetExhausted = errors.New("retry budget exhausted")

This is the policy you would ship as internal/retry. With observability hooks (OnAttempt, OnGiveUp, etc.) you can plug in Prometheus, OpenTelemetry, structured logs.

Appendix R: Observability Hooks¶

At senior level, every retry decision should be observable. Recommended metrics:

retry_attempts_total{client, op}        - counter
retry_successes_total{client, op}       - counter
retry_giveups_total{client, op}         - counter
retry_budget_denied_total{client, op}   - counter
retry_breaker_denied_total{client, op}  - counter
retry_attempt_at_success                 - histogram (1, 2, 3 ...)
retry_delay_seconds                      - histogram

From these you compute:

Retry rate: retry_attempts_total - retry_successes_total - retry_giveups_total.
Retry ratio: retry_attempts_total / total_requests.
Average attempt count: histogram_average(retry_attempt_at_success).

Alert thresholds:

Retry ratio > 5% — investigate.
Budget-denial rate > 0 — incident.
Average attempts > 1.5 — flakiness rising.

We will see Prometheus integration in the professional file.

Appendix S: Tracing Across Retries¶

Distributed tracing (OpenTelemetry) needs special handling for retries. Each attempt is a separate span; they should share a parent span (the overall retry operation).

import "go.opentelemetry.io/otel"

tracer := otel.Tracer("retry")

func (r *Retrier) Do(ctx context.Context, op func(context.Context) error) error {
    ctx, parentSpan := tracer.Start(ctx, "retry.Do")
    defer parentSpan.End()

    for attempt := 0; attempt < r.Policy.MaxAttempts; attempt++ {
        attemptCtx, attemptSpan := tracer.Start(ctx, fmt.Sprintf("retry.attempt.%d", attempt))
        err := op(attemptCtx)
        attemptSpan.End()
        if err == nil {
            return nil
        }
        // ...
    }
    return nil
}

In your tracing UI, you see:

retry.Do
├── retry.attempt.0 (failed)
├── retry.attempt.1 (failed)
└── retry.attempt.2 (success)

Each attempt's own work (HTTP call, DB query) is a sub-span under the attempt span. This makes debugging far easier.

Appendix T: A Senior-Level Reading List¶

Beyond this file:

Release It! by Michael Nygard — the canonical reference on stability patterns. Chapters on retry, timeout, circuit breaker.
Building Microservices by Sam Newman — chapter on resilience.
Designing Data-Intensive Applications by Martin Kleppmann — chapter on fault tolerance, idempotency.
The Site Reliability Workbook — practical SRE content, including retry budgets.
AWS Architecture Blog posts on jitter and timeouts (Marc Brooker).
Google SRE Book (free online) — overload, cascading failures, throttling.
Patterns of Distributed Systems by Unmesh Joshi — patterns of retries, idempotency, consensus.

Each of these will deepen your understanding past what this file covers.

Appendix U: A Quick Reference for Design Reviews¶

When reviewing a teammate's retry code, ask:

"What is the max attempts?"
"What is the cap?"
"Is there jitter? Which strategy?"
"Is context propagated?"
"Is the deadline clipped?"
"Is there a retry budget?"
"Does the operation have an idempotency key?"
"Is the breaker open-circuit treated as permanent?"
"Is Retry-After honoured?"
"Are metrics emitted?"
"What error types are retryable?"
"What error types are permanent?"
"Where else in the stack does retry happen?" (Avoid multiplication.)
"What is the worst-case wall clock?"
"What happens during a complete outage of the dependency?"

If most answers are vague, push back. Retry code is one of the most operationally consequential parts of a codebase.

Appendix V: Common Anti-Patterns Specific to Senior Code¶

Some anti-patterns appear only in code written by mid-engineers who think they have learned senior tricks. Watch for:

Anti-pattern 1: hedging without idempotency¶

Hedged requests duplicate the call. If the operation is not idempotent, hedging creates duplicate effects.

Anti-pattern 2: breaker that records every retry as failure¶

Each retry inside a single operation should count as one failure for the breaker, or zero. Not five.

Anti-pattern 3: per-replica budget calibrated for single-replica deployment¶

If you deploy 10 replicas of a service, each with a 100 RPS retry budget, the aggregate is 1000 RPS. Calibrate aggregate, not per-replica.

Anti-pattern 4: adaptive throttling with no floor¶

The adaptive formula can drop admission to 0%. If you have no minimum admission rate, the system never recovers. Always set a floor (e.g. minimum 10% admission).

Anti-pattern 5: idempotency keys generated per-attempt¶

for attempt := 0; attempt < 5; attempt++ {
    key := uuid.NewString() // wrong: generated each attempt
    err := client.Do(ctx, key, req)
    // ...
}

The key must be generated once and reused across attempts.

Anti-pattern 6: deadline computed from local clock when servers have skew¶

Absolute deadline propagation fails with clock skew. Use relative deadlines (timeouts) and recompute the absolute at each hop.

Anti-pattern 7: streaming gRPC retry that drops state¶

A streaming RPC that retries from scratch loses state. The server may double-process.

Anti-pattern 8: retry inside a database transaction¶

Long retries hold transaction locks for long. Either move retries outside the transaction or use short-lived locks.

Anti-pattern 9: full-jitter clamped to zero floor¶

If the jitter draws 0, you retry immediately. Sometimes the operation needs a moment to recover. Consider equal jitter or a small floor.

Anti-pattern 10: budget that does not reset¶

A budget that decrements but never refills is just a counter. Make sure your rate.Limiter is the right shape: tokens replenish at a rate.

Appendix W: Recap¶

The senior-level shift is from local correctness to global correctness. The retry policy is not just code; it is a system-design contract that interacts with deadlines, idempotency, breakers, budgets, and architecture choices.

The math (queueing theory, multiplicative amplification) lets you predict outcomes. The patterns (edge-only retries, deadline propagation, hedging, adaptive throttling) let you build resilient systems. The postmortems (AWS S3, GitHub, others) show what happens when these go wrong.

The professional file translates all of this into production code: cenkalti/backoff integration, gRPC interceptors, OpenTelemetry tracing, Prometheus metrics, and an end-to-end client wrapping all the layers.

Final Exercises (Senior)¶

Pick three or four; solutions in tasks.md.

Compute the worst-case load amplification for 4 tiers with 5 retries each and 70% failure rate.
Implement Google's adaptive throttling formula in Go.
Design an idempotency-key scheme for a hypothetical payments service.
Write a hedge package with hedging budget and cancellation.
Combine retry + breaker + budget + bulkhead into one package.
Trace a 3-tier system through a 30-second blip; estimate amplification.
Argue for or against retry in a specific deployment.
Write a kill-switch mechanism that disables retries at runtime.

If you can do these, you can architect retry policy at staff-engineer level.

Appendix X: Modelling A Real System — A Worked Example¶

Let us model retry behaviour in a hypothetical e-commerce system. The components:

Web tier (A): Public HTTP API, behind a load balancer.
Order service (B): Internal microservice creating orders.
Inventory service (C): Internal microservice checking stock.
Database (D): Backs B and C.

A user adds an item to cart. The flow:

Browser → A: POST /add-to-cart.
A → B: CreateCartItem(...).
B → C: CheckStock(...).
C → D: SELECT stock FROM items WHERE id=?.
B → D: INSERT INTO cart_items ....

Five hops. Each hop has a 50ms median latency, 200ms p99. The database has 200,000 connections, 50,000 RPS capacity.

Normal load: 10,000 users per second. Each user triggers 2 reads to C, 1 write to B. Database load: 30,000 RPS. Utilisation ρ = 0.6. Latency is fine.

Scenario: 5-minute slow database¶

Database GC pause causes 50% of queries to slow to 1-second latency for 30 seconds.

Without retries (good): B and C surface slow queries as errors. Users see "try again" page. The system is slow but recovering. After 30s, latency returns to normal.

Without retries (bad): some queries succeed slowly. A's request handler waits up to its 5s deadline. Users see slow page loads but no errors.

With retries at every tier:

C retries each query 3 times against D. Effective load on D: 3 × normal = 90,000 RPS. D is now over capacity.
B retries calls to C. B is sending 3× normal C requests, which become 9× normal D requests. Effective: 270,000 RPS. D is way over capacity.
A retries calls to B. 3× B = 9× C = 27× D. Effective: 810,000 RPS. D crashes.

Even with full jitter spreading the retries, the total RPS is bounded by:

total_load = normal_load * (1 + amplification)

Amplification with 3 tiers × 3 retries each: ~27. The system is crushed.

Scenario: edge-only retries¶

Now A retries 3 times. B and C do not retry. With deadline propagation, A's 5s budget is shared.

Per user request: - Attempt 1: A → B → C → D × 3 = 3 calls to D. - Attempt 1 fails. Wait jittered ~200ms. - Attempt 2: another 3 calls to D. - Attempt 2 fails. Wait jittered ~600ms. - Attempt 3: another 3 calls to D. Maybe succeeds.

Total per user: up to 9 calls to D. With 10,000 users hitting the failure window, D sees 90,000 extra RPS. D is briefly over capacity but recovers.

This is a 3× factor instead of 27×. Edge-only retries reduce amplification 9-fold.

Scenario: edge-only retries + retry budget¶

A has a retry budget of 10% of normal traffic. Normal: 10,000 RPS. Retry budget: 1,000 RPS.

When the failure starts, A's clients all retry. The budget exhausts within seconds. New retries are denied. Users see errors faster but the system load stays bounded.

Total extra D load: 1,000 retries × 3 calls = 3,000 RPS. Database is fine.

Scenario: + circuit breaker at A¶

A has a per-dependency breaker on calls to B. When error rate exceeds 50%, the breaker opens.

When the failure starts, A's calls to B fail. After ~10 failures, the breaker opens. A returns "service unavailable" without trying B.

D load drops to near zero (only the half-open probes).

After 30s, A's breaker tries half-open probes. The database has recovered. Probes succeed. Breaker closes. Normal traffic resumes.

This is the ideal outcome of a sophisticated retry/breaker policy. Failure is localised, contained, and recovered without operator intervention.

Summary of scenarios¶

Configuration	D load multiplier	Outcome
No retries	1×	slow but recoverable
Retries everywhere	27×	catastrophic
Edge-only retries	3×	rough but recoverable
Edge + budget	1.3×	smooth
Edge + budget + breaker	~1×	invisible to most users

This is the architectural progression a senior engineer pushes for.

Appendix Y: Where The Pros Disagree¶

A senior engineer should know the open debates in retry-policy design.

Debate 1: client-side vs server-side retry¶

Some prefer server-side retry (the server retries internal operations transparently; clients see only the final outcome). Reasons:

Servers have more context (transaction state, idempotency keys).
Clients are simpler.
Centralised policy.

Others prefer client-side retry. Reasons:

Client knows the user's deadline; server does not.
Server cannot retry network failures it never saw.
Decentralised: clients can have different policies.

In practice, both: server retries internal transient infrastructure (DB connection blips); client retries network and 5xx from server. They cover different failure modes.

Debate 2: retry vs hedging¶

Both reduce tail latency from slow replicas. Retry waits for failure; hedging duplicates aggressively.

Hedging arguments: aggressive, faster for latency-dominated workloads, requires idempotency.

Retry arguments: conservative, no duplicates, works for any operation.

Real systems do both: hedging for read fan-out, retry for writes.

Debate 3: full jitter vs decorrelated jitter¶

Full jitter is AWS's default and the most common. Decorrelated jitter is sometimes preferred for stateless retry counters or for minimising total retry work.

The difference is small. Pick full jitter unless you have a specific reason.

Debate 4: fixed budget vs adaptive throttling¶

Fixed budget (rate.Limiter at 10% of traffic) is simple. Adaptive throttling (Netflix-style, Google-style) reacts to observed conditions.

Adaptive throttling is more sophisticated but harder to tune. For most systems, fixed budget is enough.

Debate 5: per-attempt timeout vs total deadline¶

Per-attempt timeout: each call has its own 5s limit. Total deadline: 5s across all attempts.

Each has a place. Per-attempt is necessary so individual stuck calls do not consume the whole budget. Total deadline is the user-facing SLO. Use both.

Debate 6: retry on `connection-refused` vs not¶

connection-refused could be a load balancer routing to a drained instance (retry helps) or a misconfiguration (retry does not help). Most clients retry; you may decide otherwise based on context.

Debate 7: should `502 Bad Gateway` be retried¶

502 is technically "the upstream returned an invalid response". Some treat it as transient (load balancer hiccup); others treat it as permanent (the upstream is broken). Most retry it.

Debate 8: should client retry across multiple replicas¶

If you have multiple replicas, should the retry pick a different replica on retry? Yes — same replica that just failed is the worst pick.

Standard pattern: round-robin or random selection from a healthy pool, with the previously-failed replica temporarily de-prioritised.

These debates do not have universal answers. A senior engineer reads context — workload, dependencies, team — and chooses.

Appendix Z: The Senior-Level Mindset Shift¶

The transition from middle to senior is mostly mental. At junior and middle level you ask "is my retry code correct?". At senior level you ask "what is the retry behaviour of my entire system, and is that what I want?".

Concretely:

Stop optimising your retry helper in isolation. Start measuring how much retry traffic your service generates.
Stop thinking of retries as a coding pattern. Start thinking of them as a system-design contract.
Stop adding retries reflexively. Start adding them only where they buy real reliability.
Stop assuming each retry is independent. Start assuming retries cascade and amplify.

This mindset is what distinguishes senior engineers from middle. It is not about more code or fancier policies — it is about understanding the consequence of policies at scale.

When you can describe your system's retry behaviour in a 5-minute design review and predict its load impact under failure, you have crossed into senior territory.

Appendix AA: One More Worked Calculation¶

A senior interviewer might ask: "Your service does 10,000 RPS normally. The dependency typically has 1% transient failure rate. You retry 3 times. What is the load on the dependency under normal conditions?"

Solution:

Normal calls: 10,000 RPS. First-attempt failures: 10,000 × 0.01 = 100 RPS. Each failure retries 3 times. Each retry has the same 1% failure rate. Retry 1: 100 × 0.99 succeed + 100 × 0.01 retry again = ~99 + 1 retries. Retry 2: 1 × 0.99 succeed + 1 × 0.01 retry = ~1 retry.

Total dependency load: 10,000 + 100 + 1 + 0 ≈ 10,101 RPS.

About 1% additional load from retries. Easy.

Now the dependency fails completely for 30 seconds. What is the load during that window?

Every call fails. Every call retries 3 times.

Per second: 10,000 first attempts + 10,000 retry 1 + 10,000 retry 2 + 10,000 retry 3 = 40,000 RPS.

But wait — the retries are delayed by exponential backoff. Let us model:

t=0: 10,000 failures. Retry scheduled for t=jitter (avg 50ms).
t=0.05: 10,000 retries fail. Schedule retry 2 for t=0.05 + jitter (avg 100ms).
t=0.15: 10,000 retries fail. Schedule retry 3 for t=0.15 + jitter (avg 200ms).
t=0.35: 10,000 retries fail. Give up.

Plus the new requests arriving every second.

Total RPS during the failure window: - Steady: 10,000 first attempts + retries from previous seconds. - After 0.35s, all retries from t=0 are done. New requests at t=1s are starting their retry pipeline.

In steady-state during the 30-second outage: - 10,000 RPS first attempts - ~10,000 RPS retry 1 (delayed by ~50ms, so overlapping) - ~10,000 RPS retry 2 - ~10,000 RPS retry 3 - Total: ~40,000 RPS

So during a 100% failure window, dependency load is 4× normal. Still bounded; the dependency can survive if it has 4× capacity headroom.

This is the kind of back-of-envelope calculation a senior engineer does during incident planning.

Appendix BB: Going Beyond Backoff — Other Strategies¶

Exponential backoff with jitter is the workhorse. Other strategies exist:

Linear backoff¶

delay = base * attempt. Grows slower than exponential. Suitable when failures are usually short and you want to retry more aggressively.

Fibonacci backoff¶

delay(n) = delay(n-1) + delay(n-2). Between linear and exponential. Slightly smoother growth.

Constant backoff with jitter¶

delay = uniform(base/2, 3*base/2). Same expected delay each retry, just jittered. Suitable when you have no a priori reason for exponential growth.

Token bucket only¶

No fixed schedule; just a budget. Retry as fast as possible while the budget allows. Aggressive; only suitable when the dependency can absorb load.

Custom curves¶

A library might support a curve function: delay(attempt) = f(attempt). Implementations include logarithmic growth, capped polynomial, or even reinforcement-learning-tuned curves.

In practice, exponential with jitter dominates. The other strategies have niches.

Appendix CC: Multi-Region Considerations¶

Retry policy interacts with multi-region deployments.

Pinning vs failover¶

A client pinned to one region retries to that region. Suitable when latency matters.

A client with regional failover tries another region on persistent failure. Suitable when reliability matters.

Combining: retry within region first; on maxAttempts exhausted, fail over to another region.

Geographic jitter¶

For traffic across regions, jitter has another role: desynchronising regions. If all regions retry at the same instant after a global event, they all hit the failover region simultaneously.

Adding regional offset to jitter avoids this:

jitter = uniform(0, cap) + region_offset(region_id)

A small randomised offset per region (e.g. 100ms) ensures regions are out of phase.

Cross-region replication delay¶

If your retry is to a write that replicates across regions, retries during replication delay can produce inconsistencies. Use idempotency keys to deduplicate.

Failover trigger¶

When the primary region is down, clients failover. If they all failover simultaneously, the secondary region is hit by 2× load. Same thundering-herd problem, different scale.

Adaptive failover (probabilistic, jittered) mitigates this.

Appendix DD: Building A Retry SDK¶

A senior engineer often is asked to build an internal retry SDK that all services in their company use. The design considerations:

API simplicity. retry.Do(ctx, op) should be enough for 90% of cases.
Configurability. Power users need to override every knob.
Defaults. Sensible defaults so non-experts get correct behaviour.
Observability. Metrics by default; tracing optional.
Documentation. Examples for common cases.
Tested. Property-based tests for jitter; integration tests for retry behaviour.
Versioned. Breaking changes are painful at scale.

A typical company has 1-3 retry libraries:

A general-purpose one (retry.Do for any function).
An HTTP-specific one (wraps net/http).
A gRPC-specific one (interceptor).

They share a core policy implementation. The wrappers add domain-specific defaults and integrations.

If you build this, you will revisit it every year. Workloads change; defaults need tuning.

Appendix EE: Final Thoughts¶

At senior level, retry policy is one of the most consequential parts of your system. A bad policy causes outages. A good policy invisibly saves users from countless transient blips.

The middle file taught you the mechanics. The senior file taught you the architecture. The professional file will teach you the integration — taking these ideas into a real Go codebase with libraries, observability, and operational maturity.

If you understand thundering herd, deadline propagation, idempotency, retry budgets, breakers, and adaptive throttling — and you can compose them — you have everything you need to operate a high-availability system. Move to the professional file when ready.

Appendix FF: Failure Injection Testing¶

A senior engineer designs failure injection into their system. You need to prove your retry policy works before relying on it in production.

Chaos engineering basics¶

Netflix's Chaos Monkey randomly kills instances. The system must survive. Same idea applies to retries: inject failures and observe retry behaviour.

For retry-specific testing:

HTTP middleware that injects errors. Return 503 for X% of requests for Y seconds.
Database wrapper that injects timeouts. Random delays on Z% of queries.
Network-level chaos. tc (traffic control) on Linux to add packet loss.

The retry policy should: handle bounded failure rate, recover when failure stops, not cause cascading damage during failure.

Property tests¶

Property-based testing is well-suited to retry:

Property: "If the operation eventually succeeds within the deadline, the retrier returns success."
Property: "The retrier returns within maxAttempts * (maxDelay + max_op_latency)."
Property: "Total wall clock is bounded by parent deadline + epsilon."
Property: "Jittered delays are within their advertised range."

Use testing/quick or gopter for these.

Mutation testing¶

Insert a bug (if attempt > 100 to allow infinite retries; remove the jitter; forget defer t.Stop()). Run your test suite. If tests still pass, your tests are insufficient.

This catches sloppy tests that pass for the wrong reasons.

Appendix GG: Capacity Planning With Retries¶

When provisioning capacity, account for retries.

Steady-state formula¶

Capacity needed (RPS):

capacity = max(
    normal_load * (1 + retry_overhead),
    failure_window_load * amplification_factor
)

For normal operation: normal_load * 1.05 (5% retry overhead). For incidents: normal_load * 2-4 depending on policy.

Provision for the larger of the two.

Worked example¶

Normal load: 50,000 RPS.
Retry overhead in normal: 5%. So 52,500 RPS sustained.
During a 30-second blip: 4× amplification. Peak: 200,000 RPS for 30 seconds.

Provision for sustained 60,000 RPS (some headroom) plus a burst capacity of 200,000 RPS for 30 seconds.

If your infrastructure can elastically scale (Kubernetes HPA, auto-scaling groups), the 30-second burst is hard to absorb — autoscaling takes minutes. Either provision the burst capacity always, or add a retry budget that caps the burst.

Bin-packing constraints¶

Some workloads are bursty. Retry burst plus normal traffic burst combined can exceed budget.

Capacity planning for retry-heavy workloads needs to look at the joint distribution of normal traffic and failure events, not the marginal.

Appendix HH: Worked Read-Pattern Optimisations¶

For read-heavy fan-out workloads:

service A calls B (10 reads), C (10 reads), D (10 reads).

If any read fails, A retries. With 30 dependencies and 1% per-call failure, ~26% of A requests fail at least one call.

Approaches:

Retry only the failed reads. 1% retry traffic instead of 26% full-request retries.
Hedge slow reads. First read at t=0; if not done by t=p99, send second to a different replica.
Cache aggressively. Tolerate stale reads to avoid the slow dependency.
Async pre-fetch. Issue reads in parallel; if any fails, retry it in isolation.

The fan-out pattern interacts with retry. Edge-only retries are not enough if the edge is a fan-out; you may need per-leaf retries with very tight budgets.

This is one of the cases where naive "retry at the edge" guidance breaks down.

Appendix II: Senior-Level Production Review Checklist¶

Before a service goes to production, audit:

If all 20 are green, the service is ready. If any are red, escalate.

Appendix JJ: Retry Across Asynchronous Boundaries¶

Retry inside a synchronous request handler is the simplest case. Asynchronous workflows have different concerns.

Background jobs¶

A job system (Sidekiq-style, or task queue like SQS, Pub/Sub) processes jobs. Failed jobs retry on a schedule the queue manages.

The queue's retry differs from in-request retry:

Delays can be much longer (minutes, hours, days).
Idempotency is mandatory because the queue may retry arbitrarily many times.
Visibility timeouts must exceed expected job latency or you process duplicates.
Exponential backoff is still common: 1m, 5m, 25m, 2h, etc.

Event-driven¶

Kafka consumers retry message processing on failure. The retry pattern is:

Process message.
On failure, send to a "retry topic" with backoff metadata.
A worker consumes the retry topic with delay.
After N retries, send to a dead-letter topic.

This is exponential backoff implemented at the queue/topic level, not in code.

Workflow engines¶

Temporal, Cadence, etc., have built-in retry policies for each activity:

activityOptions := workflow.ActivityOptions{
    RetryPolicy: &temporal.RetryPolicy{
        InitialInterval:    1 * time.Second,
        BackoffCoefficient: 2,
        MaximumInterval:    10 * time.Second,
        MaximumAttempts:    5,
    },
}

The workflow engine implements the retry. Your code only declares the policy. Far cleaner for long-running workflows.

Appendix KK: Retry With Distributed Locks¶

Some operations require coordination — only one client should be doing work X at a time. Distributed locks (Redis, Zookeeper, etcd) coordinate.

Retry interaction:

Client tries to acquire lock. Fails (held by another).
Client wants to retry.

With exponential backoff: - Attempt 1: try lock, fail. Wait 100ms. - Attempt 2: try lock, fail. Wait 200ms. - ...

If the lock-holder is slow, all retriers stack up. Each retry blocks waiting.

Alternative: use a blocking lock with a deadline. lock.AcquireUntil(ctx, deadline). The lock service notifies you when the lock is free.

For distributed coordination, blocking primitives plus a deadline is usually better than polling with exponential backoff.

Appendix LL: Designing For Operability¶

Beyond correctness, retry code must be operable: a stressed SRE at 2 AM should be able to understand and intervene.

Operability principles:

Clear metric names. retry_attempts_total{service="orders"} is better than retries.
Useful dashboard. Show retry rate, success-at-attempt-N, budget consumption.
Documented runbooks. "If retry budget exhausted, do X."
Kill switches. Disable retries without redeploying.
Graceful degradation. If retries fail, the user gets a clear error.

The professional file dives into this; the senior level is about recognising that operability is part of the system.

Appendix MM: Common Misunderstandings From Other Senior Engineers¶

A few subtleties even seasoned engineers get wrong:

"We have a 3-retry budget, so we're safe"¶

Per-call budgets do not protect against system-wide retry storms. You need a rate budget, not a per-call cap.

"Idempotency keys mean we can retry forever"¶

No. Idempotency keys make retries safe for the operation. They do not protect the dependency from load. You still need budgets and breakers.

"Adaptive throttling solves everything"¶

Adaptive throttling reacts to observed conditions, but it has parameters that must be tuned. Untuned, it can over-throttle (denying legitimate traffic) or under-throttle (not catching the spiral). Test under chaos before relying on it.

"Full jitter is the only correct choice"¶

For thundering herd, yes. But specific contexts (rate-limit contracts, ordered processing) need different strategies. Know when full jitter is suboptimal.

"We always need exponential backoff"¶

For dependencies you call frequently, linear backoff may be better. Exponential's "let the system breathe" benefit is small when the call frequency is high anyway.

"Retries are always safe inside a circuit breaker"¶

A breaker that records every retry as a failure can open after one retry sequence. Aggregate retries into one breaker event.

"Retry-After always wins"¶

If the server suggests a 60-second retry-after and your deadline is 5 seconds, do not wait 60s. Surface the error.

Each of these is the kind of mistake a 5-year engineer makes. Recognise them.

Appendix NN: Senior-Level Interview Questions¶

Practice for promotion interviews:

"Walk me through what happens when a popular cache dies. What does your retry policy do?"
"Estimate the load multiplier when a database has 200ms latency increase. Three tiers, 3 retries each."
"Design an idempotency-key scheme for a stripe-like API. Handle race conditions."
"When would you choose hedging over retry?"
"Describe an adaptive throttling algorithm."
"What is deadline propagation and why does it matter?"
"How do you test a retry policy?"
"When would you disable retries in production?"
"Critique this retry code: [show 10 lines]."
"Design a retry policy for a payments service."
"What is the worst case latency for your retry policy with N attempts?"
"What is the difference between syntactic and semantic idempotency?"
"What is a retry storm?"
"How do retries interact with circuit breakers?"
"How do you observe retry behaviour in production?"

For each, you should be able to talk for 3-5 minutes with concrete numbers.

Appendix OO: A Note On Reading Open-Source Retry Code¶

To deepen your understanding, read the source of major retry libraries. Recommendations:

github.com/cenkalti/backoff/v4 — the most popular Go retry library. Clean, well-documented. Read this first.
github.com/aws/aws-sdk-go-v2/aws/retry — AWS SDK's retry. More features (token bucket, classification), more complex.
github.com/grpc/grpc-go/internal/transport/controlbuf.go — gRPC's internal retry. Subtle.
github.com/sony/gobreaker — circuit breaker. Pair with a retry for production.
github.com/platinummonkey/go-concurrency-limits — adaptive concurrency limits (Netflix port).

For each, ask: - What is their default policy? - How do they classify retryable errors? - How do they handle context? - How is their jitter implemented? - How do they expose metrics? - How do they integrate with circuit breakers?

Reading these libraries crystallises the patterns we have discussed. The professional file dives deeper into the most important ones.

Appendix PP: Final Senior-Level Checklist¶

Before claiming senior-level expertise on retry:

If all 12 are confident yeses, you are senior on this topic. If some are weak, re-read the corresponding appendix and try the exercises.

The professional file builds on all of this with concrete code, real libraries, and operational depth. Read it next.

Closing¶

Senior-level retry is about understanding consequences. Every retry policy choice has system-wide implications, not just local ones. The math (queueing theory, amplification, recovery dynamics) lets you predict the consequences. The patterns (edge-only, deadline propagation, budgets, breakers, hedging, adaptive throttling) let you compose policies that handle real workloads.

The next file makes this concrete in Go.

Appendix QQ: Modelling Recovery After A Failure¶

When a system fails, retries start. When it recovers, retries stop firing (because operations succeed). But the transition is delicate. Let us model it.

State after failure ends¶

Just as the dependency recovers, many clients have retries pending. Their next scheduled retry is at various points in the next few seconds (full-jitter distribution).

The arrival pattern over time:

t = recovery_time:        baseline traffic
t = recovery + jitter_1:  baseline + retries from first round
t = recovery + jitter_2:  baseline + retries from second round
...

The total traffic profile depends on how long the failure lasted, what fraction of clients are still retrying, and the jitter strategy.

Modelling with arrival rate function¶

Let R(t) be the retry arrival rate. For a failure window [0, T] and N clients with full jitter at attempt k of cap c_k:

R(t) = sum over clients still retrying of (1 / c_k(t)) for t in [last_attempt, last_attempt + c_k]

The total R(t) + baseline(t) is what the server must handle.

Critical observation¶

If the total R(t) + baseline(t) exceeds server capacity even briefly, queues form. Queues drain slowly (queue-drain dynamics). The system stays slow for longer than the spike.

This is why provisioning for the peak matters more than the average.

Numerical example¶

100,000 clients, all retrying with full jitter, base 100ms, cap 5s.
After failure (lasting 1 second), retries spread over the next 5 seconds.
Average rate: 100,000 / 5 = 20,000 RPS over the recovery window.
Plus baseline of 50,000 RPS.
Total: 70,000 RPS for 5 seconds.

If the server's capacity is 60,000 RPS, the system is overloaded for 5 seconds after the failure. The 1-second failure becomes a 5-second incident.

If the cap were 10s instead, the rate would be 10,000 RPS over 10 seconds — 60,000 RPS total. Just at capacity. The incident is twice as long but the overload is gone.

Tuning cap is therefore a trade-off between rapid recovery and overload risk.

Appendix RR: Bayesian Backoff¶

A more sophisticated approach: estimate the failure probability of the dependency in real time, and choose retry delays to maximise expected information.

If you observe a string of failures, the probability of a transient versus permanent failure shifts. Early failures are likely transient; many consecutive failures suggest permanent.

In a Bayesian framework:

Prior: P(transient) = 0.9, P(permanent) = 0.1.
After 1 failure: P(transient) drops slightly.
After 5 failures: P(transient) drops a lot.

When P(transient) becomes small enough, retrying is wasted.

Practical implementations are rare because the math is complex and the gains are marginal compared to a fixed cap. But the intuition is useful: the longer you retry, the less likely the next retry succeeds. Fixed maxAttempts is a simple approximation.

Appendix SS: The Mathematics of Hedging¶

Hedging budget calculation:

If you hedge after d (a fraction of the typical p99), the duplicate-request rate is approximately (1 - F(d)) where F is the cumulative distribution of latency.

For d = p95, duplicate rate is 5%. For d = p99, duplicate rate is 1%. For d = p90, duplicate rate is 10%.

Choosing d is a trade-off between tail-latency improvement and load multiplier.

Real-world tuning: pick d such that hedging duplicates are <5% of total traffic. Then the tail-latency improvement is significant but the load multiplier is bounded.

Hedging vs retry comparison¶

For latency-dominated workloads: - Retry: wait for p99 + retry. Total: 2 × p99. - Hedging: wait for p95, hedge, take first. Total: p95 + (hedge_latency or rest of original).

Hedging wins for tail latency. Retry wins for total work.

Appendix TT: Sample Decision Tree¶

When designing retry for a new endpoint, walk through:

Is the operation idempotent?
├── Yes → proceed to retry design
└── No → can we use idempotency keys?
    ├── Yes → use keys, then proceed
    └── No → do not retry, surface errors

Is the dependency rate-limited?
├── Yes → respect Retry-After, low retry budget
└── No → standard exponential

Is the operation user-facing?
├── Yes → 3-5 attempts, <5s total deadline
└── No → 10-20 attempts, long deadline

Is the dependency a critical path?
├── Yes → circuit breaker + retry + budget
└── No → simple retry

Is the dependency in a different region?
├── Yes → consider regional failover after retry exhaust
└── No → retry to same target

Is the operation a write?
├── Yes → must have idempotency
└── No (read) → consider hedging too

Walking through this for each endpoint takes ~5 minutes and prevents most retry-design mistakes.

Appendix UU: Composing Retry With Other Resilience Patterns¶

The full resilience stack includes:

Timeout — bound per-call latency.
Retry — handle transient failures.
Circuit breaker — fail fast on known-bad.
Bulkhead — isolate dependencies.
Rate limiter — protect server.
Backpressure — signal upstream to slow down.
Cache — avoid the dependency.
Fallback — degrade gracefully.

Composition order matters. From inside out:

fallback(
  cache(
    breaker(
      bulkhead(
        rate_limiter(
          retry(
            timeout(call)
          )
        )
      )
    )
  )
)

Reading inside-out: the inner call has a timeout; the retrier wraps the timeout; the rate limiter caps outbound rate; the bulkhead limits concurrency; the breaker fails fast on known-bad; the cache avoids the dependency; the fallback degrades.

Most systems do not need all of these. Pick based on workload.

Why this order¶

Timeout must be innermost or you cannot bound the inner call.
Retry must wrap timeout: retry resends after a timeout.
Rate limiter caps the call rate including retries.
Bulkhead bounds concurrent calls including retries.
Breaker monitors aggregate health.
Cache short-circuits before the dependency.
Fallback is the last resort when everything else fails.

This is the canonical ordering. Deviations need justification.

Appendix VV: A Long Worked Example¶

Let us design retry for a hypothetical "payments-api" service from scratch.

Requirements¶

Public API accepting POST /payments.
Backed by a payment processor (Stripe).
Must not double-charge.
99.9% uptime SLO.
Median latency 200ms; p99 1s; user deadline 5s.
Stripe occasionally has 503s; rarely (during incidents) has minute-long outages.

Design¶

Idempotency: Client provides Idempotency-Key header. Server records in Redis with 24h TTL.

Lock: During processing, server holds a Redis lock keyed on the idempotency key (60s TTL).

Retry to Stripe: Wrapping our calls to Stripe: - 3 attempts. - 200ms base, 5s cap. - Full jitter. - Honour Stripe's Retry-After header. - Treat 4xx (except 429) as permanent. - Treat network errors as transient.

Budget: rate.NewLimiter(rate.Limit(100), 200) — 100 retries/sec to Stripe, burst 200.

Breaker: sony/gobreaker with threshold 50% failures over 60 seconds, open for 30 seconds.

Deadline: Client passes 5s deadline; we propagate via context.WithTimeout.

Metrics: - payments_attempts_total{result} counter. - payments_duration_seconds histogram. - payments_retries_total counter. - payments_budget_denied_total counter. - payments_breaker_state gauge.

Alerts: - Retry rate > 5% — warn. - Budget denied rate > 1% — page. - Breaker open > 60s — page.

Tests: - Happy path: succeed first try. - Transient: fail twice, succeed third. - Permanent: fail with 4xx, surface immediately. - Exhausted: fail all 3 attempts. - Deadline: parent ctx cancelled mid-retry. - Budget: 200 simultaneous failures, some get denied. - Breaker: persistent failure trips breaker. - Idempotency: duplicate request returns cached response. - Race: simultaneous duplicate requests, only one processes.

Code skeleton¶

package payments

import (
    "context"
    "encoding/json"
    "errors"
    "net/http"
    "time"

    "github.com/sony/gobreaker"
    "github.com/redis/go-redis/v9"
    "golang.org/x/time/rate"
    "yourmodule/retry"
)

type Service struct {
    HTTP    *http.Client
    Redis   *redis.Client
    Retrier *retry.Retrier
}

func NewService(rdb *redis.Client) *Service {
    breaker := gobreaker.NewCircuitBreaker(gobreaker.Settings{
        Name:        "stripe",
        MaxRequests: 5,
        Interval:    60 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(c gobreaker.Counts) bool {
            return c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) > 0.5
        },
    })
    p := retry.Policy{
        MaxAttempts: 3,
        Base:        200 * time.Millisecond,
        MaxDelay:    5 * time.Second,
        Budget:      rate.NewLimiter(100, 200),
        Breaker:     breaker,
    }
    return &Service{
        HTTP:    &http.Client{Timeout: 10 * time.Second},
        Redis:   rdb,
        Retrier: retry.New(p),
    }
}

func (s *Service) Charge(ctx context.Context, key string, body []byte) ([]byte, error) {
    cached, err := s.Redis.Get(ctx, "idem:"+key).Bytes()
    if err == nil {
        return cached, nil
    } else if !errors.Is(err, redis.Nil) {
        return nil, err
    }
    locked, err := s.Redis.SetNX(ctx, "idem:"+key+":lock", "1", 60*time.Second).Result()
    if err != nil {
        return nil, err
    }
    if !locked {
        return nil, errors.New("conflict: request in progress")
    }
    defer s.Redis.Del(ctx, "idem:"+key+":lock")

    var result []byte
    err = s.Retrier.Do(ctx, func(ctx context.Context) error {
        var err error
        result, err = s.callStripe(ctx, body)
        return err
    })
    if err != nil {
        return nil, err
    }
    s.Redis.Set(ctx, "idem:"+key, result, 24*time.Hour)
    return result, nil
}

func (s *Service) callStripe(ctx context.Context, body []byte) ([]byte, error) {
    req, _ := http.NewRequestWithContext(ctx, "POST", "https://api.stripe.com/v1/charges", bytes.NewReader(body))
    resp, err := s.HTTP.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    if resp.StatusCode == 429 || resp.StatusCode >= 500 {
        // transient
        return nil, fmt.Errorf("stripe status %d", resp.StatusCode)
    }
    if resp.StatusCode >= 400 {
        return nil, retry.MarkPermanent(fmt.Errorf("stripe status %d", resp.StatusCode))
    }
    return io.ReadAll(resp.Body)
}

Production-ready shape. Missing: full metrics, tracing, error structuring. The professional file fills these in.

Operations runbook¶

When the alerts fire:

Retry rate alarm: check Stripe status page. If Stripe is degraded, this is expected; let it ride. If not, investigate.
Budget denied alarm: spike in retry traffic. Identify the cause. May need to disable retries temporarily.
Breaker open alarm: Stripe is failing. Check Stripe; if recoverable, monitor breaker state.
All three firing simultaneously: major Stripe incident. Switch to degraded mode (queue payments for retry later).

This is a senior-level design. The professional file shows the implementation in full.

Appendix WW: The "Why Now" of Retry Patterns¶

Many of the patterns in this file (jitter, retry budgets, adaptive throttling, hedging) became widely known after 2010. Why now?

Three factors:

Scale. Pre-2010, most services had thousands of clients. Thundering herd was a nuisance. Post-2010, services have millions of clients. Thundering herd is an outage.
Distributed systems mainstream. Microservices made every call a distributed call. Retries multiplied.
SRE discipline. Google's SRE book (2016) formalised many of these patterns. Companies adopted them.

The result: retry policy went from "for loop with sleep" to a major sub-discipline of software engineering.

Knowing the history helps. The patterns are not arbitrary — they evolved to solve specific failures. The next failure that comes up will probably evolve a new pattern.

Appendix XX: A Senior's Bookshelf¶

For continued growth past this file:

Release It! — Michael Nygard. Stability patterns including retry, timeout, circuit breaker. Essential.
Site Reliability Engineering — Google SRE. Chapter on overload. Essential.
The Site Reliability Workbook — Google SRE. Practical exercises.
Designing Distributed Systems — Brendan Burns. Patterns including some retry.
Patterns of Distributed Systems — Unmesh Joshi. Patterns and trade-offs.
Database Internals — Alex Petrov. Background on consistency and replication.
Concurrency in Go — Katherine Cox-Buday. Background on Go's concurrency primitives.
AWS Architecture Blog — ongoing posts on resilience.
Marc Brooker's blog (brooker.co.za) — deep posts on backoff, timeouts, congestion.
Google's research blog — papers on distributed systems.

Build a habit of reading one of these every week. Patterns compound.

Appendix YY: A Final Compendium¶

For quick reference, the senior-level patterns:

Anti-patterns to avoid¶

Retry at every tier.
No deadline propagation.
Per-attempt idempotency keys.
Breakers that record every retry.
Per-replica budgets that aggregate too high.
Ignoring Retry-After.
Sleeping past parent deadline.
Using crypto/rand for jitter.
No metrics, flying blind.
No kill switch.

Patterns to use¶

Edge-only retries.
Deadline propagation everywhere.
Per-operation idempotency keys.
Single breaker event per retry sequence.
Aggregate retry budgets across replicas.
Honour and jitter Retry-After.
Clip sleeps to remaining deadline.
Use math/rand (or math/rand/v2).
Emit metrics on every retry decision.
Build a kill switch into the policy.

Numbers to remember¶

Default retry: 3-5 attempts, 100ms base, 5s cap.
Default budget: 10% of normal RPS.
Default idempotency TTL: 24 hours.
Default hedging budget: 1-5% duplicates.
Default breaker threshold: 50% failures over 60 seconds.
Default total deadline: 5 seconds for user-facing.
Default per-call timeout: 2 × p99.

Skills to develop¶

Reading postmortems and identifying retry-related failures.
Estimating load amplification in a hypothetical system.
Designing idempotency for a new endpoint.
Auditing retry behaviour via metrics.
Tuning policy parameters from observed behaviour.
Operating during a retry-induced incident.

If you can do all of the above, this file has done its job.

Closing (Final)¶

Senior-level retry is not about knowing the algorithm. It is about understanding the system the algorithm lives in. The math, the patterns, the postmortems, the trade-offs — all of these turn the simple for loop from junior level into the resilient client infrastructure that powers high-availability services.

The professional file translates everything in this file into a complete Go implementation: cenkalti/backoff integration, gRPC interceptors, OpenTelemetry tracing, Prometheus metrics, end-to-end testing, deployment runbooks. Move to it when you have internalised this file.

That is the senior level. The next stop is professional — where theory becomes the production codebase that runs your company's services every minute of every day.