Exponential Backoff — Hands-On Tasks¶
18 graded exercises. Each comes with a problem statement, a solution sketch, and "extra credit" extensions.
Task 1: Implement basic exponential backoff (Junior)¶
Problem¶
Write a function Retry(op func() error, maxAttempts int, base time.Duration) error that retries op up to maxAttempts times, doubling the wait each time. Return nil on success or the wrapped final error.
Solution¶
package retry
import (
"fmt"
"time"
)
func Retry(op func() error, maxAttempts int, base time.Duration) error {
var lastErr error
for attempt := 0; attempt < maxAttempts; attempt++ {
err := op()
if err == nil {
return nil
}
lastErr = err
if attempt < maxAttempts-1 {
time.Sleep(base * time.Duration(1<<attempt))
}
}
return fmt.Errorf("after %d attempts: %w", maxAttempts, lastErr)
}
Extra credit¶
- Add a
maxDelayparameter and cap the sleep. - Add a
factorparameter for multipliers other than 2.
Task 2: Add context support (Junior/Middle)¶
Problem¶
Modify Retry to accept context.Context. Before each attempt and during sleep, check for cancellation. If cancelled, return ctx.Err().
Solution¶
func Retry(ctx context.Context, op func(context.Context) error, maxAttempts int, base, maxDelay time.Duration) error {
var lastErr error
for attempt := 0; attempt < maxAttempts; attempt++ {
if err := ctx.Err(); err != nil {
return err
}
err := op(ctx)
if err == nil {
return nil
}
lastErr = err
if attempt < maxAttempts-1 {
d := base * time.Duration(1<<attempt)
if d > maxDelay || d < 0 {
d = maxDelay
}
t := time.NewTimer(d)
select {
case <-t.C:
case <-ctx.Done():
t.Stop()
return ctx.Err()
}
}
}
return fmt.Errorf("after %d attempts: %w", maxAttempts, lastErr)
}
Extra credit¶
- Clip the sleep to the remaining deadline if context has one.
- Wrap the final error so
errors.Is(err, context.DeadlineExceeded)works.
Task 3: Implement full jitter (Middle)¶
Problem¶
Modify your retry to use full jitter. Each delay is uniformly distributed on [0, min(maxDelay, base * 2^attempt)).
Solution¶
import "math/rand"
func nextDelay(attempt int, base, maxDelay time.Duration) time.Duration {
cap := base * time.Duration(1<<attempt)
if cap > maxDelay || cap < 0 {
cap = maxDelay
}
if cap <= 0 {
return 0
}
return time.Duration(rand.Int63n(int64(cap)))
}
Extra credit¶
- Add equal and decorrelated jitter as alternatives.
- Make the strategy configurable.
Task 4: Implement equal jitter (Middle)¶
Problem¶
Implement equal jitter: delay = cap/2 + U[0, cap/2].
Solution¶
func equalJitter(attempt int, base, maxDelay time.Duration) time.Duration {
cap := base * time.Duration(1<<attempt)
if cap > maxDelay || cap < 0 {
cap = maxDelay
}
half := cap / 2
if half <= 0 {
return 0
}
return half + time.Duration(rand.Int63n(int64(half)))
}
Extra credit¶
- Compare the average delay of full vs equal jitter empirically (run 1000 samples for each, take the mean).
Task 5: Implement decorrelated jitter (Middle/Senior)¶
Problem¶
Implement decorrelated jitter. Each delay is U[base, prev*3], capped at maxDelay. Initialise prev = base.
Solution¶
type Decorrelated struct {
Base time.Duration
MaxDelay time.Duration
prev time.Duration
}
func (d *Decorrelated) Next() time.Duration {
if d.prev == 0 {
d.prev = d.Base
}
upper := d.prev * 3
if upper > d.MaxDelay || upper < 0 {
upper = d.MaxDelay
}
span := upper - d.Base
if span <= 0 {
d.prev = d.Base
return d.Base
}
delay := d.Base + time.Duration(rand.Int63n(int64(span)))
d.prev = delay
return delay
}
Extra credit¶
- Add a
Reset()method that returns to the initial state. - Make the multiplier (3) configurable.
Task 6: Retry an HTTP GET (Middle)¶
Problem¶
Write a function GetWithRetry(ctx, url) ([]byte, error) that retries the GET with exponential backoff and full jitter. Retry on 5xx and 429. Surface 4xx (except 429) immediately.
Solution¶
func GetWithRetry(ctx context.Context, url string) ([]byte, error) {
var body []byte
err := Retry(ctx, func(ctx context.Context) error {
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return MarkPermanent(err)
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode == 429 || (resp.StatusCode >= 500 && resp.StatusCode <= 599) {
return fmt.Errorf("status %d", resp.StatusCode)
}
if resp.StatusCode >= 400 {
return MarkPermanent(fmt.Errorf("status %d", resp.StatusCode))
}
b, err := io.ReadAll(resp.Body)
if err != nil {
return err
}
body = b
return nil
}, 5, 100*time.Millisecond, 5*time.Second)
return body, err
}
Extra credit¶
- Honour
Retry-Afterheader. - Parse it for both seconds and HTTP-date formats.
Task 7: Implement Permanent error wrapper (Middle)¶
Problem¶
Add a Permanent error type. Wrapping an error with MarkPermanent(err) should cause the retry loop to surface immediately without retrying.
Solution¶
type Permanent struct{ Err error }
func (p *Permanent) Error() string { return p.Err.Error() }
func (p *Permanent) Unwrap() error { return p.Err }
func MarkPermanent(err error) error { return &Permanent{Err: err} }
// In Retry:
var perm *Permanent
if errors.As(err, &perm) {
return perm.Err
}
Extra credit¶
- Add a
Retryablewrapper for the opposite (force retry even if classifier says no). - Detect cycles (
Permanent(Permanent(err))).
Task 8: Add a retry budget (Senior)¶
Problem¶
Add a retry budget using golang.org/x/time/rate.Limiter. If the budget is empty when a retry is about to fire, surface a special error ErrBudgetExhausted.
Solution¶
import "golang.org/x/time/rate"
var ErrBudgetExhausted = errors.New("retry budget exhausted")
func RetryWithBudget(ctx context.Context, op func(context.Context) error, maxAttempts int, base, maxDelay time.Duration, budget *rate.Limiter) error {
var lastErr error
for attempt := 0; attempt < maxAttempts; attempt++ {
if err := ctx.Err(); err != nil {
return err
}
err := op(ctx)
if err == nil {
return nil
}
lastErr = err
if attempt < maxAttempts-1 {
if !budget.Allow() {
return fmt.Errorf("%w: %v", ErrBudgetExhausted, lastErr)
}
d := fullJitter(attempt, base, maxDelay)
// sleep with context...
}
}
return fmt.Errorf("after %d attempts: %w", maxAttempts, lastErr)
}
Extra credit¶
- Allow per-operation budgets keyed on a string label.
- Emit a metric on budget denial.
Task 9: Integrate a circuit breaker (Senior)¶
Problem¶
Use github.com/sony/gobreaker to wrap each retry attempt. If the breaker is open, surface immediately without retrying.
Solution¶
import "github.com/sony/gobreaker"
func RetryWithBreaker(ctx context.Context, op func(context.Context) error, breaker *gobreaker.CircuitBreaker, maxAttempts int, base, maxDelay time.Duration) error {
var lastErr error
for attempt := 0; attempt < maxAttempts; attempt++ {
if err := ctx.Err(); err != nil {
return err
}
_, err := breaker.Execute(func() (interface{}, error) {
return nil, op(ctx)
})
if err == nil {
return nil
}
if errors.Is(err, gobreaker.ErrOpenState) {
return err
}
lastErr = err
if attempt < maxAttempts-1 {
d := fullJitter(attempt, base, maxDelay)
// sleep with context...
}
}
return fmt.Errorf("after %d attempts: %w", maxAttempts, lastErr)
}
Extra credit¶
- Avoid breaker thrash: only record one breaker event per retry sequence.
- Add metrics on breaker state changes.
Task 10: Simulate retry strategies (Senior)¶
Problem¶
Write a program that simulates 1000 clients retrying against a server that recovers at t = 5s. Compare full, equal, decorrelated, and no-jitter strategies. Report total time-to-completion and total retries.
Solution¶
package main
import (
"fmt"
"math/rand"
"time"
)
func simulate(strategy string, clients int, base, cap time.Duration, recovery time.Duration, maxAttempts int) (completionTime time.Duration, totalRetries int) {
for c := 0; c < clients; c++ {
var t time.Duration
var prev time.Duration
for attempt := 0; attempt < maxAttempts; attempt++ {
totalRetries++
if t >= recovery {
if t > completionTime {
completionTime = t
}
break
}
var d time.Duration
switch strategy {
case "none":
d = base * time.Duration(1<<attempt)
case "full":
cc := base * time.Duration(1<<attempt)
if cc > cap { cc = cap }
d = time.Duration(rand.Int63n(int64(cc)))
case "equal":
cc := base * time.Duration(1<<attempt)
if cc > cap { cc = cap }
d = cc/2 + time.Duration(rand.Int63n(int64(cc/2)))
case "decorr":
if prev == 0 { prev = base }
upper := prev * 3
if upper > cap { upper = cap }
span := upper - base
if span <= 0 {
d = base
} else {
d = base + time.Duration(rand.Int63n(int64(span)))
}
prev = d
}
if d > cap { d = cap }
t += d
}
}
return
}
func main() {
for _, s := range []string{"none", "full", "equal", "decorr"} {
ct, tr := simulate(s, 1000, 100*time.Millisecond, 5*time.Second, 5*time.Second, 50)
fmt.Printf("%-7s completion=%v retries=%d\n", s, ct, tr)
}
}
Extra credit¶
- Plot the distribution of retry arrivals over time for each strategy.
- Simulate with variable recovery times.
Task 11: Hedged HTTP request (Senior)¶
Problem¶
Write a function HedgedGet(ctx, urls, hedgeDelay) ([]byte, error) that sends a GET to the first URL, and if no response in hedgeDelay, sends to the second URL too. Return the first successful response.
Solution¶
func HedgedGet(ctx context.Context, urls []string, hedgeDelay time.Duration) ([]byte, error) {
type result struct {
body []byte
err error
}
out := make(chan result, len(urls))
ctx, cancel := context.WithCancel(ctx)
defer cancel()
for i, url := range urls {
go func(i int, url string) {
if i > 0 {
t := time.NewTimer(hedgeDelay * time.Duration(i))
select {
case <-t.C:
case <-ctx.Done():
t.Stop()
return
}
}
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
select {
case out <- result{nil, err}:
case <-ctx.Done():
}
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
select {
case out <- result{body, err}:
case <-ctx.Done():
}
}(i, url)
}
var lastErr error
for i := 0; i < len(urls); i++ {
select {
case r := <-out:
if r.err == nil {
return r.body, nil
}
lastErr = r.err
case <-ctx.Done():
return nil, ctx.Err()
}
}
return nil, lastErr
}
Extra credit¶
- Add a hedging budget.
- Add metrics: hedged requests count, hedge wins.
Task 12: Implement idempotency-key handler (Senior)¶
Problem¶
Server-side: write an HTTP handler that processes a request with Idempotency-Key header. On first request, do the work and cache the response. On subsequent requests with the same key, return the cached response without redoing the work. Use Redis.
Solution¶
import "github.com/redis/go-redis/v9"
type Handler struct {
Redis *redis.Client
Op func(ctx context.Context, body []byte) ([]byte, error)
}
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
key := r.Header.Get("Idempotency-Key")
if key == "" {
http.Error(w, "missing key", http.StatusBadRequest)
return
}
cached, err := h.Redis.Get(ctx, "idem:"+key).Bytes()
if err == nil {
w.Write(cached)
return
}
locked, err := h.Redis.SetNX(ctx, "idem:"+key+":lock", "1", 60*time.Second).Result()
if err != nil {
http.Error(w, err.Error(), 500)
return
}
if !locked {
http.Error(w, "in progress", http.StatusConflict)
return
}
defer h.Redis.Del(ctx, "idem:"+key+":lock")
body, _ := io.ReadAll(r.Body)
result, err := h.Op(ctx, body)
if err != nil {
http.Error(w, err.Error(), 500)
return
}
h.Redis.Set(ctx, "idem:"+key, result, 24*time.Hour)
w.Write(result)
}
Extra credit¶
- Detect key reuse with different bodies; return 409.
- Use a proper lock library for distributed correctness.
Task 13: Build a kill switch (Senior)¶
Problem¶
Add a global kill switch that disables retries. Implement via env var or atomic boolean. Check before each retry.
Solution¶
var retriesDisabled atomic.Bool
func init() {
if os.Getenv("RETRIES_DISABLED") == "true" {
retriesDisabled.Store(true)
}
}
func RetryIfEnabled(ctx context.Context, op func(context.Context) error, maxAttempts int) error {
if retriesDisabled.Load() {
return op(ctx)
}
return Retry(ctx, op, maxAttempts, /* ... */)
}
// HTTP endpoint to toggle:
func handleKillSwitch(w http.ResponseWriter, r *http.Request) {
if r.URL.Query().Get("disabled") == "true" {
retriesDisabled.Store(true)
w.Write([]byte("retries disabled"))
} else {
retriesDisabled.Store(false)
w.Write([]byte("retries enabled"))
}
}
Extra credit¶
- Make it per-dependency (kill switch for stripe only, not internal services).
- Integrate with feature-flag service.
Task 14: Add Prometheus metrics (Professional)¶
Problem¶
Instrument the retry helper with Prometheus counters and histograms. Emit: total attempts, attempt-at-success histogram, budget denial count.
Solution¶
import "github.com/prometheus/client_golang/prometheus/promauto"
var (
attempts = promauto.NewCounterVec(
prometheus.CounterOpts{Name: "retry_attempts_total"},
[]string{"outcome"},
)
attemptAtSuccess = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "retry_attempt_at_success",
Buckets: []float64{1, 2, 3, 5, 10},
})
budgetDenied = promauto.NewCounter(prometheus.CounterOpts{
Name: "retry_budget_denied_total",
})
)
func RetryWithMetrics(...) error {
for attempt := 0; attempt < maxAttempts; attempt++ {
err := op(ctx)
if err == nil {
attempts.WithLabelValues("success").Inc()
attemptAtSuccess.Observe(float64(attempt + 1))
return nil
}
attempts.WithLabelValues("failure").Inc()
// ...
if !budget.Allow() {
budgetDenied.Inc()
// ...
}
}
attempts.WithLabelValues("giveup").Inc()
return /* ... */
}
Extra credit¶
- Add per-operation labels.
- Build a dashboard.
Task 15: Add OpenTelemetry spans (Professional)¶
Problem¶
Wrap the retry helper so that each attempt produces a child span under a parent "retry.Do" span. Each attempt's error should be recorded on its span.
Solution¶
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("retry")
func RetryWithTracing(ctx context.Context, op func(context.Context) error, /* ... */) error {
ctx, parent := tracer.Start(ctx, "retry.Do")
defer parent.End()
for attempt := 0; attempt < maxAttempts; attempt++ {
attemptCtx, attemptSpan := tracer.Start(ctx, fmt.Sprintf("attempt.%d", attempt+1))
err := op(attemptCtx)
if err != nil {
attemptSpan.RecordError(err)
}
attemptSpan.End()
if err == nil {
parent.SetAttributes(attribute.Int("success_attempt", attempt + 1))
return nil
}
// ... retry logic ...
}
return /* ... */
}
Extra credit¶
- Add attributes for retry policy parameters.
- Add tracing across an HTTP call.
Task 16: Integrate cenkalti/backoff (Professional)¶
Problem¶
Migrate your custom retry to github.com/cenkalti/backoff/v4. Configure with full-jitter-like behaviour using RandomizationFactor = 1.0, and add context support.
Solution¶
import "github.com/cenkalti/backoff/v4"
func RetryWithBackoff(ctx context.Context, op func() error) error {
b := backoff.NewExponentialBackOff()
b.InitialInterval = 100 * time.Millisecond
b.MaxInterval = 5 * time.Second
b.MaxElapsedTime = 30 * time.Second
b.RandomizationFactor = 1.0
bWithCtx := backoff.WithContext(b, ctx)
return backoff.RetryNotify(op, bWithCtx, func(err error, d time.Duration) {
// log/metric
})
}
Extra credit¶
- Combine with
WithMaxRetries. - Use
Permanentto short-circuit non-retryable errors.
Task 17: Honour Retry-After header (Professional)¶
Problem¶
When the server returns 429 or 503 with a Retry-After header, use the header's value (with optional jitter) instead of your computed backoff.
Solution¶
func ParseRetryAfter(h string) (time.Duration, bool) {
if h == "" { return 0, false }
if s, err := strconv.Atoi(h); err == nil {
return time.Duration(s) * time.Second, true
}
if t, err := http.ParseTime(h); err == nil {
return time.Until(t), true
}
return 0, false
}
// In retry loop:
if d, ok := ParseRetryAfter(resp.Header.Get("Retry-After")); ok {
// add some jitter
d += time.Duration(rand.Int63n(int64(d) / 10))
// clip to deadline
if deadline, hasDeadline := ctx.Deadline(); hasDeadline {
if d > time.Until(deadline) {
return ctx.Err()
}
}
sleep = d
} else {
sleep = fullJitter(...)
}
Extra credit¶
- Cap max value (don't sleep > 1 hour even if header says so).
- Test both seconds and HTTP-date formats.
Task 18: End-to-end resilient HTTP client (Professional)¶
Problem¶
Build a resilient HTTP client that combines retry, full jitter, context, retry budget, circuit breaker, Prometheus metrics, and OpenTelemetry tracing. Use it to call a flaky test server.
Solution sketch¶
The full code is the end-to-end example in professional.md. Build:
Clientstruct with*http.Client,*backoff.BackOff,*rate.Limiter,*gobreaker.CircuitBreaker.Get(ctx, url) ([]byte, error)method.- Inside: tracing span, then retry loop, then breaker, then HTTP call.
- Emit metrics throughout.
Extra credit¶
- Test with a
httptest.NewServerthat returns 503 randomly. - Tune parameters and observe via metrics.
- Inject a kill switch.
Final Project: Build a full retry SDK¶
Combine all the above into a single library:
- Package:
retry. - API:
New(opts ...Option) *Retrier;r.Do(ctx, op) error. - Options:
WithMaxAttempts,WithBase,WithMaxDelay,WithJitter,WithBudget,WithBreaker,WithMetrics. - Tests: 100% coverage of branches.
- Docs: README, examples, godoc.
This is a 1-2 day project for a senior engineer. It is also a great portfolio piece.
Solutions checklist¶
You have completed the tasks when you can:
- Write Tasks 1-7 (junior/middle) from memory.
- Implement all three jitter strategies in 10 minutes.
- Integrate retry with budget, breaker, and context.
- Test your helper with chaos (random failures).
- Emit Prometheus metrics with proper labels.
- Add OpenTelemetry tracing.
- Honour
Retry-After. - Build a kill switch.
If all checked: you are production-ready on retries.