Handle, Don't Just Check — Middle Level¶

Table of Contents¶

Introduction
The Six Decisions, Revisited
The "Decide or Surface" Question, At Every Layer
Retry Mechanics: Backoff, Jitter, Idempotency
Transforming at API Boundaries
The Log-or-Return Rule, In Detail
Recovery Strategies: Fallback, Cache, Degraded Mode
The errWriter Pattern and Errors as State
Errors Across Goroutines
Errors and Context Cancellation
Sentinel + Custom Error Type Patterns
Anti-Patterns That Look Like Handling
Code Review Heuristics
Worked Example: Order Processor
Summary
Further Reading

Introduction¶

Focus: "Why?" and "When?"

At junior level you learned the decision menu and the "happy path stays straight" idiom. Middle level is where those rules meet a real codebase: a service with five layers, three external dependencies, two goroutine pools, and a public HTTP API. The question shifts from "what do I do at one error site?" to "who in this five-layer chain is responsible for which decision, and where do the rules change?"

This file is the answer set: how to assign responsibility for errors layer by layer, when to retry vs. when to give up, how to translate at boundaries, and how to write recovery patterns that do not silently leak failures.

The Six Decisions, Revisited¶

A reminder of the menu, with middle-level commentary:

Decision	Details
Recover	Returns a fallback. Good for cache misses, missing config files with defaults, optional features. Be explicit: a comment `// missing file is OK; use defaults` makes intent visible.
Retry	Only for idempotent ops on transient errors. Both conditions matter; a retry on a non-idempotent op is a bug, not a fix.
Transform	Re-express the error in the next layer's language. Storage error → domain error. Domain error → HTTP status. Each translation simplifies for the next reader.
Surface	`return err` — the laziest correct answer. Always pair with a wrap that adds new context, not just the same word.
Log	Owns the error: nothing else needs to know. Used at boundaries (handler), or in fire-and-forget code paths.
Abort	`panic` — for invariants that must hold or programmer errors. Never for ordinary failure.

A handler is any function that picks one of these explicitly. A checker is one that always picks "Surface" without thinking.

The "Decide or Surface" Question, At Every Layer¶

A typical Go service has roughly four layers:

  ┌──────────────────┐
  │  Transport       │  HTTP/gRPC handlers
  ├──────────────────┤
  │  Application     │  Use cases, command handlers
  ├──────────────────┤
  │  Domain          │  Business rules
  ├──────────────────┤
  │  Infrastructure  │  DB, queue, external API
  └──────────────────┘

Each layer has different information and different responsibility. The handling decision changes with the layer:

Layer	Knows	Best decision for most errors
Infrastructure	The driver/protocol error	Retry transient (idempotent ops); transform driver error to a sentinel; surface the rest
Domain	Business invariants, sentinels	Surface to caller; do not log; do not retry (no protocol info)
Application	Use case context, transactional intent	Map sentinels to domain results; choose retry policy at the use-case level
Transport	Request, user identity, response format	Translate sentinels → status codes; log once; never panic

A common smell: the wrong layer making the decision. A storage adapter that retries non-idempotent calls. A domain service that logs. An HTTP handler that swallows. Watch for these in code review.

Retry Mechanics: Backoff, Jitter, Idempotency¶

A retry is the most "interesting" handling decision because it is the most often misused. Three rules:

Idempotency is mandatory. GET, PUT, DELETE on a known ID — fine. POST that creates a new resource — usually not. Wrap non-idempotent ops in an idempotency key, or do not retry.
Backoff is mandatory. A tight retry loop on a downed downstream service is a self-inflicted DDoS. Standard pattern is exponential backoff with jitter: each retry waits longer, plus a random component to break thundering herds.
A budget is mandatory. Retry forever and the caller's request times out anyway, but with a longer trace. Cap attempts; cap total time; honour the parent context's deadline.

A correct retry helper:

func Retry(ctx context.Context, attempts int, base time.Duration,
    op func(context.Context) error, retryable func(error) bool) error {

    var err error
    for i := 0; i < attempts; i++ {
        if err = op(ctx); err == nil {
            return nil
        }
        if !retryable(err) {
            return err
        }
        // Exponential backoff with full jitter
        d := time.Duration(rand.Int63n(int64(base * (1 << i))))
        select {
        case <-time.After(d):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return fmt.Errorf("after %d attempts: %w", attempts, err)
}

Three things this helper gets right:

It checks retryable(err) so you only retry the kinds you should.
It uses select { case ... <-ctx.Done() } so a cancelled context aborts the wait.
It wraps the final error with the attempt count so logs explain why the request took so long.

Common mistakes in custom retry helpers:

time.Sleep instead of select-on-context. The wait blocks past the deadline.
No retryable predicate — retries ErrInvalidArgument forever.
Linear (not exponential) backoff — does not relieve a struggling service.
No jitter — synchronised retries from many clients hammer the upstream simultaneously.

Transforming at API Boundaries¶

Every API layer has its own vocabulary for failure:

Boundary	Vocabulary
HTTP	4xx/5xx status codes, problem-details JSON
gRPC	`codes.NotFound`, `codes.PermissionDenied`, etc.
Domain (between packages)	Sentinel errors and typed errors
User CLI	A short message, an exit code

The transform decision happens at each boundary. Internal vocabulary leaks are a smell:

// Bad: transport leaks SQL error to client
http.Error(w, "sql: no rows in result set", 500)

// Good: transport translates
if errors.Is(err, sql.ErrNoRows) {
    http.NotFound(w, r)
    return
}

A typical translation table for a CRUD service:

Domain sentinel	HTTP	gRPC
`ErrNotFound`	404	NotFound
`ErrAlreadyExists`	409	AlreadyExists
`ErrInvalidArgument`	400	InvalidArgument
`ErrPermissionDenied`	403	PermissionDenied
`ErrUnauthenticated`	401	Unauthenticated
anything else	500	Internal

Implementation:

func httpStatus(err error) int {
    switch {
    case errors.Is(err, ErrNotFound):
        return http.StatusNotFound
    case errors.Is(err, ErrAlreadyExists):
        return http.StatusConflict
    case errors.Is(err, ErrInvalidArgument):
        return http.StatusBadRequest
    case errors.Is(err, ErrPermissionDenied):
        return http.StatusForbidden
    case errors.Is(err, ErrUnauthenticated):
        return http.StatusUnauthorized
    default:
        return http.StatusInternalServerError
    }
}

The boundary owns this map. Adding a new domain sentinel means updating the map exactly once.

The Log-or-Return Rule, In Detail¶

The rule again: log OR return, not both. Why is breaking it bad?

Each log.Printf("op failed: %v", err) is structured noise. Five layers logging the same error means five entries with similar but not identical wording. Operators learn to scroll past the noise. The actual moment of decision — where it failed and why — is buried.

Two rules-of-thumb to keep the rule honest:

Logs are owned by the boundary. HTTP middleware logs. Worker recovery logs. Background timers log. Internal layers do not log; they wrap and return.
If you have to log inside a layer, you have a reason. "Best-effort cache flush failed" is a reason. "Made a debugging note" is not. Document the reason in the log line itself: log.Printf("best-effort cache flush failed (continuing): %v", err).

Counter-example: a worker pool without a top-level handler logs because it has nowhere to surface to. The worker is the boundary.

Recovery Strategies: Fallback, Cache, Degraded Mode¶

When you decide to recover instead of surfacing, the recovery strategy itself has a small taxonomy:

Strategy	Example
Static default	Missing config → use baked-in defaults.
Cached value	Downstream API down → serve last successful response.
Stale read	DB primary unreachable → read from a (possibly stale) replica.
Reduced feature	Recommendation service down → return a generic feed, not personalised.
Skip the step	Optional analytics fire-and-forget — log and continue.

Recovery is rarely "do nothing"; it is choose a degraded behaviour. The user gets something, even if not the best something. This is what "graceful degradation" means in production.

A pattern: recover, but mark the response.

type RecResp struct {
    Items   []Item
    Stale   bool
    Reason  string // optional, for debug/observability
}

func recommend(ctx context.Context, userID int) (RecResp, error) {
    items, err := personaliser.Recommend(ctx, userID)
    if err != nil {
        items, err2 := generic.Recommend(ctx)
        if err2 != nil {
            return RecResp{}, fmt.Errorf("recommend %d: personalised %v; generic %w", userID, err, err2)
        }
        return RecResp{Items: items, Stale: true, Reason: "personaliser unavailable"}, nil
    }
    return RecResp{Items: items}, nil
}

The caller learns the response is degraded; observability picks up Stale=true rates. Recovery without observability is silent breakage.

The errWriter Pattern and Errors as State¶

Long sequences of operations that all return the same error type get tedious:

if _, err := w.Write(a); err != nil { return err }
if _, err := w.Write(b); err != nil { return err }
if _, err := w.Write(c); err != nil { return err }
if _, err := w.Write(d); err != nil { return err }

Rob Pike's errors are values essay introduces the errWriter pattern: capture the error in a struct field, no-op subsequent calls, check at the end:

type errWriter struct {
    w   io.Writer
    err error
}

func (e *errWriter) write(p []byte) {
    if e.err != nil {
        return
    }
    _, e.err = e.w.Write(p)
}

func writeAll(w io.Writer, blocks ...[]byte) error {
    ew := &errWriter{w: w}
    for _, b := range blocks {
        ew.write(b)
    }
    return ew.err
}

Every ew.write checks the prior state and either continues or no-ops. The final return ew.err is the one handle. This collapses N checks into one decision point.

The same idea generalises: any state machine that should stop on first error can hold an err field and short-circuit.

type Parser struct {
    in  *bufio.Scanner
    err error
}

func (p *Parser) Token() string {
    if p.err != nil {
        return ""
    }
    if !p.in.Scan() {
        p.err = p.in.Err()
        return ""
    }
    return p.in.Text()
}

func (p *Parser) Err() error { return p.err }

Pattern: errors are state, not control flow. No panic/recover, no exception simulation; just a sticky field.

Errors Across Goroutines¶

A go f() is fire-and-forget. The launching goroutine cannot recover a panic that happens inside f. The launching goroutine cannot read an error returned by f. You must build the channel back yourself.

Two standard tools:

`errgroup.Group` (golang.org/x/sync/errgroup)¶

import "golang.org/x/sync/errgroup"

func fanOut(ctx context.Context, ids []int) error {
    g, ctx := errgroup.WithContext(ctx)
    results := make([]Result, len(ids))
    for i, id := range ids {
        i, id := i, id
        g.Go(func() error {
            r, err := fetch(ctx, id)
            if err != nil {
                return fmt.Errorf("fetch %d: %w", id, err)
            }
            results[i] = r
            return nil
        })
    }
    return g.Wait()
}

errgroup collects the first error and cancels the group's context. Other goroutines see ctx.Done() and stop. The g.Wait() returns that first error.

Channels for explicit collection¶

type result struct {
    id  int
    err error
}

ch := make(chan result, len(ids))
for _, id := range ids {
    go func(id int) {
        ch <- result{id, work(id)}
    }(id)
}
var firstErr error
for i := 0; i < len(ids); i++ {
    r := <-ch
    if r.err != nil && firstErr == nil {
        firstErr = r.err
    }
}

Use channels when you need all errors, not just the first; pair with errors.Join (Go 1.20+) to combine.

Always recover panics in goroutines you spawn¶

go func() {
    defer func() {
        if r := recover(); r != nil {
            log.Printf("worker panic: %v\n%s", r, debug.Stack())
        }
    }()
    work()
}()

A panic in an unrecovered goroutine crashes the entire process. This is the most common Go production crash. Every go you write should have a recovery, unless you explicitly want crash-on-panic semantics for that worker.

Errors and Context Cancellation¶

context.Context introduces two special errors: context.Canceled and context.DeadlineExceeded. Both are expected, not failures. Treat them as a successful "stop" signal:

if err := op(ctx); err != nil {
    if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
        return nil // not an error: caller asked us to stop
    }
    return fmt.Errorf("op: %w", err)
}

Or, more often, surface them but not as alarms:

if err := op(ctx); err != nil {
    if ctx.Err() != nil {
        return ctx.Err() // surface canceled/deadline as-is
    }
    return fmt.Errorf("op: %w", err) // wrap real failures
}

Why does this matter? A monitoring dashboard that counts "errors" should not flag context cancellations as alarming. Filter by errors.Is(err, context.Canceled) and exclude.

select patterns make cancellation explicit:

select {
case res := <-resultCh:
    return res, nil
case <-ctx.Done():
    return zero, ctx.Err()
}

Worker goroutines that receive a cancelled context should clean up and return — not panic, not log "weird state".

Sentinel + Custom Error Type Patterns¶

Two complementary patterns for layering decisions:

Sentinels for kind¶

var (
    ErrNotFound       = errors.New("not found")
    ErrAlreadyExists  = errors.New("already exists")
    ErrInvalidInput   = errors.New("invalid input")
)

Used with errors.Is. The boundary translates each into a status code.

Custom types for data¶

type ValidationError struct {
    Field string
    Reason string
}

func (e *ValidationError) Error() string {
    return fmt.Sprintf("validation: %s: %s", e.Field, e.Reason)
}

Used with errors.As. The boundary inspects fields:

var ve *ValidationError
if errors.As(err, &ve) {
    http.Error(w, fmt.Sprintf("invalid %s: %s", ve.Field, ve.Reason), 400)
    return
}

A typical service uses sentinels for kinds (small finite set) and custom types when you need to carry data (multi-field, structured).

Anti-pattern: a single megastruct¶

type AppError struct {
    Code    int
    Message string
    Cause   error
    Stack   []byte
    Trace   string
    UserID  int
    // ... 20 more fields
}

Dumping every possible piece of metadata into one struct couples every layer to the same shape. Prefer many small types or sentinels + wrapping.

Anti-Patterns That Look Like Handling¶

Anti-pattern 1: Log and rethrow chains¶

if err != nil {
    log.Printf("foo failed: %v", err)
    return err
}

Five layers of this and your log is full of "foo failed: bar failed: baz failed: i/o timeout". Pick a layer.

Anti-pattern 2: Swallow with `_`¶

data, _ := io.ReadAll(resp.Body)

If ReadAll fails, data is whatever was read so far — possibly empty. The error is gone. The bug is invisible until users complain about empty responses.

Anti-pattern 3: Always wrap, never inspect¶

return fmt.Errorf("op: %w", err)  // every layer

Mechanical wrapping without thinking is the same as mechanical checking. Wrap when the next layer cannot reconstruct the context. Skip the wrap when it adds nothing.

Anti-pattern 4: Generic catch-all¶

defer func() {
    if r := recover(); r != nil {
        log.Println("something went wrong")
    }
}()

A panic with no detail in the log is worse than a crash. Always include the panic value and stack: log.Printf("panic %v\n%s", r, debug.Stack()).

Anti-pattern 5: Ignoring `Close()` errors¶

defer file.Close()

For files you wrote, Close may report buffer-flush errors. Silently dropping them means "your data is on disk" can be a lie. Pattern:

err := write(file)
if cerr := file.Close(); err == nil {
    err = cerr
}
return err

Anti-pattern 6: Re-panic with stripped info¶

if err := work(); err != nil {
    panic(err.Error())
}

Converts a typed error into a string. The recovery side cannot errors.Is or errors.As anymore. If you must escalate to panic, panic with the original error.

Anti-pattern 7: Returning errors from `Close`-only methods¶

If your "Close-like" method's only failure mode is "could not flush", consider whether the caller can do anything with that information. Often the caller cannot. Either way, document what the error means.

Code Review Heuristics¶

A short list of questions to ask in PR review:

Does this if err != nil block make a decision, or is it a reflex? If reflex, ask the author what they expect to happen.
Is the wrap message informative? "load user 42" yes; "error" no.
Is this layer logging and returning? Pick one.
Is this retry actually safe? Idempotent + transient + bounded?
Does this recover have a debug.Stack()? Otherwise where-info is lost.
Are domain sentinels translated at the boundary? Or do storage errors leak to clients?
Are context cancellations distinguished from real failures? Otherwise alarms fire on every shutdown.
Are spawned goroutines protected by a recover? A panic kills the process.
Is the happy path at the left margin? If indentation grows past 2 levels, look hard.
Are Close() errors captured for files you wrote? If yes, where?

A reviewer checklist saves more bugs than any tool.

Worked Example: Order Processor¶

Putting it all together — an order processor with all six decisions visible:

package orders

import (
    "context"
    "database/sql"
    "errors"
    "fmt"
    "log"
    "time"
)

var (
    ErrOrderNotFound = errors.New("order not found")
    ErrAlreadyPaid   = errors.New("already paid")
)

type Service struct {
    db      *sql.DB
    payment Payment
}

type Payment interface {
    Charge(ctx context.Context, orderID string, amount int) error
}

func (s *Service) ProcessOrder(ctx context.Context, orderID string) error {
    // 1. Fetch — surface or transform DB error
    order, err := s.fetchOrder(ctx, orderID)
    if err != nil {
        return err // already wrapped/transformed inside fetchOrder
    }

    // 2. Idempotency — recover (do nothing, return success)
    if order.Paid {
        return nil // already paid; this is success, not error
    }

    // 3. Charge — retry transient, surface permanent
    if err := s.chargeWithRetry(ctx, order); err != nil {
        return fmt.Errorf("process order %s: %w", orderID, err)
    }

    // 4. Mark paid — log on failure, do not surface
    if err := s.markPaid(ctx, orderID); err != nil {
        // We took the money. The reconciler will fix this row.
        // Surfacing now would tell the caller "failed" after success.
        log.Printf("WARN: could not mark order %s paid: %v", orderID, err)
    }
    return nil
}

func (s *Service) fetchOrder(ctx context.Context, id string) (Order, error) {
    row := s.db.QueryRowContext(ctx, "SELECT id, amount, paid FROM orders WHERE id=?", id)
    var o Order
    if err := row.Scan(&o.ID, &o.Amount, &o.Paid); err != nil {
        if errors.Is(err, sql.ErrNoRows) {
            return Order{}, ErrOrderNotFound // transform: sql -> domain
        }
        return Order{}, fmt.Errorf("fetch order %s: %w", id, err) // surface
    }
    return o, nil
}

func (s *Service) chargeWithRetry(ctx context.Context, o Order) error {
    return Retry(ctx, 3, 100*time.Millisecond,
        func(ctx context.Context) error {
            return s.payment.Charge(ctx, o.ID, o.Amount)
        },
        IsTransient,
    )
}

func (s *Service) markPaid(ctx context.Context, id string) error {
    _, err := s.db.ExecContext(ctx, "UPDATE orders SET paid=1 WHERE id=?", id)
    return err
}

type Order struct {
    ID     string
    Amount int
    Paid   bool
}

func IsTransient(err error) bool {
    // Real code: check for connection reset, 503, deadline.
    return false
}

Walk through the decisions:

fetchOrder — transform sql.ErrNoRows to ErrOrderNotFound; surface others with context.
ProcessOrder early return on Paid — recover (idempotent skip).
chargeWithRetry — retry transient; surface permanent (the helper handles both).
markPaid — log if it fails; do not surface, because the money is already taken.

The HTTP handler that calls ProcessOrder then translates:

func chargeHandler(s *Service) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        id := r.PathValue("id")
        err := s.ProcessOrder(r.Context(), id)
        switch {
        case err == nil:
            w.WriteHeader(http.StatusOK)
        case errors.Is(err, ErrOrderNotFound):
            http.Error(w, "order not found", http.StatusNotFound)
        case errors.Is(err, context.Canceled), errors.Is(err, context.DeadlineExceeded):
            // client gave up; do not flag as our error
            return
        default:
            log.Printf("ProcessOrder %s: %v", id, err)
            http.Error(w, "internal error", http.StatusInternalServerError)
        }
    }
}

Each layer makes its decision in its own language. No layer logs and returns. Cancellations are handled separately. Internal details never reach the client.

Summary¶

Middle-level Go error handling is layer-aware. Each layer has different information and different responsibility, and the right decision depends on which layer you are in. Retries belong where idempotency is known. Translations belong at boundaries. Logging belongs at the layer that owns the error — exactly one. Recovery is a strategy, not a no-op: cache, fallback, degraded mode, with observability so silent failures cannot hide. The errWriter pattern collapses long sequences of checks into one decision point. Goroutines need their own recover and their own error-collection plumbing. Context cancellation is an expected stop signal, not an alarm.

Most of what makes a Go service feel solid is the discipline of these middle-level patterns — applied consistently, every PR.