Handle, Don't Just Check — Senior Level¶

Table of Contents¶

Introduction
Error Handling as an Architectural Property
Building a Boundary-Aware Error Strategy
Idempotency at Scale
Circuit Breakers and Bulkheads
Degraded Mode Design
Error Budgets and SLOs
Cross-Service Error Propagation
Saga and Compensation Patterns
Observability of Handling Decisions
Architectural Anti-Patterns
Worked Example: Multi-Service Checkout
Cheney vs. Exception-Based Languages
Code Review at Senior Level
Summary
Further Reading

Introduction¶

Focus: "How to optimize?" and "How to architect?"

At senior level, "handle, don't just check" is not a coding style; it is a system property. The question is no longer "did this PR write the right if err != nil block" but "does this service produce errors that the next service over can act on" and "can the on-call engineer in 2026 actually do something with this log line at 3 AM."

The decisions that shape error handling at architectural level are made before code is written: where errors are translated, where retries happen, what counts as a 'transient' failure, how the system degrades when one dependency goes dark. Get these right and individual if err != nil lines are mostly mechanical. Get them wrong and no amount of careful local handling will save you.

This file is about the architectural side of Cheney's principle: how to build services where every error has an owner and every failure has a graceful response.

Error Handling as an Architectural Property¶

Three properties separate a system that handles errors from one that merely checks them at scale:

Owners. Every error has exactly one layer responsible for deciding what to do. No error is logged twice; no error is swallowed; no error is surfaced through five layers without a single decision.
Vocabularies. Each layer speaks its own language for failure (driver errors → sentinels → domain errors → status codes). Translation happens at boundaries deliberately, not by accident.
Failure modes. The system has a documented response to every dependency outage: cache miss → degraded mode; auth provider down → cached tokens; analytics down → fire-and-forget.

When you join a service that gets these three right, you can debug any failure in minutes. When you join one that gets them wrong, every incident is a forensic project.

Building a Boundary-Aware Error Strategy¶

A boundary is anywhere the meaning of "error" changes. Common boundaries:

Boundary	Translation
Storage adapter → domain	`sql.ErrNoRows` → `ErrNotFound`
Domain → application	Wrap with use-case context
Application → transport	Sentinel → HTTP status / gRPC code
Transport → user	Generic message; details to log
Worker → scheduler	Result + metric
Service A → Service B	gRPC `status.Status` with code + message

Senior teams document the error model for the service: a small list of public sentinels, the rules for when to wrap, and the boundary mapping table. New developers refer to this document; PR review enforces it.

Example: Order Service Error Model

PUBLIC SENTINELS (returned from public methods):
  ErrOrderNotFound       -> 404, NotFound
  ErrOrderAlreadyPaid    -> 409, AlreadyExists
  ErrInvalidAmount       -> 400, InvalidArgument
  ErrPaymentDeclined     -> 402, FailedPrecondition

WRAPPING RULES:
  - Storage adapters: wrap driver errors with "op: %w" and translate
    sql.ErrNoRows / driver.ErrBadConn / pgx-specific to sentinels above.
  - Domain layer: never wrap; sentinels pass through unchanged.
  - Application layer: wrap with use case ("ChargeOrder %s: %w").
  - Transport: do not wrap; map sentinel → status code.

LOGGING:
  - Top-level handler logs once, structured, with request_id and trace_id.
  - All other layers return without logging.
  - Worker recovery logs panic + stack and continues / restarts.

Three pages of conventions; saves three months of confusion.

Idempotency at Scale¶

Retry is the most common failure-recovery technique in distributed systems, and idempotency is its prerequisite. A senior service is built to be retry-safe; idempotency is not bolted on.

Patterns to make operations idempotent¶

Idempotency keys. The client sends a unique key with each request. The server stores keys → results. A duplicate key returns the stored result without redoing the work. Stripe, Square, and many financial APIs work this way.
Conditional updates. "Update if version = 5" — second update with the same version fails harmlessly.
Upserts with deterministic IDs. Generate the resource ID on the client (UUID); inserts on the same ID are no-ops.
Idempotent state transitions. "Mark paid" is naturally idempotent — applying twice has the same result as once. "Add 5 dollars" is not — guard with a transaction or version.

Implementation: idempotency-key middleware¶

type IdempotencyStore interface {
    Get(ctx context.Context, key string) (Response, bool, error)
    Put(ctx context.Context, key string, resp Response) error
}

func IdempotencyMiddleware(s IdempotencyStore) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := r.Header.Get("Idempotency-Key")
            if key == "" {
                next.ServeHTTP(w, r)
                return
            }
            if cached, ok, err := s.Get(r.Context(), key); err == nil && ok {
                writeCached(w, cached)
                return
            }
            // capture response, then store on success
            recorder := newRecorder(w)
            next.ServeHTTP(recorder, r)
            if recorder.status < 500 {
                _ = s.Put(r.Context(), key, recorder.snapshot())
            }
        })
    }
}

The middleware turns retries into safe operations. The server can then retry from any layer without worrying about double-charging the customer.

Why this matters for handling¶

Without idempotency, you cannot retry. Without retry, transient errors must be surfaced as failures. Surfacing every transient failure is itself a failure mode — the user sees a 500 for what was a 100ms blip. Idempotency converts a class of "must surface" errors into "can recover".

Circuit Breakers and Bulkheads¶

Two classic patterns from Michael Nygard's Release It!, both about not making a bad situation worse.

Circuit breaker¶

A breaker has three states:

State	Behaviour
Closed	Calls flow normally; failures are counted.
Open	After N failures in window, calls fail immediately without hitting the dependency.
Half-open	After cooldown, allow a probe call. Success → closed. Failure → open.

The point: stop hammering a downstream that is already struggling. A retry without a breaker turns a struggling service into a dead one.

type Breaker struct {
    mu      sync.Mutex
    state   int       // closed=0, open=1, half=2
    fails   int
    opened  time.Time
    th      int           // failure threshold
    cool    time.Duration // cooldown before half-open
}

func (b *Breaker) Do(op func() error) error {
    b.mu.Lock()
    if b.state == 1 {
        if time.Since(b.opened) > b.cool {
            b.state = 2 // half-open
        } else {
            b.mu.Unlock()
            return ErrBreakerOpen
        }
    }
    b.mu.Unlock()

    err := op()

    b.mu.Lock()
    defer b.mu.Unlock()
    if err != nil {
        b.fails++
        if b.fails >= b.th {
            b.state = 1
            b.opened = time.Now()
        }
        return err
    }
    b.fails = 0
    b.state = 0
    return nil
}

Real implementations (Sony's gobreaker, Hystrix) handle metrics and concurrency more carefully, but the shape is this.

Bulkhead¶

Isolate failure domains. If service A and service B are both behind one connection pool, A going slow saturates the pool and B suffers too. Separate pools (separate "bulkheads") keep failures contained.

// Bad: shared pool
sharedClient := http.Client{Transport: sharedTransport}

// Good: bulkheads
authClient   := http.Client{Transport: newTransport(maxConns: 10)}
paymentClient := http.Client{Transport: newTransport(maxConns: 50)}

When the auth service goes slow, payment requests are unaffected.

Where these fit¶

These patterns convert otherwise-unhandled cascades into handled failures: an open breaker is a decision — "we know auth is down, we will not even try". The caller gets ErrBreakerOpen and decides what to do (degraded mode, error to user, retry later).

Degraded Mode Design¶

A senior service has explicit modes:

Mode	Description
Normal	All dependencies healthy, full feature set.
Reduced	Some dependencies degraded; fall back to caches, defaults, generic responses.
Read-only	Database write path failing; serve reads only.
Maintenance	Operator-flipped; small static page.

Each mode is implemented as a set of fallback decisions in the handler layer, often gated by feature flags or health checks.

func recommendHandler(deps *Deps) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        if deps.RecommenderHealth.IsDegraded() {
            renderJSON(w, deps.GenericFeed.Latest()) // recover: degraded
            return
        }
        items, err := deps.Recommender.For(r.Context(), userID(r))
        if err != nil {
            log.Printf("recommender error: %v", err)
            renderJSON(w, deps.GenericFeed.Latest()) // recover: emergency
            return
        }
        renderJSON(w, items)
    }
}

Two recovery paths: one proactive (the breaker says we are degraded), one reactive (the call failed). Both end at the same fallback so the user gets something either way.

The architectural rule: prefer "users see something less" to "users see an error". Most non-transactional endpoints can degrade; transactional ones (payments, account changes) generally must fail loudly.

Error Budgets and SLOs¶

A service-level objective (SLO) is a target for reliability: 99.9% of requests succeed, p99 latency < 300ms. The complement is the error budget: 0.1% of requests are allowed to fail. The error budget is what lets you risk anything.

Why SLOs matter for handling¶

Once you have an error budget, "handle the error" becomes a budget decision:

Burning budget fast? Make handling more conservative — bigger circuit breaker thresholds, less aggressive retries, slower deploys.
Plenty of budget left? Take more risk — feature flags, canary rollouts, even chaos experiments.

The decision of whether to surface an error to the user depends on what doing so costs in error-budget terms. A 5% transient failure rate that retries to 0.5% real failure is a budget choice.

Implementation: SLI middleware¶

Service-Level Indicators are the metrics that feed the SLO:

func sliMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        rec := &statusRecorder{ResponseWriter: w, status: 200}
        start := time.Now()
        next.ServeHTTP(rec, r)
        elapsed := time.Since(start)
        success := rec.status < 500
        sliCounter.WithLabelValues(strconv.FormatBool(success)).Inc()
        sliLatency.WithLabelValues(r.URL.Path).Observe(elapsed.Seconds())
    })
}

Errors that are expected (4xx) are not budget-burning; errors that are internal (5xx) are. The middleware enforces this distinction.

Cross-Service Error Propagation¶

In a distributed system, the question "where to handle?" extends across processes. A gRPC client calling a gRPC server: which side handles which error?

Standard pattern¶

Failure	Handled by
Network connectivity	Client retries (server didn't see the call).
Server returned `Unavailable`	Client retries.
Server returned `InvalidArgument`	Client surfaces to its caller — fixing input is upstream.
Server returned `Internal`	Client may retry once, then surface. Server logged the cause.
Client deadline exceeded	Client surfaces as timeout to its caller.

The server records the cause; the client records the request. Each side knows half of the story.

gRPC error mapping¶

import (
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
)

func (s *Server) GetUser(ctx context.Context, req *pb.GetUserReq) (*pb.User, error) {
    u, err := s.repo.Get(ctx, req.Id)
    if err != nil {
        switch {
        case errors.Is(err, ErrUserNotFound):
            return nil, status.Error(codes.NotFound, "user not found")
        case errors.Is(err, ErrPermissionDenied):
            return nil, status.Error(codes.PermissionDenied, "")
        default:
            log.Printf("GetUser %d: %v", req.Id, err)
            return nil, status.Error(codes.Internal, "internal error")
        }
    }
    return toProto(u), nil
}

The boundary is the gRPC handler. Domain errors stop being domain errors and become status.Status. Internal details never cross.

Client side¶

u, err := client.GetUser(ctx, &pb.GetUserReq{Id: 42})
if err != nil {
    s, _ := status.FromError(err)
    switch s.Code() {
    case codes.NotFound:
        return ErrUserNotFound // re-translate at this side's domain
    case codes.Unavailable:
        return ErrTransient // retry candidate
    default:
        return fmt.Errorf("get user 42: %w", err)
    }
}

The client re-translates gRPC codes back into its own domain vocabulary. Each service has its own dialect; the wire is the lingua franca.

Saga and Compensation Patterns¶

In a distributed system that crosses transactional boundaries, you cannot wrap five microservice calls in one transaction. The standard answer is a saga: a sequence of local transactions, each with a compensating action that undoes it.

Pattern¶

1. Reserve inventory   → on failure: nothing to undo
2. Charge payment      → on failure: release inventory
3. Ship                → on failure: refund payment, release inventory
4. Mark complete       → on failure: try again (idempotent)

Each step is an operation; each operation has an explicit compensator. The orchestrator (or a choreography) knows which compensator to run based on which step failed.

Implementation sketch¶

type Step struct {
    Do        func(ctx context.Context) error
    Undo      func(ctx context.Context) error
}

func RunSaga(ctx context.Context, steps []Step) error {
    var done []Step
    for _, s := range steps {
        if err := s.Do(ctx); err != nil {
            // run compensators in reverse order
            for i := len(done) - 1; i >= 0; i-- {
                if cerr := done[i].Undo(ctx); cerr != nil {
                    log.Printf("compensation failed: %v", cerr)
                    // record for manual reconciliation
                }
            }
            return fmt.Errorf("saga failed at step %d: %w", len(done), err)
        }
        done = append(done, s)
    }
    return nil
}

The handling decision at each step is not "surface the error" — it is "compensate then surface". Surfacing an error after a partial commit is a form of swallowing: the user gets back a 500, but the world is in an inconsistent state.

Saga vs 2PC¶

Two-phase commit guarantees atomicity but requires a coordinator and is not available across heterogeneous services. Sagas accept eventual consistency in exchange for autonomy — each service handles its own errors locally and the saga handles the rollback.

Observability of Handling Decisions¶

A handled error should leave a trace. Three ways to make handling visible:

1. Structured logs with the decision recorded¶

slog.Info("recovered",
    "decision", "fallback",
    "reason", "personaliser_unavailable",
    "user_id", userID,
    "error_kind", "timeout",
)

Now you can query: "how often did we fall back yesterday because of personaliser timeouts?"

2. Metrics on decisions¶

// Counter labelled by decision
errorDecisions.WithLabelValues("retry", "transient").Inc()
errorDecisions.WithLabelValues("recover", "fallback").Inc()
errorDecisions.WithLabelValues("surface", "domain").Inc()

A dashboard of decisions by kind tells you whether your retry policy is doing useful work or simply hiding a real outage.

3. Traces with span events¶

span.AddEvent("retry", trace.WithAttributes(
    attribute.String("reason", "transient"),
    attribute.Int("attempt", 2),
))

When investigating a slow trace, the events tell you the request retried twice — that 800ms came from waiting, not from work.

Why decisions need observability¶

A retry that always succeeds on the second try is fine in normal times. But if 30% of your calls retry, the upstream is trending bad and your latency has doubled silently. Without visible decisions, you only see the symptoms.

Architectural Anti-Patterns¶

1. Catch-all middleware that hides domain meaning¶

defer func() { if r := recover(); r != nil { http.Error(w, "error", 500) } }()

Every panic becomes a 500, regardless of whether it was a NotFound, a validation failure, or a bug. Translation must happen before the recovery.

2. Retry policies set by copy-paste¶

Every team copies the same retry helper but adjusts constants (3 retries, 100ms backoff). Now the latency budget for any single request is 3 services × 3 retries × backoff = many seconds. Multiplicative retries are a known killer.

Solution: enforce one retry layer per request graph. Inner services do not retry; outermost does.

3. Logging the same error in every layer¶

Already discussed; deserves repeating because it scales catastrophically. A team of 50 logging "from habit" produces a log volume the SRE team cannot handle.

4. Generic error messages user-facing¶

"internal server error" is fine for the content; the response code tells the user it is server-side. But "error" for a 4xx tells the user nothing — they cannot fix what they did.

5. No degraded mode¶

Every dependency is required. A 50ms blip in a non-critical service surfaces as a user-visible failure. The architecture has no graceful degradation.

Auth and search share a pool. Search slows; auth waits behind it; logins fail. Bulkheads exist for this.

7. Custom panic handlers that re-throw¶

A worker recovers a panic, "logs it", then exits the goroutine — losing the worker. Or re-panics into a goroutine the parent does not own. Both are silent breakage.

Worked Example: Multi-Service Checkout¶

A realistic distributed checkout, with handling decisions explicit:

[Frontend] -> [Checkout] -> [Inventory] -> [Payment] -> [Fulfillment]
                  |             |              |              |
                  |          local DB      Stripe API     local DB
                  |             |              |              |
                  +-- saga orchestrator with compensators ---+

Decisions per service:

Service	Decisions
Frontend	Retry on `Unavailable`. Surface validation errors immediately. Show degraded UI on full outage.
Checkout (orchestrator)	Run saga. On partial failure → run compensators → surface. Log once with saga ID.
Inventory	Reserve is idempotent (key = order ID). Retry on transient. Sentinel `ErrOutOfStock`.
Payment	Idempotency key required (Stripe API). Retry on transient. Sentinel `ErrCardDeclined`.
Fulfillment	Mark in DB; idempotent on order ID. Retry until success.

Failure modes:

Failure	Response
Inventory service down	Frontend shows "checkout temporarily unavailable" — no compensation needed (no work started).
Inventory reserved, Payment service times out	Compensate: release reservation. User sees "payment failed, please retry".
Payment succeeded, Fulfillment service down	Saga retries Fulfillment. The payment is done; the user is told "your order is confirmed". The saga keeps trying in the background.
Inventory + Payment succeed, Fulfillment fails permanently	Compensate: refund payment, release inventory. User sees "we could not complete your order; refund issued".

Each failure is handled, not just checked. The saga has a decision for every step's failure; the orchestrator has a single owner for the whole flow; each service translates its own errors at its boundary.

That kind of design — failure modes designed before the happy path is fully built — is what senior-level error handling looks like.

Cheney vs. Exception-Based Languages¶

A frequent question: isn't Java's try/catch easier?

Aspect	Go (errors as values)	Java (exceptions)
Failure visibility	Explicit in every signature	Hidden in the type system unless `throws` is required
Default handling	Forced to think (no implicit propagation)	Default propagation up the stack
Cost	Free per check; allocations per wrap	Stack capture per throw; ~µs
Ease of "ignore"	`_ = err` makes it visible	`try { ... } catch (Exception ignored) {}` makes it nearly invisible
Layered translation	Manual but uniform	Often skipped; original exception bubbles to the boundary
Recovery decisions	Local, value-based	Often global, in a single catch-all

Cheney's argument: forcing the writer to say something at every error site makes lazy handling visible. The verbosity is the feature, because it surfaces decisions that exception-based code hides.

The cost: there are more lines on the page. Mature Go developers stop seeing them; new developers find them noisy. The middle path is the discipline of this topic — make the content of those lines say something useful.

A pithy summary: Java handles errors at the catch site; Go handles them at the throw site. The throw site is closer to the cause, has more context, and is harder to copy-paste-and-forget. That is the case for the value-based model.

Code Review at Senior Level¶

A senior reviewer reads error handling with three lenses:

1. Layer responsibility¶

Does the layer making this decision have the right information to make it? A storage adapter retrying based on HTTP status codes is misplaced; the policy belongs at the application layer.

2. System-wide consistency¶

Does this PR follow the team's published error model? New sentinel? Documented in the model? Mapped to a status code?

3. Failure-mode coverage¶

What does this code do when its dependencies are down? When the request times out? When the upstream returns malformed data? Are those modes tested?

A senior PR comment looks like:

"This retries on any error. Should we restrict to transient? Otherwise we will retry validation errors forever and exhaust the budget. Also: the wrap message says 'failed' — could you say what it failed to do, with the entity ID?"

Specific. Names a layer responsibility. Suggests an alternative.

Summary¶

At senior level, "handle errors gracefully" is an architectural property: every error has an owner, every boundary translates, every dependency has a documented failure mode. Idempotency converts surface-only errors into recoverable ones. Circuit breakers and bulkheads prevent local failures from cascading. Degraded mode keeps users served. SLOs and error budgets turn handling into a measurable discipline. Sagas extend the same rules across service boundaries with explicit compensation. Observability of decisions — not just outcomes — is what separates a service that fails loudly and clearly from one that fails confusingly. The keystroke-level lessons of junior and middle level apply at every line; the architecture is what makes them sum to a debuggable, operable system.