Error Handling Basics — Senior Level¶
Table of Contents¶
- Introduction
- Error Architecture as a Design Decision
- The Error Domain
- Layered Error Strategies
- Error Modes vs Failure Modes
- Designing Error APIs for Libraries
- The Cost of Wrapping
- Errors and Concurrency
- Errors and Context Cancellation
- Errors and Distributed Systems
- Telemetry: Errors as Signals
- Error Wrapping and Information Hiding
- Debugging Production Errors
- Architecture Patterns
- Anti-Patterns at Scale
- Summary
- Further Reading
Introduction¶
Focus: "How to optimize?" and "How to architect?"
At senior level, error handling is no longer a per-function concern. It is a system property. You design how errors flow across packages, services, layers, retries, and humans. The decisions you make at this level affect availability, debuggability, and the on-call rotation's quality of life.
This file is about the architecture of error handling. Not the keystrokes — those are second nature now — but the strategy.
Error Architecture as a Design Decision¶
A senior engineer answers four questions for every system they build:
- What can fail? — Enumerate failure modes. Disk full, network partition, malformed input, dependency timeout, race condition, OOM.
- Who handles each failure? — A cache miss is recovered locally; a database outage propagates to the caller; OOM crashes the process.
- What does the user see? — 4xx vs 5xx vs retry vs degraded UI vs nothing.
- What does the operator see? — A log line with full detail, a metric that fires, a trace span tagged as error.
Most systems have an implicit answer to these questions, and that implicit answer is usually wrong: "we wrap with %w and return up." That works until the system hits 99.9% availability and you start hunting milliseconds.
A senior engineer makes the answer explicit.
The Error Domain¶
Define a small, intentional set of kinds of errors that your system recognizes. Examples:
var (
ErrNotFound = errors.New("not found")
ErrConflict = errors.New("conflict")
ErrInvalidInput = errors.New("invalid input")
ErrUnauthorized = errors.New("unauthorized")
ErrRateLimited = errors.New("rate limited")
ErrUpstreamFailure = errors.New("upstream failure")
ErrInternal = errors.New("internal")
)
Each kind maps to: - An HTTP status code (404, 409, 400, 401, 429, 502, 500). - A retry policy (no, no, no, no, yes-with-backoff, yes, no). - A user-facing message ("not found", "already exists", input details, "log in", "slow down", "try again", "we are looking into it"). - A monitoring rule (which of these warrant alerting?).
Without this domain you have a thousand unique strings each meaning roughly the same thing, and your handlers turn into walls of strings.Contains.
With it, you write:
switch {
case errors.Is(err, ErrNotFound):
return 404
case errors.Is(err, ErrInvalidInput):
return 400
default:
return 500
}
And every layer of the system speaks the same vocabulary.
Layered Error Strategies¶
A typical Go service has four layers and four error strategies:
| Layer | Strategy |
|---|---|
| Storage (DB, cache, fs) | Translate driver-specific errors into the domain (sql.ErrNoRows → ErrNotFound). |
| Domain (business logic) | Use only domain errors. Never expose sql.* upward. |
| Transport (HTTP/gRPC handlers) | Translate domain errors into protocol responses. |
| Edge (CDN, gateway, client) | Surface a small set of statuses and messages. Hide internals. |
This is error translation as a layer responsibility. Each layer takes errors from the layer below and re-expresses them in its own dialect.
Why? Because tomorrow you might swap PostgreSQL for MongoDB. The domain code should not change. The translation layer changes — and that's it.
Error Modes vs Failure Modes¶
Subtle but important distinction:
- An error mode is something the function explicitly returns: parse failed, not found.
- A failure mode is what happens to the system when that error occurs at runtime: latency spike, retry storm, alert fires, on-call paged.
You design error modes; failure modes happen to you. Bad error handling is the bridge:
| Error mode | Bad failure mode |
|---|---|
| Database timeout | Caller retries, retries pile up, DB gets DDoS'd by its own clients. |
| Validation error | Caller logs at ERROR level, log volume explodes, log infra falls over. |
| Conflict (409) | Caller treats as transient, retries forever, livelock. |
| Not found | Caller proceeds with nil, panics later in a deeply nested call. |
A senior engineer designs error handling to prevent the failure modes, not just to communicate the error mode. Circuit breakers, exponential backoff, log sampling, retry budgets — these are the tools.
Designing Error APIs for Libraries¶
If you publish a library, your error values are part of your API contract. Renaming or repurposing them is a breaking change. Three solid patterns:
Pattern 1: Sentinel + Is¶
package fs
var ErrNotExist = errors.New("file does not exist")
func Open(path string) (*File, error) {
// ...
if !exists(path) {
return nil, fmt.Errorf("open %q: %w", path, ErrNotExist)
}
// ...
}
Callers use errors.Is(err, fs.ErrNotExist). Wrapping with %w lets you add context without breaking the sentinel check.
Pattern 2: Typed errors + As¶
type ValidationError struct {
Field string
Message string
}
func (e *ValidationError) Error() string {
return fmt.Sprintf("validation: %s: %s", e.Field, e.Message)
}
Callers use errors.As(err, &ve) to extract structured data. Use this when callers need fields, not just identity.
Pattern 3: Errors as enum (kind)¶
type Kind int
const (
KindNotFound Kind = iota + 1
KindConflict
KindInvalid
)
type Error struct {
Kind Kind
Op string // operation
Path string
Err error
}
Used by the standard library (fs.PathError, net.OpError, *os.LinkError). One struct, many kinds.
Rule: pick one of these and stick with it inside a package. Mixing them confuses callers.
The Cost of Wrapping¶
Each fmt.Errorf("%w: %v", ...) call: - Allocates a *fmt.wrapError struct (24 bytes). - Walks the format string once. - Stores the wrapped error pointer. - Costs roughly 100-200 ns on modern hardware.
In a steady-state web service this is invisible. In a hot loop processing a million events per second, it can be the difference between meeting and missing latency targets. Two mitigations:
- Pre-allocate sentinels at package level — they cost nothing per call.
- Wrap at boundaries, not inside loops — wrap once at the top of the operation, not on every iteration.
Profile before optimizing. Wrapping is rarely the dominant cost.
Errors and Concurrency¶
Errors and goroutines have a natural friction: a goroutine that returns nothing has nowhere to put its error. Three patterns:
Pattern 1: Channel of errors¶
errCh := make(chan error, len(jobs))
for _, j := range jobs {
go func(j Job) { errCh <- process(j) }(j)
}
var firstErr error
for range jobs {
if err := <-errCh; err != nil && firstErr == nil {
firstErr = err
}
}
Pattern 2: errgroup¶
g, ctx := errgroup.WithContext(ctx)
for _, j := range jobs {
j := j
g.Go(func() error { return process(ctx, j) })
}
if err := g.Wait(); err != nil {
return err
}
golang.org/x/sync/errgroup cancels the shared context on the first error. Standard practice for fan-out work.
Pattern 3: Aggregate with errors.Join¶
var errs []error
var mu sync.Mutex
var wg sync.WaitGroup
for _, j := range jobs {
wg.Add(1)
go func(j Job) {
defer wg.Done()
if err := process(j); err != nil {
mu.Lock()
errs = append(errs, err)
mu.Unlock()
}
}(j)
}
wg.Wait()
if err := errors.Join(errs...); err != nil {
return err
}
Use when you want all failures, not just the first.
Errors and Context Cancellation¶
Two specific errors deserve named handling:
Whenever a long operation runs under a context.Context, it must: 1. Stop early when the context is done. 2. Return ctx.Err() (or wrap it). 3. Not be confused with a "real" failure — context cancellation is an expected outcome, not an alert-worthy one.
Best practice: at the top of any handler, check errors.Is(err, context.Canceled) and treat it as success-equivalent for monitoring. Otherwise every user closing their browser tab pages your on-call.
Errors and Distributed Systems¶
Network calls return many errors that look the same but mean very different things:
| Error | Retry? | User impact |
|---|---|---|
| Connection refused (cold start) | Yes | None |
| 503 Service Unavailable | Yes (with backoff) | Maybe |
| 429 Too Many Requests | Yes (longer backoff) | Throttle |
| 504 Gateway Timeout | Yes (idempotent only) | Elevated latency |
| 500 Internal Server Error | No (often non-idempotent) | Reflect to user |
| Connection reset mid-request | Yes (idempotent only) | Possible duplicate |
| TLS handshake failure | No (config issue) | Outage |
A retry helper that does not distinguish these will either retry forever (livelock) or never retry (poor availability). Senior engineers encode the distinction:
type retryable interface {
Retryable() bool
}
func shouldRetry(err error) bool {
var r retryable
if errors.As(err, &r) {
return r.Retryable()
}
return errors.Is(err, ErrUpstreamFailure) || isTimeout(err)
}
Telemetry: Errors as Signals¶
An error's lifecycle in production is more than its return:
- Error happens — function returns it.
- Error is observed — some code calls
.Error()for the first time, often the logger. - Error is recorded — log line written, metric incremented, trace span tagged.
- Error is alerted on — SLO burn rate, error budget, threshold rules.
Senior systems wire these explicitly: - Each domain error has a metric label. - Each unhandled error is logged once at the boundary. - Traces tag the error span with otel.RecordError(err). - Alert thresholds are tied to kinds (e.g., 5xx rate > X), not raw counts.
Error Wrapping and Information Hiding¶
Wrapping leaks. fmt.Errorf("query users: %w", err) exposes err.Error() if anyone calls .Error() on the result. If err is pq: relation "users" does not exist, that string now reaches whoever calls .Error(). If that "whoever" is the HTTP response, you just leaked your schema.
Two strategies:
Strategy A: Always log, never expose¶
func handler(w http.ResponseWriter, r *http.Request) {
if err := s.do(r); err != nil {
log.Printf("internal: %v", err)
http.Error(w, "internal error", 500)
}
}
The log gets full detail; the user gets a bland message.
Strategy B: Tagged errors¶
type publicError struct{ msg string; cause error }
func (e *publicError) Error() string { return e.msg }
func (e *publicError) Unwrap() error { return e.cause }
Error() returns only the safe part. Internal Unwrap() exposes the rest for logging.
Debugging Production Errors¶
A production error log line should answer five questions:
- What — the error message.
- Where — the operation, often via wrapping context.
- When — timestamp.
- Who — the request ID, user ID, trace ID.
- Why — the chain of causes (unwrapped).
Tools: - fmt.Errorf("op: %w", err) — chain of causes. - Structured logging (slog, zap, logrus) — stable fields. - Trace IDs in every log line — to correlate. - runtime/debug.Stack() — for diagnostic situations only, not normal errors.
Architecture Patterns¶
Pattern: Error boundary¶
A single layer translates all errors into the protocol response. Inside the boundary, errors flow naturally; outside, only sanitized data.
Pattern: Result envelope¶
Wrap every domain operation in a result type that carries (value, error, metadata). Useful for APIs that need correlation IDs, retry hints, etc.
Pattern: Saga / compensation¶
In a multi-step transaction, an error at step N triggers compensating actions for steps 1..N-1. Errors are inputs to the rollback engine, not just diagnostics.
Pattern: Dead-letter queue¶
Errors during async processing get the message moved to a DLQ for later inspection, instead of retrying forever.
Anti-Patterns at Scale¶
- Generic
errors.New("error")— useless when 200 of these come from 200 callers. if err != nil { return err }only — no context, no kind, no telemetry. Errors arrive at the top with one-word messages.- Sentinel addiction — defining 200 sentinels with no semantic grouping. Use kinds (a small enum) instead.
- String matching on error messages — fragile, breaks on locale/version.
- Logging on every layer — log amplification. Log once at the boundary.
- Conflating timeout with failure — context cancellation is not an outage; do not page on it.
Summary¶
At senior level, error handling becomes a system design discipline. You define an error domain, layer translation strategies, separate error modes from failure modes, integrate with telemetry, and design for distributed-system realities. The senior question is not "did I check the error?" but "does my service degrade gracefully when this entire class of errors becomes common?"