Fail-Fast — Professional¶

Focus: staff/principal decisions. Fail-Fast is the simplest pattern in this catalogue and one of the most likely to get a service paged. The hard parts are not if !valid { return err }. They are: which boundary owns validation, how much CPU to budget, what happens when 200k req/s simultaneously fail-fast against a downstream that just came back up, and why a "harmless" log-and-continue cost a customer six hours of duplicate billing.

1. Fail-Fast as a system primitive¶

Fail-Fast is rejection at the boundary. The producer (caller) and the consumer (your code) share a contract; the moment the contract is violated, you stop. That decision differentiates Fail-Fast from every adjacent resilience pattern.

Primitive	When it acts	What it protects	Failure model
Fail-Fast	Before work starts	CPU, downstream, data integrity	Reject with structured error
Retry	After transient failure	Throughput under partial outage	Exponential backoff + jitter
Circuit breaker	After repeated failures	Downstream from saturation	Open / half-open / closed
Bulkhead	Continuously	One tenant from another	Resource pool partitioning
Fallback	After failure	User experience	Cached / degraded response
Timeout	After elapsed budget	Caller from hanging	Deadline propagation

Four distinctions:

Preventive vs reactive. Fail-Fast says "this can never succeed." Retry says "this might succeed if we try again." Mixing produces retry storms (§8).
Local vs statistical. Fail-Fast rejects this request on this input. A breaker rejects based on the failure rate of a peer. Both coexist: validate input, check breaker, call.
Composes upward. A handler that fails fast on bad JSON saves codec, validator, ORM, database, and downstream from touching the request. Savings multiply with depth.
Contract, not defence. Validation in a private function is mostly waste; validation at the boundary is the contract. Internal if x == nil { panic } lines are programmer-error assertions, not Fail-Fast.

The rule: Fail-Fast at every trust boundary; trust within. HTTP handler, gRPC server, queue consumer, CLI entry — each is a trust boundary that owns validation. Past it, business code assumes correctness. Defensive checks inside a service are duplication and a symptom of a leaky boundary.

2. Cost analysis¶

Go 1.24, amd64, Linux 6.8. Validation appears in every request; cost matters even when individually small.

sentinel: if s == "" { ... }                       ~2 ns
regexp.MatchString (compiled, short input)         ~200 ns
regexp.MatchString (uncompiled per call)           ~50 µs   (never)
go-playground/validator on 10-field struct         ~3 µs
protoc-gen-validate on 10-field proto              ~500 ns  (generated)
JSON Schema (santhosh-tekuri, compiled, 20 keys)   ~1 µs
JSON Schema (interpreted per call)                 ~30 µs
json.Unmarshal of 1 KB                             ~10 µs
proto.Unmarshal of 1 KB                            ~2 µs

A validator tag check costs ~3 µs. Against a 10 ms DB call, irrelevant. Against a 5 ns lookup-table hit on the hot path, 600x overhead. Validation cost is only invisible when downstream work dominates.

Early-exit savings. Every rejected request avoids the work it would have caused:

Stage	Cost if executed	Saved
JSON decode + struct validation	13 µs	paid
Auth check	50 µs	saved
DB transaction	200 µs	saved
Business logic + writes	~5 ms	saved
Downstream gRPC	10 ms	saved
Total per rejected request	~16 ms	saved

At 100 k req/s with a 1% bad-input rate, fail-fast saves 16 s of CPU+IO per second — the difference between healthy and saturated during an attack or a buggy client.

Where validation belongs. Each layer fails fast on the constraints it owns: edge gateway (schema/size/rate), ingress handler (request struct + auth), application (business invariants), domain (constructor-enforced invariants), infrastructure (physical constraints). Duplicating edge checks in the domain wastes CPU; skipping domain invariants because "the handler already checked" produces corrupt state when a new caller is added. Each layer's validations are its public contract — not a backstop for the layer above.

3. Schema-driven validation¶

Hand-written if chains rot. Every new field is two edits — once in the struct, once in the validator — and the second is what people forget. Schema-driven validation moves the contract into a declarative artifact both sides of the wire understand.

Schema	Wire format	Cost	Use case
JSON Schema	JSON, YAML	~1 µs compiled	REST APIs, OpenAPI
protoc-gen-validate (PGV)	Protobuf	~500 ns	gRPC services
buf/protovalidate	Protobuf, CEL-based	~1 µs	New gRPC services
FlatBuffers verifier	FlatBuffers	Near-zero	Game state, mobile sync
Avro	Avro	~5 µs	Kafka, event streams
CUE	Any	~10 µs	Config validation

var compiled *jsonschema.Schema

func init() {
    c := jsonschema.NewCompiler()
    _ = c.AddResource("user.json", strings.NewReader(userSchemaJSON))
    s, err := c.Compile("user.json")
    if err != nil { panic(fmt.Errorf("user schema: %w", err)) }
    compiled = s
}

func validateUser(raw json.RawMessage) error {
    var v any
    if err := json.Unmarshal(raw, &v); err != nil {
        return fmt.Errorf("malformed JSON: %w", err)
    }
    if err := compiled.Validate(v); err != nil {
        return fmt.Errorf("schema violation: %w", err)
    }
    return nil
}

Three rules:

Compile once, reuse forever. Per-request compilation is a 100x regression nobody catches until production p99 doubles.
Validate before unmarshalling into your typed struct. Otherwise you re-implement the same checks with if.
The schema is the API. Generate OpenAPI and SDKs from it.

protovalidate for gRPC. Generated validator code: no reflection, no allocation, ~500 ns for a 10-field message:

message CreateUserRequest {
  string email = 1 [(buf.validate.field).string.email = true];
  int32 age = 2 [(buf.validate.field).int32 = {gte: 13, lte: 120}];
  string display_name = 3 [(buf.validate.field).string = {min_len: 1, max_len: 64}];
}

Wire as a gRPC interceptor (§5) and every RPC fails fast without per-handler boilerplate. For Kafka/Pulsar, Avro + Schema Registry is the only sane way to evolve event payloads across hundreds of consumers without coordinated deploys. Do not hand-roll a schema-shaped problem. Hand-written validators drift from docs immediately and become the source of auth-bypass CVEs.

4. Circuit breakers as runtime Fail-Fast¶

A circuit breaker is Fail-Fast applied to a peer service. States: Closed (normal, failures counted), Open (reject immediately with sentinel error), Half-Open (single probe; on success close, on failure re-open with cooldown). Same pattern: stop early, report cheaply, let the system recover.

var paymentsBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "payments",
    MaxRequests: 1,
    Interval:    10 * time.Second,
    Timeout:     30 * time.Second,
    ReadyToTrip: func(c gobreaker.Counts) bool {
        return c.ConsecutiveFailures >= 5 ||
            (c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) >= 0.5)
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        slog.Warn("breaker.state", "name", name, "from", from, "to", to)
        breakerStateChanges.WithLabelValues(name, to.String()).Inc()
    },
})

func charge(ctx context.Context, req ChargeRequest) (*Receipt, error) {
    res, err := paymentsBreaker.Execute(func() (any, error) {
        return paymentsClient.Charge(ctx, req)
    })
    if err != nil {
        if errors.Is(err, gobreaker.ErrOpenState) {
            return nil, ErrPaymentsUnavailable
        }
        return nil, err
    }
    return res.(*Receipt), nil
}

hystrix-go (Go port of Netflix Hystrix) is unmaintained; use sony/gobreaker or failsafe-go (generics-aware).

Sizing. Default start: 50% failure over a 10 s rolling window of at least 20 requests, 30 s cooldown.

Knob	Too low	Too high
Failure threshold	Trips on noise; fake outages	Slow to trip; downstream saturates
Window length	Misses bursty failures	Stays open after recovery
Cooldown	Probes too often; thrashes downstream	Slow recovery

Where breakers belong: Redis cache, read replicas, third-party APIs, internal microservices — yes. Same-process functions — no. Database primary — usually no (degrading to read-only is application-specific). Breakers without a fallback are noise — they convert "5xx from downstream" into "5xx from breaker." The point is to do something better when open: cached data, queued work, ask to retry.

5. Distributed Fail-Fast¶

In a mesh, Fail-Fast is system policy, not per-service concern.

gRPC interceptor validation:

func ValidateUnaryInterceptor(v *protovalidate.Validator) grpc.UnaryServerInterceptor {
    return func(ctx context.Context, req any, info *grpc.UnaryServerInfo, h grpc.UnaryHandler) (any, error) {
        msg, ok := req.(proto.Message)
        if !ok { return h(ctx, req) }
        if err := v.Validate(msg); err != nil {
            validationFailures.WithLabelValues(info.FullMethod).Inc()
            return nil, status.Errorf(codes.InvalidArgument, "validation failed: %v", err)
        }
        return h(ctx, req)
    }
}

s := grpc.NewServer(grpc.ChainUnaryInterceptor(
    recoveryInterceptor, loggingInterceptor, authInterceptor,
    ValidateUnaryInterceptor(validator),  // BEFORE handler
    metricsInterceptor,
))

The same generated rules run as a UnaryClientInterceptor on the producer, catching errors before the network call.

Service mesh policy. Istio, Linkerd, Consul Connect push validation, rate limiting, and breaking into sidecars: single source of truth in CRDs, polyglot enforcement, SRE owns policy, product owns logic. Cost: ~500 µs per sidecar hop. Fine for most workloads; intolerable for low-latency in-cluster RPC. Many teams keep validation in-process and put rate limiting / breaking at the mesh.

API gateways. At the edge, a gateway (Envoy, Kong, AWS API Gateway) fails fast before traffic enters the cluster: TLS, size limits, schema, JWT, rate limits. Rejecting at the edge is 100x cheaper than inside the cluster — no namespace traversal, no sidecar, no service load. The edge is the first and most important Fail-Fast boundary; everything else is defence in depth.

Backpressure as Fail-Fast. 429 / ResourceExhausted with Retry-After is the load-shedding pattern: "I cannot serve this now; do not waste cycles." Without it, services degrade gracelessly under overload.

6. Idempotency keys — Fail-Fast on duplicate¶

A request is idempotent if executing twice has the same effect as once. Most aren't, and retries from networks, mobile clients, and queue redelivery produce duplicates. The fix: an idempotency key. Client generates a UUID per logical op; server stores (key) → result and fails fast on the second arrival.

type IdempotencyStore interface {
    Get(ctx context.Context, key string) (resp []byte, ok bool, err error)
    Put(ctx context.Context, key string, resp []byte, ttl time.Duration) error
    Lock(ctx context.Context, key string, ttl time.Duration) error
}

func (s *Server) CreateCharge(ctx context.Context, req *ChargeRequest) (*ChargeResponse, error) {
    key := req.IdempotencyKey
    if key == "" {
        return nil, status.Error(codes.InvalidArgument, "idempotency_key required")
    }
    if cached, ok, err := s.store.Get(ctx, key); err != nil {
        return nil, err
    } else if ok {
        return decodeResponse(cached), nil  // Fail-Fast on duplicate
    }
    if err := s.store.Lock(ctx, key, 30*time.Second); err != nil {
        if errors.Is(err, ErrConflict) {
            return nil, status.Error(codes.AlreadyExists, "in-flight duplicate")
        }
        return nil, err
    }
    resp, err := s.process(ctx, req)
    if err != nil { return nil, err }
    _ = s.store.Put(ctx, key, encodeResponse(resp), 24*time.Hour)
    return resp, nil
}

Three correctness conditions:

Key in a location you reliably extract. Stripe uses Idempotency-Key header; PayPal uses body. Choose one; audit.
Stored response is the final response, status code included.
TTL exceeds the maximum client retry window. 60 s client + 30 s TTL = double charges. Stripe uses 24 hours.

Store	Latency	Use when
Redis `SETNX` + TTL	~500 µs	Most services
DynamoDB conditional	~5 ms	Financial; must survive cache loss
Postgres unique constraint	~2 ms	Single-database services

For payments, the idempotency record must commit in the same transaction as the business write, or keys survive for rolled-back transactions. Postgres with UNIQUE(tenant_id, idempotency_key) solves this for free.

Retry budget. Idempotency keys make retries safe; retry budgets make them bounded. A retry budget fails fast on retries themselves once retry rate exceeds (say) 10% — the classic sign of a downstream in distress. Idempotency + retry budget + circuit breaker is the canonical retry-safety triad. Pick any two and incidents teach you to want the third.

7. Observability¶

A service that fails fast is invisible until you measure rejection. Signals: rejection rate, by reason, with blast radius.

Metric	Type	Why
`validation_failures_total{endpoint,reason}`	Counter	Spike = bad deploy or attack
`validation_duration_seconds{endpoint}`	Histogram	Regression catches slow regex
`breaker_state{name}`	Gauge	Live upstream health
`breaker_state_changes_total{name,to}`	Counter	Flap detection
`idempotency_duplicates_total{endpoint}`	Counter	Client retry rate
`requests_rejected_total{reason}`	Counter	All fail-fast reasons unified
`error_budget_remaining{slo}`	Gauge	Burn rate

Most useful chart: rejection rate by reason, stacked. Each reason has a different remediation; aggregated "errors" hides the diagnosis.

Blast radius. Label by tenant and client version (bounded cardinality):

validationFailures.With(prometheus.Labels{
    "endpoint": "/v1/charges",
    "reason":   "email_format",
    "tenant":   tenantID,         // < 10k tenants
    "version":  clientVersion,    // a few SDK versions
}).Inc()

Never label by raw user ID or full URL path with embedded IDs. Bucket aggressively.

Error budget burndown. If SLO is "99.9% of valid requests succeed," rejected-as-invalid requests should not burn budget. Compute SLO over (total - rejected_for_user_error). 4xx is contractual; 5xx is your fault. Burn budget only on 5xx.

Structured errors:

type ValidationError struct {
    Field   string `json:"field"`
    Code    string `json:"code"`     // "required", "format", "range"
    Message string `json:"message"`
    Got     any    `json:"got,omitempty"`
}

Let clients react programmatically and logs aggregate by code. "Wrong format" without the field name is unactionable.

8. Failure modes¶

Silent acceptance of invalid input:

// BAD — error swallowed
result, _ := strconv.Atoi(input)
charge(result)  // input was "abc"; charged 0 cents

// BAD — error logged, work continues
n, err := strconv.Atoi(input)
if err != nil { log.Println("bad input", err) }
charge(n)  // same outcome, with a forensic breadcrumb

Logging is not handling. Every _ = and every if err != nil { log... } without return is a future incident. errcheck in CI catches the swallows; log-and-continue needs code review.

Retry storms after Fail-Fast. Downstream comes back after a 30 s outage. Every client retries simultaneously. Original throughput was 10 k req/s; current is original + everything queued. Downstream falls over again. Cycle repeats.

Layer	Defence
Client	Full-jitter exponential backoff
Client	Retry budget (§6)
Server	Token bucket rate limiter
Server	Adaptive concurrency limiter
Mesh	Outlier detection ejecting hot instances

Full jitter: sleep = random(0, base * 2^attempt). The random factor must dominate or the herd spreads but doesn't disperse.

Alerts on validation spikes. A 50x spike is rarely a flood of bad users. Almost always: a stricter schema you just deployed, a client deploying a new field, a schema rolled forward without dual validation, or an attacker probing endpoint shape. Alert on rate of change. Runbook: "Was anything deployed in the last 30 minutes? Roll back."

Panic on user input:

// BAD — panics on bad JSON; takes down the worker
func handle(msg []byte) {
    var req Request
    if err := json.Unmarshal(msg, &req); err != nil {
        panic(err)  // crashed the consumer
    }
}

panic is for programmer errors. Bad user input is expected. A panicking handler is a DoS primitive. recover is a parachute, not a strategy; the fix is to return error.

Half-validation. Validating each present field is not the same as validating the request makes sense. email == "" && phone == "" slips past per-field checks; schema validators express "at least one required" — hand-written ones miss it.

9. Security¶

Fail-Fast is input sanitisation. Every Fail-Fast check is a security check; every absent check is a possible CVE.

OWASP API Top 10 item	Fail-Fast control
Broken Object Level Authorization (BOLA)	Resource-ID validation includes auth
Excessive Data Exposure	Response schema validation strips fields
Lack of Resources & Rate Limiting	429 at the edge before work
Mass Assignment	Allowlist field binding
Improper Inventory Management	Schema is the inventory; reject unknown versions
Unsafe Consumption of APIs	Validate downstream responses too

Denying service to invalid input before resource use. Validation must happen before expensive operations:

// BAD — DB query first, validation second
func search(q string) ([]Result, error) {
    results, err := db.Search(q)  // attacker sends 100 KB regex; DB melts
    if err != nil { return nil, err }
    if len(results) > 1000 { return nil, errors.New("too many") }
    return results, nil
}

// GOOD — Fail-Fast first
func search(q string) ([]Result, error) {
    if len(q) > 256 { return nil, ErrQueryTooLong }
    if !validQueryChars(q) { return nil, ErrInvalidQuery }
    return db.Search(q)
}

Validate cheap, then policy, then expensive work. Order is itself a security property; reordering for "cleanliness" routinely opens DoS holes.

Mass assignment. Never bind the public payload directly to the persistence struct:

// VULNERABLE
var u User
_ = json.NewDecoder(r.Body).Decode(&u)
db.Save(&u)  // attacker set IsAdmin: true

// SAFE — explicit allowlist DTO
type UpdateUserRequest struct {
    DisplayName *string `json:"display_name,omitempty"`
    Email       *string `json:"email,omitempty"`
}

DTO and entity are different types on purpose. Every team learns this the day a customer becomes an admin via curl.

Size limits everywhere. http.MaxBytesReader(w, r.Body, 1<<20), multipart.Reader.ReadForm(maxMemory), gRPC MaxRecvMsgSize, every queue consumer. Missing limits is a top cause of "OOM at 03:00."

Constant-time comparison. Fail-Fast's natural shape — return on first mismatch — is a timing attack on secret comparison:

if subtle.ConstantTimeCompare([]byte(api.Key), []byte(provided)) == 0 {
    return ErrUnauth
}

Cost: microseconds. Bug: credential leak.

10. Schema evolution¶

A schema lives forever. Every backwards-incompatible change is a coordinated multi-team deploy. Every change is breaking unless proven otherwise.

Change	Safe?
New optional field	Yes
New enum value	Mostly — needs default branch on consumer
New required field	No
Field type widening (int32 → int64)	JSON yes; protobuf no
Field removal	No
Field rename	No (remove + add)

Deprecation warnings:

func (s *Server) Charge(ctx context.Context, req *ChargeRequest) (*ChargeResponse, error) {
    if req.LegacyCurrency != "" {
        deprecatedFieldUsage.WithLabelValues("Charge", "legacy_currency").Inc()
        slog.Warn("deprecated field used",
            "method", "Charge", "field", "legacy_currency",
            "tenant", tenantFromCtx(ctx))
        if req.Currency == "" {
            req.Currency = req.LegacyCurrency
        }
    }
    // ...
}

Three phases, mandatory:

Announce. slog.Warn + Prometheus counter on every use. One release cycle minimum.
Document. Update SDK docs and OpenAPI/protovalidate annotations. Email paid-API customers.
Remove. Only when the counter has been zero for two release cycles across all environments — staging included.

Skipping any phase breaks a consumer you don't know about.

Dual validation during migration:

func validate(raw json.RawMessage) error {
    if err := schemaV2.Validate(raw); err == nil { return nil }
    if err := schemaV1.Validate(raw); err == nil {
        v1Acceptance.Inc()
        return nil
    }
    return ErrInvalid
}

Accept under either schema; count v1 acceptances; remove when zero for a release cycle. Strict mode is the destination, not the start.

Schema Registry as governance. Confluent / AWS Glue / Apicurio enforce compatibility modes:

Mode	Default for
`BACKWARD`	Event streams — new consumers read old events
`FORWARD`	When you must roll producers first
`FULL`	Safest; slows evolution
`NONE`	"We accept production incidents"

Choose explicitly; do not accept defaults.

11. Anti-patterns at scale¶

Anti-pattern	Symptom	Fix
Log-and-continue on validation error	Silent corruption	Return error; reject
`result, _ := op()`	Wrong arithmetic; phantom transactions	Handle every error
Blanket catch returning 200	Clients believe writes succeeded that didn't	Return 4xx/5xx with reason
`recover()` as flow control	Crashes hidden; bugs never fixed	Recover only at top, log + 500
Validation in middleware and handler	CPU doubled; rules drift	One owner; tests assert layout
Panic on user input	DoS vector	Return `error`
Regex compiled per request	p99 spikes	`regexp.MustCompile` at init
`json.Unmarshal` into `map[string]any`	Silent acceptance of anything	Typed struct or schema
Validation after partial work	Half-applied state on error	Validate first, mutate second
"Be liberal in what you accept"	Backwards-incompatibility bugs	Strict at boundary; deprecation cycle
Idempotency key in header but not body	Logging strips it; replays missed	Define one location; lint
Breaker without fallback	"5xx from breaker" instead of "5xx from downstream"	Implement degraded mode
No size limits	OOM, GC thrash	`MaxBytesReader`, `MaxRecvMsgSize`
Bind public payload to entity	Mass assignment; admin escalation	Separate DTO and entity
Validation in domain only, not boundary	Wasted CPU through layers	Validate at boundary, trust within
Error message echoing field value (PII)	Logs full of emails, tokens	Hash or redact
Sentinel `==` not `errors.Is`	Wrapped errors slip through	`errors.Is`/`As` everywhere
Different validation server vs SDK	Server rejects what SDK accepted	Generate both from one schema

The deepest anti-pattern: Fail-Fast as defensive coding instead of contract. Defensive coding sprinkles if x == nil everywhere for a false sense of safety. Contract-based Fail-Fast names the boundary, places the check there, documents the rule, and trusts the rest of the codebase. First produces slow, brittle, untestable code; second produces a system where you can reason about where errors come from.

12. Closing principles¶

Fail-Fast is rejection at the boundary. The defining trade is strict-now vs lenient-now-debug-later — "later" is always more expensive.

Early loud failures are cheap; late silent failures are expensive. Validation at the handler: 3 µs and a structured 4xx. The same bad input reaching the database: constraint violation, half-applied transaction, angry customer, postmortem. Six orders of magnitude.
The boundary owns the contract. Each trust boundary fails fast on the rules it owns. Inside, business code trusts its inputs. Defensive duplication is a symptom of a leaky boundary.
Schemas are the contract. Hand-written if chains drift from docs immediately. JSON Schema, protovalidate, Avro — pick the technology that fits the wire format, generate validators, treat the schema as the API.
Validation, breaker, retry budget, idempotency are one system. Each is Fail-Fast applied to a different failure mode. Skipping one produces the incident that teaches you to want it.
Log-and-continue is not handling. Every if err != nil { log... } without return is a future incident. _ = err requires a comment. CI lints both.
panic is for programmer errors. Bad user input is expected and gets error. A handler that panics on input is a DoS primitive.
Validate cheap, then policy, then expensive work. Order is itself a security property. Body size before parse, auth before query, rate limit before computation.
Schema evolution requires dual validation. Deprecate over three release cycles; remove only when telemetry shows zero use. Day-one strict breaks the long tail of clients you cannot reach.
Multi-tenant means tenant-scoped errors and metrics. Label by tenant (bounded cardinality) to know blast radius without paging.
Observe rejection, not just success. Rate by reason, broken out by tenant and client. Burn budget on 5xx, not 4xx.

Get these right and Fail-Fast is invisible: invalid input bounces with a structured error, downstreams are insulated, retries are safe. Get them wrong and the on-call is a worker crashing on bad JSON, a retry storm taking down a recovered service, a customer billed twice because the idempotency key was in a stripped header, and a mass-assignment bug making attackers into admins. Fail-Fast is the cheapest pattern to write and one of the easiest to operate carelessly. Rejection at the boundary is the contract; the rest of the system is built on the assumption that the boundary did its job.