Fail-Fast — Professional¶
Focus: staff/principal decisions. Fail-Fast is the simplest pattern in this catalogue and one of the most likely to get a service paged. The hard parts are not
if !valid { return err }. They are: which boundary owns validation, how much CPU to budget, what happens when 200k req/s simultaneously fail-fast against a downstream that just came back up, and why a "harmless" log-and-continue cost a customer six hours of duplicate billing.
1. Fail-Fast as a system primitive¶
Fail-Fast is rejection at the boundary. The producer (caller) and the consumer (your code) share a contract; the moment the contract is violated, you stop. That decision differentiates Fail-Fast from every adjacent resilience pattern.
| Primitive | When it acts | What it protects | Failure model |
|---|---|---|---|
| Fail-Fast | Before work starts | CPU, downstream, data integrity | Reject with structured error |
| Retry | After transient failure | Throughput under partial outage | Exponential backoff + jitter |
| Circuit breaker | After repeated failures | Downstream from saturation | Open / half-open / closed |
| Bulkhead | Continuously | One tenant from another | Resource pool partitioning |
| Fallback | After failure | User experience | Cached / degraded response |
| Timeout | After elapsed budget | Caller from hanging | Deadline propagation |
Four distinctions:
- Preventive vs reactive. Fail-Fast says "this can never succeed." Retry says "this might succeed if we try again." Mixing produces retry storms (§8).
- Local vs statistical. Fail-Fast rejects this request on this input. A breaker rejects based on the failure rate of a peer. Both coexist: validate input, check breaker, call.
- Composes upward. A handler that fails fast on bad JSON saves codec, validator, ORM, database, and downstream from touching the request. Savings multiply with depth.
- Contract, not defence. Validation in a private function is mostly waste; validation at the boundary is the contract. Internal
if x == nil { panic }lines are programmer-error assertions, not Fail-Fast.
The rule: Fail-Fast at every trust boundary; trust within. HTTP handler, gRPC server, queue consumer, CLI entry — each is a trust boundary that owns validation. Past it, business code assumes correctness. Defensive checks inside a service are duplication and a symptom of a leaky boundary.
2. Cost analysis¶
Go 1.24, amd64, Linux 6.8. Validation appears in every request; cost matters even when individually small.
sentinel: if s == "" { ... } ~2 ns
regexp.MatchString (compiled, short input) ~200 ns
regexp.MatchString (uncompiled per call) ~50 µs (never)
go-playground/validator on 10-field struct ~3 µs
protoc-gen-validate on 10-field proto ~500 ns (generated)
JSON Schema (santhosh-tekuri, compiled, 20 keys) ~1 µs
JSON Schema (interpreted per call) ~30 µs
json.Unmarshal of 1 KB ~10 µs
proto.Unmarshal of 1 KB ~2 µs
A validator tag check costs ~3 µs. Against a 10 ms DB call, irrelevant. Against a 5 ns lookup-table hit on the hot path, 600x overhead. Validation cost is only invisible when downstream work dominates.
Early-exit savings. Every rejected request avoids the work it would have caused:
| Stage | Cost if executed | Saved |
|---|---|---|
| JSON decode + struct validation | 13 µs | paid |
| Auth check | 50 µs | saved |
| DB transaction | 200 µs | saved |
| Business logic + writes | ~5 ms | saved |
| Downstream gRPC | 10 ms | saved |
| Total per rejected request | ~16 ms | saved |
At 100 k req/s with a 1% bad-input rate, fail-fast saves 16 s of CPU+IO per second — the difference between healthy and saturated during an attack or a buggy client.
Where validation belongs. Each layer fails fast on the constraints it owns: edge gateway (schema/size/rate), ingress handler (request struct + auth), application (business invariants), domain (constructor-enforced invariants), infrastructure (physical constraints). Duplicating edge checks in the domain wastes CPU; skipping domain invariants because "the handler already checked" produces corrupt state when a new caller is added. Each layer's validations are its public contract — not a backstop for the layer above.
3. Schema-driven validation¶
Hand-written if chains rot. Every new field is two edits — once in the struct, once in the validator — and the second is what people forget. Schema-driven validation moves the contract into a declarative artifact both sides of the wire understand.
| Schema | Wire format | Cost | Use case |
|---|---|---|---|
| JSON Schema | JSON, YAML | ~1 µs compiled | REST APIs, OpenAPI |
| protoc-gen-validate (PGV) | Protobuf | ~500 ns | gRPC services |
| buf/protovalidate | Protobuf, CEL-based | ~1 µs | New gRPC services |
| FlatBuffers verifier | FlatBuffers | Near-zero | Game state, mobile sync |
| Avro | Avro | ~5 µs | Kafka, event streams |
| CUE | Any | ~10 µs | Config validation |
var compiled *jsonschema.Schema
func init() {
c := jsonschema.NewCompiler()
_ = c.AddResource("user.json", strings.NewReader(userSchemaJSON))
s, err := c.Compile("user.json")
if err != nil { panic(fmt.Errorf("user schema: %w", err)) }
compiled = s
}
func validateUser(raw json.RawMessage) error {
var v any
if err := json.Unmarshal(raw, &v); err != nil {
return fmt.Errorf("malformed JSON: %w", err)
}
if err := compiled.Validate(v); err != nil {
return fmt.Errorf("schema violation: %w", err)
}
return nil
}
Three rules:
- Compile once, reuse forever. Per-request compilation is a 100x regression nobody catches until production p99 doubles.
- Validate before unmarshalling into your typed struct. Otherwise you re-implement the same checks with
if. - The schema is the API. Generate OpenAPI and SDKs from it.
protovalidate for gRPC. Generated validator code: no reflection, no allocation, ~500 ns for a 10-field message:
message CreateUserRequest {
string email = 1 [(buf.validate.field).string.email = true];
int32 age = 2 [(buf.validate.field).int32 = {gte: 13, lte: 120}];
string display_name = 3 [(buf.validate.field).string = {min_len: 1, max_len: 64}];
}
Wire as a gRPC interceptor (§5) and every RPC fails fast without per-handler boilerplate. For Kafka/Pulsar, Avro + Schema Registry is the only sane way to evolve event payloads across hundreds of consumers without coordinated deploys. Do not hand-roll a schema-shaped problem. Hand-written validators drift from docs immediately and become the source of auth-bypass CVEs.
4. Circuit breakers as runtime Fail-Fast¶
A circuit breaker is Fail-Fast applied to a peer service. States: Closed (normal, failures counted), Open (reject immediately with sentinel error), Half-Open (single probe; on success close, on failure re-open with cooldown). Same pattern: stop early, report cheaply, let the system recover.
var paymentsBreaker = gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "payments",
MaxRequests: 1,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
ReadyToTrip: func(c gobreaker.Counts) bool {
return c.ConsecutiveFailures >= 5 ||
(c.Requests >= 20 && float64(c.TotalFailures)/float64(c.Requests) >= 0.5)
},
OnStateChange: func(name string, from, to gobreaker.State) {
slog.Warn("breaker.state", "name", name, "from", from, "to", to)
breakerStateChanges.WithLabelValues(name, to.String()).Inc()
},
})
func charge(ctx context.Context, req ChargeRequest) (*Receipt, error) {
res, err := paymentsBreaker.Execute(func() (any, error) {
return paymentsClient.Charge(ctx, req)
})
if err != nil {
if errors.Is(err, gobreaker.ErrOpenState) {
return nil, ErrPaymentsUnavailable
}
return nil, err
}
return res.(*Receipt), nil
}
hystrix-go (Go port of Netflix Hystrix) is unmaintained; use sony/gobreaker or failsafe-go (generics-aware).
Sizing. Default start: 50% failure over a 10 s rolling window of at least 20 requests, 30 s cooldown.
| Knob | Too low | Too high |
|---|---|---|
| Failure threshold | Trips on noise; fake outages | Slow to trip; downstream saturates |
| Window length | Misses bursty failures | Stays open after recovery |
| Cooldown | Probes too often; thrashes downstream | Slow recovery |
Where breakers belong: Redis cache, read replicas, third-party APIs, internal microservices — yes. Same-process functions — no. Database primary — usually no (degrading to read-only is application-specific). Breakers without a fallback are noise — they convert "5xx from downstream" into "5xx from breaker." The point is to do something better when open: cached data, queued work, ask to retry.
5. Distributed Fail-Fast¶
In a mesh, Fail-Fast is system policy, not per-service concern.
gRPC interceptor validation:
func ValidateUnaryInterceptor(v *protovalidate.Validator) grpc.UnaryServerInterceptor {
return func(ctx context.Context, req any, info *grpc.UnaryServerInfo, h grpc.UnaryHandler) (any, error) {
msg, ok := req.(proto.Message)
if !ok { return h(ctx, req) }
if err := v.Validate(msg); err != nil {
validationFailures.WithLabelValues(info.FullMethod).Inc()
return nil, status.Errorf(codes.InvalidArgument, "validation failed: %v", err)
}
return h(ctx, req)
}
}
s := grpc.NewServer(grpc.ChainUnaryInterceptor(
recoveryInterceptor, loggingInterceptor, authInterceptor,
ValidateUnaryInterceptor(validator), // BEFORE handler
metricsInterceptor,
))
The same generated rules run as a UnaryClientInterceptor on the producer, catching errors before the network call.
Service mesh policy. Istio, Linkerd, Consul Connect push validation, rate limiting, and breaking into sidecars: single source of truth in CRDs, polyglot enforcement, SRE owns policy, product owns logic. Cost: ~500 µs per sidecar hop. Fine for most workloads; intolerable for low-latency in-cluster RPC. Many teams keep validation in-process and put rate limiting / breaking at the mesh.
API gateways. At the edge, a gateway (Envoy, Kong, AWS API Gateway) fails fast before traffic enters the cluster: TLS, size limits, schema, JWT, rate limits. Rejecting at the edge is 100x cheaper than inside the cluster — no namespace traversal, no sidecar, no service load. The edge is the first and most important Fail-Fast boundary; everything else is defence in depth.
Backpressure as Fail-Fast. 429 / ResourceExhausted with Retry-After is the load-shedding pattern: "I cannot serve this now; do not waste cycles." Without it, services degrade gracelessly under overload.
6. Idempotency keys — Fail-Fast on duplicate¶
A request is idempotent if executing twice has the same effect as once. Most aren't, and retries from networks, mobile clients, and queue redelivery produce duplicates. The fix: an idempotency key. Client generates a UUID per logical op; server stores (key) → result and fails fast on the second arrival.
type IdempotencyStore interface {
Get(ctx context.Context, key string) (resp []byte, ok bool, err error)
Put(ctx context.Context, key string, resp []byte, ttl time.Duration) error
Lock(ctx context.Context, key string, ttl time.Duration) error
}
func (s *Server) CreateCharge(ctx context.Context, req *ChargeRequest) (*ChargeResponse, error) {
key := req.IdempotencyKey
if key == "" {
return nil, status.Error(codes.InvalidArgument, "idempotency_key required")
}
if cached, ok, err := s.store.Get(ctx, key); err != nil {
return nil, err
} else if ok {
return decodeResponse(cached), nil // Fail-Fast on duplicate
}
if err := s.store.Lock(ctx, key, 30*time.Second); err != nil {
if errors.Is(err, ErrConflict) {
return nil, status.Error(codes.AlreadyExists, "in-flight duplicate")
}
return nil, err
}
resp, err := s.process(ctx, req)
if err != nil { return nil, err }
_ = s.store.Put(ctx, key, encodeResponse(resp), 24*time.Hour)
return resp, nil
}
Three correctness conditions:
- Key in a location you reliably extract. Stripe uses
Idempotency-Keyheader; PayPal uses body. Choose one; audit. - Stored response is the final response, status code included.
- TTL exceeds the maximum client retry window. 60 s client + 30 s TTL = double charges. Stripe uses 24 hours.
| Store | Latency | Use when |
|---|---|---|
Redis SETNX + TTL | ~500 µs | Most services |
| DynamoDB conditional | ~5 ms | Financial; must survive cache loss |
| Postgres unique constraint | ~2 ms | Single-database services |
For payments, the idempotency record must commit in the same transaction as the business write, or keys survive for rolled-back transactions. Postgres with UNIQUE(tenant_id, idempotency_key) solves this for free.
Retry budget. Idempotency keys make retries safe; retry budgets make them bounded. A retry budget fails fast on retries themselves once retry rate exceeds (say) 10% — the classic sign of a downstream in distress. Idempotency + retry budget + circuit breaker is the canonical retry-safety triad. Pick any two and incidents teach you to want the third.
7. Observability¶
A service that fails fast is invisible until you measure rejection. Signals: rejection rate, by reason, with blast radius.
| Metric | Type | Why |
|---|---|---|
validation_failures_total{endpoint,reason} | Counter | Spike = bad deploy or attack |
validation_duration_seconds{endpoint} | Histogram | Regression catches slow regex |
breaker_state{name} | Gauge | Live upstream health |
breaker_state_changes_total{name,to} | Counter | Flap detection |
idempotency_duplicates_total{endpoint} | Counter | Client retry rate |
requests_rejected_total{reason} | Counter | All fail-fast reasons unified |
error_budget_remaining{slo} | Gauge | Burn rate |
Most useful chart: rejection rate by reason, stacked. Each reason has a different remediation; aggregated "errors" hides the diagnosis.
Blast radius. Label by tenant and client version (bounded cardinality):
validationFailures.With(prometheus.Labels{
"endpoint": "/v1/charges",
"reason": "email_format",
"tenant": tenantID, // < 10k tenants
"version": clientVersion, // a few SDK versions
}).Inc()
Never label by raw user ID or full URL path with embedded IDs. Bucket aggressively.
Error budget burndown. If SLO is "99.9% of valid requests succeed," rejected-as-invalid requests should not burn budget. Compute SLO over (total - rejected_for_user_error). 4xx is contractual; 5xx is your fault. Burn budget only on 5xx.
Structured errors:
type ValidationError struct {
Field string `json:"field"`
Code string `json:"code"` // "required", "format", "range"
Message string `json:"message"`
Got any `json:"got,omitempty"`
}
Let clients react programmatically and logs aggregate by code. "Wrong format" without the field name is unactionable.
8. Failure modes¶
Silent acceptance of invalid input:
// BAD — error swallowed
result, _ := strconv.Atoi(input)
charge(result) // input was "abc"; charged 0 cents
// BAD — error logged, work continues
n, err := strconv.Atoi(input)
if err != nil { log.Println("bad input", err) }
charge(n) // same outcome, with a forensic breadcrumb
Logging is not handling. Every _ = and every if err != nil { log... } without return is a future incident. errcheck in CI catches the swallows; log-and-continue needs code review.
Retry storms after Fail-Fast. Downstream comes back after a 30 s outage. Every client retries simultaneously. Original throughput was 10 k req/s; current is original + everything queued. Downstream falls over again. Cycle repeats.
| Layer | Defence |
|---|---|
| Client | Full-jitter exponential backoff |
| Client | Retry budget (§6) |
| Server | Token bucket rate limiter |
| Server | Adaptive concurrency limiter |
| Mesh | Outlier detection ejecting hot instances |
Full jitter: sleep = random(0, base * 2^attempt). The random factor must dominate or the herd spreads but doesn't disperse.
Alerts on validation spikes. A 50x spike is rarely a flood of bad users. Almost always: a stricter schema you just deployed, a client deploying a new field, a schema rolled forward without dual validation, or an attacker probing endpoint shape. Alert on rate of change. Runbook: "Was anything deployed in the last 30 minutes? Roll back."
Panic on user input:
// BAD — panics on bad JSON; takes down the worker
func handle(msg []byte) {
var req Request
if err := json.Unmarshal(msg, &req); err != nil {
panic(err) // crashed the consumer
}
}
panic is for programmer errors. Bad user input is expected. A panicking handler is a DoS primitive. recover is a parachute, not a strategy; the fix is to return error.
Half-validation. Validating each present field is not the same as validating the request makes sense. email == "" && phone == "" slips past per-field checks; schema validators express "at least one required" — hand-written ones miss it.
9. Security¶
Fail-Fast is input sanitisation. Every Fail-Fast check is a security check; every absent check is a possible CVE.
| OWASP API Top 10 item | Fail-Fast control |
|---|---|
| Broken Object Level Authorization (BOLA) | Resource-ID validation includes auth |
| Excessive Data Exposure | Response schema validation strips fields |
| Lack of Resources & Rate Limiting | 429 at the edge before work |
| Mass Assignment | Allowlist field binding |
| Improper Inventory Management | Schema is the inventory; reject unknown versions |
| Unsafe Consumption of APIs | Validate downstream responses too |
Denying service to invalid input before resource use. Validation must happen before expensive operations:
// BAD — DB query first, validation second
func search(q string) ([]Result, error) {
results, err := db.Search(q) // attacker sends 100 KB regex; DB melts
if err != nil { return nil, err }
if len(results) > 1000 { return nil, errors.New("too many") }
return results, nil
}
// GOOD — Fail-Fast first
func search(q string) ([]Result, error) {
if len(q) > 256 { return nil, ErrQueryTooLong }
if !validQueryChars(q) { return nil, ErrInvalidQuery }
return db.Search(q)
}
Validate cheap, then policy, then expensive work. Order is itself a security property; reordering for "cleanliness" routinely opens DoS holes.
Mass assignment. Never bind the public payload directly to the persistence struct:
// VULNERABLE
var u User
_ = json.NewDecoder(r.Body).Decode(&u)
db.Save(&u) // attacker set IsAdmin: true
// SAFE — explicit allowlist DTO
type UpdateUserRequest struct {
DisplayName *string `json:"display_name,omitempty"`
Email *string `json:"email,omitempty"`
}
DTO and entity are different types on purpose. Every team learns this the day a customer becomes an admin via curl.
Size limits everywhere. http.MaxBytesReader(w, r.Body, 1<<20), multipart.Reader.ReadForm(maxMemory), gRPC MaxRecvMsgSize, every queue consumer. Missing limits is a top cause of "OOM at 03:00."
Constant-time comparison. Fail-Fast's natural shape — return on first mismatch — is a timing attack on secret comparison:
Cost: microseconds. Bug: credential leak.
10. Schema evolution¶
A schema lives forever. Every backwards-incompatible change is a coordinated multi-team deploy. Every change is breaking unless proven otherwise.
| Change | Safe? |
|---|---|
| New optional field | Yes |
| New enum value | Mostly — needs default branch on consumer |
| New required field | No |
| Field type widening (int32 → int64) | JSON yes; protobuf no |
| Field removal | No |
| Field rename | No (remove + add) |
Deprecation warnings:
func (s *Server) Charge(ctx context.Context, req *ChargeRequest) (*ChargeResponse, error) {
if req.LegacyCurrency != "" {
deprecatedFieldUsage.WithLabelValues("Charge", "legacy_currency").Inc()
slog.Warn("deprecated field used",
"method", "Charge", "field", "legacy_currency",
"tenant", tenantFromCtx(ctx))
if req.Currency == "" {
req.Currency = req.LegacyCurrency
}
}
// ...
}
Three phases, mandatory:
- Announce.
slog.Warn+ Prometheus counter on every use. One release cycle minimum. - Document. Update SDK docs and OpenAPI/protovalidate annotations. Email paid-API customers.
- Remove. Only when the counter has been zero for two release cycles across all environments — staging included.
Skipping any phase breaks a consumer you don't know about.
Dual validation during migration:
func validate(raw json.RawMessage) error {
if err := schemaV2.Validate(raw); err == nil { return nil }
if err := schemaV1.Validate(raw); err == nil {
v1Acceptance.Inc()
return nil
}
return ErrInvalid
}
Accept under either schema; count v1 acceptances; remove when zero for a release cycle. Strict mode is the destination, not the start.
Schema Registry as governance. Confluent / AWS Glue / Apicurio enforce compatibility modes:
| Mode | Default for |
|---|---|
BACKWARD | Event streams — new consumers read old events |
FORWARD | When you must roll producers first |
FULL | Safest; slows evolution |
NONE | "We accept production incidents" |
Choose explicitly; do not accept defaults.
11. Anti-patterns at scale¶
| Anti-pattern | Symptom | Fix |
|---|---|---|
| Log-and-continue on validation error | Silent corruption | Return error; reject |
result, _ := op() | Wrong arithmetic; phantom transactions | Handle every error |
| Blanket catch returning 200 | Clients believe writes succeeded that didn't | Return 4xx/5xx with reason |
recover() as flow control | Crashes hidden; bugs never fixed | Recover only at top, log + 500 |
| Validation in middleware and handler | CPU doubled; rules drift | One owner; tests assert layout |
| Panic on user input | DoS vector | Return error |
| Regex compiled per request | p99 spikes | regexp.MustCompile at init |
json.Unmarshal into map[string]any | Silent acceptance of anything | Typed struct or schema |
| Validation after partial work | Half-applied state on error | Validate first, mutate second |
| "Be liberal in what you accept" | Backwards-incompatibility bugs | Strict at boundary; deprecation cycle |
| Idempotency key in header but not body | Logging strips it; replays missed | Define one location; lint |
| Breaker without fallback | "5xx from breaker" instead of "5xx from downstream" | Implement degraded mode |
| No size limits | OOM, GC thrash | MaxBytesReader, MaxRecvMsgSize |
| Bind public payload to entity | Mass assignment; admin escalation | Separate DTO and entity |
| Validation in domain only, not boundary | Wasted CPU through layers | Validate at boundary, trust within |
| Error message echoing field value (PII) | Logs full of emails, tokens | Hash or redact |
Sentinel == not errors.Is | Wrapped errors slip through | errors.Is/As everywhere |
| Different validation server vs SDK | Server rejects what SDK accepted | Generate both from one schema |
The deepest anti-pattern: Fail-Fast as defensive coding instead of contract. Defensive coding sprinkles if x == nil everywhere for a false sense of safety. Contract-based Fail-Fast names the boundary, places the check there, documents the rule, and trusts the rest of the codebase. First produces slow, brittle, untestable code; second produces a system where you can reason about where errors come from.
12. Closing principles¶
Fail-Fast is rejection at the boundary. The defining trade is strict-now vs lenient-now-debug-later — "later" is always more expensive.
-
Early loud failures are cheap; late silent failures are expensive. Validation at the handler: 3 µs and a structured 4xx. The same bad input reaching the database: constraint violation, half-applied transaction, angry customer, postmortem. Six orders of magnitude.
-
The boundary owns the contract. Each trust boundary fails fast on the rules it owns. Inside, business code trusts its inputs. Defensive duplication is a symptom of a leaky boundary.
-
Schemas are the contract. Hand-written
ifchains drift from docs immediately. JSON Schema, protovalidate, Avro — pick the technology that fits the wire format, generate validators, treat the schema as the API. -
Validation, breaker, retry budget, idempotency are one system. Each is Fail-Fast applied to a different failure mode. Skipping one produces the incident that teaches you to want it.
-
Log-and-continue is not handling. Every
if err != nil { log... }withoutreturnis a future incident._ = errrequires a comment. CI lints both. -
panicis for programmer errors. Bad user input is expected and getserror. A handler that panics on input is a DoS primitive. -
Validate cheap, then policy, then expensive work. Order is itself a security property. Body size before parse, auth before query, rate limit before computation.
-
Schema evolution requires dual validation. Deprecate over three release cycles; remove only when telemetry shows zero use. Day-one strict breaks the long tail of clients you cannot reach.
-
Multi-tenant means tenant-scoped errors and metrics. Label by tenant (bounded cardinality) to know blast radius without paging.
-
Observe rejection, not just success. Rate by reason, broken out by tenant and client. Burn budget on 5xx, not 4xx.
Get these right and Fail-Fast is invisible: invalid input bounces with a structured error, downstreams are insulated, retries are safe. Get them wrong and the on-call is a worker crashing on bad JSON, a retry storm taking down a recovered service, a customer billed twice because the idempotency key was in a stripped header, and a mass-assignment bug making attackers into admins. Fail-Fast is the cheapest pattern to write and one of the easiest to operate carelessly. Rejection at the boundary is the contract; the rest of the system is built on the assumption that the boundary did its job.
Further reading¶
- Dave Cheney, Don't just check errors, handle them gracefully; Rob Pike, Errors are values
- Eric Allman, The Robustness Principle Reconsidered — Postel's law as anti-pattern
bufbuild/protovalidate,santhosh-tekuri/jsonschema,sony/gobreaker,failsafe-go/failsafe-go- Netflix
concurrency-limits— adaptive concurrency limiting - AWS Architecture Blog, Exponential Backoff And Jitter
- Stripe Engineering, Designing robust and predictable APIs with idempotency
- OWASP API Security Top 10; Marc Brooker, Timeouts, retries, and backoff with jitter
- Confluent Schema Registry — compatibility modes
- Tyler Treat, You Cannot Have Exactly-Once Delivery