State — Professional¶

Focus: staff/principal-level decisions. A finite state machine is a wire protocol the system speaks with itself. The runtime cost is small; the design cost is real; the operational cost — once ten million entities are in flight — is what you actually get paid to manage. Opinionated where the field agrees, explicit about trade-offs where it does not.

1. FSM as a system primitive¶

The State pattern in code is the same idea as Paxos in distributed consensus, an actor's behavior in Erlang, and a Temporal workflow: a function (state, event) -> (state', actions). The shape is invariant; only the substrate changes.

Primitive	State lives in	Transition trigger	Durability	Failure model
In-process FSM (this pattern)	Pointer/string field on an object	Method call	In-memory until persisted	Crash loses state unless serialized
Event-sourced aggregate	An append-only event log	Command produces events	Log is the source of truth	Replay rebuilds state on restart
Workflow engine (Temporal, Cadence)	History persisted by the engine	Signal, timer, activity completion	Engine guarantees durability across crashes	Engine handles retries; workflow code stays deterministic
Actor model (Erlang/OTP, Akka)	Per-actor mailbox + private state	Message receive	Volatile unless the actor persists	Supervisor restart; blank slate by default
CSP / process calculi (Go channels, occam)	Goroutine's position in its code	Channel send/receive	None — the PC is the state	Crash loses everything
Replicated state machine (Paxos, Raft)	Log of commands applied in order on every replica	Consensus-confirmed command	Log is durable on a quorum	Replicas converge given the same log

Read all six rows as one sentence: the system has a state, an input arrives, the state changes, side effects fire. The rows differ on durability, replication, determinism, and what crosses the boundary on a transition. A staff engineer's job is to know which substrate the problem deserves before reaching for any of them.

Three observations hold across all six:

Determinism is the load-bearing property. A pure FSM is trivially deterministic. Once a transition does I/O, determinism is gone and you need either idempotency or a replay-aware engine.
States must be finite and small. A "state" with a free-form blob is a record, not a state. Ten states is healthy; a hundred is a smell; a thousand is a defect.
Transitions are events, not method names. Pay(), Ship() look like methods, but in an event-sourced world they are nouns: PaymentRequested, ShipmentDispatched. Methods rot; events persist.

Lamport's The Part-Time Parliament (1998) frames Paxos as a replicated state machine — every replica is an FSM; the protocol ensures all replicas see the same input sequence. Hewitt's actor model (1973) frames a system as a soup of FSMs that communicate by message. Both predate microservices by decades; the math was already there.

2. Runtime cost analysis¶

Per-transition CPU and allocation matter on data planes (per-packet protocols, per-tick game loops). They are irrelevant on control planes (one order's lifecycle). Numbers below are Go 1.22, amd64, warm cache.

2.1 Interface dispatch¶

type State interface {
    Handle(*Machine, Event) State
    Name() string
}

func (m *Machine) Send(ev Event) {
    next := m.state.Handle(m, ev)   // interface dispatch
    if next != m.state { m.state = next }
}

m.state.Handle(...) is an indirect call: two loads (itab + data), one method-pointer load from the itab, then CALL CX. ~1-2 ns when one concrete state dominates and the branch predictor hits; ~3-5 ns under a mixed workload. The itab is cached by runtime.getitab; the first call for a new (interface, concrete) pair pays a ~50 ns hash lookup, subsequent calls hit the cache.

2.2 State-function form (Rob Pike's lexer)¶

type stateFn func(*Machine) stateFn
func (m *Machine) run() { for s := initial; s != nil; { s = s(m) } }

A direct call through a function value: one load, one indirect CALL. No itab, no method-pointer dereference. ~1 ns — effectively the cost of a CALL. To avoid closure-per-state allocation (~30 ns each), keep state functions top-level — never closures capturing variables.

2.3 Table lookup¶

A literal []Transition linear scan is O(N) but cache-friendly. For ≤30-40 entries it often beats a hashed map by avoiding the hash. A map[struct{From, Event}]StateID is constant-time ~15-20 ns. For maximum throughput with small dense alphabets, encode as a 2D array:

var table [numStates][numEvents]StateID
next := table[currentState][event]   // ~1 ns, no hashing

This is how TCP implementations encode transitions — N=11, M=8, table is 88 entries.

2.4 Atomic swap vs mutex¶

A single goroutine owning the FSM (events arrive on a channel) needs no synchronization. Concurrent transitions need a mutex or atomic swap.

Form	Uncontested cost	Contended cost	Caveat
`sync.Mutex` around `Handle`	~25 ns	Hundreds of ns in the futex queue	Serializes `Handle` body — bad if `Handle` does I/O
`atomic.Pointer[State]` + CAS loop	~10 ns	Spins; better than parking for fast transitions	`Handle` may be called twice on CAS retry — must be pure
Channel-owned single goroutine	~50 ns send + recv	Backpressure via bounded channel	Idiomatic Go; ordering preserved

Rule: atomic swap for read-mostly state lookups; mutex for transitions with side effects; channel-driven loop when in doubt.

2.5 Memory: closures vs structs vs IDs¶

Encoding	Per-state cost	Per-transition cost	Notes
State as closure (`stateFn` capturing vars)	1 heap alloc per closure	0 if reused	Avoid; use top-level functions
State as singleton struct (`&Pending{}`)	1 alloc on creation, shared globally	0	Standard form
State as named ID (`StateID = uint8`)	0 (one byte per machine)	0 (table lookup)	Best for table-driven FSMs

The singleton trick:

var pendingState = &Pending{}
var paidState    = &Paid{}

Each state is allocated once for the process; every Machine shares the pointer. State objects carry behavior; per-entity data lives on the Machine as a blackboard.

3. Distributed FSMs¶

When one entity's state is split across services, the FSM lives in no process. It lives in the consensus of several. Under partition, two services disagree about the state of order #1234; everything below follows from that fact.

3.1 Consistency models¶

Model	Read sees	Implementation	Use when
Linearizable	Latest committed, globally ordered	Single leader, sync replication (Raft, Paxos)	Money, inventory, access control
Sequential	Some prefix; monotonic per client	Causal replication	Collaboration, chat
Eventual	Convergence later	CRDTs, gossip	Telemetry, presence, caches
Read-your-writes	Your writes are visible to your reads	Sticky sessions	User dashboards
Bounded staleness	Up to N seconds old	Async replication with monitor	Analytics, audit

A replicated FSM is linearizable only if every transition goes through one authoritative log. Allow two services to advance the FSM independently and you have a CRDT problem — possibly unsolvable if transitions don't commute.

3.2 The CAP angle¶

A replicated state machine is the canonical CP system: under partition, the minority side cannot transition (sacrificing availability), but every committed transition is consistent. AP variants (Dynamo-style stores) cannot maintain a single canonical state for non-commutative transitions like Pay then Refund. For FSMs where order matters, CP is the only honest choice — build on Raft (etcd, Consul) or a database with serializable transactions.

3.3 Sagas as cross-service FSMs¶

A saga (Pat Helland, Life beyond Distributed Transactions, 2007) is an FSM whose states span services. The order service's OrderShipping FSM invokes commands on payments (Charge) and inventory (Reserve); each may fail, requiring compensating transitions. A typical state set: Created → InventoryReserved → Charged → Confirmed, with branches into ChargeFailed → Compensating → Failed.

Two engineering rules: compensations are not inverses (refund is a new transaction, not an undo), and every saga transition is idempotent (the transition UUID doubles as a downstream idempotency key). Hand-rolled sagas are reasonable for two or three steps; beyond that, adopt Temporal (§6).

3.4 Where the FSM "lives"¶

Pattern	Authoritative state	Propagation	Cost
Single-owner aggregate	One service owns end-to-end	Others use commands / read-only	Low — classic DDD
Outbox + event log	Owner writes state + events in one tx; consumers project	At-least-once, idempotent consumers	Moderate — outbox + CDC or polling
Replicated log (Raft/Kafka)	The log is the FSM	All consumers replay the same sequence	High — operational burden of a distributed log

Default to the first. Move to the second only when read scaling demands it; to the third only when the FSM truly belongs to no one service.

4. Event sourcing vs current-state¶

Current-state stores state as a field. Event sourcing stores the sequence of events and derives current state by folding.

Concern	Current-state	Event-sourced
Read latency	One row, immediate	Fold N events, may snapshot
Write cost	Update one row	Append one event
Audit	Needs a separate table	Built in — events are the audit
Schema evolution	Alter table, migrate rows	New event type; old events stay readable
"Why is the state X?"	Logs only if disciplined	Trivial — read the events
Storage	O(entities)	O(transitions); grows forever
Replay under new logic	No	Yes
Time-travel queries	No	Yes
Operational simplicity	High	Moderate to low
Snapshotting needed	No	Yes for replay performance

Current-state is the right default. Event sourcing earns its complexity when two of: audit is a product requirement (finance, healthcare); replay under new logic has commercial value; the entity has many transitions over a long life; multiple downstream projections need different shapes of the same data.

4.1 Snapshotting¶

Folding 10 M events per entity is unworkable. Snapshot periodically:

func loadAggregate(id string) Aggregate {
    snap := snapshotStore.Latest(id)            // {Version, State []byte}
    events := eventStore.Since(id, snap.Version)
    agg := decode(snap.State)
    for _, e := range events { agg = agg.Apply(e) }
    return agg
}

Choose N (snapshot every N events) so average load time stays under a budget (say, 50 ms). Treat snapshots as a cache, not a source of truth — they assume the schema in effect when taken; schema changes require migration or replay-from-events.

5. Hierarchical state machines (HSMs)¶

A flat FSM with twenty states often hides a hierarchy. UML 2.5 statecharts (Harel, 1987) formalize this with composite states, history pseudostates, and orthogonal regions.

Concept	Meaning	Why it matters
Composite state	Contains sub-states	Common entry actions for a group
Initial pseudostate	Default sub-state on entry	"Where you start" inside a composite
History pseudostate (H, H*)	Re-enter the last active sub-state (deep variant recurses)	"Resume where you left off"
Orthogonal regions	Concurrent sub-states (AND-states)	`Connected AND Authenticated` simultaneously
Entry/exit actions	Run on every boundary crossing	DRY for setup/teardown
Internal transition	No entry/exit	Side effects without leaving
Guards	Boolean condition gating a transition	Conditional routing

5.1 By hand in Go¶

type State interface {
    Enter(m *Machine); Exit(m *Machine)
    Handle(m *Machine, ev Event) State
}

type CompositeConnected struct{}
func (CompositeConnected) Enter(m *Machine) { m.openSocket() }
func (CompositeConnected) Exit(m *Machine)  { m.closeSocket() }
func (c CompositeConnected) Handle(m *Machine, ev Event) State {
    if ev == EvDisconnect { return Disconnected{} }
    return c
}

type Authenticating struct{ CompositeConnected }   // embed parent
func (Authenticating) Handle(m *Machine, ev Event) State {
    switch ev {
    case EvAuthOK:   return Authenticated{}
    case EvAuthFail: return Disconnected{}
    }
    return Authenticating{}.CompositeConnected.Handle(m, ev)   // delegate up
}

Embedding gives sub-states access to parent behavior; the walk to find a handler is explicit. Go has no built-in "send this event to enclosing states until one consumes it" — that walk is the price of hand-rolled HSMs.

5.2 Libraries and when hierarchy helps¶

Library	Hierarchical	Notes
`looplab/fsm`	No	Most popular; table-driven, flat FSMs
`qmuntal/stateless`	Yes (parent states, entry/exit)	Best Go option for HSMs; port of .NET Stateless
`cocoonspace/fsm`	No	Small; simple cases

For orthogonal regions, run multiple FSMs side by side via shared context. A flat FSM with N states and M events has N×M transitions; a two-level HSM encodes common ones at the composite level — for a 20-state FSM with 5 universal events, that's ~25% fewer transitions and test rows. Hierarchy trades one-time cognitive cost for recurring testing savings.

6. Temporal/Cadence patterns¶

Temporal and its predecessor Cadence (Uber) implement durable state machines: the workflow is Go code, the engine persists the workflow's history, and on crash the engine replays the history through the same code. Code is the FSM; history is the durability.

6.1 The deterministic replay constraint¶

Workflow code must be deterministic given the same history:

func OrderWorkflow(ctx workflow.Context, orderID string) error {
    var payment PaymentResult
    if err := workflow.ExecuteActivity(ctx, ChargeCard, orderID).Get(ctx, &payment); err != nil {
        return err
    }
    if err := workflow.ExecuteActivity(ctx, ReserveInventory, orderID).Get(ctx, nil); err != nil {
        workflow.ExecuteActivity(ctx, RefundCharge, payment.ChargeID)
        return err
    }
    workflow.ExecuteActivity(ctx, ShipOrder, orderID)
    return nil
}

The function runs many times — original execution, every worker restart, every replay-debugging session. Forbidden: time.Now, rand.Intn, os.Getenv, time.Sleep, direct HTTP/DB, native goroutines, unstable map iteration that affects branching. Replace each with workflow.Now, workflow.NewRandom, workflow.SideEffect, workflow.Sleep, activities, workflow.Go, or pre-sorted slices.

The reframing that helps: workflow code is the FSM's transition function; activities are I/O; the engine is the durable substrate. Anything not pure logic must cross the activity boundary.

6.2 The history abstraction¶

Each workflow has a history: a sequence of events the engine durably records (WorkflowExecutionStarted, ActivityTaskScheduled, ActivityTaskCompleted, TimerStarted, WorkflowExecutionSignaled, ...). On replay the engine matches each workflow.ExecuteActivity call positionally against the next ActivityTaskCompleted and returns the recorded result. When code outpaces history, the engine schedules new work. Same idea as event sourcing (§4), refined for orchestration.

6.3 When Temporal earns its keep¶

Sagas with more than two or three steps and durable state between them; long-running workflows (subscription billing, document signing, fraud reviews); workflows with timers, retries, and external signals arriving over hours; cross-team workflows where retry/timeout/visibility policies must be uniform. Don't reach for it on per-request FSMs, anything bounded to one transaction, or in-process FSMs (TCP, parsers). The engine has real operational cost; it fits problems where durability is non-negotiable.

7. Observability deep dive¶

An FSM is among the easiest things to instrument well and the most often instrumented badly. What you usually want is not "current state count by name" (a gauge); it is "duration spent in each state" (a histogram).

7.1 Per-state duration histograms¶

stateLatency := prometheus.NewHistogramVec(prometheus.HistogramOpts{
    Name:    "fsm_state_duration_seconds",
    Buckets: prometheus.ExponentialBuckets(0.1, 4, 12),   // 100 ms .. 7 days
}, []string{"machine", "state"})

func (m *Machine) Transition(ev Event) {
    enteredAt := m.lastEnter
    prev := m.state
    m.state = prev.Handle(m, ev)
    if m.state != prev {
        stateLatency.WithLabelValues(m.kind, prev.Name()).
            Observe(time.Since(enteredAt).Seconds())
        m.lastEnter = time.Now()
    }
}

Buckets must span the timescales of interest. For an FSM whose Shipped → Delivered step takes days, exponential buckets from 100 ms to 7 days at base 4 cover the range in 12 buckets. The default prometheus.DefBuckets (5 ms to 10 s) drops everything past Pending into overflow.

Canonical questions: p50 of Charging should be seconds; p99 tens of seconds; p999 is the long tail where stuck workflows hide.

7.2 Stuck workflow alerts¶

# Entities older than 1 hour in a state that should resolve in minutes
max by (state) (
  time() - fsm_entity_state_entered_at{state=~"Charging|Reserving|Refunding"}
) > 3600

Any state whose normal duration is in seconds is stuck once occupied for an order of magnitude longer. The alert must be per-state — terminal states (Delivered) are expected to be old.

Complementary: alert on rate(fsm_transitions_out_total[15m]) / fsm_entities_in_state falling below a threshold. Liveness, not just lateness.

7.3 W3C Trace Context across transitions¶

A single entity's lifecycle spans many requests, goroutines, and processes. Persist W3C Trace Context with each transition so the lifecycle renders as one trace:

func (m *Machine) Send(ctx context.Context, ev Event) {
    ctx, span := tracer.Start(ctx, "fsm.transition",
        trace.WithAttributes(
            attribute.String("entity.id", m.id),
            attribute.String("state.from", m.state.Name()),
            attribute.String("event", ev.Name())))
    defer span.End()
    m.state = m.state.Handle(m, ev)
    span.SetAttributes(attribute.String("state.to", m.state.Name()))
    m.persistTransition(ctx, Transition{ /* includes TraceParent from span */ })
}

When the next transition resumes on another worker, the stored TraceParent becomes the parent of its span. The trace tree shows the entity's lifetime as one logical operation across many workers and many days.

7.4 Structured logs¶

slog.InfoContext(ctx, "fsm transition",
    "entity_id", m.id, "kind", m.kind,
    "state_from", prev.Name(), "state_to", m.state.Name(),
    "event", ev.Name(), "duration", time.Since(m.lastEnter),
    "actor", actorFromContext(ctx),
    "trace_id", trace.SpanContextFromContext(ctx).TraceID().String())

Eight fields cover every diagnostic query. Log every transition; storage is cheap; "what happened to order #1234?" becomes one query.

8. Failure modes & recovery¶

A real FSM does I/O on transitions — DB write, queue publish, API call. Any can fail partway. The recovery model is not optional.

8.1 Partial action failure¶

Pay does three things atomically from the user's view: charge the card, mark the order paid, publish OrderPaid. They are not atomic in the system.

Failure point	DB	Card	Queue	Recovery
Charge call timed out	Pending	Possibly charged	No event	Reconcile via charge ID lookup
Charge succeeded, DB write failed	Pending	Charged	No event	Retry; idempotency prevents double-charge
DB succeeded, queue publish failed	Paid	Charged	No event	Outbox dispatcher retries
All three succeeded	Paid	Charged	Published	Done

The third row is why the outbox pattern exists: write the event into the same DB transaction as the state change; a separate dispatcher publishes to the queue. DB is the source of truth; queue is best-effort distribution.

8.2 Reconciliation loops¶

For external state the FSM doesn't fully control:

func reconcileCharges(ctx context.Context) {
    rows := db.Query(`
        SELECT order_id, charge_id FROM orders
        WHERE state = 'Charging' AND updated_at < NOW() - INTERVAL '5 minutes'`)
    for rows.Next() {
        var orderID, chargeID string
        rows.Scan(&orderID, &chargeID)
        actual, _ := paymentClient.GetCharge(ctx, chargeID)
        switch actual.Status {
        case "succeeded": fsm.Load(orderID).Send(ctx, EvChargeConfirmed)
        case "failed":    fsm.Load(orderID).Send(ctx, EvChargeFailed)
        }
    }
}

Two rules: reconciliation must be idempotent (sending EvChargeConfirmed twice is a no-op past Charging), and must have a horizon (only entities stuck > 5 min, not all of them).

8.3 Eventually-consistent FSMs¶

When the FSM's state is the aggregate of several services' partial views, eventual consistency is the correctness model. A subscription's Active state requires a successful charge (payments), a provisioned account (identity), and an entitlement (billing). Each confirms asynchronously; the FSM gathers and transitions only when all three arrive — a gather-and-transition pattern.

type SubscriptionFSM struct {
    State                                       string
    PaymentOK, AccountOK, EntitlementOK         bool
}
func (s *SubscriptionFSM) onConfirmation(svc string) {
    switch svc {
    case "payment":     s.PaymentOK = true
    case "account":     s.AccountOK = true
    case "entitlement": s.EntitlementOK = true
    }
    if s.PaymentOK && s.AccountOK && s.EntitlementOK { s.State = "Active" }
}

If a fact never arrives, a timeout handler advances to PendingVerification where humans intervene.

8.4 Compensating actions on entry¶

When the FSM enters an error state, encode compensation in the entry hook so it is co-located with the failure semantics:

type ChargeFailed struct{}
func (ChargeFailed) Enter(m *Machine) {
    if m.data.ReservationID != "" {
        m.inventory.Release(m.ctx, m.data.ReservationID)
    }
    m.notifications.Send(m.ctx, "charge_failed", m.data.CustomerID)
}

Compensation is part of the FSM's contract: entering ChargeFailed is defined as releasing the reservation. Without it, "charge failed but inventory stayed reserved" becomes a recurring ops ticket.

9. Schema evolution at scale¶

Adding a state to an FSM with 10 M in-flight entities is a migration, not a code change.

9.1 Compatibility matrix¶

Change	Safe?	Notes
Add terminal state	Yes	New code; future transitions can reach it
Add intermediate state	With care	New code must handle entities in the old path
Remove state	After no entity has been there for a retention window	Tombstone the handler, then delete
Rename state	Treat as remove + add	Persist both names; route reads to either
Change a transition target	Risky	In-flight entities follow the old graph until they exit the affected state
Add event	Yes	Existing code ignores unknown events
Remove event	After no producer sends it	Tombstone the handler

9.2 Double-write during migration¶

When the storage shape changes, write to both during the migration:

func (m *Machine) persistTransition(ctx context.Context, t Transition) error {
    return m.db.RunInTx(ctx, func(tx *sql.Tx) error {
        if _, err := tx.ExecContext(ctx,
            `UPDATE orders SET state = $1 WHERE id = $2`, t.To, t.EntityID); err != nil {
            return err
        }
        _, err := tx.ExecContext(ctx,
            `INSERT INTO state_transitions (entity_id, from_state, to_state, event, at)
             VALUES ($1, $2, $3, $4, $5)`,
            t.EntityID, t.From, t.To, t.Event, t.OccurredAt)
        return err
    })
}

Double-write continues until every reader migrates and the new shape is fully populated. A separate deploy then stops writing the old shape; the old column drops after a retention window.

9.3 Shadow execution¶

For changes to transition logic, run the new FSM in shadow — compute what it would do, log divergence, commit the old result:

func (m *Machine) Send(ctx context.Context, ev Event) {
    oldNext := m.state.Handle(m, ev)
    if shadowEnabled {
        if shadow := newFSM(m.data).Handle(ev); shadow.Name() != oldNext.Name() {
            slog.WarnContext(ctx, "fsm shadow divergence",
                "entity_id", m.id, "event", ev.Name(),
                "old_target", oldNext.Name(), "new_target", shadow.Name())
        }
    }
    m.state = oldNext
    m.persist(ctx)
}

Run shadow for at least one full lifecycle of the longest-lived entity. Non-zero divergence is a bug; zero confirms the cutover is safe.

9.4 Flag-gated rollout¶

Flip the live behavior under a feature flag partitioned by entity (1% → 10% → 50% → 100%). The flag check happens at entity creation and is sticky for the entity's lifetime — an entity must not cross between graph versions mid-flight, or it will encounter states the cohort's graph doesn't define.

10. Security¶

FSMs encode who can do what when. Mistakes show up as authorizing the wrong actor for a transition, or authorizing on the wrong dimension (state instead of event).

10.1 Authorize the event, not the state¶

"Customer can read an order in Shipped" authorizes by state and leaks intent. The right model: "customer can send CancelOrder on their own order, provided the FSM accepts it from the current state."

type Authorization struct {
    Event EventName
    Roles []string
    Guard func(actor Principal, m *Machine) bool
}

var policies = map[EventName]Authorization{
    "Pay":         {Roles: []string{"customer"},   Guard: ownsOrder},
    "RefundOrder": {Roles: []string{"agent"},      Guard: hasRefundPermission},
    "FreezeOrder": {Roles: []string{"fraud_team"}, Guard: nil},
}

func (m *Machine) Send(ctx context.Context, ev Event) error {
    pol := policies[ev.Name()]
    actor := principalFromContext(ctx)
    if !hasAnyRole(actor, pol.Roles)            { return ErrUnauthorized }
    if pol.Guard != nil && !pol.Guard(actor, m) { return ErrForbidden }
    next := m.state.Handle(m, ev)
    if next == m.state { return ErrInvalidTransition }
    m.state = next
    return nil
}

State-based authorization conflates "you may know it is shipped" with "you may transition it." Different questions.

10.2 Audit log of who sent which event¶

Every transition deserves an audit row.

Column	Purpose
`transition_id`	UUID; idempotency anchor
`entity_id`, `entity_kind`	Which entity
`actor_id`, `actor_kind`	`user` / `service` / `system`
`event`	Event name
`state_from`, `state_to`	The transition
`occurred_at`	Wall clock
`trace_id`	Link to the distributed trace
`outcome`	`applied` / `rejected_invalid` / `rejected_auth`
`reason`	Optional human note

Append-only, immutable, separate retention from operational logs. For regulated FSMs the audit log is the legal record. AWS CloudTrail and Stripe's Sigma are the same pattern at scale.

10.3 Sensitive states¶

Frozen, Suspended, UnderInvestigation carry extra rules:

Entry to a sensitive state requires elevated authorization.
Reads of a sensitive state may require their own authorization (a "frozen" signal can tip off a fraudster — show others a neutral "processing" indicator).
Transitions out of a sensitive state are at least as privileged as transitions in.

10.4 Replay attacks on event channels¶

Events arriving over a network can be replayed. Defenses are the same as for commands: idempotency keys per event, signed envelopes, max-age timestamps. The idempotency key naturally maps to transition_id — a duplicate event finds the prior transition and short-circuits.

11. Testing at scale¶

Beyond unit tests, an FSM with thousands of transitions needs property-level assurances.

11.1 Property-based testing¶

func TestNoEntityEverReachesInvalidState(t *testing.T) {
    valid := map[StateID]bool{Pending: true, Paid: true, Shipped: true,
                              Delivered: true, Cancelled: true, Refunded: true}
    f := func(events []EventID) bool {
        m := New()
        for _, e := range events {
            m.Send(e)
            if !valid[m.State()] { return false }
        }
        return true
    }
    if err := quick.Check(f, &quick.Config{MaxCount: 100000}); err != nil {
        t.Fatal(err)
    }
}

100k random sequences explore the graph more aggressively than handwritten tests. Properties worth checking: state always in declared set; no unreachable state ever reached; every reachable state reached by some sequence; terminal states absorbing; idempotent events truly idempotent.

gopter (https://github.com/leanovate/gopter) is stronger than testing/quick — shrinks failing inputs, supports constrained generators, gives reproducible seeds.

11.2 Model checking with TLA+ / Alloy¶

For correctness-critical FSMs (payments, consensus, locks), model the spec in TLA+ and check invariants exhaustively. A two-step example:

VARIABLES state, charged, refunded
Init == state = "Pending" /\ charged = FALSE /\ refunded = FALSE
Pay    == state = "Pending" /\ state' = "Paid"     /\ charged'  = TRUE
Refund == state = "Paid"    /\ state' = "Refunded" /\ refunded' = TRUE
Inv == charged \/ ~refunded     \* cannot refund what was not charged

TLC explores every reachable state and verifies the invariant. The implementation can be correct against the wrong spec; the spec is the contract. Alloy is the lighter alternative for finite scopes.

11.3 Chaos engineering on stuck workflows¶

Inject failure deliberately: kill a worker mid-transition (verify reconciliation); delay a downstream past the FSM's timeout (verify the timeout state and compensation); drop a fraction of events (verify retries and idempotency); replay an old event from DLQ (verify rejection as duplicate). toxiproxy plus a transition-killer sidecar is enough tooling to flush out obvious gaps.

12. Anti-patterns at scale¶

Anti-pattern	Symptom	Fix
Distributed FSM with no idempotent transitions	Duplicate charges, double refunds, drift	Idempotency key on every transition; dedup store; retries assume at-least-once
Soft states that drift	A "state" computed from fields; different services compute it differently	Store the state name explicitly; reconcile divergent views via a single owner
States defined in code AND in the database	Schema has a `status` enum; code has a `State` interface; they disagree	One source of truth — generate code from schema or vice versa
State machine used as workflow engine without checkpointing	Pay-then-ship-then-notify in one function; crashes leave inconsistency	Adopt Temporal/Cadence, or persist between every step with retries and reconciliation
Polymorphic state objects stored as JSON	New state field breaks unmarshalling of in-flight entities	Store the state name; reconstitute behavior from a registry
FSM that swallows invalid events	Bugs hide because invalid events are no-ops	Distinguish "no-op because already done" from "no-op because illegal" — the second is an error
Catch-all `Failed` state	Every error path goes to the same state; recovery needs manual review	Specific failure states per category (`ChargeFailed`, `ReservationFailed`) with their own recovery
Side effects scattered across `Enter`, `Handle`, and callers	Auditing is painful	Side effects in `Enter` only; `Handle` is pure (state → state); the machine drives I/O
Per-entity workers	One goroutine per active entity; OOM at scale	Single pool consumes events from a shared queue; load, apply, persist, release
No retention on transitions table	500 GB audit table; queries time out	Partition by month; archive partitions older than retention
Global enum of states across all FSM kinds	One enum with 400 values	One state enum per FSM; the catalog is the architecture

The deepest anti-pattern: using state machines as workflow engines without the engine. The shape is identical (events, transitions, side effects, durability) and the code is appealingly small at first. It becomes a maintenance disaster the first time a process crashes mid-saga. If the workflow has more than two non-trivial steps, more than a minute between them, or slow external systems, use Temporal/Cadence/Conductor. The pattern is not the engine.

13. Closing principles¶

A finite state machine is a contract. Honor it at every boundary.

States are nouns; events are verbs. A state named with a verb (Processing, Paying) is usually two states pretending to be one. Split it (PaymentPending, PaymentConfirmed). Naming is the architecture.
The transition is the audit record. Every transition deserves a row with from, to, event, actor, trace_id, occurred_at. Without it, "what happened to this entity?" is archaeology. With it, the FSM tells its own story.
Determinism is non-negotiable on durable FSMs. Once an FSM survives restarts, its transition function must be a pure function of (state, event, context) — no clocks, no randomness, no network reads inside the transition. Move I/O outside or behind activity boundaries.
Idempotency is the price of distribution. Two processes that can advance the same entity force every event to carry an idempotency key and every transition to check a dedup store. Exactly-once delivery is a marketing term; at-least-once + idempotent receiver is the real implementation.
The graph is documentation. A renderable diagram generated from the transition table — not redrawn by hand — is the single most useful artifact for onboarding, design review, and incident analysis. If no current diagram exists, the FSM has already drifted.

Get those right and the State pattern becomes invisible: the code reads as a description of the legal lives an entity can have; the runtime tells you which life each entity is currently living; the audit log tells you the history that brought every entity there.