Skip to content

Break-glass & Bypass — Senior Level

Roadmap: Quality Gates → Break-glass & Bypass The middle page showed you what a break-glass procedure looks like and when to pull it. This page is about the architecture underneath it — how time-boxed credentials are actually minted, why the audit trail must live somewhere the break-glass identity cannot reach, and the human-systems dynamics by which a once-rare emergency override quietly becomes Tuesday.


Table of Contents

  1. Introduction
  2. Prerequisites
  3. Glossary
  4. Core Concept 1 — The Access Spectrum: Standing vs JIT vs Break-glass
  5. Core Concept 2 — Time-Boxed Credentials: The STS Pattern
  6. Core Concept 3 — Dual Control, Sealed Secrets, and "the Policy Is the Approver"
  7. Core Concept 4 — The Integrity-of-Audit Problem
  8. Core Concept 5 — Fail-Safe vs Fail-Secure Under Degradation
  9. Core Concept 6 — Normalization of Deviance
  10. Core Concept 7 — The Feedback Loop to Gate Design
  11. Core Concept 8 — Incident-Response Integration
  12. Core Concept 9 — Compliance-Grade Break-glass
  13. Real-World Examples
  14. Mental Models
  15. Common Mistakes
  16. Test Yourself
  17. Cheat Sheet
  18. Summary
  19. Further Reading
  20. Related Topics

Introduction

Focus: The architecture and human-factors of the controlled exception — how a senior engineer designs an override that is fast enough to use in a crisis, safe enough that misuse is bounded, and honest enough that every use is provable after the fact.

By the middle level you can describe a break-glass procedure: a documented, audited, last-resort way to bypass a gate when the gate itself is blocking incident recovery. You know it must be logged and reviewed. That makes you safe to operate one someone else built.

The senior jump is that you now design the mechanism, and the design problem is genuinely hard because it sits at the collision of three constraints that pull in opposite directions. It must be fast — a break-glass that takes twenty minutes of approvals is useless during a revenue-burning SEV1, and people will route around it. It must be bounded — the grant has to be the minimum privilege, for the minimum time, with a blast radius you can reason about. And it must be tamper-evident — because the same elevated access that lets a responder fix production is the access that could let a malicious or panicking actor erase the record of what they did.

Get any one of those wrong and the failure mode is specific. Too slow and you get shadow bypasses — people sharing root credentials in a Slack DM because the sanctioned path was painful. Too broad and a single break-glass becomes a standing skeleton key. Not tamper-evident and your audit log is worthless precisely in the case it exists for. And underneath all of it runs a quieter failure that no IAM policy can stop: normalization of deviance — the slow process by which a rare, alarming override becomes routine, the alarm stops firing in people's heads, and the boundary of "acceptable risk" silently migrates. This page is the architecture and the dynamics, together, because they are the same problem.


Prerequisites

  • Required: You've internalized middle.md — what a break-glass procedure is, why it must be rare, logged, and reviewed, and the basic "documented last resort" framing.
  • Required: A working model of cloud IAM — roles vs users, assume-role, temporary credentials, and the principle of least privilege.
  • Required: You've worked an incident and felt the tension between "follow the process" and "the site is down right now."
  • Helpful: Familiarity with how an audit log is collected and stored (centralized logging, immutability, retention) and what an auditor asks for.
  • Helpful: Exposure to a postmortem culture — blameless review, the difference between human error and system design.

Glossary

Term Meaning
Standing access Privilege that is always present on an identity — always-on, used or not. The default and the worst baseline.
JIT access Just-In-Time access: privilege granted on request, approved, time-boxed, then auto-revoked. The steady-state pattern for elevated work.
Break-glass A pre-authorized, self-service emergency grant of high privilege, used when the normal approver is unavailable or time is critical — traded for heavy, mandatory logging and post-hoc review.
Time-boxed credential A credential valid only for a short, fixed window (e.g., an STS session of 1 hour), after which it is useless without re-issue.
Dual control / two-person rule A control requiring two distinct people to jointly authorize an action, so no single actor can act alone.
Sealed secret A high-value credential split or sealed (e.g., via Shamir secret sharing) so retrieval requires reconstruction and necessarily emits an alert.
WORM Write-Once-Read-Many storage: data can be written but not altered or deleted before its retention expires.
Tamper-evident A log or store where any alteration is detectable (and ideally impossible) by the very identity whose actions it records.
Fail-open / fail-closed Whether a control, when its enforcement system is unavailable, defaults to allowing (open) or blocking (closed) the action.
Normalization of deviance (Vaughan) The social process by which repeated successful rule-breaking redefines the deviant act as normal and acceptable.
Bypass rate The frequency of break-glass / override use per gate over time — the leading indicator of a slipping boundary.
Emergency change A change made outside the normal change-management process to resolve or prevent an incident; a regulated category in ITIL/SOX/SOC2.

Core Concept 1 — The Access Spectrum: Standing vs JIT vs Break-glass

Break-glass is not a standalone idea; it is the far end of an access spectrum, and choosing it correctly means understanding what sits before it. There are three points on the line, ordered by how much trust they place in the moment of use versus in advance.

Standing access is privilege that simply exists on an identity, all the time. The admin group your on-call engineer is permanently in. It is the cheapest to operate and the worst security posture: the credential is a target 24/7, the blast radius of a compromised account is the full grant, and there is no signal — no request, no approval, no clock — to tell anyone the privilege is being used versus merely held. Standing access should be reserved for the genuinely unavoidable (the automation that must run continuously) and minimized everywhere else.

Just-In-Time (JIT) access replaces "always have it" with "get it when you need it, lose it when you're done." The lifecycle is explicit:

request  →  approve  →  time-boxed grant  →  auto-revoke
   |           |              |                    |
  who/why   second person   1h STS session     credential
  + ticket   approves       least-privilege    expires; role
                            role               membership removed

JIT is the steady-state pattern for almost all elevated work. The key word is time-boxed: the grant expires on its own, so the failure mode of "forgot to revoke" is designed out. Tools like AWS IAM Identity Center, Azure AD Privileged Identity Management (PIM), GCP's IAM Conditions + privileged access requests, and third-party brokers (Teleport, StrongDM, sudo session brokers) all implement this loop. Crucially, JIT has a human in the approval path — someone other than the requester says yes before the grant is minted.

Break-glass is what you reach for when that human-in-the-loop is the problem. It is a pre-authorized, self-service emergency grant: the approval was given in advance, in policy, so that at 3 a.m. during a SEV1 — when the approver is asleep, unreachable, or the latency of finding one would itself extend the outage — the responder can grant themselves the access without a synchronous approval. The trade is brutal and deliberate: in exchange for skipping the live approval, break-glass demands heavy, unavoidable logging, immediate alerting (everyone knows it was pulled, instantly), and a mandatory post-hoc review of every single use.

Key insight: JIT and break-glass differ in when the authorization happens, not whether it happens. JIT authorizes at the moment of use (a live approver). Break-glass authorizes in advance, by policy, and substitutes total transparency and after-the-fact review for the missing live approver. You use break-glass precisely and only when waiting for a live approver is worse than the documented risk of self-service. If your "break-glass" gets used when an approver was available, it isn't break-glass — it's just JIT with the safety removed.

The spectrum is a decision tree: prefer JIT; fall back to break-glass only when the live-approval latency is itself unacceptable; never default to standing access for elevated privilege.


Core Concept 2 — Time-Boxed Credentials: The STS Pattern

The mechanical heart of any modern break-glass is the short-lived credential. The credential you mint must (a) be high enough privilege to fix the emergency, (b) expire on its own so a forgotten cleanup can't leave a skeleton key behind, and (c) be issued only under conditions that prove this is a real, attributable emergency.

On AWS the canonical implementation is STS AssumeRole against a dedicated break-glass role. The architecture has three parts that matter:

  1. A dedicated role, used for nothing else, so every credential it ever mints is unambiguously a break-glass event in the logs.
  2. A short session duration (MaxSessionDuration), so the grant is time-boxed by construction.
  3. A trust/condition policy that refuses to mint the credential unless real-world preconditions hold — MFA present, and (ideally) an active incident referenced.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AssumeBreakGlassOnlyWithMfaAndIncident",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111122223333:root" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "Bool":        { "aws:MultiFactorAuthPresent": "true" },
        "NumericLessThan": { "aws:MultiFactorAuthAge": "300" },
        "StringLike":  { "sts:RoleSessionName": "breakglass-INC-*" }
      }
    }
  ]
}
# A responder breaks glass — session name MUST carry the incident id (attribution),
# the session is short-lived, and MFA was used within the last 5 minutes.
aws sts assume-role \
  --role-arn      arn:aws:iam::444455556666:role/BreakGlassAdmin \
  --role-session-name breakglass-INC-4821-byashin \
  --duration-seconds 3600 \
  --serial-number arn:aws:iam::111122223333:mfa/byashin \
  --token-code    492817

The role itself (its permission policy) should still be scoped to the emergency domain, not literally *:* "god mode," unless the failure class genuinely requires it. A break-glass that grants account-wide admin when the recurring emergency is "the deploy pipeline is wedged" is over-privileged; mint a role that can restart the pipeline and roll back, and nothing else.

The same pattern exists on every major platform:

Platform Time-boxed mechanism Conditions you attach
AWS sts:AssumeRole with MaxSessionDuration (≤ 12h, set short) MFA present + age, session-name = incident id, source IP/VPCE
Azure PIM "eligible" role + activation with expiry Justification text, ticket number, MFA, approver (or none for break-glass), max activation duration
GCP IAM Condition with request.time expiry + privileged access grant expiry timestamp, justification, conditional bindings
Kubernetes Short-lived RoleBinding created by a broker, TTL-reaped Tied to an incident object / annotation, reconciled away after TTL

Key insight: Time-boxing is the single most important property, because it converts "remember to revoke" — a human task that will eventually be skipped — into "expires automatically" — a property of the system that cannot be skipped. The most common chronic failure of break-glass is not abuse; it is a grant that quietly never expired and became standing access nobody is tracking. A credential with a 1-hour TTL cannot become that.


Core Concept 3 — Dual Control, Sealed Secrets, and "the Policy Is the Approver"

The single-actor self-service grant in Concept 2 is appropriate for most break-glass — speed matters and the logging is the control. But the highest-privilege glass (the root account, the KMS key-deletion path, the production database master credential) warrants more friction, because the blast radius of misuse is catastrophic. Three patterns add it.

Dual control / the two-person rule. For the highest tier, require two distinct humans to jointly break the glass. This deliberately reintroduces a human approver even in the emergency path — the reasoning being that for a small set of nuclear actions, even a SEV1 should not let one person act alone. The implementation is usually a credential or signing operation that requires two approvals (e.g., an SSM/automation runbook that needs two named approvers, or an AWS root credential held under M-of-N control). The cost is latency, so reserve dual control for the genuinely catastrophic set, not the everyday glass — over-applying it is how you train people to route around the system.

Sealed-secret break-glass. Sometimes the credential that must be protected can't be a clean IAM role — it's the AWS root user, or a bootstrap secret. Here the pattern is to split the secret so no one person holds it, and make retrieval itself the alarm. Shamir Secret Sharing (the mechanism behind Vault's unseal keys) splits a secret into N shares of which any M reconstruct it; you distribute shares to different people, and reconstruction requires a quorum to deliberately come together. The secret is sealed in a vault, and any retrieval fires a high-severity alert to everyone out-of-band — so the act of breaking the glass is loud and collective by construction.

                 Shamir split (M-of-N), e.g. 2-of-3
   root password ─┬─► share 1  →  Security lead     (vault A)
                  ├─► share 2  →  Eng director       (vault B)
                  └─► share 3  →  Offline escrow     (safe)
   reconstruction (any 2 shares) ──► triggers PagerDuty SEV1 alert
                                 ──► writes immutable retrieval record

"The policy is the approver." For automated emergency paths — where a human can't be the approver because the response must be machine-speed — the approval is encoded as policy and evaluated by a trusted system. An auto-remediation that can break a gate (e.g., automatically roll back a deploy when error rate crosses a threshold) is break-glass with the policy as the pre-authorized approver: the conditions under which the machine may bypass are written down, version-controlled, reviewed like code, and every firing is logged and reviewed exactly as a human break-glass would be. The policy is the approver, the code review is the dual control, and the audit applies identically.

Key insight: The amount of friction in a break-glass should scale with the blast radius of the privilege it grants, not be uniform. One-person self-service for the everyday glass (speed, with logging as the control); two-person or sealed-secret for the nuclear set (the rare actions where even an emergency shouldn't let one actor act alone). Uniform friction is a design smell: too little on the dangerous paths, too much on the common ones — and too much is what trains people to invent shadow bypasses.


Core Concept 4 — The Integrity-of-Audit Problem

Here is the problem that separates a serious break-glass design from a checkbox one, and it is the one most often missed: a break-glass that grants admin may grant the power to alter the very logs that record the break-glass.

Walk the attack. A break-glass grants account-wide admin. The audit trail — who broke glass, what they did — is written to a log store in the same account. An actor (malicious, or merely panicking and covering a mistake) breaks glass, does something they shouldn't, and then, with the same admin grant, deletes the CloudTrail trail, empties the S3 bucket, or rewrites the log objects. The record of the bypass is destroyed by the bypass. Your audit trail is now worthless in exactly the scenario it exists to cover. This is not hypothetical; "disable logging" is a standard early step in real cloud intrusions.

The defense has a single principle with several mechanisms: the audit trail must be tamper-evident and out-of-band — written somewhere the break-glass identity cannot modify.

  1. Separation of the log destination. Ship audit logs to a separate account (or separate trust domain) whose access is not covered by the break-glass grant. On AWS this is the organization CloudTrail → a dedicated, locked Logging account pattern: the trail in the workload account writes to an S3 bucket owned by a different account, and the break-glass admin role in the workload account has no permission there.

  2. Immutability at the store (WORM). Make the log objects physically un-deletable for their retention period. S3 Object Lock in Compliance mode means not even the account root can delete or overwrite an object before its retention expires. The log is write-once. Append-only / WORM is the substrate.

  3. Append-only and chained. Where possible, logs are append-only and integrity-chained (each entry hashes the previous), so any deletion or edit in the middle is detectable as a broken chain — tamper-evident even where you can't make it tamper-proof.

  4. Separation of duties for logging. The people who can administer the logging system are a different set from the people who can break glass on the workloads. No identity should be able to both perform a privileged action and control the record of it.

   Workload account (444455556666)              Logging account (777788889999)
   ┌───────────────────────────────┐            ┌──────────────────────────────────┐
   │  BreakGlassAdmin role          │            │  S3 bucket  (Object Lock,          │
   │  (admin HERE only)             │  CloudTrail│             COMPLIANCE mode)       │
   │                                ├───────────►│  append-only · WORM · immutable    │
   │  CANNOT touch the bucket  ✗────┼──────╳     │  break-glass identity has NO access │
   │  in the logging account        │            │  separate admins, separate keys    │
   └───────────────────────────────┘            └──────────────────────────────────┘
            actor breaks glass here                 ...but the record lives HERE,
            and acts                                 where the actor can't reach it

The same architecture generalizes off AWS: ship to a separate SIEM/log sink the operational identity can't administer (Splunk/Chronicle/an append-only audit service), enable immutable retention, and keep the logging admins organizationally distinct.

Key insight: An audit log that the audited party can edit is not an audit log — it is a courtesy. The integrity property that makes break-glass trustworthy is not "we log it"; it is "the record lives somewhere the breaker of glass provably cannot reach." If you can break glass and then delete the evidence with the same grant, you have a logging feature, not an audit control. Out-of-band + WORM + separation-of-duties is the non-negotiable core.


Core Concept 5 — Fail-Safe vs Fail-Secure Under Degradation

A subtle senior question: what does the gate itself do when it is broken? Break-glass is the controlled answer, but you have to first decide the gate's failure direction.

When the system that enforces a gate becomes unavailable — the policy service is down, the signing API is unreachable, the scanner can't run — the gate has two possible default behaviors, and the choice is a deliberate risk decision, not an accident:

  • Fail-open (fail-safe-for-availability): if the gate can't evaluate, allow the action. Optimizes for availability; the cost is that a broken enforcement plane silently means no enforcement.
  • Fail-closed (fail-secure): if the gate can't evaluate, block the action. Optimizes for safety; the cost is that an outage in the gate becomes an outage in your ability to ship — including your ability to ship the fix for the outage.

The terms come from physical security and are worth keeping straight: a fail-safe lock unlocks on power loss (safe for people — you can get out of a burning building); a fail-secure lock stays locked on power loss (secure for assets — the vault stays shut). For software gates, the right default depends on what the gate protects. A gate guarding the deployment of a security patch failing closed could trap you in a vulnerable state; a gate guarding access to customer PII failing open could be a breach.

The senior insight is the relationship between these and break-glass: break-glass is the controlled fail-open. Rather than designing the gate to silently fail open (where a degraded enforcement plane invisibly disables your protection — the worst of both worlds, because you don't even know you're unprotected), you design it to fail closed by default, with break-glass as the explicit, logged, attributable path to open it. That way the open state is never silent: it is always a deliberate act, recorded, alerted, and reviewed.

   Gate enforcement degraded / unavailable
        ├── silently fail-open      → unprotected AND unaware  ✗ (worst case)
        ├── silently fail-closed    → safe but you're stuck, may worsen the incident
        └── fail-closed + break-glass → safe by default; opening is explicit,
                                        logged, attributed, reviewed   ✓

Key insight: "Fail open" and "break-glass" are easy to conflate but are opposites in honesty. A silent fail-open removes the control without anyone deciding to; break-glass keeps the control closed and forces a deliberate, recorded human act to open it. Design gates to fail closed, and make break-glass the only door — so that every time the protection is bypassed, someone chose it and the system knows who.


Core Concept 6 — Normalization of Deviance

No IAM policy can save a break-glass from the human dynamic that ultimately kills it. The mechanism is normalization of deviance, named by sociologist Diane Vaughan in her study of the Challenger disaster, and it is the single most important human-factors concept at this level.

The idea: when a rule is broken and nothing bad happens, the rule-breaking is quietly reclassified — from "a deviation we got away with" to "an acceptable practice." Each successful deviation makes the next one feel more normal, and the boundary of "acceptable risk" migrates outward, one unremarkable step at a time. Vaughan called it "the slow, steady slippage of standards." The crews launching Challenger had repeatedly flown with O-ring erosion that violated the original spec; each flight that survived made the erosion seem like an expected, tolerable condition rather than the danger sign it was — until the launch where it wasn't tolerable. The Columbia loss seventeen years later was, by the CAIB's own analysis, the same dynamic with foam strikes: a known anomaly, repeatedly survived, reclassified as routine.

The translation to break-glass is exact. The first time someone bypasses the gate, it feels significant — a real exception, discussed, slightly uncomfortable. The site didn't fall over. The tenth time, it's a known move. The fiftieth time, "just break glass" is a documented step in someone's personal runbook, the discomfort is gone, and the gate it bypasses has effectively ceased to exist while still appearing, on paper, to be enforced. The boundary slipped, invisibly, and no one decided to disable the control — they just stopped noticing they were defeating it.

The engineering countermeasures all attack the invisibility and frictionlessness that let deviance normalize. Make each bypass:

  • Visible — every break-glass fires a loud, public, out-of-band signal (a channel post, a page, an entry on a dashboard the whole team sees). Deviance normalizes in the dark; sunlight is the antidote.
  • Effortful — the bypass requires a deliberate, slightly inconvenient act (justification text, an incident reference, MFA). Not so painful that people avoid it (that creates shadow bypasses — Concept 3), but never frictionless. A frictionless override is one nobody notices using.
  • Reviewed — every use gets a mandatory post-hoc review (Concept 8). The certainty of being asked "why did you break glass?" keeps the act conscious.

And the leading indicator: bypass-rate-per-gate over time. A break-glass used twice a year is a healthy emergency valve. The same break-glass used twice a week is not a sign of frequent emergencies — it is the instrument reading of a boundary that has slipped. Rising bypass rate is the canary; it tells you the deviance is normalizing before the incident that proves it.

This leads to the governing principle: fix the gate, not the human. When a gate is bypassed repeatedly, the senior response is not "remind people not to bypass it" — that fights a structural force with a memo and loses. The repeated bypass is data that the gate is wrong: too slow, too flaky, too strict for reality, or guarding the wrong thing. Chronic break-glass is a defect in the gate, and you treat it as a bug to be fixed, not a discipline problem to be scolded.

Key insight: Normalization of deviance means a break-glass cannot be made safe by a one-time design; it decays under use because success at rule-breaking is self-reinforcing. The only durable defense is to make every single bypass visible, effortful, and reviewed, and to watch the bypass rate as the leading indicator of a slipping boundary. The day "break glass" stops feeling like an event is the day your gate quietly died — and the bypass-rate chart will have shown it to you weeks earlier.


Core Concept 7 — The Feedback Loop to Gate Design

Break-glass telemetry is one of the richest signals you have about whether your gates are right, and wiring it back into gate design (the subject of 05 — Gate Design: Speed vs Safety) is what closes the loop between exception and rule.

The loop:

   break-glass / bypass events ──► telemetry (rate, gate, reason, who, when)
                                  periodic gate review
                 ┌──────────────────────┼──────────────────────┐
                 ▼                       ▼                       ▼
        the gate is wrong         the root cause is real     it was a bad month
        → fix / relax / redesign  → fix the underlying       → note it, no change,
          the gate                  flakiness/slowness         keep watching

The hard part — and the senior judgment — is distinguishing "the gate is wrong" from "we had an unusually bad month." A spike in bypasses during a single multi-day major incident is not evidence the gate is broken; it's evidence you had a bad incident. The data that separates the two:

  • Trend, not level. A sustained rise in bypass rate across many incidents and many people indicts the gate. A one-time spike correlated with a single root cause does not.
  • Distribution across people. If one person bypasses constantly and no one else does, that's a coaching or access-design issue for that path. If everyone bypasses, the gate is wrong for the team.
  • The stated reasons cluster. When the review records (Concept 8) keep citing the same reason — "the integration test suite is flaky," "the gate takes 40 minutes and the rollback can't wait" — the cluster is the root cause, and it's actionable.
  • Same gate, repeatedly. A specific gate with a high and rising bypass rate while others stay near zero is a gate-level defect, not an org-wide culture problem.

When the data says the gate is wrong, the response is concrete: relax the threshold, make the check faster, fix the flaky test that forces the bypass, split a slow blocking gate into a fast blocking part and a slower async part, or add an automated fast-path so the case that needs break-glass becomes a normal supported flow. When the data says the root cause is real (the bypasses are masking a recurring production problem), you fix that. When it's a bad month, you record it and keep watching — resisting the urge to either over-correct or rationalize.

Key insight: Every break-glass is a small, expensive experiment your organization just ran on its own gate. The telemetry from those experiments is the highest-signal input to gate design you will ever get — it's real engineers, under real pressure, voting with their actions on whether the gate's speed/safety trade is correct. A break-glass mechanism with no feedback loop into gate review is wasting its most valuable output.


Core Concept 8 — Incident-Response Integration

A break-glass that exists only as an IAM role and a wiki page will be improvised under pressure, and improvisation during a SEV1 is how unsafe bypasses get invented in the heat of the moment. The senior practice is to integrate break-glass into incident response so the sanctioned fast path is the obvious one when the pressure is highest.

Pre-planned emergency-change procedures in the runbook. The incident runbook should contain, explicitly, the answer to "during a SEV1, here is the sanctioned way to move fast" — the exact break-glass to pull, the command, the role, the conditions, and the bound on what it grants. The goal is that a stressed responder at 3 a.m. follows a prepared fast path rather than inventing one. People under acute stress do not design good controls; they execute the procedure in front of them. Put a safe procedure in front of them.

This is also why the break-glass must genuinely be fast. If the sanctioned path is slow or fiddly, a panicking responder will find a faster, unsanctioned one — sharing a credential, using a personal admin token, disabling the gate at the source. The existence of a fast, sanctioned path is itself a safety control: it removes the incentive to improvise. (This is the speed half of the speed/safety/honesty tension from the Introduction.)

Mandatory post-incident review of every break-glass use. This is the keystone of the whole design and the enforcement mechanism for everything in Concept 6. Every single break-glass use gets reviewed afterward — not optionally, not "if something looked wrong," but always. The review is:

  • Blameless — the question is never "who screwed up?" but "what made breaking glass necessary, and was the action appropriate?" Blame drives the behavior underground and destroys the telemetry; psychological safety is what keeps people using the sanctioned path and reporting honestly.
  • Mandatory — the certainty of review is what keeps each bypass a conscious act and starves normalization of deviance. If review is optional, it stops happening, and the deviance normalizes.

A concrete review record makes the evidence automatic and feeds Concept 7:

# break-glass review record — INC-4821
incident:        INC-4821
severity:        SEV1 (checkout 5xx spike, ~$/min revenue impact)
broke_glass:     byashin
role_assumed:    BreakGlassAdmin (acct 444455556666)
session:         breakglass-INC-4821-byashin  (STS, 3600s, MFA age 41s)
window:          2026-06-22T03:14:09Z → 03:41:50Z  (auto-expired)
why_glass:       on-call approver paged but unresponsive 11 min; revenue burning
actions_taken:   - rolled back deploy d-9f3a2 to d-7c118 via pipeline
                 - cleared stuck SQS DLQ (412 msgs) blocking order workers
gate_bypassed:   deploy-approval gate (05) — required 2 sign-offs, none available
appropriate?:    YES — action was minimal, scoped, and resolved the incident
audit_verified:  YES — CloudTrail in logging acct 777788889999 (Object Lock) intact
follow_ups:
  - JIRA OPS-2210: add 'auto-approve rollback during active SEV1' fast-path to gate 05
  - JIRA OPS-2211: fix DLQ-drain runbook so this needs no admin next time
reviewer:        on-call manager + security  (blameless review, 2026-06-22)
bypass_rate:     deploy-approval gate now 4 break-glass / 90d (was 1 / 90d) ← watch

Note the last two fields: the follow-ups feed the gate-design loop (Concept 7), and the bypass-rate line surfaces the leading indicator (Concept 6) right in the record.

Key insight: The mandatory, blameless post-incident review is the load-bearing control of the entire system. It is what converts break-glass from "a hole in your security" into "a managed, observable, self-correcting exception process." Skip it and you keep the risk of break-glass while throwing away every benefit — the learning, the telemetry, and the conscious-act discipline that holds back normalization of deviance.


Core Concept 9 — Compliance-Grade Break-glass

Auditors do not object to break-glass; they expect it. SOC 2, SOX (for financially-relevant systems), PCI-DSS, and ISO 27001 all explicitly contemplate emergency access and emergency change — they just require that it be controlled. Knowing exactly what they ask for lets you design the evidence to be automatic rather than reconstructed in a panic before the audit.

What every one of these frameworks wants from emergency access reduces to four things:

  1. A pre-defined, documented procedure. The break-glass process exists in writing before it's used — who may use it, under what conditions, what it grants. (This is your runbook procedure from Concept 8.)
  2. Restricted access. Break-glass is limited to a defined, minimal set of authorized people, with least privilege on the grant itself. (The dedicated, scoped role from Concept 2.)
  3. Logging. Every use is recorded — who, what, when, why — in a trustworthy, tamper-evident store. (The out-of-band WORM audit from Concept 4.)
  4. Post-hoc review / approval. Every emergency use is reviewed (and, in stricter regimes, retroactively approved) after the fact. (The mandatory review from Concept 8.)

Mapping the specifics:

Framework Relevant control area What it expects for emergency access
SOC 2 CC6 (logical access), CC8 (change mgmt) Emergency access restricted, logged, reviewed; emergency changes follow a defined process and are reviewed after the fact
SOX ITGC — change & access controls Emergency changes documented, approved (often retroactively), segregation of duties between who acts and who approves/reviews
PCI-DSS Req. 7 (least privilege), Req. 10 (logging) Access by business need; all privileged actions logged to a tamper-evident store with defined retention; logs reviewed
ISO 27001 A.5/A.8 (access control, logging) Privileged access managed and time-bound; event logging protected from tampering; review of privileged activity
NIST SP 800-53 AC-6 (least privilege), AU-9 (protection of audit info), CM-3 (change control) Emergency access procedures; audit information protected from the very users it records; emergency changes documented and reviewed

The design move that makes audits painless: make the evidence a byproduct of the mechanism, not a manual artifact. If breaking glass automatically (a) uses a dedicated scoped role, (b) writes an immutable CloudTrail record to a locked logging account, and (c) opens a review-record template that must be completed, then your SOC 2 evidence for "emergency access is controlled, logged, and reviewed" is generated by using the system correctly. The auditor asks for the last year of break-glass events and reviews; you query the logging account and the review-record store. Nothing was reconstructed; nothing was backfilled. Compliance becomes a read against a system that was honest by construction. NIST AU-9 — "protect audit information from the users it records" — is, notably, the formal version of Concept 4: the standards themselves encode the integrity-of-audit principle.

Key insight: Compliance-grade break-glass is not a heavier process bolted onto the engineering one — it is the same well-designed mechanism (scoped role, out-of-band immutable log, mandatory review) with the evidence falling out automatically. If you have to prepare for the break-glass portion of an audit by gathering and reconstructing records, your mechanism is under-instrumented. The well-built version makes the audit a query, because every required control is enforced by the system rather than asserted by a human.


Real-World Examples

1. AWS root account break-glass (the canonical sealed-secret case). The AWS account root user can do things no IAM role can (close the account, change the support plan, certain billing actions), so it can't be replaced by JIT roles. The standard pattern: set a long random root password and a root MFA, seal both — password in a vault under M-of-N control, the MFA device physically secured — remove all root access keys, and turn on a CloudTrail alarm that pages security the instant root is used at all. Root becomes pure break-glass: retrievable only by a deliberate, quorum act, used a handful of times a year (e.g., to recover a locked-out org), and loud every single time. This is Concepts 3 (sealed secret), 4 (the alarm and out-of-band log), and 9 (a documented, restricted, logged, reviewed emergency procedure) in one artifact.

2. The deploy-gate bypass that revealed a flaky test (the deviance/feedback case). A team's pre-deploy gate required the integration suite to pass. The suite was ~6% flaky, so engineers learned to "break glass and deploy" when a known-flaky test failed. For a quarter, nobody minded — the deploys were fine. The bypass-rate dashboard, though, showed that gate climbing from ~1 break-glass/quarter to several per week (Concept 6's leading indicator). The gate review (Concept 7) read the clustered reasons in the review records — every one cited the same flaky test — diagnosed the gate is wrong, the boundary is slipping, and fixed the test, not the engineers. Bypass rate fell back to near zero. The "fix the gate, not the human" principle, executed off real telemetry.

3. CloudTrail to a locked logging account surviving an intrusion (the integrity-of-audit case). In real cloud-intrusion patterns, an attacker who gains admin in a workload account routinely tries to disable CloudTrail and wipe the log bucket to cover their tracks. Organizations that ship the org trail to a separate logging account with S3 Object Lock in Compliance mode find that this step fails: the workload-account admin (or break-glass identity) has no permission in the logging account, and even root there cannot delete a locked object before retention expires. The forensic record of the entire intrusion survives because it lived out-of-band and immutable. This is Concept 4 not as theory but as the thing that decides whether your incident has a timeline or a blank space where the attacker erased it.

4. The painful bypass that bred a shadow path (the anti-pattern, lived). A company made its production-access break-glass so heavyweight — a form, a manager's synchronous approval, a security review, during the incident — that responders during a major outage couldn't get access in time. So they did the human thing under pressure: someone with standing admin pasted credentials into the incident Slack channel, and the team used those. The sanctioned control was so safe it was unusable, so the actual behavior was the least safe option imaginable: shared, unlogged, unattributed admin in a chat channel. The lesson (Concepts 3 and 8): a break-glass that's too painful doesn't make people safe — it makes them route around it, and the shadow path is always worse than the one you designed.


Mental Models

  • The access spectrum is a dial, not a switch. Standing → JIT → break-glass is ordered by when authorization happens (always / at use / in advance). Reach for the leftmost point that meets the need: prefer JIT, drop to break-glass only when live approval is too slow, never default to standing for elevated privilege.

  • Time-boxing converts a human task into a system property. "Remember to revoke" is a task humans skip; "expires in one hour" is a guarantee they can't. Every break-glass grant should die on its own. The grant that quietly never expired is the most common real failure.

  • The auditor and the breaker of glass must live in different rooms. If the identity that can break glass can also reach the log of the break-glass, the log is a courtesy, not a control. Out-of-band + WORM + separation-of-duties is the whole game for audit integrity.

  • Break-glass is the controlled fail-open. A silent fail-open removes the control with nobody deciding; break-glass keeps the control closed and forces a deliberate, recorded act to open it. Fail closed, and make break-glass the only door.

  • A break-glass decays under use. Normalization of deviance means success at rule-breaking is self-reinforcing — the override that felt like an event becomes Tuesday. Visible + effortful + reviewed is what keeps it an event; the bypass-rate chart is the canary that shows the boundary slipping first.

  • Chronic break-glass is a bug report on the gate. Repeated bypass is not a discipline problem to be scolded; it's data that the gate is wrong. Fix the gate, not the human — and the review records tell you which gate and why.


Common Mistakes

  1. Admin break-glass with self-modifiable logs. Granting admin in the same trust domain that holds the audit trail, so the breaker can delete the evidence with the same grant. The single most dangerous design error. Ship logs out-of-band to a locked account with WORM/Object Lock (Concept 4).

  2. Shared break-glass credentials. A single root password or admin key passed around (or, worse, pasted into a channel during an incident). Destroys attribution — you can prove glass was broken but not by whom. Use per-person assume-role with the session named for the actor and incident.

  3. Break-glass that never expires. A grant with no TTL, or a role membership added "for the incident" and never removed, silently becoming standing access. Time-box everything (short STS sessions / TTL'd bindings) so expiry is automatic, not a remembered chore (Concept 2).

  4. No post-use review. Logging the break-glass but never reviewing it. Keeps all of break-glass's risk while discarding its entire benefit — the learning, the telemetry, and the conscious-act discipline that holds back normalization of deviance. Review must be mandatory and blameless (Concept 8).

  5. A bypass so painful people route around it. Over-frictioned break-glass (synchronous approvals during the incident, multi-step forms) drives responders to shadow paths that are far less safe. Match friction to blast radius; the everyday glass must be genuinely fast (Concepts 3, 8).

  6. Treating chronic bypass as a people problem. Responding to a rising bypass rate with reminders and scolding instead of fixing the gate. You're fighting a structural force (normalization of deviance) with a memo. The repeated bypass is the bug report; fix the gate (Concepts 6, 7).

  7. Silent fail-open instead of break-glass. Designing the gate to quietly allow when its enforcement plane is down — so you're unprotected and unaware. Fail closed and make break-glass the explicit, logged door to open it (Concept 5).

  8. Over-privileged glass. Minting account-wide admin for an emergency that only needs to restart a pipeline. The blast radius of misuse is the grant, not the need. Scope the break-glass role to the failure class it actually addresses (Concept 2).


Test Yourself

  1. Place standing access, JIT, and break-glass on the access spectrum. What single dimension orders them, and when do you correctly choose break-glass over JIT?
  2. Write the three properties a time-boxed break-glass credential must have, and explain why time-boxing specifically prevents the most common real-world break-glass failure.
  3. Describe the integrity-of-audit problem in one sentence, then name the three architectural mechanisms that defend against it.
  4. A colleague proposes designing the deploy gate to "fail open if the policy service is down, so we're never blocked." What's the failure mode, and what's the better design?
  5. Define normalization of deviance and connect it to break-glass. What is the single best leading indicator that it's happening, and what does a rising value tell you?
  6. A specific gate's bypass rate has tripled over two quarters; one other gate spiked once during a single major incident. Which is "the gate is wrong" and which is "a bad month," and what evidence distinguishes them?
  7. Why is the mandatory, blameless post-incident review called the load-bearing control of the whole system? What breaks if it's optional?
Answers 1. They're ordered by **when authorization happens**: standing access is authorized *always* (privilege is just present), JIT is authorized *at the moment of use* (a live approver says yes, then a time-boxed grant), and break-glass is authorized *in advance by policy* (no live approver — used when waiting for one is worse than the documented risk). Choose break-glass over JIT precisely when the *latency of finding a live approver* is itself unacceptable (3 a.m. SEV1, approver unreachable). If an approver *was* available, you should have used JIT. 2. (a) **High enough privilege** to fix the emergency; (b) **short, fixed expiry** (time-boxed); (c) **issued only under proven conditions** (MFA + incident reference). Time-boxing prevents the most common real failure — a grant that *quietly never gets revoked* and becomes untracked standing access — by converting "remember to revoke" (a human task that will eventually be skipped) into "expires automatically" (a system property that can't be). 3. *A break-glass that grants admin may grant the power to alter the very logs that record the break-glass.* Defenses: (a) **out-of-band log destination** — ship to a separate account/trust domain the break-glass identity can't touch (CloudTrail → locked logging account); (b) **immutability/WORM** — S3 Object Lock Compliance mode so not even root can delete before retention; (c) **separation of duties for logging** — the people who can break glass are not the people who administer the logs. 4. **Silent fail-open** means a degraded enforcement plane *invisibly* disables the control — you're unprotected *and* unaware, the worst case. Better design: **fail closed by default, with break-glass as the explicit, logged, attributed door to open it** — so the open state is never silent; it's always a deliberate, recorded act. 5. Normalization of deviance (Vaughan) is the process by which repeated successful rule-breaking gets reclassified from "deviation we got away with" to "acceptable practice," migrating the boundary of acceptable risk outward step by step. For break-glass: each unremarkable bypass makes the next feel normal until "break glass" stops feeling like an event and the gate is effectively dead while appearing enforced. Best leading indicator: **bypass-rate-per-gate over time**. A rising value tells you the boundary is slipping — *before* the incident that proves it. 6. The gate whose bypass rate **tripled sustainedly over two quarters** is "the gate is wrong" (a *trend* across time/people indicts the gate). The one that **spiked once during a single major incident** is "a bad month" (a one-time spike correlated with a single root cause). Distinguishing evidence: trend vs level, distribution across people (everyone vs one person), whether the stated reasons in review records *cluster* on the same cause, and whether it's the same gate repeatedly while others stay near zero. 7. Because it is what converts break-glass from "a hole in your security" into a *managed, observable, self-correcting* exception process — it produces the learning, the telemetry that feeds gate design (Concept 7), and the conscious-act discipline that holds back normalization of deviance (Concept 6). If it's optional, it stops happening: you keep all of break-glass's risk and lose every benefit, and the deviance normalizes unchecked because being asked "why did you break glass?" was the thing keeping each use a conscious act.

Cheat Sheet

THE ACCESS SPECTRUM (prefer leftmost that fits)
  standing   always-on privilege            ✗ default for elevated; target 24/7, no use signal
  JIT        request→approve→time-box→revoke ✓ steady-state; LIVE approver at use
  break-glass pre-authorized self-service    ✓ when live-approval latency is unacceptable;
                                               trade = heavy log + alert + mandatory review

TIME-BOXED CREDENTIAL (AWS STS pattern)
  aws sts assume-role --role-arn .../BreakGlassAdmin \
    --role-session-name breakglass-INC-<id>-<user>   ← attribution
    --duration-seconds 3600                          ← time-boxed (auto-expire)
  trust policy conditions: MFA present + MFA age, session-name = incident id
  role permissions: SCOPED to the failure class, not *:* unless truly needed
  Azure → PIM activation w/ expiry · GCP → IAM Condition request.time · K8s → TTL'd RoleBinding

HIGHEST-PRIVILEGE GLASS (scale friction to blast radius)
  dual control / two-person rule   nuclear actions — even SEV1 needs 2 people
  sealed secret (Shamir M-of-N)    root/bootstrap — retrieval = quorum + alarm
  "policy is the approver"         automated emergency paths — code-reviewed, logged like human

INTEGRITY OF AUDIT  (the part everyone misses)
  problem: admin glass can DELETE the log of the glass
  fix:  out-of-band destination  (CloudTrail → separate LOCKED logging account)
        immutable / WORM          (S3 Object Lock COMPLIANCE — not even root deletes)
        separation of duties      (glass-breakers ≠ log-admins)
  rule: a log the audited party can edit is a courtesy, not a control

FAIL DIRECTION
  fail-open    allow when gate can't evaluate   (availability; silent loss of control = worst)
  fail-closed  block when gate can't evaluate   (safety; may worsen incident)
  break-glass = the CONTROLLED fail-open → fail closed by default, glass is the only door

HUMAN FACTORS (no IAM policy fixes this)
  normalization of deviance  success at rule-breaking → reclassified as normal (Vaughan)
  countermeasure:  every bypass VISIBLE + EFFORTFUL + REVIEWED
  leading indicator:  bypass-rate-per-gate over time  (rising = boundary slipping)
  principle:  fix the GATE, not the human — chronic break-glass = a defect in the gate

FEEDBACK LOOP → break-glass telemetry → gate review → fix/relax gate | fix root cause | bad month
  gate-wrong vs bad-month:  TREND not level · spread across people · reasons cluster · same gate

REVIEW (load-bearing control)  every use → MANDATORY + BLAMELESS post-hoc review
COMPLIANCE  SOC2/SOX/PCI/ISO/NIST want: pre-defined procedure · restricted access · logging · review
            → make evidence a BYPRODUCT of the mechanism; audit becomes a query, not a scramble

Summary

  • Break-glass is the far end of an access spectrum — standing → JIT → break-glass — ordered by when authorization happens. Prefer JIT (a live approver at the moment of use); fall back to break-glass (pre-authorized by policy) only when waiting for a live approver is worse than the documented risk; never default to standing access for elevated privilege.
  • The mechanism's heart is the time-boxed credential (STS AssumeRole against a dedicated, scoped break-glass role, with MFA and incident conditions). Time-boxing matters most because it turns "remember to revoke" into "expires automatically," designing out the most common real failure: a grant that quietly became standing access.
  • Friction scales with blast radius — one-person self-service for everyday glass, dual control or sealed secrets (Shamir M-of-N) for the nuclear set, and "the policy is the approver" for automated emergency paths. Uniform friction is a smell.
  • The integrity-of-audit problem is the part most designs miss: admin break-glass can delete the record of the break-glass. The audit trail must be out-of-band (separate locked account), immutable (WORM / Object Lock), and separated by duty. A log the audited party can edit is not a control.
  • The gate should fail closed, with break-glass as the controlled fail-open — so bypassing the protection is never silent, always a deliberate, recorded act.
  • Normalization of deviance (Vaughan) means a break-glass decays under use: rule-breaking that succeeds gets reclassified as normal. Keep every bypass visible, effortful, and reviewed; watch bypass-rate-per-gate as the leading indicator; and fix the gate, not the human — chronic break-glass is a bug in the gate, and its telemetry is the highest-signal input to gate design.
  • The mandatory, blameless post-incident review is the load-bearing control: it produces the learning, the feedback loop, and the conscious-act discipline that holds the whole system together. Built well, it also makes compliance (SOC2/SOX/PCI/ISO/NIST emergency-change controls) a query rather than a scramble, because every required control is a byproduct of the mechanism.

The next layer — professional.md — is about operating break-glass across an organization: the policy, the governance, the cross-team standards, and the audits that prove it works.


Further Reading

  • The Challenger Launch Decision — Diane Vaughan. The origin and definitive treatment of normalization of deviance; the canonical case study for why repeated successful rule-breaking redefines acceptable risk.
  • Columbia Accident Investigation Board (CAIB) Report, Vol. I — the second NASA case, explicitly diagnosing the same dynamic (foam strikes reclassified as routine); short and devastating on organizational drift.
  • Site Reliability Engineering and The SRE Workbook (Google) — incident response, on-call, emergency access, and the blameless postmortem culture that the review process depends on.
  • AWS guidance on break-glass / emergency access and JIT access — dedicated break-glass roles, root-account protection, and the CloudTrail-to-logging-account architecture (and S3 Object Lock for WORM audit).
  • Azure AD Privileged Identity Management and GCP IAM Conditions / privileged access documentation — the JIT-with-expiry implementations on the other two major clouds.
  • NIST SP 800-53 (AC-6 least privilege, AU-9 protection of audit information, CM-3 change control) and the SOC 2 Trust Services Criteria — the formal encoding of restricted, logged, reviewed emergency access (AU-9 is the integrity-of-audit principle in standard form).
  • For operating these controls at organizational scale — policy, governance, and audit — see professional.md.