Skip to content

Break-glass & Bypass — Professional Level

Roadmap: Quality Gates → Break-glass & Bypass The senior page taught you how to build one break-glass mechanism. This page is about owning the org's break-glass program — where "should we allow an admin merge?" stops being a per-PR question and becomes a question about your control design, your normalization-of-deviance risk, your audit posture, and whether your strict gates are actually survivable.


Table of Contents

  1. Introduction
  2. Prerequisites
  3. Glossary
  4. Core Concept 1 — The Thesis: Break-glass Is What Makes Strict Gates Survivable
  5. Core Concept 2 — Program Design: One Sanctioned Path, Not Per-Team Improvisation
  6. Core Concept 3 — The Metrics That Run the Program
  7. Core Concept 4 — The Normalization-of-Deviance Fight
  8. Core Concept 5 — Compliance: Making Break-glass Auditor-Ready
  9. Core Concept 6 — Incident-Response Integration
  10. Core Concept 7 — The Culture: Blameless but Mandatory Review
  11. War Stories
  12. Decision Frameworks
  13. Mental Models
  14. Common Mistakes
  15. Test Yourself
  16. Cheat Sheet
  17. Summary
  18. Further Reading
  19. Related Topics

Introduction

Focus: Owning the break-glass program across an org, where the emergency fast-path is a designed control — a security, compliance, and incident-response concern, not a backdoor.

The senior page framed break-glass as a mechanism: a pre-authorized escape hatch with logging and review. At the professional level the same mechanism shows up in different meetings: a SEV1 where a required check is flaking and the fix is stuck behind the gate that's supposed to protect production; a quarterly ops review where one gate has a 40% bypass rate and nobody noticed; a SOC2 audit asking "show me your emergency-change procedure and the evidence it's followed"; a postmortem where the root cause is "we always admin-merge on Fridays" and an unreviewed change took down checkout.

None of these are new concepts — they're the mechanism from the earlier tiers, now multiplied by an org, a regulator, and a culture that drifts. The hardest thing here is not building a break-glass button; it's running a program where the safe path is the easy path, so teams don't delete your gates or invent unsafe workarounds behind your back. The central, counter-intuitive truth: a sanctioned bypass is not a weakness in your gates — it is the load-bearing control that lets the gates stay strict. This page is the pragmatic, battle-tested layer for the person who owns that.


Prerequisites

  • Required: senior.md — the break-glass mechanism: pre-authorization, time-boxing, out-of-band logging, mandatory review.
  • Required: You've designed or operated a required-checks gate (05 — Gate Design) and branch protection (02).
  • Helpful: You've run an incident as IC or been the approver on an emergency change.
  • Helpful: You've sat through a SOC2/SOX/PCI audit, or owned the evidence for one.

Glossary

  • Break-glass: A pre-designed, sanctioned procedure to bypass a normal control in an emergency, leaving an audit trail and triggering mandatory review. Named for the "break glass to pull fire alarm" box.
  • Bypass: Any path around a gate. Sanctioned bypass = break-glass. Unsanctioned bypass = someone deleting branch protection or disabling a check — the failure mode break-glass exists to prevent.
  • JIT (just-in-time) access: Privilege granted for a bounded window on request, then auto-revoked — as opposed to standing access that's always on.
  • Glass rate (bypass rate): Break-glass uses per gate per unit time, normalized by total merges/deploys through that gate. The program's leading indicator.
  • Normalization of deviance: Diane Vaughan's term (from the Challenger analysis): a once-exceptional deviation becomes routine and accepted because "nothing bad happened last time," until it does.
  • Fail-open / fail-closed: When a gate's infrastructure is unavailable, fail-open lets traffic/merges through; fail-closed blocks them. A policy choice, distinct from break-glass.
  • Emergency change: In ITIL/SOC2/SOX/PCI language, a change made outside the normal approval flow to restore service or mitigate risk; auditors expect it to be controlled, logged, and reviewed.
  • Tamper-evident log: A record an actor with elevated privileges cannot silently alter — typically because it's written out-of-band to a sink the actor can't reach.

Core Concept 1 — The Thesis: Break-glass Is What Makes Strict Gates Survivable

Start from the failure you're actually preventing. Imagine a gate with no sanctioned bypass — required reviews, required green checks, no admin override — and a SEV1 in progress where the gate is in the way (the fix is blocked on a flaky check, or the only reviewer is asleep). Under enough pressure, exactly one of three things happens:

  1. Someone with admin rights deletes the control — disables branch protection, force-pushes, turns off the required check — to ship the fix. The gate is now off, often org-wide, and frequently stays off because re-enabling it is nobody's job at 3 a.m.
  2. Someone invents an unsafe workaround — a side branch that skips CI, a manual deploy that bypasses the pipeline, a shared admin token passed around in Slack.
  3. The org concludes the gate is "unrealistic" and lobbies to weaken or remove it permanently — so the gate that mattered 99% of the time is gone because of the 1%.

All three are worse than a sanctioned bypass, because all three are unobservable and uncontrolled. The deviation still happened; you just lost the log, the scoping, and the review.

This is the thesis, and it should change how you argue for break-glass to leadership and to skeptical security teams: break-glass is not the hole in your gate; it is the pressure-relief valve that keeps the gate intact. A boiler without a relief valve doesn't stay safe — it explodes at the weakest seam. The sanctioned fast-path channels the inevitable emergency through a route you designed, instrumented, and can review, instead of through whatever the most stressed engineer improvises.

The principle: make the safe path the easy path. If breaking glass is easier than deleting branch protection — one documented command, a known approver, automatic logging — people use it, and you keep observability over every deviation. If it's harder (undocumented, requires a ticket and a 2-hour wait), they route around it, and you're blind. Your program's success is measured by whether the sanctioned path out-competes the unsanctioned ones.

A corollary that staff engineers under-internalize: a gate's strictness and its break-glass quality are complements, not substitutes. The stricter the gate, the better your break-glass needs to be — because strictness raises the pressure on the relief valve. Teams that try to run strict gates without investing in break-glass get the normalization-of-deviance failure (Concept 4) for free.


Core Concept 2 — Program Design: One Sanctioned Path, Not Per-Team Improvisation

The single biggest lever is consolidation: one well-known, documented emergency procedure for the org, not forty teams each inventing their own. Per-team improvisation is how you end up with one team that admin-merges, one that has a shared bot token, one that disables checks, and one that genuinely has no path and panics. You cannot audit, measure, or improve forty different unofficial procedures. You can audit, measure, and improve one.

A well-designed program has five non-negotiable properties:

Property What it means Why it matters
Single documented procedure One runbook, linked from the incident process, that says exactly how to break glass Removes improvisation; the procedure is the same whether it's 2 p.m. or 3 a.m.
Pre-authorized actors + scopes A defined, small set of people who may break glass, and which gates they may bypass A break-glass right is not a god-mode admin; scope it to the gate, not the kingdom
Time-boxed JIT grants The elevated privilege is granted on request and auto-expires (minutes to hours) No standing god-accounts; the window of risk is bounded by construction
Tamper-evident out-of-band logging The use is logged to a sink the actor cannot alter An audit trail the actor can edit is not an audit trail (see War Story 3)
Mandatory blameless post-use review Every use is reviewed, no exceptions, without blame This is the feedback loop and the deterrent against casual use

A few design details that separate a real program from a checkbox:

Pre-authorization, not on-the-fly grants. Deciding who can break glass during an incident is too late and too political. Decide it in advance: a named on-call role, a small set of senior engineers, the IC. The emergency is not the time to debate authorization.

Scope the bypass to the gate, not to admin. The most common mistake is implementing break-glass as "grant the user org-admin for an hour." That over-grants wildly — now they can also rewrite history, delete repos, and edit logs. Scope the grant to exactly the gate being bypassed: a bypass token for the merge gate, a one-deploy approval override, a single time-boxed exception in the policy engine. (This is why mature setups use a policy/access broker that can mint narrow, expiring grants rather than toggling roles.)

Time-box by construction, not by promise. "Remember to turn it off after" is not a control; it's a hope (War Story 1 is exactly this failure). The grant must auto-expire. JIT-access tooling (e.g., a privileged-access broker, or a short-TTL token) makes the window self-closing.

Log out-of-band. If the break-glass actor has rights over the system that stores the log, the log is not tamper-evident. The log must go somewhere the actor cannot reach: a separate append-only store, a SIEM, a different cloud account, an immutable bucket with object-lock. (War Story 3 is the lesson that forces this.)

Make using it auto-file the review. The single best mechanism: breaking glass automatically opens the post-use review ticket, pre-filled with who/what/when/why. If the review is a separate manual step, it gets skipped. If it's an automatic side-effect of the bypass, it can't be.

The org-design lens: you are not building a feature; you are establishing a standard. The deliverable is a documented procedure, a small set of pre-authorized actors, a JIT grant mechanism that auto-expires, an out-of-band log, and an auto-filed review — and then the much harder work of getting every team to use that instead of their own. A program that exists only on paper while teams keep admin-merging is worse than no program, because you now believe you have control you don't.


Core Concept 3 — The Metrics That Run the Program

You cannot run a break-glass program on vibes; you run it on a small set of metrics reviewed in a regular forum. The metrics turn "are people abusing this?" into a fact and, more importantly, turn the glass rate into a signal about your gates.

The leading indicator: glass rate per gate. Count break-glass uses per gate, normalized by throughput, over a rolling window. The per-gate breakdown is the whole point — an aggregate glass rate hides the signal. The insight that makes this powerful:

A gate with a high glass rate is not a discipline problem; it is a mis-designed gate. If people bypass a specific gate constantly, the gate is too slow, too flaky, or blocking the wrong thing. The fix is almost never "tell people to stop bypassing" — it's to feed that gate back to gate-design (05) and fix the root cause. The glass rate is your gates' pain, made visible.

This reframes break-glass metrics from a security surveillance tool into a gate-quality tool. The single most valuable thing a high glass rate tells you is which gate to fix next.

The metric set the program actually runs on:

Metric What it tells you Healthy direction
Glass rate per gate Which gates are mis-designed / too painful Low and falling; investigate any outlier gate
Time-to-review Whether reviews are actually happening Within SLA (e.g., < 2 business days); rising = process rotting
% of uses reviewed Whether "mandatory" is real 100%; anything less means the deterrent is gone
Exception / waiver age Whether temporary bypasses became permanent Old waivers are debt; nothing should be "temporarily" exempt for 9 months
Repeat actors / repeat gates Normalization-of-deviance early warning Same person+gate repeatedly = it's becoming routine
Unsanctioned bypass count Whether people route around the program Should trend to zero as the sanctioned path wins

A couple of subtleties:

Watch the second derivative, not just the level. A glass rate that's rising is a leading indicator of either a degrading gate or creeping normalization — both worth catching before the SEV1. A flat low rate is fine; an accelerating one is a smell.

Distinguish glass rate from unsanctioned-bypass rate. A falling sanctioned glass rate is only good if the unsanctioned-bypass rate isn't rising to meet it. If sanctioned uses drop while admin-merges climb, you haven't fixed anything — you've pushed the deviation back into the dark. Track both; the goal is sanctioned-up-then-everything-down, in that order.

Review them in a standing forum. The metrics do nothing in a dashboard nobody opens. Put glass rate per gate, time-to-review, and waiver age on the agenda of an existing operational/quality review (the same one that reviews SLOs and incident trends). The act of a senior leader asking "why is the deploy gate at 30% glass rate?" every month is what keeps the program alive.


Core Concept 4 — The Normalization-of-Deviance Fight

This is the staff engineer's hardest, most human job, and it's where most break-glass programs quietly die. Diane Vaughan coined normalization of deviance analyzing the Challenger disaster: O-ring erosion was a known deviation from spec, but because it hadn't caused a catastrophe yet, each launch with erosion made the next one feel acceptable, until the deviation was the norm and the catastrophe arrived.

The software version is mundane and lethal: "we always admin-merge on Fridays." It starts as a genuine emergency. Nothing bad happens. Next Friday, with a tight deadline, someone admin-merges again — "we did it last week, it was fine." Within a quarter it's culture: the gate is theater, everyone bypasses it routinely, and the one time the bypassed change is actually broken, it ships straight to production and causes a SEV1 (War Story 2). The deviation didn't announce itself; it became invisible by being routine.

Your job is to fight this structurally, because willpower and reminders don't scale. The levers:

Visibility makes deviance un-routine. The single most effective antidote is a dashboard that makes every bypass visible and a forum where the numbers are discussed. Deviance normalizes in the dark; it cannot normalize when "the payments team broke glass 11 times this month" is on a slide every senior engineer sees. You're not shaming — you're denying the deviation its camouflage.

Mandatory review of every use is the deterrent. Not because review is punitive (it must be blameless — Concept 7), but because friction-with-attention is what keeps a bypass exceptional. If breaking glass is free and unobserved, it normalizes instantly. If every single use auto-files a review someone will read, the routine "eh, just admin-merge" never gets comfortable. The mandate is the point: skip-able review is no review.

Fix the root-cause gate fast so people don't need to bypass. This is the deepest lever and the one that ties back to Concept 3. The reason "we always admin-merge on Fridays" takes hold is almost always that the gate genuinely hurts — a slow check, a flaky required test, a release-freeze that's too blunt. People normalize the bypass because the alternative is worse for them. Remove the pain and the bypass stops being attractive. The most durable defense against normalization of deviance is making the compliant path not suck. A high glass rate that you fix at the gate drains the swamp; a high glass rate you police just moves it.

Leadership air-cover for "we slowed down to fix the gate." Engineers normalize bypassing because shipping is rewarded and the gate is in the way. The counter has to come from leadership: explicit, visible support for "we paused feature work to fix the flaky check that everyone was bypassing." If a staff engineer says "stop admin-merging" while leadership keeps rewarding the people who admin-merge to hit dates, the staff engineer loses. The cultural lever requires the political one.

The hard truth: normalization of deviance is the default trajectory of any bypass that is easy, unobserved, and relieves real pain. Left alone, every break-glass program drifts there. Preventing it is not a one-time setup — it's a standing practice: visibility, mandatory review, fast root-cause fixes, and leadership air-cover, applied continuously. The day you stop watching the glass rate is the day Fridays start.


Core Concept 5 — Compliance: Making Break-glass Auditor-Ready

Here is the part that surprises engineers: a well-run break-glass program is not a compliance liability — it is a compliance asset. Every serious framework — SOC 2, SOX (ITGC change management), PCI-DSS, ISO 27001 — assumes emergencies happen and requires you to have a controlled way to handle them. The control they're testing is "emergency change management," and a sanctioned break-glass program is exactly how you pass it.

The auditor's mental model for an emergency change is a four-part test, and a good break-glass program satisfies all four by construction:

Auditor asks The control Break-glass property that satisfies it
Is there a documented procedure? A defined emergency-change process The single documented runbook (Concept 2)
Is access restricted to authorized people? Least-privilege, segregation of duties Pre-authorized actors + scoped, time-boxed JIT grants
Is every use logged? Complete, tamper-evident audit trail Out-of-band immutable logging
Is every use reviewed/approved after the fact? Post-implementation review of emergency changes Mandatory blameless review, ideally auto-filed

Documented + restricted + logged + reviewed = the emergency-change control is satisfied. That sentence is worth memorizing; it's the exact shape SOC2/SOX/PCI auditors are looking for. The thing that fails an audit is not "you have a break-glass path" — it's "you have undocumented admin merges with no log and no review," i.e., the unsanctioned bypass your program exists to replace. Ad-hoc bypass is the finding; sanctioned break-glass is the remediation.

The evidence pipeline is the real deliverable. Auditors don't accept "we have a procedure"; they sample. They'll pick a quarter, ask "show me every emergency change," and expect, per use: who did it, what was bypassed, when, why (the justification), and the completed post-use review. If your evidence is "let me grep Slack and the GitHub admin audit log and reconstruct it," you'll spend the audit doing archaeology and probably get a finding for incompleteness. Build the pipeline so the evidence is a query, not an excavation:

  • Break-glass use writes a structured record (actor, gate, timestamp, justification, incident link) to the immutable log automatically.
  • That record auto-links to the review ticket.
  • A standing report can list, for any time range, every use and its review status.

When that pipeline exists, the audit interaction is: "here is every emergency change last quarter, each with its justification and signed-off review." That's a clean result (War Story 5), and it's almost entirely a function of having built logging + auto-filed review into the mechanism months earlier — the same machinery that serves the metrics in Concept 3.

The reframe for skeptical security/GRC partners: an organization with no break-glass program doesn't have fewer emergency changes — it has the same emergency changes happening invisibly, which is precisely the uncontrolled-change finding auditors hate. Sanctioning, scoping, logging, and reviewing the fast-path is how you convert an audit liability into an audit asset. You're not adding risk; you're making existing risk legible.


Core Concept 6 — Incident-Response Integration

Break-glass and incident response are the same problem viewed from two angles: a SEV1 is the canonical reason to break glass, and an unsanctioned bypass is a frequent cause of the next SEV1. So the sanctioned fast-path has to live inside the incident runbook, not in a separate security doc nobody reads at 3 a.m.

Bake the fast-path into the runbook. When responders are mid-incident, cognitive load is maxed and improvisation is dangerous. The runbook must say, in the response steps, exactly how to break glass for the relevant gates — the command, the approver, the fact that it auto-logs. The whole value of a designed control evaporates if responders have to find it under pressure. The goal is that breaking glass during a SEV1 is a known step, not a creative act.

The core tension: "restore service fast" vs "don't make it worse." This is the genuine hard part. Incident response rewards speed — every minute of downtime costs — so the pressure is to bypass everything and ship. But the gates exist because skipping them is how you turn a SEV1 into a SEV0 (the emergency fix that itself was broken; War Story 2 in spirit). The resolution is not "always bypass" or "never bypass"; it's deliberate, scoped bypass:

  • Bypass the gate that's in the way of the fix (the flaky check, the slow approval), not every gate.
  • Keep the gates that are cheap and protect against making it worse — a fast smoke test, a syntax/lint check, a canary. These cost seconds and catch the "fix" that's a typo.
  • Have the bypass itself reviewed by a second person in-incident when feasible — the IC or a second responder eyeballs the change. The fastest way to extend an outage is a solo, unreviewed, panic "fix."

Pre-approved emergency changes lower the in-the-moment cost. For known incident classes, pre-authorize specific emergency changes so they don't need fresh approval mid-fire: "rolling back to the last known-good deploy is pre-approved during any SEV1," "scaling up replicas is pre-approved," "enabling the kill-switch flag is pre-approved." Pre-approval is break-glass thinking applied ahead of time — it shrinks the set of decisions you have to make while the building is on fire, and it keeps the risky improvisation budget for the genuinely novel situations.

The integration discipline: the SEV1 fast-path belongs in the incident runbook, scoped (bypass what's in the way, keep the cheap safety nets), and pre-approved for known cases. The failure mode is a break-glass procedure that's technically correct but socially invisible during incidents, so responders improvise an unsanctioned bypass anyway — and now you have an uncontrolled change and an ongoing incident. Rehearse it; a break-glass path nobody has used in a drill won't be used correctly in a real SEV1.


Core Concept 7 — The Culture: Blameless but Mandatory Review

The whole program rests on a cultural stance that has to be held precisely, because it's easy to get wrong in either direction. The stance: a break-glass use is data, not a crime — but the review is mandatory, every time. Blameless and mandatory. Drop either half and the program fails.

Blameless, because punishment drives it underground. If breaking glass gets you a stern talking-to or a ding on your review, the rational engineer stops breaking glass — and starts quietly working around the gate instead, because the unsanctioned path has no review to punish them. Punishing sanctioned bypass selects for unsanctioned bypass. The review must treat a break-glass use as a signal ("what made the compliant path unusable?") not a transgression. The questions are about the system: Was the gate too slow? Was the check flaky? Was there no other reviewer? Each blameless review feeds Concept 3's metrics and Concept 4's root-cause fixes.

Mandatory, because optional review normalizes deviance instantly. Blameless does not mean optional. Every use is reviewed — no "it was obviously fine, skip it." The mandate is the deterrent against casual use (Concept 4) and the evidence the auditor samples (Concept 5). Blameless-but-mandatory is the only stable point: blameless keeps it honest, mandatory keeps it exceptional.

Reward the disclosure, not the heroics. The cultural keystone, and the thing leadership has to actively model: "I broke glass and filed the follow-up" must be celebrated over "I quietly worked around it." The engineer who used the sanctioned path and disclosed it did exactly the right thing — they gave you observability over a deviation that was going to happen regardless. Treating them as suspect, while the engineer who silently disabled a check skates by unseen, is the single fastest way to kill the program. You want a culture where disclosing a bypass is normal and respected, because every disclosure is a data point you'd otherwise be blind to.

The cultural reframe: you are optimizing for observability of deviation, not zero deviation (which is a fantasy under real production pressure). The engineer who breaks glass and tells you is your ally; the one who quietly routes around the gate is the actual risk. Blameless review keeps allies honest; mandatory review keeps deviation exceptional; rewarding disclosure keeps the sanctioned path winning. Get the culture wrong and no amount of tooling saves you — people route around tools they're punished for using correctly.


War Stories

The team with no break-glass that force-pushed over main. A service team ran strict branch protection — required reviews, required green CI, no admin override, no sanctioned bypass. During a Sat-night outage, the fix was blocked: a required check was failing for an unrelated flake and the on-call couldn't merge. Under pressure, an engineer with org-admin deleted the branch protection rule to force the fix through, force-pushed over main, and shipped. The fix worked — but branch protection stayed off for eleven days (nobody owned re-enabling it), during which two unreviewed changes landed, one of which caused a second incident. The lesson the org took: a strict gate without a sanctioned fast-path doesn't stay strict under pressure — it gets deleted under pressure, often org-wide and often permanently. They built a scoped, time-boxed break-glass that bypassed the specific check without touching the protection rule. The relief valve was cheaper than the explosion.

The Friday admin-merge culture that caused a SEV1. A payments team's gate required two reviews. Reviews were slow on Fridays (reviewers heads-down for the week's end), so to hit a deadline someone admin-merged "just this once." Nothing broke. It became routine — "Friday admin-merge" was an understood, unwritten practice within a quarter, used dozens of times, never reviewed. Then one Friday an admin-merged change had a subtle bug in fee calculation that would have been caught in review; it shipped straight to production and mischarged customers for six hours — a SEV1 and a refund exercise. The postmortem root cause wasn't the bug; it was the normalization of deviance that let an unreviewed change reach prod. The fix had two parts: a dashboard that made every bypass visible in the weekly ops review (so "Friday admin-merge" couldn't hide), and fixing the actual gate — the review SLA — so people didn't need to bypass on Fridays. Visibility plus root-cause fix; policing alone would have just moved it.

The break-glass admin who edited the audit log. An org built a clean break-glass: pre-authorized, scoped, logged. But the implementation granted org-admin for the window, and the log was written to a store the org-admin role could modify. During a contentious incident, the break-glass holder made a change, it contributed to the outage, and — to avoid blame — edited the audit log to soften what they'd done. It came out in the postmortem, and the trust damage was severe. Two lessons, both now standard: (1) scope the grant to the gate, never to god-mode — the break-glass right should bypass the one control, not grant the power to rewrite history and edit logs; (2) the log must be out-of-band and tamper-evident — written to a sink the break-glass actor cannot reach (separate account, append-only/object-locked store, SIEM). An audit trail the actor can edit is not an audit trail.

The high glass-rate that was really a flaky check. A quarterly review surfaced that one gate — the integration-test required check on a particular service — had a 35% break-glass rate, by far the org's highest. The instinct in the room was "that team has a discipline problem." A staff engineer pushed back and investigated the gate instead. The integration check was flaky: it failed ~1-in-3 runs for infrastructure reasons unrelated to the code, so engineers broke glass to get legitimate changes through a check that lied. The real fix was to the gate — quarantine and repair the flaky test, add a retry, fix the test infra — after which the glass rate for that gate fell to near zero on its own. The lesson: a high glass rate per gate is a signal about the gate, not the people. The break-glass metric had done its real job — pointing at the gate that needed fixing (feeding straight back to 05 — Gate Design).

The clean SOC2 audit. Going into a SOC2 Type II audit, an org's break-glass program had been running for a year: one documented procedure, scoped JIT grants, out-of-band immutable logging, and an auto-filed mandatory review per use. The auditor sampled a quarter and asked for every emergency change. The team ran one report: each break-glass use, with actor, gate, timestamp, justification, incident link, and the signed-off post-use review — complete, for every single use. The auditor's emergency-change-management control passed with no findings, in one meeting. The contrast the team had seen before: a sister org with ad-hoc admin merges spent the audit reconstructing events from Slack and the platform audit log, and took a finding for incomplete change records. The clean result wasn't audit-week heroics; it was the evidence pipeline (Concept 5) built into the mechanism months earlier — documented + restricted + logged + reviewed, by construction.


Decision Frameworks

Standing access vs JIT vs break-glass — which to grant?

Need Mechanism Why
Routine, frequent, low-risk action by a defined role Standing access (scoped) JIT friction isn't worth it for the everyday; scope it tight
Sensitive action, needed sometimes, by known people JIT access (request → bounded window → auto-revoke) No always-on privilege; window of risk is bounded
Bypassing a gate in an emergency, rare, must be reviewed Break-glass (pre-auth + scoped + logged + mandatory review) The deviation is exceptional and must be observable + reviewed
Anyone needs org-admin "just in case" None of the above — redesign Standing god-access is the anti-pattern all three exist to avoid

Is this gate's glass rate telling us to fix the gate?

Glass rate on a gate Likely meaning Action
Near zero, stable Gate is well-designed and bypass is genuinely exceptional Leave it; this is the goal
Moderate but only during real incidents Working as intended — emergencies happen Confirm via review justifications; no gate change
High, steady, across many people Gate is too slow / flaky / blocking the wrong thing Fix the gate (05); don't police the people
Rising over time Degrading gate or creeping normalization Investigate now — leading indicator of both failure modes
High for one person/team only Possibly local normalization of deviance Blameless conversation + check if their context differs

Break-glass review SLA + required fields.

Element Standard Rationale
Review opened Automatically, at time of use Manual filing gets skipped; auto-file makes it non-optional
Review SLA Within ~2 business days Fresh enough to be accurate; slipping SLA = process rotting
% reviewed 100%, no exceptions "Mandatory" with exceptions isn't mandatory
Required fields actor, gate bypassed, timestamp, justification, incident/ticket link, follow-up action Exactly what the auditor samples and what feeds the metrics
Tone Blameless — "what made the safe path unusable?" Punishment drives bypass underground

Fail-open vs fail-closed under incident (the gate infrastructure itself is down — distinct from break-glass):

Situation Choose Why
Security/safety gate (auth check, prod-deploy approval) Fail-closed "Can't verify it's safe" must mean "don't proceed"
Availability-critical path where the gate is advisory Fail-open (with alarm) Don't let a down checker cause an outage it can't prevent
The gate is the safety control (the thing protecting prod) Fail-closed + break-glass Don't silently open; force the sanctioned, logged bypass instead
Any fail-open Must alarm loudly + be temporary A silent fail-open is an invisible, normalized bypass

When to add a sanctioned bypass vs relax the gate.

Signal Add a sanctioned bypass (break-glass) Relax / redesign the gate
The gate is right 99% of the time; emergencies are the 1% ✅ Keep it strict, add the relief valve
The gate has a high steady glass rate (it hurts routinely) ✅ The gate is mis-designed; fix it (05)
The check is flaky / slow (lies or blocks legit work) ✅ Fix the check; bypass just masks it
The risk the gate guards is genuinely critical ✅ Strict gate + scoped break-glass ❌ Don't weaken a gate that's earning its keep
People are already routing around it unsanctioned ✅ Sanction + instrument the path they're using And investigate why — usually the gate hurts

Mental Models

  • Break-glass is the pressure-relief valve, not the hole. A boiler without a relief valve doesn't stay safe — it explodes at the weakest seam. The sanctioned fast-path channels the inevitable emergency through a route you designed and can review, instead of through whatever the most stressed engineer improvises.

  • The glass rate is your gates' pain, made visible. A gate with a high steady glass rate isn't a discipline problem — it's a mis-designed gate telling you which one to fix next. Police the gate, not the people.

  • You're optimizing for observability of deviation, not zero deviation. Under real production pressure, deviation will happen. The engineer who breaks glass and tells you is your ally; the one who quietly routes around the gate is the risk. Make the sanctioned path out-compete the unsanctioned one.

  • Normalization of deviance is the default trajectory. Any bypass that's easy, unobserved, and relieves real pain drifts toward "we always do this." Visibility, mandatory review, and fixing the painful gate are what hold the line — continuously, not once.

  • A sanctioned, logged, reviewed bypass is a compliance asset; an ad-hoc one is the finding. Documented + restricted + logged + reviewed = the emergency-change control auditors are testing. No program doesn't mean fewer emergencies — it means invisible ones.

  • Blameless keeps it honest; mandatory keeps it exceptional. Drop blameless and bypass goes underground; drop mandatory and it normalizes. The only stable point is both.


Common Mistakes

  1. Running strict gates with no sanctioned bypass. The gate doesn't survive the first real emergency — it gets deleted under pressure (often org-wide, often permanently) or routed around invisibly. The stricter the gate, the better its break-glass must be. Strictness without a relief valve is how you lose the gate entirely.

  2. Per-team improvisation instead of one program. Forty unofficial procedures can't be audited, measured, or improved. Consolidate to one documented procedure, then do the hard work of getting every team onto it.

  3. Implementing break-glass as "grant org-admin for an hour." That over-grants wildly — now the actor can also rewrite history and edit logs (War Story 3). Scope the grant to the one gate being bypassed, not to god-mode.

  4. Time-boxing by promise instead of construction. "Remember to turn it off" is a hope, not a control — and branch protection stays off for eleven days (War Story 1). The grant must auto-expire.

  5. Logging where the actor can edit the log. An audit trail the break-glass holder can alter is not an audit trail. Log out-of-band to a sink they can't reach (separate account, append-only/object-locked store, SIEM).

  6. Treating a high glass rate as a discipline problem. It's almost always a gate problem — a flaky or slow check people bypass to get legit work through (War Story 4). Investigate and fix the gate; policing the people just moves the deviation into the dark.

  7. Punishing sanctioned bypass. Dinging people for breaking glass selects for quietly working around the gate, which has no review to punish. Reward "I broke glass and filed the follow-up" over "I quietly worked around it" — you want the disclosure.

  8. Optional review. Blameless-but-optional normalizes deviance instantly and fails the audit. Every use, reviewed — ideally auto-filed at time of use so it can't be skipped.

  9. A break-glass path that's invisible during incidents. If it's not in the runbook, responders improvise an unsanctioned bypass under pressure. Bake the SEV1 fast-path into the incident runbook and rehearse it.


Test Yourself

  1. A peer argues that adding a sanctioned admin-override "weakens our gates and is a security hole." Make the counter-argument in one or two sentences, using the relief-valve framing.
  2. One gate shows a 35% break-glass rate; every other gate is near zero. What's your first hypothesis, and what do you do — and what do you explicitly not do?
  3. Name the five non-negotiable properties of a well-designed break-glass program.
  4. Why must the break-glass grant be scoped to the gate rather than implemented as "org-admin for an hour"? Give the two distinct risks that "org-admin for an hour" creates.
  5. An auditor asks you to demonstrate emergency-change management for SOC2. What four properties of your program satisfy the control, and what does the evidence pipeline need to produce per use?
  6. Explain why a break-glass review must be both blameless and mandatory. What specifically fails if you drop each half?
  7. "We always admin-merge on Fridays" has become routine. Name the concept, and give the three structural levers (not "tell people to stop") you'd use to fight it.
  8. During a SEV1, the fix is blocked by a gate. Should you bypass all gates to move fast? Explain the "restore service vs don't make it worse" resolution.
Answers 1. Break-glass is the **pressure-relief valve, not the hole**: without a sanctioned fast-path, the inevitable emergency forces people to *delete* the gate (often org-wide and permanently) or route around it invisibly — both *worse* than a scoped, logged, reviewed bypass. The relief valve is what lets the gate *stay* strict. 2. First hypothesis: **the gate is mis-designed** (likely a flaky or slow required check people bypass to ship legit work), not a discipline problem. **Do:** investigate the gate itself and fix the root cause (feed it to [gate design](../05-gate-design-speed-vs-safety/professional.md)) — typically the glass rate falls on its own. **Don't:** tell that team to "stop bypassing" or treat it as a behavior issue; policing just pushes the deviation into the dark. 3. (1) A single **documented procedure**; (2) **pre-authorized actors + scopes**; (3) **time-boxed JIT grants** that auto-expire; (4) **tamper-evident out-of-band logging**; (5) **mandatory blameless post-use review** (ideally auto-filed). 4. Scope it to the gate so the right bypasses *only the one control*. "Org-admin for an hour" creates two distinct risks: (a) the actor can do *far more* than bypass the gate — delete repos, rewrite history; and (b) critically, the actor gains rights over the **audit log itself** and can edit it (War Story 3), so the trail isn't tamper-evident. 5. Documented + restricted + logged + reviewed: a **documented procedure**, **restricted** (pre-authorized, scoped, time-boxed) access, **out-of-band immutable logging**, and **mandatory post-use review**. The evidence pipeline must produce, *per use*: actor, gate bypassed, timestamp, justification, incident/ticket link, and the completed/signed-off review — as a *query*, not a Slack archaeology dig. 6. **Blameless** so people don't go underground — punishing sanctioned bypass selects for *unsanctioned* bypass (which has no review to punish). **Mandatory** so the bypass stays exceptional — optional review normalizes deviance instantly and fails the audit. Drop blameless → bypass goes dark; drop mandatory → bypass becomes routine. 7. **Normalization of deviance** (Vaughan). Structural levers: (a) **visibility** — a dashboard + standing forum that denies the deviation its camouflage; (b) **mandatory review of every use** as the deterrent; (c) **fix the root-cause gate** (here, the Friday review-SLA pain) so people don't *need* to bypass — plus leadership air-cover for slowing down to fix it. 8. **No — bypass only what's in the way of the fix, keep the cheap safety nets.** Resolution: scope the bypass to the blocking gate (the flaky check, the slow approval), but *keep* the gates that cost seconds and prevent making it worse (fast smoke test, lint, canary), and get the in-incident change a second pair of eyes when feasible. "Bypass everything" is how a SEV1 becomes a SEV0; pre-approve known emergency changes (rollback, scale-up) so they're not even a decision.

Cheat Sheet

THE THESIS
  Break-glass = pressure-relief valve, NOT a hole.
  No sanctioned path → people DELETE the gate or route around it invisibly (worse).
  Stricter gate → BETTER break-glass needed. Make the safe path the easy path.

PROGRAM = 5 NON-NEGOTIABLES
  1 documented procedure (org-wide, not per-team)
  pre-authorized actors + scopes (scope to the GATE, not org-admin)
  time-boxed JIT grants (auto-expire by construction, not by promise)
  tamper-evident OUT-OF-BAND log (actor can't edit it)
  mandatory blameless review (auto-filed at time of use)

METRICS (run them in a standing ops/quality review)
  glass rate PER GATE      ← leading indicator; high rate = MIS-DESIGNED gate
  time-to-review           ← within SLA (~2 biz days); rising = rotting
  % reviewed               ← 100% or "mandatory" is a lie
  waiver/exception age     ← old "temporary" exemptions are debt
  unsanctioned-bypass count← should fall as sanctioned path wins

HIGH GLASS RATE ON A GATE → FIX THE GATE, not the people (→ 05 gate design)

NORMALIZATION OF DEVIANCE ("we always admin-merge on Fridays")
  levers: visibility + mandatory review + FIX THE PAINFUL GATE + leadership air-cover
  default trajectory of any easy/unobserved/pain-relieving bypass

COMPLIANCE  documented + restricted + logged + reviewed
  = SOC2/SOX/PCI emergency-change control SATISFIED
  ad-hoc admin merge = the FINDING; sanctioned break-glass = the remediation
  evidence = a QUERY (actor/gate/time/justification/link/review), not archaeology

INCIDENT INTEGRATION
  bake the SEV1 fast-path INTO the runbook; rehearse it
  bypass what's IN THE WAY, keep cheap safety nets (smoke/lint/canary)
  pre-approve known emergency changes (rollback, scale-up)

CULTURE
  blameless (or it goes underground) AND mandatory (or it normalizes)
  reward "I broke glass + filed follow-up" > "I quietly worked around it"

Summary

  • Break-glass is the control that makes strict gates survivable, not a security hole. Without a sanctioned fast-path, the inevitable emergency forces teams to delete the gate (often org-wide and permanently) or invent unsafe, invisible workarounds — both strictly worse than a scoped, logged, reviewed bypass. Make the safe path the easy path so it out-competes the unsanctioned ones.
  • The program is a standard, not a feature: one documented procedure, pre-authorized actors with gate-scoped (not org-admin) grants, time-boxed JIT that auto-expires, tamper-evident out-of-band logging, and mandatory blameless review — ideally auto-filed at time of use.
  • Glass rate per gate is the leading indicator and a gate-quality signal: a high steady rate means the gate is mis-designed, and the fix is to the gate (05), not to the people. Run glass rate, time-to-review, % reviewed, and waiver age in a standing review.
  • Normalization of deviance is the default trajectory of any easy, unobserved, pain-relieving bypass. Fight it structurally — visibility, mandatory review, fast root-cause gate fixes, and leadership air-cover — continuously.
  • A well-run program is a compliance asset: documented + restricted + logged + reviewed = the SOC2/SOX/PCI emergency-change control, with the evidence as a query. Ad-hoc bypass is the finding; sanctioned break-glass is the remediation.
  • Integrate it with incident response — bake the SEV1 fast-path into the runbook, bypass only what's in the way (keep the cheap safety nets), and pre-approve known emergency changes. Hold the culture blameless and mandatory, and reward disclosure over silent workarounds.

You can now own the break-glass program as an org-level control — the relief valve that keeps your gates strict, your deviations observable, and your auditors satisfied. The remaining tier — interview.md — consolidates the topic into the questions that probe whether someone actually understands all of this.


Further Reading

  • Site Reliability Engineering (Google) — Managing Incidents & Postmortem Culture — the blameless, runbook-driven incident model the SEV1 fast-path lives inside.
  • The SRE Workbook — Incident Response & On-Call — operationalizing the runbook, pre-approved changes, and escalation.
  • Diane Vaughan, The Challenger Launch Decision — the foundational analysis of normalization of deviance; the lens for the "we always admin-merge on Fridays" failure.
  • AICPA SOC 2 Trust Services Criteria & SOX ITGC change-management guidance — what auditors mean by emergency-change control (documented + restricted + logged + reviewed).
  • PCI-DSS Requirement 6 (change management) — emergency-change expectations in a regulated payments context.
  • interview.md — the whole topic distilled into interview-grade questions and model answers.