Memory-Safety Mechanisms — Professional Level¶
Topic: Memory-Safety Mechanisms Focus: Hardware-enforced memory safety (ARM MTE memory tagging, CHERI capabilities, fat pointers), the economics of spatial vs temporal protection at silicon scale, bounds-check elimination by optimizers, and the industry-wide migration to memory-safe languages — the data, the strategy, and how to lead it.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Real-World Analogies
- Mental Models
- Code Examples
- Pros & Cons
- Use Cases
- Coding Patterns
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Tricky Points
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
Focus: When you have billions of lines of C/C++ you can't rewrite, and software discipline plus sanitizers still leak ~70% of severe CVEs, what do you do? The professional answer has two prongs: move the enforcement into the hardware (MTE, CHERI) and migrate to memory-safe languages where you can — and lead both with evidence.
By this level you understand the bug taxonomy, the detection/mitigation tooling, and the language designs. The remaining problem is scale and economics. Sanitizers are too slow for production. Hardened allocators only raise exploit cost. Rust rewrites are expensive and can't touch the entire existing C/C++ estate at once. The two frontiers that change the cost curve are:
-
Hardware-enforced safety. If the silicon checks every memory access against a tag or a capability, you get safety (or near-safety) at production-affordable cost, without recompiling to a new language. Two designs lead: ARM MTE (Memory Tagging Extension — a probabilistic, low-overhead lock-and-key on every allocation, shipping in production phones) and CHERI (capabilities — fat pointers carrying hardware-enforced bounds and permissions, deterministic spatial+temporal safety, in research/early-product silicon like Arm Morello).
-
The industry migration. Backed by hard data — Microsoft's ~70%, Chromium's ~70%, and Google's finding that the vast majority of memory bugs are in new or recently-modified code — the strategy that actually moves the needle is "safe by default for new code": stop adding C/C++ where you can, write new code in Rust/safe languages, and watch the memory-bug rate fall. Android did exactly this and saw memory-safety vulnerabilities drop from ~76% of total (2019) toward ~24% (2024) as the fraction of new memory-safe code rose — with essentially zero memory-safety CVEs in the new Rust code.
🎓 Why this matters at professional level: You are now the person who sets the org's memory-safety strategy: which mitigations to enable in the production toolchain, whether to adopt MTE, how to phase a Rust migration, how to argue the ROI to leadership, and how to measure success. This requires understanding the mechanisms well enough to know their guarantees and costs, and the data well enough to make the economic case. This page is that toolkit.
This page covers MTE and CHERI in mechanism detail, fat pointers, the deep reason temporal safety resists cheap hardware enforcement, bounds-check elimination, and the migration playbook with its evidence base. It is the capstone of the topic's mechanisms; interview.md and tasks.md exercise all four tiers.
Prerequisites¶
- Required (senior level): Rust ownership/borrowing, managed-runtime GC safety and its leaks (Unsafe/cgo/data races), ARC.
- Required (middle level): shadow memory, redzones, quarantine, guard pages, hardened allocators, spatial-cheap/temporal-expensive.
- Helpful: virtual memory, MMU, TLB, cache lines, pointer representation (64-bit pointers don't use all 64 bits — top-byte-ignore).
- Helpful: familiarity with security-mitigation history (ASLR, DEP/NX, stack canaries, CFI) and the attacker-defender cost dynamic.
- Helpful: experience leading a migration or toolchain rollout across a large codebase.
You do not need: microarchitecture/silicon design depth; this is the systems-software view of hardware features.
Glossary¶
| Term | Definition |
|---|---|
| MTE (Memory Tagging Extension) | ARMv8.5+ feature: every 16-byte memory granule and every pointer carry a 4-bit tag; a mismatch on access faults. Probabilistic lock-and-key. |
| Memory tagging / lock-and-key | Tag the allocation (lock) and the pointer (key); the hardware checks key==lock on every access. Catches OOB and UAF when tags differ. |
| Granule | The unit of tagged memory in MTE: 16 bytes share one tag. |
| Top-Byte-Ignore (TBI) | ARM feature where the top byte of a 64-bit pointer is ignored for addressing — MTE stores its 4-bit key there. |
| CHERI | Capability Hardware Enhanced RISC Instructions: pointers become unforgeable 128-bit capabilities carrying base, length, permissions, and a validity tag. |
| Capability | A hardware-protected fat pointer: address + bounds + permissions + a 1-bit validity tag (kept out-of-band) that the CPU checks; can't be forged or widened. |
| Provenance | The lineage of a pointer/capability — where its authority came from. CHERI enforces that you can only narrow authority, never invent it. |
| Fat pointer | A pointer that carries bounds (base+length) alongside the address, enabling per-access spatial checks. CHERI capabilities are hardware fat pointers. |
| Monotonic non-increase of authority | CHERI rule: derived capabilities can only have ≤ the authority of their parent — you can shrink bounds/permissions, never grow them. |
| Bounds-check elimination (BCE) | Compiler optimization that removes a bounds check when it can prove the index is always in range. |
| ASLR / DEP(NX) / CFI | Classic exploit mitigations: randomize layout / make data non-executable / restrict indirect branches. Probabilistic or partial; not memory safety. |
| Memory-safe language (MSL) | A language that guarantees memory safety by default (Rust, Go, Java, C#, Swift, Python, JS, …). |
| "Safe by default for new code" | The migration strategy: write new/changed code in MSLs while leaving stable old code in place, since bugs concentrate in new code. |
| Vulnerability density | Memory-safety bugs per unit of code per unit of age — empirically highest in new/recently-changed code. |
| Sync-tag / async-tag mode | MTE checking modes: synchronous (precise fault at the access, higher cost) vs asynchronous (cheaper, imprecise — fault reported later). |
Core Concepts¶
1. ARM MTE — Probabilistic Lock-and-Key, Production-Affordable¶
MTE turns memory safety into a hardware tag match. Memory is divided into 16-byte granules; each granule carries a 4-bit tag stored in separate tag memory. Pointers carry a matching 4-bit tag in their unused top byte (via Top-Byte-Ignore). On every load/store, the hardware compares the pointer's tag (key) against the granule's tag (lock). Mismatch → fault.
How it catches the bug classes:
- Spatial (OOB): the allocator gives adjacent allocations different tags. An overflow from allocation A (tag 5) into allocation B (tag 9) accesses a granule tagged 9 with a pointer tagged 5 → mismatch → fault.
- Temporal (UAF): on
free, the allocator re-tags the freed granules to a new value. The old dangling pointer still carries the old tag; using it now mismatches → fault. This is the key win: MTE provides temporal protection cheaply, which sanitizers and guard pages struggle to do in production.
Why "probabilistic": only 4 bits = 16 possible tags. With ~1/16 chance, a wild access lands on a granule that happens to share the pointer's tag and is missed. So MTE is not a proof of safety — it's a high-probability detector. But across a fleet of billions of devices, a 15/16 catch rate makes most bugs surface (and, as a mitigation, makes exploitation unreliable — an attacker must win the tag lottery repeatedly). It ships in production (Pixel 8+ offers MTE; Android's "Advanced Protection" enables it), at single-digit-percent overhead in async mode — orders of magnitude cheaper than ASan, making it viable in production, not just testing.
MTE's two modes matter operationally: synchronous faults precisely at the bad access (great for debugging, higher cost) and asynchronous batches checks for lower overhead but imprecise blame (great for fleet hardening). Many deployments run async in production and sync in testing.
2. CHERI — Deterministic Safety via Unforgeable Capabilities¶
CHERI is the more radical, deterministic design. It replaces integer pointers with capabilities: 128-bit objects that bundle the address with its bounds (base + length), permissions (read/write/execute), and a separate 1-bit validity tag kept out-of-band in tagged memory. The hardware enforces three unbreakable rules:
- Bounds are checked on every dereference — access outside [base, base+length) faults. Deterministic spatial safety; no 1/16 gap.
- Capabilities are unforgeable — you cannot fabricate a valid capability from integer arithmetic. The validity tag is cleared the moment you do non-capability operations on the bits. (Provenance is enforced by hardware.)
- Authority only narrows (monotonic non-increase) — a capability derived from another can have equal or less authority (smaller bounds, fewer permissions), never more. You can hand out a restricted view; you cannot escalate.
Because every pointer carries its own bounds, CHERI is essentially hardware fat pointers enforced by the CPU. This gives deterministic spatial safety to unmodified C/C++ semantics (you recompile for CHERI; pointers become capabilities), eliminating buffer overflows and OOB by construction. Temporal safety is harder even for CHERI (a capability can still point at freed-then-reused memory), but CHERI enables efficient temporal-safety schemes (e.g. CHERIvoke / Cornucopia-style sweeping revocation: invalidate all capabilities to freed regions before reuse, made cheap because capabilities are findable in memory via their tags).
CHERI is realized in research/early silicon — notably Arm Morello (an experimental CHERI-enabled aarch64 board) and the CHERI-RISC-V efforts. It's not yet mainstream, but it represents the "deterministic memory safety in hardware for legacy C/C++" endgame. The cost is real: ~doubled pointer size (cache/memory pressure), ISA and toolchain changes, and an ecosystem port.
3. MTE vs CHERI — The Spectrum of Hardware Safety¶
| Dimension | MTE | CHERI |
|---|---|---|
| Guarantee | Probabilistic (1/16 miss) | Deterministic spatial |
| Mechanism | tag match (lock/key) | bounds+perms in capability |
| Temporal safety | yes (re-tag on free), probabilistic | needs revocation scheme, deterministic when applied |
| Pointer size | unchanged (tag in spare byte) | ~2× (128-bit capability) |
| Deployment | production today (ARMv8.5+, Pixel) | research/early (Morello) |
| Overhead | single-digit % (async) | moderate, mostly memory/cache |
| Compat | recompile + tagging allocator | recompile for CHERI ABI |
The professional reading: MTE is the pragmatic near-term win — affordable, shipping, dramatically raises the bar — at the cost of being probabilistic. CHERI is the principled long-term answer — deterministic, comprehensive — at the cost of not being here yet at scale. They are not mutually exclusive; the industry is pursuing both.
4. Why Temporal Safety Resists Cheap Hardware Too¶
The senior-level theme — temporal is harder than spatial — persists into hardware. Spatial safety needs only local info: the bounds of the object you're accessing (MTE's adjacent-tag trick, CHERI's per-pointer bounds). Temporal safety needs a global, time-varying fact: "has this exact object been freed and the memory reused?"
- MTE handles it probabilistically by re-tagging on free — but tag reuse (only 16 values) means a freed-then-reallocated region can collide tags.
- CHERI handles it deterministically only with an added revocation sweep: before reusing freed memory, find and invalidate every capability that points into it (feasible because capabilities are tagged and thus findable — impossible with raw integer pointers). That sweep has real cost, though research shows it can be amortized to small overhead.
The durable lesson: any time you see disproportionate machinery (GC, quarantine, MTE re-tagging, CHERI revocation), it is buying temporal safety. Spatial safety is comparatively cheap everywhere — software or silicon.
5. Bounds-Check Elimination — Why Spatial Safety Is Often Free¶
A frequent objection to safe languages is "bounds checks cost performance." In practice they often cost nothing, because optimizing compilers perform bounds-check elimination (BCE): when the compiler can prove an index is in range, it removes the check.
- A
for i in 0..arr.len()loop: the compiler knowsi < lenby construction → no per-iteration check. - Range analysis / induction-variable reasoning proves many derived indices safe.
- Hoisting: a check that's invariant across a loop is done once before the loop.
- Rust and modern JITs (HotSpot, V8) are aggressive here; Go has improved BCE substantially over releases.
The practical implication for a professional: don't disable bounds checks reflexively for performance. Measure first. The checks that survive optimization are usually on genuinely-dynamic indices where you want the safety, and the cost is a well-predicted branch. The narrative "safe = slow" is largely obsolete; the real cost of safety today is GC pauses/footprint and Rust authoring effort, not bounds checks.
6. The Migration Playbook — Backed by Data¶
The hardware story protects existing C/C++. The language story prevents new bugs. The evidence base that drives strategy:
- ~70% of severe CVEs are memory-safety bugs (Microsoft, Chromium — independently).
- Bugs concentrate in new/recently-changed code. Google's analysis showed memory-safety vulnerability density is highest in young code and decays as code ages and is hardened. This is the strategic linchpin.
- Android's results: as the proportion of new code written in memory-safe languages rose, the fraction of memory-safety vulnerabilities fell from ~76% (2019) to ~24% (2024) — without rewriting the old code, because the new code (where bugs are born) stopped producing them. New Rust code shipped with near-zero memory-safety CVEs.
This yields the "safe by default for new code" strategy and the migration playbook:
- Stop the bleeding: mandate memory-safe languages for new components and new high-risk surfaces (parsers, network-facing code handling untrusted input). You get most of the benefit because that's where most new bugs are.
- Harden the legacy in place: production mitigations (
_FORTIFY_SOURCE, hardened allocator, stack canaries, CFI, enable MTE where the hardware supports it), plus sanitizers + fuzzing in CI. - Rewrite selectively, by risk: rewrite the highest-exposure legacy components (the ones touching attacker-controlled data) into Rust/MSL — not a big-bang rewrite. Interop carefully across the FFI boundary (the new attack surface).
- Measure: track memory-safety CVE fraction, new-code MSL percentage, and bug density by code age. Report the trend to leadership; the curve is the ROI story.
CISA and the NSA now publish formal guidance ("memory-safe roadmaps") recommending exactly this, and the White House ONCD has urged the move. As a professional you're expected to translate this into a concrete plan for your codebase.
7. Mitigations Are Not Safety — Know the Difference¶
A professional must not conflate exploit mitigations with memory safety:
- ASLR, DEP/NX, stack canaries, CFI, hardened allocators make exploitation harder/less reliable. They are probabilistic or partial, and the attacker-defender arms race routinely bypasses each (info leaks defeat ASLR, JIT-spray defeats DEP, etc.). The bug is still there.
- Memory safety (safe languages, deterministic CHERI) makes the bug class impossible or always-caught. There's nothing to bypass.
- MTE sits in between: a strong, cheap, probabilistic mitigation that's close enough to safety to meaningfully change the economics, but is not a deterministic guarantee.
Communicating this distinction — "we have many mitigations, but the only thing that removes the bug class is memory-safe code" — is part of the senior-to-principal security conversation.
Real-World Analogies¶
-
MTE = colored wristbands at a festival. Each zone (allocation) is one of 16 colors; your wristband (pointer) must match the zone you enter. Wander into the wrong zone and security stops you — unless you happen to be wearing that zone's color (1/16). Cheap to run at the gate, catches most gate-crashers, not foolproof.
-
CHERI capability = a tamper-proof access pass with the exact rooms printed on it. Your pass literally encodes "rooms 100–110, read-only," hardware-verified, and you physically cannot edit it to add rooms (unforgeable, narrow-only). Walk one door too far and the reader denies you — deterministically, every time.
-
Temporal safety = the hard part of badge revocation. Issuing a correct badge (spatial) is easy. Making absolutely sure every copy of an old badge stops working the instant an employee leaves (temporal) requires sweeping the whole building to collect them (CHERI revocation) — the expensive part, in any building.
-
"Safe by default for new code" = stop digging. You're in a hole (legacy C). You don't have to fill the whole hole tomorrow — but stop digging it deeper. New code in safe languages stops adding to the problem, and that alone bends the curve, because most new bugs are in new dirt.
-
Mitigation vs safety = a better lock vs no door. A pick-resistant lock (ASLR/canary/MTE) slows burglars but can be picked. A wall where the door used to be (memory-safe language) can't be picked because there's no door. Mitigations buy time; safety removes the opening.
Mental Models¶
Model 1: Hardware safety moves the check from instrumentation to silicon. Sanitizers pay ~2× because software checks shadow memory. MTE/CHERI make the CPU do the tag/bounds check as part of the load/store pipeline — single-digit-% or moderate cost. Same idea (check every access), radically cheaper enforcer. That cost collapse is what makes production safety feasible.
Model 2: Probabilistic vs deterministic is the central hardware trade. MTE (probabilistic, cheap, shipping) and CHERI (deterministic, costlier, emerging) are two points on one curve. Choose by threat model and timeline: MTE now to bend the curve; CHERI as the eventual deterministic floor.
Model 3: Strategy follows the bug-age data. Because bugs concentrate in new code, "new code must be safe" captures most of the benefit for a fraction of the cost of a full rewrite. Lead with this; it's the difference between an affordable plan and an impossible one.
Model 4: Two prongs, not one. No single lever solves an industry-scale C/C++ estate. You simultaneously harden legacy (mitigations + MTE) and prevent new bugs (MSLs), and you measure the trend to prove it's working. Anyone selling a single silver bullet (just Rust / just MTE / just sanitizers) is wrong.
Code Examples¶
Conceptual/defensive. These illustrate mechanisms and tooling, not exploits.
MTE conceptually: tag-on-alloc, retag-on-free¶
malloc(size):
p = allocate(size)
tag = random_4bit() != tag_of_neighbor // different tag from neighbors
color_granules(p, size, tag) // set "lock" in tag memory
return p_with_tag(p, tag) // pointer carries "key" in top byte
free(p):
new_tag = random_4bit() // RE-TAG the freed region
color_granules(p, size, new_tag) // old pointers now mismatch -> UAF caught
actually_free(p)
load/store *q:
if tag_of(q) != granule_tag_at(q): FAULT // hardware does this every access
This is why MTE catches both adjacent-overflow (neighbor has a different tag) and use-after-free (region re-tagged on free) — cheaply, in hardware, probabilistically (1/16 collision).
Enabling MTE-style hardening in a real toolchain¶
# Compile with stack and heap tagging (ARMv8.5+ target, hardware-gated):
clang -target aarch64-linux-android \
-fsanitize=memtag-stack -fsanitize=memtag-heap \
-march=armv8.5-a+memtag app.c -o app
# At runtime the kernel/allocator must enable MTE (sync or async mode).
# This is a PRODUCTION mitigation, unlike -fsanitize=address (testing only).
Bounds-check elimination: the optimizer removes the cost¶
// Rust: the compiler proves every index is in range -> NO runtime check emitted.
fn sum(a: &[u32]) -> u64 {
let mut s = 0u64;
for i in 0..a.len() { // i < a.len() is guaranteed -> bound check elided
s += a[i] as u64; // optimizer also uses iterators to prove safety
}
s
}
// Idiomatic version is even clearer to the optimizer and equally checked:
fn sum_iter(a: &[u32]) -> u64 { a.iter().map(|&x| x as u64).sum() }
Migration: a risk-ranked plan as code review policy¶
POLICY (enforced in CI + review):
new network-facing parser? -> MUST be Rust (or other MSL). Block C/C++.
new code in existing C/C++ module? -> allowed, but +ASan/UBSan CI, +fuzz target
touching attacker-controlled input? -> highest priority for Rust rewrite
legacy untouched & low-exposure? -> leave; harden via _FORTIFY_SOURCE+MTE+CFI
METRIC dashboards:
- % of new LOC in memory-safe languages (target: up and to the right)
- memory-safety CVE fraction over time (target: down)
- bug density by code age (validates "new code" hypothesis)
Pros & Cons¶
Hardware safety (MTE / CHERI):
- ✅ Production-affordable enforcement of safety without a language rewrite — protects the existing C/C++ estate.
- ✅ MTE gives temporal protection cheaply (the historically hard part) and ships today; CHERI gives deterministic spatial safety.
- ❌ MTE is probabilistic (1/16 miss); CHERI isn't mainstream yet and roughly doubles pointer size.
- ❌ Both need toolchain/OS/allocator support and ARM-class (or CHERI) hardware.
Migration to memory-safe languages:
- ✅ Removes whole bug classes from new code, where most bugs are born; proven to bend the CVE curve (Android).
- ✅ Incremental and risk-ranked — no big-bang rewrite required.
- ❌ FFI boundary becomes the new attack surface; interop is subtle.
- ❌ Rewriting hot legacy is costly; org-wide language adoption needs investment in skills and tooling.
Use Cases¶
- Mobile/consumer fleets on modern ARM → enable MTE in production (async) and testing (sync); single biggest cheap win for legacy native code today.
- Security-critical legacy C/C++ you must keep → harden (
_FORTIFY_SOURCE, hardened allocator, CFI, MTE) and rewrite the highest-exposure components in Rust. - Greenfield systems software → start in Rust (or an MSL); you skip the migration entirely.
- Research / future-proofing / highest-assurance → track and pilot CHERI/Morello for deterministic safety on legacy semantics.
- Org strategy / leadership → adopt CISA/NSA "memory-safe roadmap" framing; instrument the metrics that prove progress.
Coding Patterns¶
TOOLCHAIN (production, ARM):
-march=armv8.5-a+memtag -fsanitize=memtag-stack,memtag-heap
link an MTE-aware allocator; run async in prod, sync in CI repro
keep -D_FORTIFY_SOURCE=3 -fstack-protector-strong -fcf-protection (CFI)
CI (still essential at the boundary):
ASan+UBSan, MSan, libFuzzer, TSan; sanitizers find what tags miss in test
FFI DISCIPLINE (Rust<->C, the new surface):
- narrow, audited extern boundary; validate all sizes/pointers at the seam
- run Miri / sanitizers across the boundary; treat it like `unsafe`
GOVERNANCE:
- "new high-risk code must be memory-safe" as a merge gate
- dashboard: %MSL new code, CVE fraction, bug density by age
Best Practices¶
- Enable MTE in production where the hardware allows — it's the cheapest large reduction in exploitability available for native code, and it covers the temporal gap.
- Don't disable bounds checks for speed without measurement. BCE removes most; the survivors are usually exactly where you want the check. The "safe = slow" story is largely outdated.
- Drive migration with the bug-age data. Mandate memory-safe languages for new and high-risk code first — that's where the ROI is, per Android's results.
- Harden legacy and prevent new bugs simultaneously — it's two prongs, not a choice. Mitigations buy time; MSLs remove the class.
- Treat the FFI boundary as the new critical surface. When you introduce Rust into C/C++ (or vice versa), the interop seam inherits all the old risks; audit it like
unsafe. - Distinguish mitigation from safety in every security conversation. ASLR/canaries/CFI/MTE raise cost; only safe code/CHERI remove the bug class. Leadership decisions depend on this distinction.
- Measure and report the trend. The memory-safety CVE fraction over time is your proof the strategy works; make it a tracked metric.
Edge Cases & Pitfalls¶
- MTE's 1/16 tag collision means a determined or lucky wild access can be missed; it's a strong mitigation, not a proof. Don't market it as "we're now memory-safe."
- MTE async mode gives imprecise blame — great for fleet hardening, frustrating for root-causing; reproduce in sync mode.
- CHERI doubles pointer size, pressuring caches and memory bandwidth and breaking code that assumes
sizeof(void*) == sizeof(long)or stuffs bits into pointers. - CHERI temporal safety still needs revocation — recompiling for CHERI gives you spatial safety "for free," but UAF protection requires the sweeping-revocation machinery, which has its own cost/tuning.
- The FFI boundary undoes guarantees. A Rust program calling C, or MTE-tagged code calling untagged libraries, has a seam where the guarantee lapses. Mixed-tag environments need care.
- Migration metric gaming. "% MSL code" can be inflated by trivial new files; pair it with CVE fraction and bug density by age so the metric reflects real risk reduction.
- Mistaking mitigation stacking for safety. Piling ASLR+canaries+CFI feels safe but each is independently bypassable; they don't compose into a guarantee.
Common Mistakes¶
- Claiming MTE makes a system "memory-safe" — it's probabilistic (1/16 miss). It makes exploitation unreliable and surfaces most bugs; that's mitigation, not a guarantee.
- Treating a Rust migration as all-or-nothing and concluding "we can't afford it." The data says rewrite new and high-risk first; that's affordable and captures most benefit.
- Disabling bounds checks for "performance" without profiling — usually a self-inflicted vulnerability for negligible gain, since BCE already removed the cheap ones.
- Forgetting the FFI seam when introducing a safe language, re-importing every classic bug at the boundary.
- Reporting "% safe code written" without "% memory-safety CVEs" — measuring activity, not outcome.
- Assuming CHERI gives temporal safety automatically on recompile (it gives spatial; temporal needs revocation).
Tricky Points¶
- Why is MTE "good enough" at only 4 bits? Because exploitation usually requires reliable, repeated control. A 15/16 per-access catch rate makes both bug discovery (across a fleet) and exploitation (which needs many successful accesses) overwhelmingly likely to trip a fault — even though a single random access has a 1/16 escape. Probabilistic-but-pervasive beats deterministic-but-unaffordable for fleet defense.
- Why can CHERI enforce temporal safety when raw pointers can't, even with the same revocation idea? Because capabilities are tagged and findable in memory — the runtime can sweep memory to locate and invalidate every capability into a freed region. With raw integer pointers you can't tell a pointer from an integer, so you can't find (let alone revoke) them. CHERI makes pointers first-class and discoverable, which is precisely what temporal revocation needs.
- Why does "safe by default for new code" work without rewriting the old code? Because vulnerability density is far higher in new/recently-modified code; old code that survived has been hardened by exposure. Stopping new memory bugs at the source removes most future CVEs even though the legacy lines remain.
- Why isn't a stack of mitigations equivalent to safety? Each mitigation is independently bypassable (info leak → ASLR; tag guess → MTE; gadget → CFI). Their product is not a guarantee because attackers chain bypasses. Safety (or CHERI determinism) removes the bug class, leaving nothing to chain.
Test Yourself¶
- Explain MTE's lock-and-key mechanism and how it catches both OOB and use-after-free. Why is it probabilistic?
- What three rules does a CHERI capability enforce in hardware, and how do they give deterministic spatial safety?
- Why does temporal safety remain the expensive part even in hardware (for both MTE and CHERI)?
- Contrast MTE and CHERI across guarantee, cost, pointer size, and deployment status.
- What does bounds-check elimination do, and why is "safe languages are slow because of bounds checks" largely false today?
- State the bug-age finding and how it justifies "safe by default for new code."
- Cite the Android migration numbers and what they demonstrate about strategy.
- Articulate the difference between an exploit mitigation and memory safety, with examples.
Cheat Sheet¶
HARDWARE SAFETY
MTE : 16B granule + 4-bit tag (lock) vs pointer tag (key); retag on free
-> catches OOB + UAF, PROBABILISTIC (1/16 miss), ~single-digit % cost
-> SHIPS TODAY (ARMv8.5+, Pixel 8+); async(prod)/sync(debug) modes
CHERI : 128-bit capability = addr + bounds + perms + validity tag
-> DETERMINISTIC spatial; unforgeable; authority only narrows
-> temporal needs revocation sweep; ~2x pointer size; Morello (early)
spatial = local/cheap (tags, bounds) temporal = global/costly (retag, revoke)
BOUNDS-CHECK ELIMINATION
compiler proves index in range -> removes check; "safe = slow" mostly obsolete
real cost of safety today = GC pauses/footprint + Rust authoring, not BCE
MIGRATION (the DATA)
~70% severe CVEs = memory safety (Microsoft, Chromium)
bugs concentrate in NEW/recent code -> "safe by default for new code"
Android: memory-safety CVEs ~76% (2019) -> ~24% (2024) as new code went safe
PLAYBOOK
1. new + high-risk code -> memory-safe language (merge gate)
2. harden legacy: _FORTIFY_SOURCE, hardened alloc, canaries, CFI, MTE
3. rewrite highest-exposure legacy in Rust; audit the FFI seam
4. MEASURE: %MSL new code, CVE fraction, bug density by age
MITIGATION != SAFETY
ASLR/DEP/canary/CFI/MTE = harder/probabilistic (bypassable)
MSL / CHERI = bug class removed (nothing to bypass)
Summary¶
At industry scale, software discipline and sanitizers don't close the ~70%-of-severe-CVEs memory-safety gap, so the professional answer is two simultaneous prongs. Hardware enforcement moves the per-access check into silicon: ARM MTE tags every 16-byte granule and every pointer with a 4-bit value and faults on mismatch — catching both out-of-bounds (neighbors get different tags) and use-after-free (regions are re-tagged on free), at production-affordable single-digit-% overhead, but probabilistically (1/16 miss). CHERI makes pointers into unforgeable 128-bit capabilities carrying hardware-checked bounds and permissions that can only narrow — giving deterministic spatial safety to recompiled C/C++, with temporal safety via a capability-revocation sweep (possible only because capabilities are findable in memory). The recurring truth holds in hardware too: spatial safety is local and cheap; temporal safety is global and costly. Meanwhile bounds-check elimination means the classic "safe languages are slow" objection is largely obsolete — optimizers remove the provable checks.
The migration strategy is data-driven: because ~70% of severe CVEs are memory-safety bugs and they concentrate in new/recently-changed code, "safe by default for new code" captures most of the benefit affordably — exactly what Android demonstrated, with its memory-safety CVE fraction falling from ~76% (2019) to ~24% (2024) as new code shifted to memory-safe languages, without rewriting the legacy. The professional leads both prongs at once — harden legacy (mitigations + MTE) and prevent new bugs (memory-safe languages, risk-ranked rewrites, audited FFI) — while keeping the crucial distinction sharp: exploit mitigations make bugs harder to exploit; memory safety removes the bug class. And the way you prove it worked is the trend line: memory-safety CVE fraction over time, down and to the right.
What You Can Build¶
- A hardened native build of a sample C service:
_FORTIFY_SOURCE=3, stack protector, CFI, a hardened allocator, and MTE (where hardware permits), with a before/after on which classes of injected bugs now abort cleanly. - A migration dashboard prototype that tracks % of new LOC in memory-safe languages, memory-safety CVE fraction over time, and bug density by code age — the three metrics that make the ROI case.
- A small Rust component replacing a high-exposure C parser, with a deliberately-narrow, sanitizer-and-Miri-audited FFI boundary, documented as the new critical surface.
Further Reading¶
- Armv8.5-A Memory Tagging Extension — Arm white paper. https://developer.arm.com/documentation/108035/latest/
- Android — "MTE comes to the Pixel 8" and the Android memory-safety blog series (the 76%→24% data). https://security.googleblog.com/
- CHERI: A Hybrid Capability-System Architecture — Watson, Neumann, Woodruff et al., University of Cambridge / SRI. https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
- An Introduction to CHERI — Cambridge technical report (the capability model, provenance, monotonicity).
- Cornucopia / CHERIvoke — efficient temporal safety via capability revocation.
- Microsoft — "We need a safer systems programming language" and the ~70% data. https://msrc.microsoft.com/blog/
- CISA/NSA — "The Case for Memory Safe Roadmaps" and ONCD — "Back to the Building Blocks: A Path Toward Secure and Measurable Software."
- Google Security — "Eliminating Memory Safety Vulnerabilities at the Source" (the new-code / bug-age strategy).
Related Topics¶
This is the professional capstone of Memory-Safety Mechanisms; it builds on junior.md (the bug taxonomy and the ~70% statistic), middle.md (sanitizers, hardened allocators, guard pages), and senior.md (Rust/managed/ARC language designs). The interview.md file drills hardware mechanisms, the migration data, and the mitigation-vs-safety distinction; tasks.md provides hands-on reasoning exercises across all tiers. Adjacent roadmap areas — exploit mitigations and CPU security (ASLR/DEP/CFI, speculative-execution side channels), the compilation pipeline and undefined behavior, garbage-collection internals, and FFI/interop — are covered in their own folders within language-internals and security.
Diagrams & Visual Aids¶
MTE: Lock-and-Key Tagging¶
ALLOC A (tag 5) ALLOC B (tag 9)
┌────────────────┐ ┌────────────────┐
│ granules: 5 5 5│ │ granules: 9 9 9│ <- tag memory ("locks")
└────────────────┘ └────────────────┘
ptr_A carries key=5 ptr_B carries key=9
overflow A->B: load via ptr_A(key 5) hits granule(tag 9) -> MISMATCH -> fault
free(A) re-tags A's granules to 12; old ptr_A(key 5) now mismatches -> UAF caught
(1/16 chance the new tag collides with the key -> missed: PROBABILISTIC)
CHERI Capability (hardware fat pointer)¶
128-bit capability + 1 out-of-band validity tag (in tagged memory)
┌───────────────────────────────────────────────────────┐ ┌───┐
│ address │ base │ length │ permissions (r/w/x) │ │ 1 │ valid?
└───────────────────────────────────────────────────────┘ └───┘
RULES (hardware-enforced):
deref outside [base, base+length) -> FAULT (deterministic spatial safety)
integer-forge a capability -> validity tag cleared (unforgeable)
derive child -> authority only <= parent (narrow-only)
Spatial vs Temporal Cost — Software to Silicon¶
SPATIAL (local, cheap) TEMPORAL (global, costly)
software bounds check (BCE-elided) GC / quarantine
hardware MTE neighbor tags MTE retag (1/16 gap) /
CHERI per-ptr bounds (det.) CHERI revocation sweep
───────────────────────────────────────────────────────────────────
pattern: heavy machinery ALWAYS buys temporal safety; spatial is cheap everywhere
The Two-Prong Strategy¶
LEGACY C/C++ ESTATE NEW CODE
(can't rewrite all) (where bugs are born)
────────────────── ────────────────────
harden in place: safe by default:
_FORTIFY_SOURCE, canaries, Rust / Go / Java / Swift...
hardened alloc, CFI, MTE risk-ranked rewrites of hot legacy
│ │
└──────────────┬───────────────────────────┘
▼
MEASURE the trend:
memory-safety CVE fraction ↓ (Android: 76% -> 24%)
In this topic