Post-Mortem Analysis Roadmap¶
"The dead process can still tell you what killed it — if you know how to listen."
This roadmap is about analysing a process after it has died (or been deliberately frozen for inspection) — core dumps, heap dumps, thread dumps, JFR recordings, eBPF traces saved offline. The "open the body and look" complement to live debugging.
Looking for live introspection of a running process (
/debug/pprof, JMX, py-spy)? See Diagnostic Endpoints.Looking for interactive debugging with a debugger attached? See Debugging.
Looking for the human post-mortem (incident write-up, blameless culture)? That belongs in the Soft-Skills / SRE tracks, not this one — this is the technical artefact side.
Why a Dedicated Roadmap¶
Live debugging fails when:
- The crash happens once a week at 3 a.m. — you can't repro it interactively
- The bug only reproduces with production traffic
- The process is gone by the time you log in
- The state at crash time is gone the moment the process restarts
The skill is collecting the corpse cheaply at the time of death, then doing the heavy analysis offline. Every language and runtime has its own format and tooling — and the cost/quality trade-offs differ wildly.
| Roadmap | Question it answers |
|---|---|
| Debugging | What's the live state? |
| Diagnostic Endpoints | What's the running state, exposed via API? |
| Post-Mortem Analysis (this) | What was the state when it died — and can I reconstruct it later? |
Sections¶
| # | Topic | Focus |
|---|---|---|
| 01 | When Post-Mortem Beats Live | The diagnostic costs of repro; production-only bugs |
| 02 | Core Dumps | What's in one, generating them (ulimit -c, /proc/sys/kernel/core_pattern), reading with gdb |
| 03 | Heap Dumps | Java .hprof, Python gc.get_objects + heapy, Go runtime.WriteHeapDump, .NET .dmp |
| 04 | Thread / Goroutine Dumps | jstack, SIGQUIT, runtime.Stack; reading them, finding the deadlock |
| 05 | Java Flight Recorder | JFR recordings, what they capture, opening with Mission Control |
| 06 | eBPF Captures | perf record, bpftrace outputs, off-CPU profiling saved offline |
| 07 | Crash Dumps on Mobile | iOS .ips, Android tombstone, symbol files |
| 08 | Analysis Tools | gdb, Eclipse MAT, VisualVM, dlv core, pyflame --dump, pprof reading |
| 09 | Symbolication | Why a dump without symbols is half-useless; dSYM, pdb, build-id matching |
| 10 | Offline Reproduction | When a dump is the bug report; replaying request from captured state |
| 11 | Cost & Storage | Dumps are big; what to keep, how long, how to triage |
| 12 | Anti-patterns | No ulimit -c, no symbol upload, throwing away dumps before triage, dump-on-every-error |
Languages¶
Examples in Java (heap dumps, jcmd, JFR, MAT), Go (runtime.WriteHeapDump, dlv core), Python (faulthandler, heapy), C / C++ (core dumps + gdb), mobile (iOS / Android crash artefacts).
Status¶
⏳ Structure defined; content pending.
References¶
- Java Performance: The Definitive Guide — Scott Oaks (heap analysis chapters)
- The Linux Programming Interface — Michael Kerrisk (signals, core dumps)
- Systems Performance — Brendan Gregg
- Debugging with GDB — official manual
- Eclipse MAT documentation
Project Context¶
Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.