An incident ends when the pager is quiet; the learning begins when someone writes the thing that makes the next person safer. Postmortems that heal share three properties: they describe reality plainly, they name contributing factors without courtroom language, and they commit to a specific change with an owner and a date.
Structure that scales:
- Summary (three sentences): what users saw, what caused it, how long it lasted.
- Timeline: timestamps with the actors and the state. Screenshots welcome. No narrative flourishes.
- Contributing factors: list format, each item framed as "this made the blast radius larger," not "Alice forgot X."
- What we changed: small, testable commits with owners and due dates.
- What we're watching: the two metrics that will tell us if the fix is real.
Stories:
We shipped a patch at 5 p.m. Friday (rookie move) that fixed the intended bug and reopened a retired code path. Saturday was a museum of 500s. The healing came not from the apology but from the new rule: "no releases after 2 p.m. Friday without a rollback plan and a buddy." Velocity increased because fewer weekends were on fire.
Another time, an alert storm hid the real signal: a quiet memory leak that only manifested when a partner's batch job aligned with our own. The postmortem led to an alert review where we removed 40% of noisy alerts and added one guardrail on heap growth. Engineers slept better and caught the next leak in hours instead of weeks.
Kindness is a policy here: we do not put names next to errors; we do put names next to fixes. People learn more when they are not performing innocence.
Case notes
The classic SRE literature nails this: blamelessness is not vibes"”it's the precondition for true root cause analysis.