Postmortems That Heal the Org

Blameless, brief, and specific: how to turn incidents into institutional memory without scar tissue.

An incident ends when the pager is quiet; the learning begins when someone writes the thing that makes the next person safer. Postmortems that heal share three properties: they describe reality plainly, they name contributing factors without courtroom language, and they commit to a specific change with an owner and a date.

Structure that scales:

Summary (three sentences): what users saw, what caused it, how long it lasted.
Timeline: timestamps with the actors and the state. Screenshots welcome. No narrative flourishes.
Contributing factors: list format, each item framed as "this made the blast radius larger," not "Alice forgot X."
What we changed: small, testable commits with owners and due dates.
What we're watching: the two metrics that will tell us if the fix is real.

Stories:

We shipped a patch at 5 p.m. Friday (rookie move) that fixed the intended bug and reopened a retired code path. Saturday was a museum of 500s. The healing came not from the apology but from the new rule: "no releases after 2 p.m. Friday without a rollback plan and a buddy." Velocity increased because fewer weekends were on fire.

Another time, an alert storm hid the real signal: a quiet memory leak that only manifested when a partner's batch job aligned with our own. The postmortem led to an alert review where we removed 40% of noisy alerts and added one guardrail on heap growth. Engineers slept better and caught the next leak in hours instead of weeks.

Kindness is a policy here: we do not put names next to errors; we do put names next to fixes. People learn more when they are not performing innocence.

Case notes

The classic SRE literature nails this: blamelessness is not vibes"”it's the precondition for true root cause analysis.

Postmortems That Heal the Org

Case notes

Related