Abstract editorial illustration

Article

Postmortems That Heal the Org

Blameless, brief, and specific: how to turn incidents into institutional memory without scar tissue.

SeriesAI, Safety & Systems
Article20251 min readReliabilityCultureEngineering

An incident ends when the pager is quiet; the learning begins when someone writes the thing that makes the next person safer. Postmortems that heal share three properties: they describe reality plainly, they name contributing factors without courtroom language, and they commit to a specific change with an owner and a date.

Structure that scales:

  1. Summary (three sentences): what users saw, what caused it, how long it lasted.
  2. Timeline: timestamps with the actors and the state. Screenshots welcome. No narrative flourishes.
  3. Contributing factors: list format, each item framed as "this made the blast radius larger," not "Alice forgot X."
  4. What we changed: small, testable commits with owners and due dates.
  5. What we're watching: the two metrics that will tell us if the fix is real.

Stories:

We shipped a patch at 5 p.m. Friday (rookie move) that fixed the intended bug and reopened a retired code path. Saturday was a museum of 500s. The healing came not from the apology but from the new rule: "no releases after 2 p.m. Friday without a rollback plan and a buddy." Velocity increased because fewer weekends were on fire.

Another time, an alert storm hid the real signal: a quiet memory leak that only manifested when a partner's batch job aligned with our own. The postmortem led to an alert review where we removed 40% of noisy alerts and added one guardrail on heap growth. Engineers slept better and caught the next leak in hours instead of weeks.

Kindness is a policy here: we do not put names next to errors; we do put names next to fixes. People learn more when they are not performing innocence.

Case notes

The classic SRE literature nails this: blamelessness is not vibes"”it's the precondition for true root cause analysis.

TC

Author

Terry Chen

Technology executive and builder focused on AI safety, cybersecurity, and decision-support systems.

Keep reading

Related articles

Browse all writing
  • Article20253 min read

    On Naming Things That Stay Named

    A field guide to names that carry meaning across code, teams, and time.

    AI, Safety & SystemsEngineeringWriting
  • Article20253 min read

    Concrete That Remembers

    The Pantheon has held for nineteen centuries because Romans mixed volcanic ash with seawater. On foundations that get stronger with time.

    AI, Safety & SystemsEngineeringHistory
  • Article20252 min read

    Implementation as a Craft

    Write maintainable code that scales with your team. Engineering best practices for readable systems, effective code reviews, and sustainable development.

    AI, Safety & SystemsEngineeringCraft
  • Article20253 min read

    When the Levee Holds

    In 1927 the Mississippi broke everything except the question of who decides which ground to sacrifice. On the moral weight of triage.

    AI, Safety & SystemsHistoryLeadership