Post-Mortem: [Incident Title]¶

Summary¶

Author: [Name] Incident Commander: [Name] Reviewers: [Names]

One-Line Summary¶

[Brief description of what happened and impact]

Impact¶

User Impact¶

Metric	Value
Users Affected	[Number or %]
Duration of Impact	[Time]
Error Rate Peak	[%]
Support Tickets	[Number]

Business Impact¶

Metric	Value
Revenue Impact	[$X or N/A]
SLA Breach	[Yes/No]
Customer Notifications	[Number]

Systems Affected¶

[System/Service 1]
[System/Service 2]

Timeline¶

All times in UTC.

Time	Event
HH:MM	[First alert/detection]
HH:MM	[Incident declared, severity assigned]
HH:MM	[Investigation started]
HH:MM	[Root cause identified]
HH:MM	[Mitigation applied]
HH:MM	[Service restored]
HH:MM	[Incident closed]

Root Cause¶

What Happened¶

[Detailed technical explanation of the failure chain]

Why It Happened¶

[Analysis of underlying causes - use 5 Whys if helpful]

Why? [First level cause]
Why? [Second level cause]
Why? [Third level cause]
Why? [Fourth level cause]
Why? [Root cause]

Trigger¶

[What specific event or change triggered the incident?]

Contributing Factors¶

Factors that made the incident possible or worse:

[ ] Detection Gap: [Monitoring didn't catch it]
[ ] Process Gap: [Missing runbook/procedure]
[ ] Testing Gap: [Untested scenario]
[ ] Documentation Gap: [Missing/outdated docs]
[ ] Capacity Issue: [Resource constraints]
[ ] Dependency Failure: [External service]
[ ] Configuration Error: [Misconfiguration]
[ ] Code Defect: [Bug in code]
[ ] Human Error: [Manual mistake]

Details: - [Factor 1]: [Explanation] - [Factor 2]: [Explanation]

Detection¶

How Was It Detected?¶

[ ] Automated monitoring/alerting
[ ] Customer report
[ ] Internal user report
[ ] Scheduled check
[ ] Other: [specify]

Detection Delay¶

Metric	Value	Notes
Time to Detection (TTD)	[X minutes]	First alert or report
Time to Acknowledgment (TTA)	[X minutes]	Investigation started
Time to Mitigation (TTM)	[X minutes]	Bleeding stopped
Time to Resolution (TTR/MTTR)	[X minutes]	Fully resolved

Detection Gaps¶

[What should have alerted us but didn't?]

Response¶

What Went Well¶

[Positive 1: e.g., "Quick escalation to on-call"]
[Positive 2: e.g., "Clear communication in incident channel"]
[Positive 3: e.g., "Runbook was helpful"]

What Didn't Go Well¶

[Issue 1: e.g., "Took too long to identify root cause"]
[Issue 2: e.g., "Missing access to logs"]
[Issue 3: e.g., "Unclear ownership"]

Where We Got Lucky¶

[Lucky break 1: e.g., "Engineer happened to be online"]
[Lucky break 2: e.g., "Impact was during low-traffic period"]

Resolution¶

Immediate Fix¶

[What was done to stop the bleeding?]

[Commands/steps taken]

Verification¶

[How did we confirm the fix worked?]

Action Items¶

Immediate (Within 1 Week)¶

Action	Owner	Due Date	Status	Beads Issue
[Action 1]	[Name]	[Date]	Open	[bd://xxx]
[Action 2]	[Name]	[Date]	Open	[bd://xxx]

Short-Term (Within 1 Month)¶

Action	Owner	Due Date	Status	Beads Issue
[Action 1]	[Name]	[Date]	Open	[bd://xxx]

Long-Term (Within 1 Quarter)¶

Action	Owner	Due Date	Status	Beads Issue
[Action 1]	[Name]	[Date]	Open	[bd://xxx]

Prevention¶

How Do We Prevent Recurrence?¶

[Specific technical and process changes]

How Do We Detect Faster?¶

[New alerts, dashboards, or checks to add]

How Do We Recover Faster?¶

[Runbook updates, automation, or process improvements]

Lessons Learned¶

Key Takeaways¶

[Lesson 1]
[Lesson 2]
[Lesson 3]

Process Improvements¶

[Improvement 1]
[Improvement 2]

Appendix¶

[Link to similar past incidents]

Relevant Logs/Dashboards¶

[Link to logs]
[Link to dashboard]
[Link to traces]

External References¶

[Vendor post-mortem if applicable]
[Related documentation]

Sign-off¶

Role	Name	Date
Author
Incident Commander
Engineering Lead
Product Owner