The Zero-prep Postmortem: How to run your first incident postmortem with no preparation
March 29, 2021
Something has gone terribly wrong. You and all the king’s men have risen to the occasion, and put Humpty Dumpty together again.
You’ve heard of postmortems, and you think doing one could help you, but where to begin? This guide is for you!
With literally no preparation, you should be able to run a successful postmortem. Even if you found this via google after assembling in the conference room!
Who to invite (Optional)
I say this is optional, because a postmortem can be valuable regardless of who attends.
But if you have the chance to control who comes, here are some things to keep in mind to make your zero-preparation postmortem as successful as possible.
- Invite anyone who was involved in the discovery of the incident (e.g. including customer service agents)
- Invite anyone who helped diagnose or respond to the incident (e.g. software engineers, operations staff)
- Invite anyone whose business interests have been impacted by the incident (e.g. product managers)
- Invite anyone who may be involved in follow-up actions (this can be harder to predict, just do your best)
- The best postmortems are often open meetings. That is to say, if you have enough meeting space, invite anyone who is interested to attend as an observer.
Conducting the postmortem meeting
0. Designate a note-taker
This doesn’t need to be fancy. It can be you. Or anyone else with a pencil and paper, or a laptop to take notes. It’s often easier if the note-taker is not leading the meeting, but this is not a hard and fast rule.
At the end of the meeting, the notes taken should be made public for all attendees, and anyone interested who did not attend, to see. So if the notes are taken on physical paper, be prepared to transcribe them into an electronic form (or worst case scenario, take a digital photo of the paper, and share it.)
Ask: Who can take notes?
If nobody volunteers, volunteer yourself.
Have the note-taker put today’s date and their name at the top of the notes. Then have them record the answer to every Ask: prompt below.
1. Summarize the incident
Describe in a sentence or two what happened. Often this requires the input of more than one person. This description should not be long or detailed. The primary goal is to ensure that everyone in the meeting is on the same page, and talking about the same thing. If more than one failure occured recently, it’s common for people to be confused about which one is being discussed.
Ask: Who can describe the incident in a sentence or two?
If anyone in the meeting disagrees, tweak the description until general consensus is achieved.
2. Determine the duration and impact of the incident
Ask: When was the incident discovered?
Ask: When did the incident begin?
This can be tricky or impossible to answer in some situations. As a rule of thumb, an incident began when the first customer was affected. A long-brewing problem isn’t an incident unless it has an impact. If you don’t know the answer, you can assume it began when it was discovered.
Ask: When was the incident resolved?
Another potentially tricky one. As a rule, the incident should be at least temporarily mitigated before you take time out to do a postmortem. But there are likely ongoing tasks to be done to finalize the resolution. As a general rule, consider the incident resolved when customers no longer saw an impact, or were able to work around the problem.
Ask: Which customers, or other stakeholders, were affected?
Ask: What was the estimated monetary (or other relevant) loss?
No need to go crazy here, especially if you don’t have these statistics readily available. Estimates are okay. The goal is to give everyone a concept of the severity of the incident.
3. Document how the incident unfolded in a timeline
For this part, it is best to have everyone who participated in the incident in the meeting. If this is not possible, do the best you can based on the information at hand.
Ask: How was the incident discovered and resolved?
For every relevant action that was taken, record the following pieces of information:
- The date and time the action took place. Do this from memory. If possible, reconstruct the timeline using the dates on relevant emails, chat messages, etc.
- Who or what took the action. This is usually a person, but it could be a server or third-party vendor.
- Briefly describe the action taken.
2020-12-15 15:00, Bob discovered that the company web site was down. He called Alice to investigate.
2020-12-15 15:20, Alice determined that the web server’s disk had filled up with too many temporary files…
4. Root-cause analysis
Now, starting with the observed problem, do a short brainstorming session to discuss the possible root causes, and recording all identified possible root causes.
Ask: What were the cause(s) of the observed problem(s)?
If there was more than one simultaneous problem, repeat this exercise for each problem—but they should be related. If you have two unrelated problems simultaneously, they should generally be treated as two incidents, and thus have two independent postmortems.
It’s okay to propose ideas here that are not verified. Don’t spend time here doing an actual investigation, instead come up here with all plausible root cuases. Investigation and remediation happens after the postmortem.
For each cause uncovered here, repeat this step, until no more underlying causes can be easily determined.
- Full disk lead to web site going down.
- The disk was filled due to too many debug logs.
- We had too many debug logs because we were logging at the wrong severity level.
- … etc
5. Prevention and mitigation
For every cause on your list from step 4, do a short brainstorming session to come up with one or more possible mitigation steps that can be taken.
Ask: How can we prevent or mitigate each of these root causes?
Don’t worry about the feasibility of these suggestions at this stage. Simply record every possible solution imagined for every one of the causes identified above.
For example, for the problem of a full disk:
- Buy bigger disks
- Use disk compression
- Delete old log files more aggressively
- Generate an alert when the disk is nearing full
- Rewrite the web server to continue operating when the disk is full
6. Examine the problem resolution
Repeat steps 4 and 5 for the problem resolution itself.
Ask: What factors or problems made the problem detection and resolution take longer than necessary?
Ask: What were the underlying causes of these problems?
Ask: How can we prevent or mitigate each of these root causes?
That is, brainstorm problems that made the problem detection and resolution more difficult or take longer than necessary, and brainstorm about root causes for each.
7. Assign ownership
Of the mitigation steps identified in steps 5 and 6, choose a minimum of 5 or so, and assign them to a person in the postmortem meeting.
Ask: Which 5 of these mitigation steps can we commit to?
Ask: Who can be responsible for seeing these steps completed?
You may wish to select those with the highest impact and lowest effort.
It’s also important to assign these tasks to someone in the meeting, as assigning a task to someone not present is likely to not be followed through. The assignee doesn’t necissarily have to complete the work, and they may even delegate the work to someone else.
8. Schedule a follow-up meeting
The final stage of a postmortem is to schedule a follow-up meeting, in which the tasks assigned in step 7 are discussed. This should be no more than 2 weeks in the future.
During the follow-up meeting, discuss the progress on the assigned tasks, and discuss any next steps, if appropriate.
A Look at Atlassian's April 2022 Jira Outage
What lessons can you take from this incident for your own organization?
Don't let the perfect ideals be the enemy of progress.
What can we learn from the Facebook outage?
Facebook has revealed the cause of their 6-hour outage: human error. I hope those pesky humans learned their lesson! Or is there more to it?