Over the last decade, the concept of a “blameless” post-mortem has become a software industry standard. According to ChatGPT, blameless was introduced to the software world by a 2012 blog post:
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of [what happened[...and that they can give this detailed account without fear of punishment or retribution.
Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.
The intent of this philosophy is good - let’s avoid fear of unfair punishment so we can learn from incidents.
The problem is that there is a metric ton of nuance in these statements. Instead of finding a middle ground of accountability and empathy, many companies ran with these principles into a no-man’s land of non-accountability.
Let’s talk about why. But first, some background.
The History Of Blameless
The idea of blameless post-mortems came from aviation and healthcare. I’m not an expert, but I think that the situations that prompted these industries to conduct blameless post-mortems had the following qualities:
- Severity: Something really really bad happened. A plane crashed. A person died in the operating room.
- Frequency: Incidents are generally infrequent. Most people are rarely involved in a post-mortem.
- Punishment: The punishment for negligence can be extraordinarily high, e.g. murder charges. You need to assuage people that they aren’t at risk of criminal punishment for well-intentioned mistakes.
- Recurrence: Given the severity of these incidents, you need to be absolutely sure you are doing every possible thing to prevent recurrence.
- Follow-through: Following an incident, official governing bodies change official rules or laws.
This all makes sense. A plane crashes once in the Toledo airport in 50 years, we gotta make sure the person who screwed in the lug nuts isn’t afraid to tell us that their wrench did seem a little loose that day.
Blameless in Software
Post-mortems in many software companies happen regularly. Many companies do weekly incident reviews. The properties of software post-mortems look like:
- Severity: Comparatively, most software incidents aren’t that bad. Your average incident might be something like an API going down for 5 minutes. You gotta fix it, but they’re not hauling bodies out of the bay.
- Frequency: Frequent. Often weekly.
- Punishment: Listen, if you cause a 5 minute API outage and we’re holding you accountable, you’re probably not even getting rated Below Expectations at most companies. Do it twice and you probably are. Do it three times and you have to go find another job that might pay you 10% more. You’re not going to jail.
- Recurrence: We do want to avoid recurrences of software incidents. It’s very important.
- Follow-through: Follow-ups are typically owned by teams, often with medium-level priority, sometimes forgotten.
Software vs Aviation/Health
In summary, software post-mortems are much lower severity and much more frequent than aviation or healthcare post-mortems. In fact, they’re so common that they’re a critical part of regular accountability and learning in software organizations.
As a result, software culture becoming too blameless is just as bad as being too blameful:
- Individuals and teams miss the opportunity to learn. Without actually saying whose fault something is, people can end up living in a world where they never hear the thing that they need to hear most - this one was on you and you must change what you’re doing.
- Because follow-ups are not as fanatically and centrally implemented, the key driver of quality is accountability, not new rules. When you obscure accountability in software incidents, you create sustained mediocrity.
Said another way, the severity of an average software incident is not so bad that it’s worth the non-accountability of a blameless approach. As a result, it’s critical that software post-mortems tweak these practices to more effectively serve their needs.
Blameful Postmortems - A Goldilocks Solution
There must be a middle ground where we can achieve all of our goals. Blameful Postmortems should have the following properties:
It’s Somebody’s Fault
An absolutely critical part of every post-mortem is that it’s somebody’s fault. Every issue is either:
- Someone’s job and thus their fault when it fails
- Nobody’s job and the fault of the leader who has an ambiguous organizational design
Note: it must be primarily one person or one team’s fault. If it’s multiple teams that are allegedly at fault, it’s the same as no teams at fault. Driving to the core of who should have prevented an incident is often one of the most fruitful exercises in refining and clarifying responsibilities.
There are no exceptions. Common examples are:
- It’s a third party provider, how can that be my fault? You picked the third party provider, you need to own their outcomes.
- I inherited this thing and never worked on it, how can that be my fault? You might not be in a ton of trouble for this failure, but it’s still your responsibility.
Note: the word fault here is knowingly a bit harsh sounding, but it’s used intentionally. Other words start you on the slippery slope to blameless avoidance. Every incident should have had someone whose job it was to own the prevention or risk mitigation.
Accountability Is Fair
A key failure with most blameless cultures is that people believe it means you don’t have consequences when things break. That’s a non-accountable culture. That’s nonsense. A good post-mortem and engineering culture promises that people will be held accountable in a fair and balanced way.
For example, if there was really no way that you could have been expected to prevent an incident, it might still be your fault, but you might have 0 repercussions. We might walk away agreeing that next time is the one you’re expected to prevent.
On the flip side, if you really messed up, you might get fired. If we said we’re in a code freeze and you YOLOed a release to try to push out a project to game the performance assessment round and you took out prod for 2 days, you will be blamed and you will be fired.
Most repercussions are a middle ground. Good culture doesn’t mean people face all-or-nothing repercussions - it means they face the appropriate accountability.
The Right Incentives
As a quick aside - I absolutely hate how most people treat incentives. So many leaders act like incentives are the only thing you can expect people to follow.
Hey, my top sales guy sold a $1M deal but did it at -90% margin, but there was no rule against it, so what do you expect?
Hey, you created this process of always holding people accountable to incidents, don’t be surprised when people hide stuff, right?
Horse apples. Nonsense.
There is one main incentive that all employees have - act with high integrity or get fired. Stop excusing people for bad behavior because your little point systems don’t cover every case of common sense.
So, as it applies to post-mortems - be clear with your team:
- Everything is someone’s fault
- Something being your fault doesn’t mean it’s a huge deal. Most things are not.
- If you game the system, hide information, or otherwise prioritize your own rewards over the health of our company, you will be subject to disciplinary action up to and including termination
Blameful Postmortems - Final Thoughts
Software isn’t aviation or healthcare, so let’s stop acting like it. Post-mortems are good. Focus on non-recurrence of incidents is good. Let’s keep doing those.
But not having accountability for failures is bad. Let’s stop doing that.
Finally, leaders make or break processes like this. The worst leaders are overly blameful and punish people unfairly, often this is to cover their own tracks. As we make it back to a more healthy culture of appropriate blame, let’s make sure that leaders are held accountable as well.