Blameful Post-Mortems

Stay SaaSy

Mar 13

Fault isn't a four letter word.

Read →

6 Comments

Adler Hsieh

Mar 15Edited

Thanks for sharing. It’s a very interesting point of view.

The purpose of post-mortem is to prevent similar issues from happening again. If leaders are able to improve processes so incidents don’t happen again, I think blameless is a better way to promote a healthy culture.

But sure, holding someone accountable is a good solution if incidents usually end up having no improvements or action items.

Expand full comment

Aaron Erickson

Mar 14

Thanks for writing this. Blameless always felt like a sort of reverse-participation trophy kind of system that allowed you to hide behind the concept to dodge accountability. I've been in too many of these things where the reality was people just didn't want to follow established guidelines and rules and broke stuff out of laziness or lack of giving a shit. And yes, there should be consequences for that.

Doesn't mean you should start doing witchunts, but it does mean we should remove the pretense that repeated breaking of rules in processes to subvert the things put in place to avoid breakages should absolutely be called out and yes, people who repeatedly break those norms should be exited with haste.

Expand full comment

Aleksey Shevchuk

Mar 13

Overall, I like and support your point, but I want to challenge that there is always someone's fault. Many times, I witnessed that the incident's cause was out of realistically achievable control. For example, one of my DBs started to struggle with corrupt files, ops blamed me until I investigated and presented them with evidence that problems happen only with exactly one Barracuda SSDs! This model just happened to missrepresent health stats - only possible blame could be that ops might have developed their tracking systems to make it evident that a bunch of incidents happen to have single inventory name....but even for huge ops like ours was this realistic in 2014 or something?

Expand full comment

Reply (1)

Stay SaaSy

Mar 13

I don't know all the details, but ultimately someone picked the hardware to run on, someone build the tracking system, and someone is using the DB. While some incidents are "realistically achievable control", that doesn't make them not someone's responsibility.

For example, if I pick a piece of hardware with a known failure rate of 1/100000 and we have that failure, that's my fault. Now, it might be fine, we might just replace that hardware and move on and it's acceptable failure, but it's still my responsibility. And if that piece of hardware starts failing at a 1/100 rate, it's definitely on me to fix it.

Expand full comment

Reply (1)

Aleksey Shevchuk

Mar 13

We had our own test facility and had been testing prototypes from Intel and other prominent manufacturers. In another incident, we've got hundreds of Intel CPUs with a bug causing Java lock malfunction when under pressure. Both incidents were once-in-a-lifetime kind - many small adjustments have been made to prAgmatically reduce the chance of similar troubles again or enable early mitigation, but no one has been blamed. Blaming for such mishaps would result in a way more harmful slowdown in operations due to people trying to overprotect themselves.

Expand full comment

Reply (1)

Stay SaaSy

Mar 13

It can be your fault and not be a big deal.

But sometimes it is.

You had two once in a lifetime incidents happen. It’s funny - I’ve been a part of about 30 different once in a lifetime incidents. People tell themselves that they’re so so out of left field but the reality is different - catastrophic failures can and do happen, and it should be someone’s job to prevent, mitigate, and manage them when they do.

Expand full comment