Blameless culture
Blameless culture is a concept that’s both simple and challenging to accept and support. Stay calm and keep moving forward!
Regardless of the environment we find ourselves in, blame is a negative behavior that starts in childhood and continues throughout our lives. Have you noticed blame in your own behavior or in the behavior of others? We can’t completely avoid blaming, but we should do our best to minimize it!
Business, especially software engineering, involves taking risks daily. Every decision, change, or new product launch carries the potential for unexpected outcomes. Ironically, in rare instances, these risks can lead to unexpectedly positive changes, but this is not the typical outcome. Business decisions often involve incurring losses until the correct decision is identified.
There are two natural responses to a mistake:
Identify the cause, find a solution, and learn from the experience to move forward more efficiently.
Identify the responsible party, hold them accountable (perhaps publicly), and force them to solve the issue or compensate for the damage.
The second natural response is the root cause of the problem. Whether it’s cultural or simply emotional, we should take a moment to reconsider the next move.
One command away
While working with sensitive information and infrastructure can be dangerous, as errors committed by an accountant can result in substantial financial losses and inquiries from the IRS. However, since I belong to the software engineering department, we will primarily focus on mistakes made from this perspective.
Only one command in a terminal can distinguish between a regular working day and a fatal error that can cause significant headaches for a team, or even an entire company, or sometimes the most of the world.
Many people think that one command mistakes can happen once in these situations:
Mistakes in startups, due to lack of budgeting or processes;
In a government or big corporations the reason could be only: hackers’ attack or conspiracy agreement.
Reality is far removed from complex explanations. In most cases, it’s simply a single error made by an individual at a specific time and place. It could be due to lack of experience, patience, or just a rush to complete the task, but these mistakes occur with everyone.
There have been TOP 3 outages that occurred in the past few years that are worth mentioning in this topic. They are chronologically sorted.
The Amazon Simple Storage Service incident of 2017 with unpredictable affect
No one outside the software development bubble was aware of Amazon Web Services (AWS) and its Simple Storage Service (S3) until a catastrophic event occurred on February 28, 20171.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Led to a major outage in S3. While such incidents are not uncommon, this particular outage was particularly concerning due to the widespread reliance on S3 by numerous large companies.
When the outage occurred, it felt like half of the internet was down. It genuinely resembled the apocalyptic scenarios depicted in movies, originating from the internet itself. Panic spread among people, while companies were blocked. However, it was one of the few relatively minor outages affecting one of the services. On this day, we gained a profound understanding of our society’s dependence on AWS and its services. We realized how a monopoly can swiftly disrupt our lives.
Cloudflare firewall rule affected a significant number of users in 2020
Cloudflare, a highly esteemed company, offers a comprehensive suite of services, including DNS, CDN, and security solutions, tailored to domains and businesses. Its generous free-to-use plan has garnered significant attention, particularly among small and medium-sized websites. These websites leverage Cloudflare’s services to safeguard themselves against various security breaches and effectively manage traffic through its CDN.
Since it’s a highly popular service, any outage or issue will impact not less than AWS’s problems. We experienced this firsthand on July 17, 20202.
As there was backbone congestion in Atlanta, the team had decided to remove some of Atlanta’s backbone traffic. But instead of removing the Atlanta routes from the backbone, a one line change started leaking all BGP routes into the backbone.
Cloudflare experienced a significant traffic drop of 50% across its entire network, affecting 20 regions in the United States and Europe. It’s worth noting that the outrage was not a result of a hacker attack or breaches, as the company maintains a strong trust with its customers.
For the avoidance of doubt: this was not caused by an attack or breach of any kind.
Facebook vanished from the internet in 2021.
October 4, 20213, all Meta’s products (Facebook, Instagram, WhatsApp) mysteriously disappeared from the internet. There’s an article on Wikipedia with more details. From users’ perspectives, they were down and didn’t work. However, from a technical point, their IP address prefixes that contained their DNS servers were withdrawn from BGP. Consequently, people were unable to access their network at all.
During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.
As usual, some people suspected a hacker attack or government influence. However, the harsh reality unexpectedly struck the Software Engineering industry with the simplest possible explanation: a single command error. A person executed a command that resulted in nearly $60 million in revenue losses for the company and approximately $6 billion in stock value for Mark Zuckerberg.
This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.
While it may appear to be a localized issue affecting Meta, it was not the case. Numerous influencers had their content exclusively available on one service. Many small businesses relied on these services without any viable alternatives, such as small shops that used Instagram as their sole source of sales. However, the most intriguing aspect is that “like” buttons were integrated into numerous websites across the most popular services. Unfortunately, due to an implementation issue (in the author’s opinion), these buttons have become a reason for page load blocks on many websites, affecting popular media platforms, small blogs, and even e-commerce websites.
Postmortems
All the incidents we reviewed above are highly critical. In most companies, such occurrences don’t happen, but they’re still critical for a company or a team itself. Even losing a thousand dollars could be critical for a small startup.
Incident resolution is a process, not a form of punishment.
When a problem is identified and during the resolution process, it’s crucial to refrain from naming names. Instead, we should push back any attempts to name individuals and refocus our attention solely on the resolution process. The company’s objective is to create a comfortable work environment where employees feel confident in their positions and have the opportunity for growth.
By fostering trust and responsibility, we can empower individuals to contribute their best efforts without fear of reprisal.
Every incident is a loss for a company. It could be a loss in reputation, stock price, income, or all of those. However, it also provides an opportunity for the leadership and team to learn how to prevent similar issues in the future and identify and resolve them promptly.
To address this learning question and communicate with customers, postmortem documents were invented. These documents are straightforward and include a few crucial points with some free-flowing text.
Questions for postmortem
There are questions that can help identify the structure of a document. This list is comprehensive, but in your postmortem, some questions and points can be simplified or omitted.
There are numerous postmortem templates available online, even pre-built in various documentation systems such as Notion or Confluence. Feel free to use any template that suits you better.
Summary
What happened?
When did it occur?
How long did the incident last?
What was the impact on users, customers, or services?
Timeline of Events
What were the key events leading up to the incident?
When was the issue detected?
What actions were taken during the response?
When was the incident resolved?
Root Cause Analysis
What was the underlying cause of the incident?
Was it a technical issue, a human error, or a process failure?
Were there any early warning signs that were missed?
Impact Assessment
How many users or customers were affected?
What specific services or features were impacted?
Were there any financial, reputational, or operational damages?
Response and Mitigation
What was done to contain and resolve the incident?
What worked well during the response?
What could have been handled better?
Lessons Learned
What key takeaways can be applied to prevent similar incidents?
Were there any blind spots in monitoring or alerting?
Did the team follow the correct incident response process?
Preventive Actions & Follow-ups
What changes need to be made to systems or processes?
Are there any automation improvements to prevent recurrence?
What additional training or documentation is needed?
Who is responsible for implementing these changes?
Supporting Data & Attachments
Logs, metrics, graphs, or screenshots showing the incident details.
Communication records (Slack messages, emails, status page updates).
Any related documents or references.
Learning from incident
The primary purpose of conducting a postmortem should be to learn from the incident. It must not be an uncomfortable process, and it should not be perceived as a form of punishment. The sole objective of this document is to facilitate learning and the development of more robust processes and solutions.
Why do people feel punished when it’s not a punishment?
The resolution process is crucial. The most important aspect is how we manage issue recovery and resolution.
Do not ask:
Who triggered an issue
Why the issue was triggered
Who identified the issue
What team/group has triggered an issue
Ask:
What happened
How was it triggered
What’s affected
What options to minimize impact
How to solve it ASAP for customers
How to properly recover
What timeline
Questions are essentially the same, but the way they are asked, the focus, and the stress involved can vary significantly. Consequently, the resulting outcomes will be vastly different.
After conducting a postmortem review, we can gain insights into our handling of the situation and identify multiple reasons for the issue. Additionally, we can analyze the ways we responded to the situation. This review will enhance our confidence in the company’s future growth and incident resolution capabilities.
We must remember that there are two types of postmortems: internal and public.
Internal postmortems should include technical details, services, IP addresses, infrastructure information, and other relevant data.
Public postmortems should be shared with the public, but they should not contain any internal details. Instead, they should provide a high-level overview of the incident.
Personal growth
To conclude all that has been discussed, we ensure that personal growth is effective in a company by providing individuals with opportunities to learn to take calculated risks. However, we strictly prohibit unnecessary risks. Additionally, we establish robust processes and instill self-responsibility in employees to minimize the risk of negligence.
By refraining from mentioning names in every issue and the process, we avoid taking personal responsibility for past incidents. Instead, we conducted a thorough review of our development and incident response processes.
Healthy management, even though it entails taking some risks, leads teams to success.
Even in this positive post, I must address some potential situations that may arise and require our attention. It’s important to remember that irresponsible individuals may be involved in the hiring process. While we hope that the interview process effectively filters out most candidates, it’s inevitable that some will slip through. Therefore, we need to be prepared to deal with these situations when they occur.
Team managers are responsible for identifying individuals responsible for incidents that occur near the development edge or due to a lack of responsibility. The most effective way to instill personal responsibility is by assigning manageable responsibilities that can be supervised by someone else. However, it’s crucial to provide a clear step-by-step management guide to facilitate effective teaching.
In the unfortunate scenario where an individual refuses to embrace this level of responsibility, there are still two viable options:
Shadowing a team member to manage tasks that don’t directly impact business risk.
During the yearly review, highlight the issue and collaborate with the other party to find the best solution for both sides.
Regardless of the approach, it is essential for the company to encourage personal growth for each individual based on their aspirations and life goals.
Official AWS’s postmortem- https://aws.amazon.com/message/41926/
Official Cloudflare’s postmortem - https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/
Official Meta’s postmortem- https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/