A Look at Atlassian's April 2022 Jira Outage
May 4, 2022What lessons can you take from this incident for your own organization?
It’s been the talk of the town for the last few weeks: Atlassian suffered a 2-week outage for a small percentage of its customers. Last week they published their Post-Incident Review on the topic. And today, I want to respond to it.
It’s no secret that I’m not a fan of Jira, but that’s not what I’m talking about today. I want to examine this incident from the perspective of what we can learn about handling our own incidents, regardless of which product it relates to.
While I do have some complaints about this Post-Incident Review, I don’t want to start on a negative note, so I’ll first dive into the things I like about this review, and how the incdent was reportedly handled. Then at the end, I’ll talk about some of the negatives.
I also won’t be summarizing the incident here, follow the link above if you’re interested in those details.
1. Public Incident Review
Not only that, but Atlassian has a publicly available status page for each of its major products, which report current uptime, and ongoing as well as historical incidents. You can still read the original incident report for the 2-week outage.
2. Identification of Multiple Causes
Communication gap. There was a communication gap between the team that requested the deletion and the team that ran the deletion. Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.
Insufficient system warnings. The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct – this meant that if a site ID is passed, a site would be deleted; if an app ID was passed, an app would be deleted. There was no warning signal to confirm the type of deletion (site or app) being requested.
I’ll have a bit more to say on this in a moment.
3. A Clear Incident Response Procedure.
Every company, no matter how large or small, ought to have an incident response procedure in place. It doesn’t need to be as exhaustive as Atlassian’s, but the last thing you want happening is panic and second guessing in the middle of an emergency.
4. Several Action Points
The report goes in to much more detail, and they did several other things right, but I leave it as an exercise to the reader to go through the full report if interested.
So what did I not like about this report? There are a few things…
1. The Preamble Sounds Like Bullshit
Rest assured, Atlassian’s cloud platform allows us to meet the diverse needs of our over 200,000 cloud customers of every size and across every industry. Prior to this incident, our cloud has consistently delivered 99.9% uptime and exceeded uptime SLAs. We’ve made long-term investments in our platform and in a number of centralized platform capabilities, with a scalable infrastructure and a steady cadence of security enhancements.
I’m reminded of a news story on my local TV station many years ago. There had been a shooting at a convenience store, across the street from a police station. The news reporter was interviewing a police officer and asked “Should citizens be worried for their safety here?” and his reply was “No, of course not. It’s very safe. There’s a police statoin across the street.” 🤦
2. No Apology
3. “Communications Gap”
Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.
Atlassian touts a no-blame postmortem culture, and without being involved in the incident, I have no evidence that suggests they don’t actually operate this way. So I only offer this as a word of caution.
“Human error” is never the cause of failure in a complex system. The real failure is that the system allowed a human to do the wrong thing. And to Atlassian’s credit, they did do this! Their prescribed solutions did not involve “educate people to use the correct IDs next time”, instead it involved “modify the system to warn about unusual things” and “modify the system to allow easy undeletion in case of a future problem like this.”
4. Poor Communication During (and After?)
The report goes into much more detail on this, and admits that they could have been more proactive in contacting the affected customers.
So in short, Atlassian suffered a rather complicated incident. They identified several legitimate technical actions to be taken to help eliminate or reduce the impact of such incidents in the future. However, their communication during, and even after, the incident leaves a lot to be desired.
What lessons can you take from this incident for your own organization? Don’t wait until you have your own outage. What small steps can you take today to help prevent such a disaster in your own future?
Adventures in DevOps 121: Reducing On-Call Engineer Burnout with a Volunteer Management InfrastructureDave Mangot
Brian Scanlan explains how Intercom increases efficiency of on-call engineers and reduces the disruptive nature of the job.
We don't want heroics
The hero is the one who stays until 8pm when customers can't log in.