A Look at Atlassian's April 2022 Jira Outage

May 4, 2022
What lessons can you take from this incident for your own organization?

It’s been the talk of the town for the last few weeks: Atlassian suffered a 2-week outage for a small percentage of its customers. Last week they published their Post-Incident Review on the topic. And today, I want to respond to it.

It’s no secret that I’m not a fan of Jira, but that’s not what I’m talking about today. I want to examine this incident from the perspective of what we can learn about handling our own incidents, regardless of which product it relates to.

While I do have some complaints about this Post-Incident Review, I don’t want to start on a negative note, so I’ll first dive into the things I like about this review, and how the incdent was reportedly handled. Then at the end, I’ll talk about some of the negatives.

I also won’t be summarizing the incident here, follow the link above if you’re interested in those details.

1. Public Incident Review

So the first thing I love about this incident review is, well... that I can read it! And so can you! Any company willing to publish incident reviews publicly deserves kudos, in my opinion!

Not only that, but Atlassian has a publicly available status page for each of its major products, which report current uptime, and ongoing as well as historical incidents. You can still read the original incident report for the 2-week outage.

2. Identification of Multiple Causes

Atlassian identified two causes of the outage:

Communication gap. There was a communication gap between the team that requested the deletion and the team that ran the deletion. Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.

Insufficient system warnings. The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct – this meant that if a site ID is passed, a site would be deleted; if an app ID was passed, an app would be deleted. There was no warning signal to confirm the type of deletion (site or app) being requested.

I’ll have a bit more to say on this in a moment.

3. A Clear Incident Response Procedure.

> Once the incident was confirmed … we triggered our major incident management process and formed a cross-functional incident management team. The global incident response team worked 24/7 for the duration of the incident until all sites were restored, validated, and returned to customers. In addition, incident management leaders met every three hours to coordinate the workstreams.

Every company, no matter how large or small, ought to have an incident response procedure in place. It doesn’t need to be as exhaustive as Atlassian’s, but the last thing you want happening is panic and second guessing in the middle of an emergency.

4. Several Action Points

The goal of every incident response plan, in my view, should be to [solve the problem in multiple ways](/posts/solve-every-problem-twice/). Atlassian did this, with four high-level actions, ranging from implementing global soft-delets, which would have made recovery from this type of error incidental, to better approaches to handling ongoing incidents like this one.

The report goes in to much more detail, and they did several other things right, but I leave it as an exercise to the reader to go through the full report if interested.

So what did I not like about this report? There are a few things…

1. The Preamble Sounds Like Bullshit

The PIR starts with a short letter from Atlassian co-CEOs Mike Cannon-Brookes and Scott Farquhar. In the letter, they explain that “one of our core values is ‘Open company, no bullshit’.” Then they go on to explain why Atlassian is so reliable and trust-worthy, practically ignoring the incident that had just occurred:

Rest assured, Atlassian’s cloud platform allows us to meet the diverse needs of our over 200,000 cloud customers of every size and across every industry. Prior to this incident, our cloud has consistently delivered 99.9% uptime and exceeded uptime SLAs. We’ve made long-term investments in our platform and in a number of centralized platform capabilities, with a scalable infrastructure and a steady cadence of security enhancements.

I’m reminded of a news story on my local TV station many years ago. There had been a shooting at a convenience store, across the street from a police station. The news reporter was interviewing a police officer and asked “Should citizens be worried for their safety here?” and his reply was “No, of course not. It’s very safe. There’s a police statoin across the street.” 🤦

2. No Apology

I would expect the letter from the CEOs to include an apology. The closest they come, however, is to say “For those customers affected, we are working to regain your trust.”

3. “Communications Gap”

One of the two "problems" Atlassian determined lead to the failure was a "Communications Gap." Based on thier description, this does not appear to be at all inaccurate. However, a word of warning: This comes dangerously close to putting the blame on "human error." In fact, the very specific humans responsible for this error are identified:

Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.

Atlassian touts a no-blame postmortem culture, and without being involved in the incident, I have no evidence that suggests they don’t actually operate this way. So I only offer this as a word of caution.

“Human error” is never the cause of failure in a complex system. The real failure is that the system allowed a human to do the wrong thing. And to Atlassian’s credit, they did do this! Their prescribed solutions did not involve “educate people to use the correct IDs next time”, instead it involved “modify the system to warn about unusual things” and “modify the system to allow easy undeletion in case of a future problem like this.”

4. Poor Communication During (and After?)

One issue that has gotten a lot of "press" (or at least social media ranting), has been the lack of communication from Atlassian during the incident. Both publicly, and directly to the affected customers. As explaind in the report, this was made worse by the fact that Atlassian had actually _deleted their contact details_ for several of the affected customers. This meant they could not even notify these customers until they had restored their data (or at least some of it). And what's worse, since these customers were deleted from the system, they didn't even have access to open a support request.

The report goes into much more detail on this, and admits that they could have been more proactive in contacting the affected customers.

So in short, Atlassian suffered a rather complicated incident. They identified several legitimate technical actions to be taken to help eliminate or reduce the impact of such incidents in the future. However, their communication during, and even after, the incident leaves a lot to be desired.

What lessons can you take from this incident for your own organization? Don’t wait until you have your own outage. What small steps can you take today to help prevent such a disaster in your own future?

Share this

Related Content

The Zero-prep Postmortem: How to run your first incident postmortem with no preparation

Adventures in DevOps 121: Reducing On-Call Engineer Burnout with a Volunteer Management InfrastructureDave Mangot

Brian Scanlan explains how Intercom increases efficiency of on-call engineers and reduces the disruptive nature of the job.

We don't want heroics

The hero is the one who stays until 8pm when customers can't log in.