Facebook’s famous motto, “Move fast and break things” officially left the buliding in 2014, but that hasn’t stopped a large number of people on social media from criticizing Facebook for this old motto in light of recent major outages.
Since Facebook no longer lives by this motto, it’s a bit of a straw-man argument to begin with. But I want to defend this straw man anyway, with a bit of a contrarian view.
The assumption being made by those criticizing the “move fast and break things” motto is that the cost of a 6-hour outage is too high—that too many things broke.
Is this justified?
I don’t know. And neither do you, unless perhaps you work closely with Mark Zuckerberg.
You see, this type of judgement is the result of an ad-hoc ROI calculation. The problem with this armchair quarterbacking is that the public is only privy to one piece of data necessary for that calculation: We know (or can estimate) the cost of a single failure.
In ROI terms, that is part of the investment variable. Let’s simplify with round numbers, since we’re defending a straw man anyway, and say that a single outage costs US$1 billion.
“Oh my stars! That’s too expensive! It’s obviously a bad idea!” some might say (are saying).
If it’s not clear yet, the problem with this is that we have no idea of the return.
If Facebook can lose $1B in, it stands to reason that their earnings potential is also astronomical. What if the fast moving that caused the broken things also earned $50B that would not have been earned by acting more cautiously?
In this light, a $1B “investment” (in the form of an outage) to earn $50B seems like a bargain.
In mature businesses, outages are expected. There’s a failure or outage budget. Management works to keep reliability at expected levels. Not too unreliable, because then business suffers, but importantly: also not too reliable because reliability is expensive.
Jown Allspaw and Paul Hammond make this point in their famous presentation 10+ Deploys Per Day by referencing World of Warcraft’s at-the-time dismal uptime:
If the business requires that the site go down every 2 weeks, even though you’re the largest online gaming platform and you have millions of paying customers, those paying customers might be quite fine for you to have availability of 97%.