Tiny DevOps episode #16 Lukas Vermeer — Can small companies do effective A/B testing?
October 28, 2021
In this episode I speak with Lukas Vermeer, former head of experimentation at Booking.com, and currently working with Vista. He answers the question of whether A/B testing makes sense in small companies and startups, and with small numbers of customers. We also discuss the broader topic of experimentation in general, and applying the scientific method to business development.
Resources
Dutch TV interview with Edsger Dijkstra in which he expounds his theory on software versions
Edmond Halley on Wikipedia
Book: Experimentation Works: The Surprising Power of Business Experiments by Stefan H. Thomke
Book: Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing by Ron Kohavi, Diane Tang & Ya Xu
Book: Field Experiments: Design, Analysis, and Interpretation by Alan S. Gerber & Donald P. Green
Book: The Structure of Scientific Revolutions by Thomas S. Kuhn
Guest
Lukas Vermeer
https://lukasvermeer.nl/
lukas@lukasvermeer.nl
Transcript
Jonathan: Hello, welcome to Tiny DevOps. Quick, before we get started, tell me your opinion. Do you prefer A,
Speaker 2: Ladies and gentlemen, the Tiny DevOps guy.
Jonathan: Or B.
Speaker 2: Ladies and gentlemen, the Tiny DevOps guy.
Jonathan: Today, we're talking about A/B testing.
[music]
Hello, and welcome to another episode of Tiny DevOps, where we believe you don't need a thousand engineers to do amazing DevOps. I'm your host, Jonathan Hall, and today we're going to talk about A/B testing on small teams. My guest today used to run A/B testing at Booking.com and he's now working with Vista. Lukas, welcome to the show. Would you introduce yourself briefly?
Lukas Vermeer: I don't know how much introduction you want, but I'm Lukas. I was responsible for A/B testing or experimentation in general at Booking for eight years. Now, I've joined a company called Vista, which is also an online presence, but they have physical factories where they print stuff like t-shirts and caps and business cards and paper. It's a whole new world for me because it's suddenly opens up this, how do you do experimentation on the factory floor, which is fascinating.
Jonathan: This may be a little bit of an unusual topic considering that you do or your expertise or at least the part of expertise I want to talk to you about is A/B testing for big companies. I think there's an overlap here and that is that for the last two companies I've worked with where we had-- The very last one was a startup with zero clients at the beginning, zero customers, but they were already asking to do A/B testing. They tasked my team with, "Give us the infrastructure to manage the data to run our A/B tests."
Of course, my thought is, "You have zero customers. What are you going to be testing?"
Then we had a similar situation at a company before that where they didn't have clients. They were selling e-commerce stuff, but we had a few thousand visitors per day. I was hoping you could provide some insight to anybody who's, might be in a similar situation, they're on a small team without a lot or maybe it's not the size of the team so much that matters here, is how big the customer base is or how much traffic they're getting, is A/B testing valuable in this situation?
Lukas: This is one of my favorite topics. The first thing I'd like to point out is that people talk about Booking.com as an example of a company who is very large and can therefore do experiments, but the reality is that Booking.com started experimentation in 2005 when they had a total of five developers in the entire company. The whole idea that Booking can't afford to experiment because they are big is just a fallacy that is, the counterexample is Booking.com itself, they started running experiments when they were really, really, really small.
I think there is an argument to be made for that one of the reasons that Booking got so big is the fact that it started experimenting small. I think we should stop using Booking and some of those other companies as examples of "experimentation only works when you're big." I think the other thing I'd like to say is that when you're as big as Booking.com, you get a lot of traffic. In fact, from a business point of view, you actually do need that much traffic in order to pick up effects that are interesting. The reason I point that out is that when you think about a business context, we are usually interested in the absolute magnitude of effect.
How many additional dollars are we making per day or how many customers are we helping? The weird thing is that the way that you would pick this up and experiment is through a concept called statistical power. Statistical power grows quadratically or logarithmically. There's a square root in there somewhere, which means that if you want to pick up an effect that is half the size, you need to quadruple your traffic. Now, from a business point of view, if you quadruple your traffic, then the impact of half the size effect is still double what you could measure if you had less traffic.
From a giant business point of view, this is actually working against them. The more traffic you have, the more money there is on the line, and the more difficult it becomes to actually pick up something that is meaningful. I actually think if you reason this through, that smaller companies actually have the advantage because yes, a smaller company cannot pick up a 0.1% difference in the conversion rate or revenue or whatever they are interested in, but they also, from a business point of view, should not be interested in a 0.1% difference because that's not going to make or break their business.
A startup should be growing double digits. They should be looking at those things that are impacting the business by 10% or 20%. Those impacts are very, very easy to measure using a controlled experiment. I would say from that point of view, not only is it a fallacy to point to Booking.com and say, "Oh, yes, they're big. They can run experiments" because it's a fallacy because Booking started when they're smaller. It's actually easier when you're smaller. I would say especially small companies should be running experiments. Now, where's the overlap with DevOps?
I'll get that then after this little rant, I'll go into DevOp. Is that the way that most of these companies make experimentation at scale work is they decouple code release from feature release. In a traditional IT setting, the moment you deploy the code is also the moment that the feature goes live. This is problematic for many reasons, especially when you scale out. I think one of the genius things about the DevOps movement is that you're making individual product development teams own the feature when it goes into production. The way that you usually do that is you decouple code release from feature deployment through some toggling mechanism. That could either be something that you built in-house or some SaaS solution that you use, or something that allows you to put code in production without making the features go live.
Then at any convenient point, a point that is convenient to the dev team, you say, "Actually now, I want the feature to be shown to customers." That's when you toggle it on.
Now, that toggling mechanism is also the foundation of A/B testing because when we're putting these A/B testing platforms into production, we need the ability to control treatment assignment. That is we need the ability to decide in the moment for an individual user, are we going to show them the new version or the old version, the A, or the B? To be able to do that, the code actually needs to be able to do exactly what you need for a DevOps movement, which is, it needs to be able to run both the old and the new version of any change and any combination of changes that have been entered into the code.
From that point of view, there's a very, very close technical tie, I think, between the DevOps movement and the experimentation infrastructure. The other one I think is that the DevOp movement, I'm an outsider. My reading of it after reading-- I don't know, the Phoenix project and Lean Startup and all these books, is that it puts a lot of emphasis on providing development teams with direct feedback on how their features are performing in production, not throwing it over the wall and saying, "Oh, the operations team will figure out how to make this scale" or "It does exactly what it says in the specifications. My work is done." Actually, getting these development teams to care about and to monitor and to check when we put this in production, does it actually do what we want it to do?
That is one of the things where experimentation also is very much interested because the whole point of doing experimentation in the business setting is that when we're making these changes to products, we often don't know how they will behave in production because, and this a little pun, please excuse me, but we often talk about unit testing and then integration testing, systems testing. The idea behind the integration test is you get all the systems that are involved in the entire operation of the product and you test everything at once. Most consumer-facing and even B2B-facing products have a user that is supposed to be doing something with the software.
I would argue that unless you have run a test that included that user in its natural habitat, you have not actually done an integration test. You've done a very extensive units test where you tested the software component of your application, but you have not yet tested end to end what actually happens when I put a human brain, that I do not control. Not a QA engineer that's following a script, but an actual customer or an actual business user, put it behind the software, let them use it without additional instruction, and then see whether they actually interpret the software the same way that we thought they were going to. You'll find that in many cases they don't. It's very difficult to predict how users will behave.
Jonathan: That's a great rant or introduction. I love that. I love that.
Lukas: The thing is, like this spiel, this is a spiel. I have done this 1,000 times consulting for clients, I do conferences, I run my own side business as an independent contractor, and all of them need to hear this again and again. That's why I'm here also because I think there's too much dare I say, Dijkstrisen in software development where Dijkstra famously said he doesn't understand the why software needs versions because you write the spec, then you write the proof, then you write the code, and then it's done. We find that laughable now. We say, "Well, obviously that's not how you actually develop software," I ask you, "Why? Why do we not develop software that way?"
If it was true that the world was predictable and we could predict how the software was going to be used, that's exactly what we would do, it's very efficient. You write the proof for your software, you put in production, done. The reality is that we're interacting with these human systems, the human customers or human users, that we do not fully understand. It's a fallacy to think that we can write the full proof if we don't understand the black box that is the human brain.
Jonathan: Yes. It's a fallacy that we need millions of visitors per day to start doing A/B testing. Is there a floor, is there a minimum that we need?
Lukas: Kind of. This is a trick question, Jonathan, you're setting me up for something.
Jonathan: I know.
Lukas: Let's go back a little bit, the methods that we're using for A/B testing, at least when we're looking at frequentist statistics, were designed, largely in part by a man named-- We don't really know his name, I think we know his name now, he went under a pseudonym called Student. He was the man who invented Student's t-test. He invented that Student's t-test while he was running experiments on yeast. He was working for Guinness and Guinness didn't want him to publish under the name Guinness, that's why he was using a pseudonym.
He was running experiments where the number of participants, or Petri dishes, was in the orders of 20. If you look in the psychology community right now, where 20 is still the norm for a lot of experiments, there's discussions on whether that should be raised to 50 participants. You have to keep in mind that, in those experiments, we are looking at very immediate things that we can measure and very large effects. The smaller the effects or the longer the delay between treatment and effect, the more participants we will need to effectively measure something.
Like I said earlier, if you are a business like booking.com or Vista, and you want to measure impact on lifetime value of a customer, this is such a long and delayed metric. You want to measure 1% or half a percent of difference. This is such a small effect, that those things combined means that you really need a lot of traffic. If you're a much smaller business and you say, "Look, we are changing rapidly. We are making lots of changes. We want to make sure that our website isn't drastically breaking business performance. By drastically, we mean we don't want to lose more than 5% of our business. By business performance, we mean we don't want to lose money the same day."
I'm now severely constricting both the length that it takes me to measure, the distance between treatment and effect, and the size of effects that I'm interested to detect.
In those cases you can get away with hundreds, maybe thousands of participants for each trial. That should, for a lot of startups, should already be attainable. Unless you're at some sort of B2B startup where you have five customers, which happens, in which case I would just call them up, if there's only five of them. If you're in the hundreds or even in the thousands as a startup, you can already start experimenting. You have to keep in mind that you have set on this path saying that 5% is the meaningful difference that you're interested in, or maybe even 10.
Anything smaller than that, I cannot hope to pick up. That's fine, from a business point of view that's not so interesting but it means I can't start changing the color of a button and hoping to find a 0.5% increase in conversion. Much rather, you would be looking at, "Hey, we're thinking about replacing this backend system by a new search engine and we're worried that it might actually delay performance or that it would degrade the customer experience." In those cases, you have reasons outside of the experiment to want to make this change.
You're not looking for a lift per se but you might decide not to proceed if it actually does cost you 5% of your business. Within booking.com, we changed our tooling, so that we were able to run a type of test that's called [unintelligible 00:14:30] trial, which is also used in medical trials when they want to replace an existing type of medication with a generic, saying, "This generic should do exactly the same thing, it's just a hundred times cheaper." They run a trial where the intent is not at all to improve the efficacy of the drug. The intent of the trial is to show that it's not meaningfully reducing patients' response to the drug. You almost run an opposite experiment. The point of the experiment is to show that there is not a 1% drop, or a 5% drop, or whatever you're interested.
Jonathan: Fascinating. If we're trying to measure something-- You already started to answer this but suppose you only have five customers, or you haven't launched yet and you hope to have 1,000 in six months but you want to start making data-driven decisions. Are there alternatives to consider, other than just calling all your customers? [laughs]
Lukas: It's funny, I was talking about this this morning. I think we have to distinguish between the statistical and technical apparatus that is A/B testing. Statistics being-- An A/B test is another word for a randomized control trial where we flip a coin to decide who gets what, then we treat them in some way, then we compare those two groups, A and B, using as the type of statistical test. We check whether we see a difference that is unexpected, that sort of a statistical mechanism that's behind randomized control [unintelligible 00:15:59]
The technical aspect being, we need to code to be able to do both A and B, we need to be able to identify customers and assign them and measure. Those are the statistical and the technical apparatus. At the metalevel, on top of this, there is the scientific method, which is the idea that-- I have an idea, I come up with a way that I can test this idea. I write that down so that I can show that this was my idea, that this is what I was going, it's the method section. Then I run the test, I check the results and I check them against what I had written upfront.
No cheating. I'm going to test against the previous thing, then I document and share my results, and show it to other people. This is a scientific method and it is agnostic to what type of test is under the hood. One of my favorite examples is Edmond Halley, from Halley's Comet. He started thinking about-- "Why is it that comets seem to appear at a regular interval?" He was looking at the table data of when these comets appeared and he kept seeing these repeating patterns. He said, "What if they are actually on a loop, like the earth is moving around the sun, but the loop is just much, much larger?"
He found this one comet that seemed to be coming back every 120 years. He said, "Well, if my theory is correct, that this is on a loop, then it should-- This comet should reappear on that particular day, in that particular position." He made a very precise prediction of when the comet would appear. Sadly, he passed away because 120 years is a long, long time.
Other astronomers were still watching the skies by the time his prediction came around and turned out that he predicted, I believe, in August of a particular year, that a comet would appear, and nothing happened in August, September, October, November, beginning of December.
Astronomers said, "Well, you know what, this Edmond Halley guy, he's full of crap. He predicted that this comet was appearing and it didn't appear." Until, on the 23rd of December, of that year, the comet finally did appear. In hindsight, we know this is because the comet passed one of the outer planets. The gravity of one of the outer planets slowed down the trajectory of the comet so much, that it arrived four or five months later. It did appear and that's when the astronomers say, "Actually, you know what, this Halley guy, he might be on to something." His prediction for the comet was so precise, as in, it was precisely this year, and also, the way that the comet looks is exactly what Halley had described. Therefore, there might be merit to his theory.
I like that example because it shows that the scientific method of, "I have a hypothesis, I make a prediction, then I test that prediction and I validate against what I predicted", that this works, even if you are dealing with n=1. One comet, in an uncontrolled setting, and my dataset is limited to one timestamp. That was all that Edmond Halley predicted. The reason it works, in this case, is because his prediction was very precise, he predicted the particular year, and month, and location, but also because it's very unlikely that comets appear.
If Edmond Halley had said, "Well, you know what, somewhere in the next 200 years, I don't know which the year precisely, a comet's going to appear somewhere in the northern hemisphere", that would not have been precise enough for us to use that as evidence for the theory. If on the other hand, comets were a daily occurrence, and they would show up every day, then it also wouldn't work. We don't distinguish this special day that the comet does appear. The combination of something that is very unlikely to occur and then very precisely predicted is which makes the method works and is agnostic to how you actually validate that prediction.
In the A/B testing case, the nice thing about A/B testing is that we can pick up much smaller effects with much fewer assumptions. We don't have to be as precise or as unlikely as a comet, in order to use this method to make predictions and test them. In principle, these things things still work.
If you're a small startup, and you say, I don't know, "I think we should add insurance to our product. I think our customers will love that. We have only five, but you know what, we can run an experiment. We're going to call all five of them, and we're going to ask them a very specific question, 'Would you be interested in purchasing insurance from us at this particular price point?' If more than two of them say, "Yes", then we'll take that as an indication that there is interest. If this doesn't happen, then probably there's no interest." That's an experiment. You have just designed a-- It's a pretty rough experiment, but the mechanism still works.
Jonathan Hall: We're getting close to the end of our time. What resources can you recommend for anybody who's interested in learning more about A/B testing or the scientific method as it applies to business development?
Lukas: Oh, boy. To start with the last one. There is this book, Experimentation Works. Stefan Tomko, Harvard Business School. Disclaimer, there's an entire chapter about booking [unintelligible 00:21:09] , and my name is in it. I think it's a good business summary of how experimentation can drive a company forward and which companies are applying it and how? If you're interested in how can a business use experimentation to improve their own product, then this would be my go-to book. If you're interested in the technical aspects of running experiments in an online setting, online A/B tests and this book, Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang, and Ya Xu, is a very, very practical technical guide on how you would implement A/B testing.
Assuming that you've decided that this is what you want to do, then how would you actually build the machinery to do this at scale? It's a good book. If you're more interested in the philosophical or statistical aspects of experimentation, so throw away all the business side and throw away all the technical side, just why is randomization important? What does this actually do for my results? How do I control for things? What is different types of clustering, for example, then this classic field experiments by Gerber and Green is classic. I think they're political science actually. Yes, they're political scientists.
A lot of the examples in these books are about, I know people going door to door and trying to convince you to vote Democrat or Republican. That's the experiments that they're interested in, but the theory is still very much the thing. Those are my favorite books. Last one, if you're interested in more at a meta-level, the philosophical foundations of how the scientific method actually works, then this is a classic. It's a bit heavier than the rest, but it's the Structure of Scientific Revolutions by Thomas Kuhn. It's a bit dated. It is not an easy read, but it's a classic in terms of getting a sense of-- Science isn't objective man, it's someone's opinion, well documented and vetted by other people. I think Kuhn was a big force in that.
Jonathan Hall: Great resources. I'll put those all in the show notes. Anyone who's interested can follow up. Before we sign off, if people are interested in getting in contact with you, they want to hire you, you said you do some consulting, how can we get in touch?
Lukas: My name dot NL, NL for Netherlands. On my website, should be contacts form in there. Otherwise, just my first and last dot NL.
Jonathan Hall: Great. Is there anything you'd like to add before we go?
Lukas: No, actually/ I want to ask you a question. Maybe this is one for your audience as well. I'd love to get feedback on this. One of the things I'm struggling with is can you do experimentation at scale without DevOps, and can you do DevOps without experimentation at scale?
Jonathan Hall: That's a great question.
Lukas: I don't know. I'd love to see it.
Jonathan Hall: The way I define DevOps is that you don't have to throw anything over the wall of confusion, the developers and operations and ideally, other departments, QA and whoever is contributing to the product, they're working together, and there's never this idea that, "I'm going to my part, you go to your part now." Can you do experimentation with the wall of confusion? In theory, I suppose you probably could. In that sense, you probably can do experimentation, A/B tests, just imagine, if NASA wanted to do an A/B test, then would probably be A/B test without DevOps. It's probably possible. It would probably be very difficult and painful, and I wouldn't want to do it, but it's probably possible at least on paper. If you have enough bureaucrats pushing paper around and shoving things through, you can get it done.
Lukas: It would be difficult to get those teams to care, though.
Jonathan Hall: Definitely. Then the other question was, can you do DevOps without experimentation? I think the theoretical answer is probably yes, but why would you want to?
[laughter]
Jonathan Hall: Supposing you converted your organization to DevOps, and just stopped and said, "We're never going to change anything else again." Are you still doing DevOps 10 years later? Maybe, but there's no joy in it. The whole reason that DevOps is fun is because it allows you to react to change quickly. That's all about experimentation.
Lukas: I think the moment you go into DevOps, the developers will care, they will want to know why does it not work, or what happened here? The moment you give them visibility on how their work is actually performing in production, you almost automatically get experimentation.
Jonathan Hall: The whole DevOps philosophy, if you've seen the little figure eight, that the infinity sign, it's all about feedback loops and shortening feedback loops. Oh, you have it there. If you take experimentation out-- Of course, I'm not talking A/B testing, I'm talking any form of experimentation. Hey, what if we did this other thing this way? What if we split into groups of three instead of six, or anything like that? If you're not doing that, you're maybe by some definition, doing DevOps, but God, go find another job, that's a terrible place to work. [laughs]
Lukas: Ultimately it's about feedback. It's about trying things and getting feedback on your work in a real setting. I love the NASA example, by the way, because I'm grinning because that is used so often when people want to ask me a loaded question. They ask me, "Yes, but NASA doesn't run A/B tests." I'm like, "Do you have any idea how much NASA is experimenting?" They build at least [unintelligible 00:27:09] little rockets. They shoot them up the air. They measure everything, and they try to figure out what balance works best. Their experimenting the crap out of this thing because they need to make sure that once the proper things don't goes up in the air, it actually does go up in the air. They're experimenting a lot. No, they're not running controlled experiments. They probably can't.
Jonathan Hall: That, you see the picture behind me, that's a SpaceX launch because that's my go-to analogy for can we do agile? People say, "Well, we can't because we're doing rocket science or something big." SpaceX does agile all the time. They send up six rockets a week or whatever, and they see what's wrong and they change it.
Lukas: Thank you, Jonathan. This is a blast.
Jonathan Hall: Thanks a lot. Have a great day. Bye.
Lukas: Bye.
Jonathan Hall: Find Lukas Vermeer online at lukasvermeer.nl, LUKASVERMEER.NL.
[music]
Jonathan Hall: This episode is copyright 2021 by Jonathan Hall. All rights reserved. Find me online at jhall.io. Theme music is performed by [unintelligible 00:28:14]
[00:28:17] [END OF AUDIO]