Tiny DevOps episode #36 Joy Ebertz — All About Feature Flags

March 15, 2022
Joy Ebertz is a Principal Software Engineer at Split. She focuses on the technical vision for the backend team, and she joins me today to talk about some of the obvious, as well as not so obvoius ways in which feature flags can be used on projects of any size.

Joy Ebertz is a Principal Software Engineer at Split. She focuses on the technical vision for the backend team, and she joins me today to talk about some of the obvious, as well as not so obvoius ways in which feature flags can be used on projects of any size.

In this episode

  • When does it make sense to start using a Feature Flagging library or service?
  • Should you build your own Feature Flagging service?
  • Using Feature Flags to test in production
  • Using Feature Flags for large features to allow Continuous Integratoin
  • Enabling feature packs or service tiers with Feature Flags
  • Feature Flags for circuit-breaking
  • How to use Feature Flags for infrastructure migrations
  • What is feature parity checking, and how to do it with Feature Flags
  • Some common gotchas with Feature Flags
  • How do A/B tests relate to Feature Flags?
  • Differences on mobile apps when using Feature Flags

Resources
Split.io
Blog: 7 Ways We Use Feature Flags Every Day at Split

Guest
Joy Ebertz
Blog: https://jkebertz.medium.com/
Twitter: @jkebertz
LinkedIn: joyebertz


Transcript

Jonathan: Ladies and gentlemen, The Tiny DevOps Guy.

[music]

Jonathan: Everybody, welcome to another episode of the Tiny DevOps Podcast. I'm your host Jonathan Hall, and on this show, we like to solve big technical problems with small teams. Today, I'm joined by Joy Ebertz who is a member of the Split team, which I think I'll let you, Joy, explain because I'm sure you can do a better job than I can. Welcome to the show. Would you tell us a little bit about who you are and what you do?

Joy: Sure. I'm a Principal Software Engineer at Split. Split does feature flagging and experimentation software. Specifically what I do is I'm a backend engineer for Split and I focus on basically putting together the technical vision for our backend team as a whole and trying to make sure that we are generally aligned, and trying to make sure larger technical projects can get prioritized on the various teams' roadmaps. I've been at Split for about two and a half years now and we have, I don't know, maybe 50 engineers at this point.

Jonathan: Nice. That a nice size group, not too big, not too small.

Joy: That's what I was going for when I joined [chuckles].

Jonathan: Good. Is it growing quickly?

Joy: Yes, we are. We probably doubled project and engineering, our product and engineering over the past year. Yes.

Jonathan: I met you on a Slack group where we were talking with some other people about feature flagging versus A/B testing and different ways of thinking about this stuff. In that conversation, you shared a blog post, which of course I'll put in the show notes for anybody to read. I thought it was a really interesting post because I think it was seven ways to use feature flags at Split. A couple of them jumped out at me, but maybe we could just breeze through those.

Because I mentioned many people listening, of course, are familiar with feature flags, but others that might be a new concept, and some of these use cases especially might be new. Do you want to just take us through those reasons or ways to use feature flags? The first one here was testing in production. That's probably something that sounds scary to a lot of people. Maybe you want to explain what the concept is and how feature flags are helpful [chuckles].

Joy: Yes, that's probably one of our most common use cases actually. When you release a feature, or I guess when you release in general the main use case of feature flags is to be able to separate release from deployment. The idea there is that you can actually deploy all of your code to your production servers but not have any of your users see it initially. Then at a later point, you can actually turn on the future in a way that then they'll start getting the feature. A big part of the reason to do that is deployments are risky and releasing features are risky.

Being able to decouple those allows you to understand where the problem is actually happening more easily if a problem does happen. The additional piece you can do there is because you're doing this release separately and we allow a lot of fine-tuning around how you actually target who's getting the new feature. You don't necessarily have to just do it as an on-off switch. Instead, you could do it as percentages or even target specific groups. Instead, you could say, "While this code is in production, we want it off for all of our normal users, but anyone who's an employee at Split, for example, maybe can see this new feature."

That's a way for us to start testing our stuff before it's released to the broader audience. Likewise, then maybe we think it's great and so we want to start rolling it out, but we could first pick a set of beta customers or maybe-- When I was at Box, we like to use our free customers for this sort of thing. Being able to pick a set of customers that start seeing the new feature and then get feedback or see if there are problems and then adjust accordingly based on that.

Jonathan: Now the description in the blog article talks also about maybe using a feature flag before a feature is complete. Do you consider that part of the same use case, or is that a separate? Maybe it takes three months to build the complete feature, but you don't want to wait three months to merge into the master branch or main branch so you do bits at a time, but you don't want to turn it on. Is that a separate use case in your mind?

Joy: I see that as very related, but maybe slightly separate. Yes, the use case is pretty much precisely what you outlined there.

Jonathan: The second point in the post talks about entitlement, which that title didn't make a lot of sense to me at first when I read it. Do you want to explain what that means?

Joy: Yes, sure. This is also sometimes referred to internally as feature packs or service packs, something along those lines. The idea is these would be long-lived feature flags. Unlike that first group which you put into the code, you let it run for a while and then you take it back out, this instead is meant to stay there over a period of time or probably forever. We have permanent feature flags as a type of feature flag.

The idea here is that you might have users paying for a premium tier and then you could basically control that by a feature flag. You can say if they're in this particular split, I guess we call them splits internally. If you're in this particular split then you're going to get the additional features. This is a way to easily be able to add features to a particular pack of users, or payment pack, or to be able to allow customer success or somebody else not engineering control who can have access to features.

Jonathan: I imagine a typical use case there is you offer may be a freemium tier plus a basic tier, and a professional tier or something like that, and then it turns on the different types of features.

Joy: Exactly.

Jonathan: The third one is compliance. I suppose I can imagine what that means, but maybe you can elaborate on that a little bit. How are future flags useful in a compliance scenario?

Joy: Probably a lot of ways we're not thinking of. The one I know we use is around having additional, I guess, internal admin-type features that we only allow certain people to access. We can say like a few of our support people have access to our customer's accounts, but nobody else within our organization does. That's controlled via a feature flag. Because it's done that way and because we also have audit logs for all the changes to the feature flags. We can see if somebody was accidentally put into one of those accounts or taken out of one of those accounts or things like that. We can actually see who has access to some of those support features.

Jonathan: The next one I'm going to skip because I want to get back to that one. That's the one I really wanted to spend time on. The next one down after that one is feature flags as circuit breakers. Do you want to explain that one?

Joy: Yes. This one is like an emergency out situation. If you have code, we have certain code paths that get hit very, very heavily. In certain cases, we might suspect that things could go terribly wrong or maybe they went terribly wrong in the past in that particular area. We can put in a feature flag that essentially when we notice things are going badly, we can just flip it off, and then that immediately would stop the flow.

A circuit breaker tends to work that way, right? When things start to go badly you can turn something off. This is the same idea. If you start to notice things go badly you can turn it off. We use it a lot with logs. We use it in reverse with logs occasionally too where instead we might have a logging wrapped in a feature flag for an area that we suspect might have issues, but we don't want to be logging all the time. If somebody reports a problem that we can turn the logging on and then collect some logs and then turn it back off so we're not paying for all of the logs for a long time.

Jonathan: That makes sense. That's a good one. Let's jump back up to the one that I skipped because this is the one that jumped out at me. It's maybe the longest one on this article and it's maybe the least-- At least for me, the least intuitive or obvious. That is infrastructure migration. The article says that feature flags can be used to help, for example, when migrating from a monolith to microservices. Like I said, it's not something that's obvious to me. Would you explain how this is useful and how somebody or a team that's considering that sort of migration could use feature flags to their benefit?

Joy: We have a lot of different use cases related to this. We are actually in the process of migrating our monolith to microservices. We've actually used a bunch of this. I guess like a small one to start with this was actually something we did at Box a lot, but a coworker of mine there, Schnepper, had this tool called Tombstone which the idea is that you can basically mark all of your code and figure out which sections of the code are dead and no longer accessed. That's basically done through log statements, but the problem with just doing straight-up log statements is that you can again get spammed with blogs pretty easily.

If you then mix in a feature flag as well, it gets pretty easy to slowly check various parts of the code and slowly turn it on and gain confidence that some of this code is no longer hit. I bring this up because it's nice to limit the amount of code you're trying to move or separate before you actually even try to do the migration. That's one-use case. In terms of the actual separation, there's a lot that are entangled a bit. Let me just break this apart individually. I guess parody testing. The idea here is you want to check-- Ideally, you're replacing an old system with a new system and you don't want the end-users to be able to see a difference at all.

You're hoping that this is going to be seamless and there's no changes under UI. In this case, ideally, you would stick a little interface in front of each that's the same and then you have calls come into your old system and you would initially just respond with the old system because again, you don't want to interrupt your users. Then the idea there is that then you start asynchronously sending these calls to your new system as well. Typically, if you want to be doing parity testing, you would typically send this along with the response, the expected response. Then you check the new system and you see does this match the old system and at that point, if the two values do not match, then you would log that case, and as much information as you have at that point.

Part of how a feature flag is useful here is that if you do find that you have a problem, you can easily stop logging and even stop sending traffic to the new system so it's not getting hammered for no reason while you fix the problem and then you can start doing this again. That's parity testing. It allows you to basically see does the new system actually work the same way that the old system does?

When I was at Box, we replaced our authorization system with a fully new service. We did this there and found entire use case that we had totally forgotten about which would have caused major problems if we had released this to anybody, but because we caught it with parity testing, we were able to fix it before we released any of this feature at all. That was really useful. Parity testing.

Then the next one is more around mirroring. With mirroring, it's again asynchronously sending the calls, but without actually checking if things match. The reasons you might care about mirroring the first use case is around checking costs. Not too long ago, we did a total overhaul of one of our backend data pipeline systems. We were hoping to make it a lot cheaper, but we wanted to make sure this is actually the case before we fully rolled things over and deleted the old system. As a part of that, we basically once we had it mostly built-- Although if we had done this better maybe we could have done it earlier, but once we had it mostly built, we turned mirroring on for a set period of time that we thought was indicative of our overall usage pattern. I don't remember what it was but a week or something and then you can extrapolate from there.

We saw what the cost we accrued during that time was, and then we were able to calculate from there what do we think the cost is going to be over time. Are there any surprises? Does it make sense? In our case, it actually did make sense with what we were expecting, but it's nice to be able to verify that and not have any surprises at the end. Cost there that's one use for mirroring. Another one is more around the load test and the stress testing like I mentioned. When you're building a fully new service, it's easy to forget little things. It's easy to not have your connection pool tuned correctly, or small things here and there that are going to cause problems when you actually get your expected load. The idea is, again, mirroring the traffic and that's going to help you see can you actually handle the level of traffic that you're expecting?

Another nice thing here, again, with the feature flags instead of just putting in asynchronous calls all the time is that you can actually start with a lower load. This is especially useful if the full load just immediately makes it fall over. It's sometimes really hard to see why it fell over, but starting with a lower load sometimes you can see, "This call is getting slower. This is getting weird," and so it's easier to debug and then you can wrap up after you fix some of the low-hanging fruit and the obvious problems until you can get to full load potentially even higher than full loads. That gets more into the stress testing.

There's a lot of stress testing tools things like Gatling or JMeter which you write tests and you just hammer your system until it falls over. I'm not saying not to use those, those are also really useful, but the problem with those is they don't have a realistic traffic pattern usually because it's just whatever tests you happen to put into the system. By mirroring your traffic or you can even do something like double mirroring like send a request and then wait five seconds and send the same request again. Being able to do something like that gives you much more realistic traffic patterns in terms of what you think you might see so that can be really helpful.

Jonathan: I've learned a lot already [chuckles]. That's really good. We've made it through that article. I want to step back a little bit and talk more generally about how and when to use feature flags. I like to talk about the poor man's feature flag is just an if false. You could use that anywhere in your code. If you don't have a feature flag library or service that you're using and you really need to push something out to production that's not running yet, just wrap it in an if false type of statement. Obviously, that doesn't scale very well. Your customer service department can't control that only somebody with Git access can do that.

When does it make sense, in your opinion, to start looking for a more proper feature flag solution than you can get from if false or even just reading a [unintelligible 00:15:28] file with bullions in it or something like that? When does it make sense to start looking at a full-fledged feature flag service like Split or one of your competitors?

Joy: I'm probably pretty biased here, but I would say very, very early. We have a free offering, I'm sure some of our competitors too. There are definitely some open source options out there so it's not like you have to pay money from the start. Our free offering is meant for small companies and whatnot. If you are small, then it makes sense to start with one of these. Well, you mentioned a lot of the pros to that. There's things like being able to have somebody else turn things on or off for you. There are things like being able to do those slow ramps like I mentioned or target particular people with the turn-on instead of just the straight-on and off.

The other nice thing is if there is a problem, we have a big red kill button and I'm sure basically any framework out there has something equivalent where it's just like, things are going badly, just turn it off immediately. It flips it to the off state and this is a lot faster, at least for us, our pipeline's kind of slow. This is a lot faster than pushing any kind of comp file or actually pushing code to turn something back off if things are going badly.

Jonathan: I know some of the answers to this, but it would be good to hear it from an expert. Actually, maybe we should address it a little bit differently. The question is why not just build my own feature flag tool or framework? This is the kind of thing I've seen homegrown feature flag frameworks everywhere and it seems like the kind of thing I'm just checking a bullion value or maybe worst case, I'm checking 106 values. Surely it's not that difficult to build my own. Engineers are always trying to build their own things. Why shouldn't we build our own feature flag tool? [laughs]

Joy: Honestly, so many people have done that. I'm not going to say you can't do that you obviously can. That said, I guess, first of all, a lot of our current customers are actually people who have a homegrown solution that are migrating from that. Obviously, there's something out there that people see the value and are actually paying for it. To elaborate on that a little more, I think the thing is that it's fairly easy to build the bare minimum, but it's going to be a slow bleed over time. People are going to start saying like, "I want more fancy targeting. I want to be the turn percentage rollout. I want to be able to do the target rollout," like I mentioned. "I want to be able to mix and match those. I want to be able to say those three users over there, get it on as well as 5% randomly or whatever happens to be or maybe everybody in Boston gets it."

You're slowly building these features to make it more and more complex. It's going to be a slow bleed over time. There's also things like being able to the audit logs like I mentioned, so being able to see who changed what. We have approval flow processes. I can make a change to a feature flag and then have somebody else on my team review that change before it actually takes effect things along those lines. Honestly, this is like a lot of these tools. Like authentication, for example, I know we currently have a homegrown authentication solution. We're looking at moving to something more like a paid product. A lot of that for us is sure, our current needs are met fine and we can build the next feature we're getting asked for probably quicker than replacing the whole system, but there's the next feature and the next feature and the next feature. Do we really have the time to keep building those and supporting all of those? At some point, it's not really worth it.

Jonathan: You've already touched on some of the answer to this question, but I think it would be good to address it directly. I'm curious to hear about the dimensions that make sense to do feature flagging on. You talk about geographic, are you in Boston, or percentages, but what are some other things because if you're trying to build your own, you're probably just thinking on off and then you eventually start to think of these other options. Help us brainstorm what those options are that we might care about in the future.

Joy: I feel like that's a lot of them. One of the big tripping points that I've seen with homegrown solutions is the percentage just taking a shortcut with that and doing something like modular on the user ID or something, which in theory sounds like a great idea, but then the problem is anybody with a one in the last column gets all of the new features, and then it all stacks on top of each other and causes problems, something to keep in mind.

Jonathan: That's a nice one. Maybe a related question is how can you turn on these things? Obviously, you can have a customer service person or assuming you set that up in your situation. Can you also do this through a URL parameter, if you want to just quickly for a demo, say, let's turn on feature X in a demo, or what are the different ways that you might want to enable these features?

Joy: I don't think actually you're referring to cookie flip type thing, which is another thing that we've done at Box which basically like the the URL is how we set that there is basically you put like a URL parameter and then it saves it in your cookies locally, and then every time you fetch it'll it basically send the value that you want for the feature flag. That's another good option. It's a little less secure, obviously, it's more like security, but the way we handle it too is that we actually allow fairly complex logic in the targeting rules and then that's pushed down to the local system. There's two ways to do feature flags, either the client sends all of the data to the feature flag server, so to speak, and says like, "What value should I get for this??

Or you can send the targeting rules down to the client and then they can look locally to see, like, does my parameters match this targeting rule? Slit does the ladder and the value there is that we're not actually getting any of our customers data so they can do targeting based on PII and we'll never actually get that AI, which is great for us and them to be quite honest. In that sense, they can set up if this user has this parameter on their profile, then they should get this, this treatment, or if they refers to that location-based one that I was talking about, or you can even do something like if feature flag A is turned on, they should also get feature flag B. Things along those lines.

Jonathan: I suppose you could target it on anything you want at that point, right?

Joy: Yes.

Jonathan: You could target it on their age or other demographics, or if they're using dark mode or light mode or whatever you want. Right?

Joy: Yes. Precisely

Jonathan: Which gets us to the next topic I'm interested in discussing, which is feature flags versus A/B testing, which goes back to the original conversation we had a few weeks ago. I know Split offers both, but maybe before we talk about that, what is the difference between future flagging and A/B testing in your mind?

Joy: This is probably a weird answer because I work at Split. I would say A/B testing is probably largely an extension of feature flags in a lot of ways, assuming you have an advanced enough feature flag framework.

Jonathan: Okay.

Joy: This wouldn't quite work as well, if it's just straight on and off, like we were talking about before.

Jonathan: Yes.

Joy: For us, we treat in the back end, we treat they're exactly the same object. Everything is an experiment, everything is a feature flag. They are the same thing, and then basically what gives you the experiment side of things is you can create these metrics and you can create like what are you looking for? Then we start calculating those metrics based on the feature flag states. If you have 10% like on and 20% or 90% off, then we start calculating for the metrics that you've set up for your thing, like which ones are doing better for a or on versus off.

It gets a little weird because in experimentation, people tend to talk about A/b versus feature flags people tend to do on-off. I there's obviously other states, but yes, it works basically the same. A lot of our customers use and in fact, internally, a lot of our experiments are actually what we call do no harm experiments. It's more like we're rolling out a new feature. We're using it just as a feature flag, but we want to make sure that in the process we don't mess up our performance and we don't mess up the standard clickthrough on our most important and things or whatever it happens to be. We want to make sure that new features don't cause problems here. That's a lot of our experiments that we run actually.

Jonathan: On that note, do you run all-new features through a feature flag/do-no harm experiment, or are you selective?

Joy: I would say all big features have a feature flag. We're not always as great about actually setting up the experiment side of things. I'm sure we're still calculating for most of them, but we don't always pay attention to it, but I'm sure somebody would tell us if things were going terribly badly.

Jonathan: We hope so.

Joy: We do always use the feature flag.

Jonathan: I suppose you're in a biased position here too. The question is, would you advise other people to do that? I suppose you're biased because the more they do that, the more money your company makes, but aside from that, whether they're using you or an open source project or whatever, why what's the case for that? Why does it make sense to use feature flags for essentially everything?

Joy: Honestly, even when I was at Box, we used feature flags for basically everything. I would say the big argument there is, I mean, as a lot of the stuff we've been talking about, it's being able to separate deploy from release, it's being able to start merging or like do smaller commits, start merging those early and not have your broken feature because you haven't finished it yet affect anyone while you're still working on it, but by being able to merge the code early and often you tend to have a lot fewer issues later in the process and any small side effects that you weren't expecting, you can usually catch a lot earlier on it does allow you to do testing. It allows you to do things like have a different feature flags set up in staging than production.

You can start saying, I think we're ready. Let's turn it on in staging when you can start testing, and if you're staging environment before, are you bother with production? I know we talked about testing and production before, but if you're even worried about that usually we actually start with staging and then we start doing the live production testing before we actually roll it out. It gives you a lot of these safety things. Then having that like automatic off switch of something goes wrong is very nice in a lot of cases too.

In fact at both Box and Split, in addition to like every decent size feature having a feature fly, we also use them for high risk bug fixes. If we were doing a bug fix that we just were a little worried about like side effects we weren't sure about, or something like that, we would wrap it in a feature flag. That use case you turn it on almost immediately, but it just allows you that instant off switch if something does go wrong.

Jonathan: I'm interested to hear a little bit. I want to address anybody listening who says, "Yes, this all sounds like a good idea, but it's not for me or it's too complicated or I don't have the time to learn how to do this." What's involved in setting up a feature flag if we use Split or I imagine it's similar with any framework, but from a coding standpoint, I'm a developer, I'm adding my first feature flag, what does that look like?

Joy: Split, and I'm sure most of our competitors if not all of them, have a bunch of SDKs for a bunch of different languages. The idea there is you install the SDK and that handles most of the complicated part. Basically, all you need to do then is you define we have a Constance file somewhere with the current feature flags in the system, and then from the various other files, we basically do, it's like a line of code that something like if treatments dot on equals and then get treatment for this flag with these settings or with these parameters and then that would respond with the particular setting for that flag. I know we have a lot of tutorials, most other people probably also have a lot of tutorials and examples.

Jonathan: It is basically in your statement. You have to pass in some session parameters or whatever it is that your thing might be based on, and then you get an answer.

Joy: I guess if you have more than two states, it could be a switch statement.

Jonathan: I think it's pretty clear, at least in my mind, how this would work for a typical modern web app, but what if I want to do future flags on a mobile app, or something that my server isn't necessarily controlling? What does that look like and what are the differences or concerns there?

Joy: We also have a mobile SDK so you can use that as well. Basically works kind of the same way. You would put this into your code. It would get shipped with the code to your customers. There tends to be slightly different usage patterns between server and mobile. Server tends to-- Let's see, what do we do? We do push for server and we do pull for mobile is our typical use case. The idea is like with server, we notified them when there are changes. We're like, "Things have changed in your splits and segments. Here's the notification." In fact, we do this a little bit for mobile as well but with mobile these case is much more like I open the app, and at that point, I'm going to go and fetch all the current states, and then I will use it for a little while, and then I'm going to close it again. Then the next time I open it, I'm going to refetch all of the current state versus server is more always on, so it's more just like push the updates, and they're never actually going to be fetching full state instead, they're just very, very rarely fetching full state, so they're just getting the updates.

The big thing to keep in mind, which I'm sure if you're a mobile developer you already know, is that your clients are not going to upgrade very often. Getting those feature flags into there in the first place might be a little bit more challenging. We have a JavaScript SDK for client-side JavaScript, and so that's the same idea too, it's written the same way as the mobile stuff.

Jonathan: Let's share some contact info, then obviously, Split, the company you work for, is it Split.io?

Joy: Yes.

Jonathan: If you're looking for a feature flag framework, that's one to consider. If people are interested in connecting with you, are you available for contact on social media or anywhere else?

Joy: Sure, on Twitter or LinkedIn, you can find me in those places. I'm not super active for better or for worse.

Jonathan: You have a blog as well, if I'm not mistaken.

Joy: Yes, I have a medium blog.

Jonathan: Well, Joy, thank you so much for coming on and educating me and hopefully, a few other people about feature flags and A/ testing. This is a topic that's been quite fascinating for me for many years, so it's really fun to talk about this. Thanks for coming.

Joy: Thanks for having me.

[music]

[00:32:00] [END OF AUDIO]

Share this