Tiny DevOps episode #6 Olaf Molenveld — Getting started with Progressive Delivery
June 15, 2021
Olaf Molenveld, former CTO of Vamp (now part of CirlceCI), joins me to explain the concept of Progressive Delivery, when it makes sense, and what homework every team should do before getting started with canary deployments, red/green deployments, and other progressive strategies.
Resources:
Article Towards Progressive Delivery by James Governor RedMonk
Application deployment and testing strategies from Google:
Guest:
Olaf Molenveld, former CTO of Vamp, now part of CircleCI
LinkedIn
Email: olaf@circleci.com
Transcript
Speaker 1: Ladies and gentlemen, the Tiny DevOps guy.
[music]
Jonathan: Hello everyone, and welcome to another episode of Tiny DevOps. I am your host, Jonathan Hall, and today I had with me Olaf Molenveld, who is the CTO, and co-founder of Vamp which has just been acquired by CircleCI, so we'll talk about that in just a minute. Olaf, welcome. Why don't you tell us a little bit more about yourself and what you do?
Olaf: Thank you, Jonathan. Great pronunciation by the way. Yes, I have a technical background, I did my share of programming in the past and then went more into architecture role, design role, technical conservancy. I've always been working on the boundary between technology and what value can we get out of this new technology? Also, the other way around, what problems does the business actually have that we’re trying to solve, because often people are asking for solutions, instead of describing their real challenges. I tried to do some match-making there.
Jonathan: Nice. The famous X, Y problem, as we call it sometimes. Nice one. Great. Today we're going to talk about-- I met you on LinkedIn, I saw a post that you had shared about Progressive Delivery and I wanted to talk to you about that. That's why I invited you on the show today. Of course, I'm sure everybody listening is familiar with the concept of continuous delivery and continuous deployment because those are buzzwords around the entire industry, especially if you're listening to a DevOps podcast you ought to know what that means. How is progressive delivery different? How is it the same? How is it different? What is it? Maybe you can just start with that.
Olaf: I think it's an umbrella term for a lot of methods and technologies to control, I would always say, the rollout or the go-live of software. Because releasing and delivery and deployment are a little bit tainted. A lot of semantics there, but it's a collection of things like AB testing, canary releasing, blue-green, feature flagging all these things that give you the safety nets and controls to a reduced blast radius and segment your software rollouts.
Jonathan: Are there ever situations where you would say that progressive delivery is absolutely the wrong approach or the wrong tool?
Olaf: Yes, these regulated industries, where there's a lot of manual steps and checks and balances and sign-offs. Of course, you can introduce manual gates, steps in the policy, but then it becomes a little bit harder, especially to stretch it out into production. Maybe you can do it into pre-prod or staging. In those regulated industries, it becomes a little bit tricky.
On the other hand, the thing with privacy makes this an interesting case because, often, the testing or shift left movements is, you need to replicate the production environments and that's all the data. That's not an option anymore when you are like a privacy-aware organization. You cannot shift production data into test environments. In that sense, it becomes maybe an advantage to start doing it into production. It's a little bit like a double-edged sword, I think.
Jonathan: It sounds there’s always trade-offs, as there are with practically any tool we choose to use. In fact, if you only have a hammer and everything looks like a nail, you're going to make the wrong decision so apply judgment.
Olaf: Yes, that's the thing.
Jonathan: Supposing that somebody wants to start doing any of these progressive delivery approaches, is there anything really easy they can do? Does GitHub Actions make it possible, or is there some off-the-shelf or can they just do this with Kubernetes, or they really need a tool to do this?
Olaf: In the Kubernetes ecosystem, there's open-source tools to play around with, because, effectively, what we do is we configure Ingress in a dynamic way, and we use layer 7 of HTTP filtering. In the Ingress space, there's tools that you can use, and you can basically create these configurations by hand. That would be my first suggestion, to play around with what a typical Ingress controller like Contour or any of those, or Envoy or whatever, or maybe Service Mesh offers you to get a feeling on how you can direct traffic and filter traffic and play around with that, because, in essence, it's automating that kind of thing.
Then, of course, there's the other way, the other aspect, which is the conditions or checking verification, which is more like, "Okay, you hit like open telemetry," that kind of open tracing area where you say, "Okay, what am I actually observing? What happens if I sent 5% of visitors to a new version. What am I looking at? what do I want to see?" With all those things are available to you, you can play around with them as they're in the cloud-native landscape. Just pick one. There's no perfect solution. Never. It's just getting your hands dirty and just start playing around and get a feeling for what is possible.
Jonathan: What size of companies do you usually see having the most success with this? Is it across the board, or does it require a certain level of complexity in the organization that maybe you need to be of a certain size before this really starts to be beneficial?
Olaf: That's a great question. I think it makes sense all across the board. Obviously, when you're a smaller setup, you have more control. There's an integrated team or a few teams that have feasibility on both the services, on the infrastructure, and on the application performance metrics. It makes it easier because it's about coordinating these things, like observing the application performances, but also observing technical metrics. It makes it a little bit easier to set it up and start doing these things because a lot of the time it takes, it's not about the technical things. It's about like, "What kind of metrics do we actually observe?"
Then it's much more than only like Kubernetes health, or some kind of HTTP status. It's also about maybe you update service A, and you want to look at an API endpoint somewhere else to see if this thing doesn't become really slow or break down. You need to start thinking about how you can assess the performance of your application landscape as an end-user would kind of perceive it. That takes a little bit of time because you move it out of technical things that are direct in the control of the people, but you approach it a little bit more holistic. On the other hand, if you do this, it's automation. You make data that's already there, actionable in an automated release pipeline. If you have this in place, then scaling up is much easier, adding more services through that pipeline, applying to multiple teams, multiple environments.
For bigger organizations, it's also very valuable, because it is separation of concerns. You can bring in an intern or a new developer, and he or she doesn't need to know about this, what kind of conditions to check, or what kind of segmentation. You push your codes through the pipeline and automatically, is this process is being applied. In that sense, scaling it up into a larger kind of organization also makes sense.
Jonathan: For just a moment, would you address the listener who's maybe interested in implementing canary deployments, or blue-green, or any of these sort of methodologies you've talked about. What homework do they need to do? What metrics should they consider before they just run out and start implementing this?
Olaf: That's a good question. Typically, I would say, start with something simple, low-risk service like a restful service, stateless, because then you don't have the data kind of discussion. Just use the metrics that are out of the box. The health restarts, HTTP status codes, we collect out of the box anyway. Go with that and see how you can extend your pipeline from CI to deployment, into releasing, and just observe how well it goes into production. Then you can start adding more metrics to observe.
Jonathan: It sounds like what you're saying is that the technical part is easy, what I think is maybe true for most of DevOps, right?
Olaf: Relatively easy.
Jonathan: Relatively easy. You set up a CI pipe, and the first time you do it, it's exciting. Ten years ago, the first CI pipeline, and I look at it like, "Whoa, you can do that." Nowadays, it's become second nature. We can all set up a CI pipeline. The hard part is convincing your developers to use CI to write tests for people to actually run the tests before you'd merge and all these things. It sounds like the human factor is usually-- what you're saying, the human factor is a bigger part deciding what's important. I worked with a team last year, for example, where somebody asked for canary deployments, but they didn't have any concept of, "What are we waiting for? When will we decide to release to the rest of the customers?" They just wanted a canary deployments because it made them feel better.
Olaf: Yes. That's the thing we were discussing in the beginning, like asking for solutions because this canarying comes a little bit like a magic silver bullet, a snake-oil kind of thing. Then you need to ask like, "Why and what are you actually trying to achieve here?" In the end, this is all about process because the going live, we make this distinction between the technical deployment which is like what your CI pipeline builds, and then you do your Helm or your kubectl or whatever deployment mechanism you're using, which is like putting that artifact on some infrastructure and do a run or app, or whatever you do.
That's only one thing. This thing is running, but the goal in life is like a process. It's like, "Who is this first segment of visitors that we're going to expose this thing to? Why this segment, and why not another one? Are those better users, or are those low-risk users? Can we get the data out of this segment, enough data? What is this data? What is the baseline? What are the thresholds? What do we do? What's the mitigation strategy? If it's not good, do we stop, do we rollback? What is a rollback? Is it a traffic rebalancing? Is it a stop and a redeploy?" You need to think about the process. Typically, we ask like, "How do you do it right now? What kind of metrics do you observe? What manual steps do you take?" Then the first thing is, "Let's automate some of these things."
Jonathan: I want to play the devil's advocate just a little bit here. One advantage that I often coach teams on when talking about using continuous delivery or continuous deployment, one advantage I always tell them about is the psychological advantage that the developers feel a new responsibility when they know hitting that merge button means that their code is going to be in front of customers in 10 minutes. If they go to any of these approaches, feature flags, or blue-green, or canary deployments and it removes the developer a little bit away from that pressure that their code is going to be in front of customers, there's still another chance to check this. How do you address that?
Olaf: I think nothing changes. It still gets into customers, only, it doesn't get into all your customers. You start with your low-risk segments. It's like [unintelligible 00:13:21]. You create [unintelligible 00:13:23] segments. "Okay. What's my most obvious user base that I want to test this release against in production settings without affecting my entire population?" Maybe there's these high-paying clients with very low-risk appetite, so there's always a few people like your friends and family segment where you say, "Okay. I did a release. I did an update. Can you take a look if I didn't miss anything?" Actually, that's what it is. You take a little piece and it's still exposed. That gives you basically the freedom to try it out in a production environment with production data, with real visitors and users, and observe how it's working before you go to the next stage.
Jonathan: What else would you like to say about progressive delivery, whether it's somebody considering trying it out for the first time? What have I failed to ask about?
Olaf: I think the most important takeaway is that it is about the process. In the cloud-native space, we tend to focus really on the technological parts of the thing and self-service and developers should be able to do everything, but I don't believe that. I think it's a shared effort between Ops and Dev and also business side of things where they think of new features, and the analytics part of things where people observe actually how the platform is being used. For me, it's about how you start working together between those different responsibilities, even though maybe in a smaller set these responsibilities are shared within single roles.
Think about the process and then pick the technology that fits instead of like you say, pick some technology like cargo-culting and then try to apply it to something that maybe doesn't really fit with your way of working. That would be my main thing. Try to think about how you want to work and try to also involve your non-technical co-workers. What do they need to do their jobs? What kind of dashboard controls visibility do they need? How much of your tasks can you offload to your colleagues? Because maybe it's not a technical thing. Maybe it's something for somebody at the business side of things or analytics to control a release or configure a release. You need to collaborate on these things and be more like an enabler facilitating these things.
Jonathan: I was recently looking at the Vamp website and it says, "Vamp is a cloud-native AIOps platform." Can you explain AIOps a little bit?
Olaf: Yes. Then just talk to our marketing department.
[laughter]
Olaf: AIOps, of course, this is like a buzzword a little bit, but the thing is, we collect a lot of data, and there is already a lot of data. If you want to automate, you need to apply machine learning to it because you want to do anomaly detection and you want to aggregate things. There's different types of metrics like histograms, and counters, and all these things. The AIOps part of things is more like, "How can we apply intelligence and machine learning to all that data to make sense of it?"
Jonathan: Then let's talk about cloud-native. That's a pretty widely-used term also. If people are not doing a cloud-native application, is Vamp still going to be valuable to them?
Olaf: We require Kubernetes, so that's a hard requirement, currently.
Jonathan: Let's talk about this merge or this acquisition that was announced. What do you see coming up in the next maybe 6 or 12 months, now that you're part of CircleCI?
Olaf: A lot of exciting things, of course, but obviously, the first task at hand is to integrate the features and the technology of Vamp into the wider CircleCI platform. Then, of course, also integrate the vision that we have on releasing, which is not deploying, like I said, into the mindset and the roadmap of where we go because it's stretching, extending the use cases of the platform into also other roles and more towards the business release management if you're a bigger organization. That's basically what we will be working on.
Jonathan: Are you sticking around or are you [crosstalk] and going to the Bahamas? [laughs]
Olaf: No, no, no, we're sticking around. Actually, the entire team is sticking around, and we're super excited to do this. We're still in Amsterdam. Obviously, it is a distributed global organization. Locally, we stay in Amsterdam. We are also distributed over Europe, but that doesn't really change.
Jonathan: For people interested in learning more about progressive delivery, what resources can you recommend?
Olaf: The original term was coined by James Governor from RedMonk, and I guess the link will show up somewhere later on, so you can Google for it. I think there's some interesting pieces from his hand where you can start reading on the history of it. Basically, all over the internet, there's a lot of interesting content. Of course, on our Vamp.io website, we have all kinds of white papers and blog posts. I think also on Amazon AWS and on the Google website there are some very interesting articles on what kind of patterns to apply, when and when not, blue-green, AB, canarying. We will share those links, and people can read up. There's plenty of interesting stuff to be read.
Jonathan: If people are interested, of course, in hiring, essentially, Vamp to do the dirty work for them, you can go to Vamp.io and look at the products and pricing. If people do want to get a hold of you, what's the best place to contact you?
Olaf: That will be either through LinkedIn, or just send me an email which is olaf@circleci.com
Jonathan: If you enjoyed this content but you don't want to wait a week for the next episode, subscribe to my daily mailing list at jhall.io/daily. Thanks to Riley Day for the tiny DevOps theme music.
[00:20:39] [END OF AUDIO]