How to handle long-running batch jobs during an upgrade
Two approaches to managing hours-long jobs with continuous deployment.There are a number of questions that seem to always come up whenever a team starts down the path of continuous deployment. One that I hear a lot is: “How should we handle long-running batch processes?”
Let me explain the scenario first, then provide two approaches to addressing it.
Say you’re changing vendors for an important API. And as a result, you need to replace an old integer id from the legacy API vendor with a new UUID-based id for the new vendor… On 10 million database rows.
You can run a database migration to add the new UUID column, then query the API 10 million times to populate the new row, then drop the old column. But even on the fastest of APIs, this will probably take a few hours.
And what happens if your service is deployed again during that few-hour window?
Thus the conundrum.
And now my two approaches:
-
Manage these long-running tasks completely outside of the normal service deployment procedure.
In this example, that would probably mean using your normal deployment procedure to run a database migration to add the new column, but the process of calling the API 10 million times would be run separately; probably manually. If there are deployments while this update is running, it shouldn’t impact the update. Once the update is complete, you can deploy another migration to drop the old id column.
-
Don’t have any long-running tasks!
Okay, that might sound a bit dismissive. What I really mean, of course, is to break the long-running task into many short-running tasks. The key is to make the task safe to interrupt, and possible to resume. In our example, those features are probably built in. In some cases it’s not nearly so straight forward.
In practice, this would mean we run the database migration to add then new UUID column, then in the background the main service could be chugging away at calling the new API to populate the new column in small batches. The key is to handle each update (or batch of updates) atomically so that if, at any time, the service goes down, say for another upgrade, it won’t leave any of the data in an inconsistent or partially-updated state.
At some point, when there are no more columns without the new UUID, the code to call the new API and perform the update can be removed, and a final database migration to drop the old integer id column can be performed.
Both approaches can work. I tend to prefer the second approach for two main reasons:
- It’s safe against any restart, not just upgrades. Unexpected hardware or software failure can leave long-running batch processes in limbo, too.
- Running a manual process on production data is dangerous. There’s often no audit log or other indication of what happened.