contents
Is it time to retrain your machine learning model?
Even though data science is all about… data, the answer to this question is surprisingly often based on a gut feeling.
Some retrain the models overnight—because it is convenient. Others do it every month—it seemed about right, and someone had to pick the schedule. Or even "when users come complaining"—ouch!
Can we do better?
To answer the Retraining Question more precisely, we can convert it into three.
First, how often should we usually retrain a given model? We can get a ballpark of our retraining needs in advance by looking at the past speed of drift.
Second, should we retrain the model now? How is our model doing in the present, and has anything meaningfully changed?
Third, a bit more nuanced. Should we retrain, or should we update the model? We can simply feed the new data in the old training pipeline. Or review everything, from feature engineering to the entire architecture.
Let us dive in.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
To start, why do we talk about changing the models? We surely did our best when we built them.
The most obvious answer: machine learning models grow old. Even if nothing drastic happens, small changes accumulate. We can experience data drift, concept drift, or both.
To stay up to date, the models should re-learn the patterns. They need to look at the most recent data that better reflects reality.
That is what we call "retraining": adding new data to the old training pipelines and running them once again.
Depending on the domain, the model decay can be fast or slow.
For example, manufacturing processes tend to be stable. A quality prediction model can last a whole year. Or until the first external change, such as a new raw materials supplier.
Consumer demand can be more volatile. Each new week's data could bring some novelty you better factor in.
Of course, nothing is set in stone. We've seen manufacturing models that need updates after every process batch and sales that remain boringly stable.
In cases like fraud detection, we have to consider the adversaries. They might adapt quickly. New types of fraud—new concepts for our model—will appear now and then.
Take search engine optimization. Bad actors might try to game the system. In this case, model retraining can have a dual reason. First, we do that to maintain the ranking quality. Second, to address this unwanted behavior.
Updates can make sense even if the ranking quality remains the same! The goal is to make the adaptation more difficult by changing the way the system operates.
In real-time systems, we might care a lot about the most recent feedback.
Consider music recommendations. If a user downvotes a few songs, the recommendation engine should react to this in a matter of seconds.
This does not mean the model is no longer valid. We don't always have to retrain it completely! But we need to factor in the most recent tastes.
For example, we can rebalance the weights of the user's preferences. We can lower the importance of specific tastes and promote others. Such adjustment is computationally fast, and we can do it on the fly.
These are only a few examples. Different models have different maintenance needs. They vary from minor online calibration to a complete offline update or a combination of both. Some models drift quickly, some not that much. We can run our training pipelines daily or do that once per month.
We cannot define a universal schedule for all model types and applications. But at least when it comes to the regular model "aging," we can address it in a systematic way.
We'll describe the approach below.
In more complex cases, you might need to adapt it—use these checks as an inspiration!
Here comes our question number one. How often should we retrain the models, and is it even needed at all?
As usual in this line of business, we can learn from the data!
To come up with a schedule, we can run a few checks. Depending on how critical your model is, you might go through some or all of them.
Yes, the learning curves.
Our goal is to see how much data we need to train the model in the first place. We might have too much, or we might have too little.
That is something to check before any retraining discussion: are we done with training yet?
This check comes with nice side effects. We might reduce the training data volume. And, we can get a sense of how much signal each chunk of our data contains.
Where do we start?
If you have training data for a long period, you can check if you need it all.
Say we have 5 years of historical data. We can see if the model trained on the complete dataset does better than the model trained only on the most recent periods.
Maybe, just one or two years of data will be enough to get the same results?
If you deal with time-series data and known seasonal patterns, you might need a custom test design. But in many cases of abundant data, this rough check helps conveniently reduce the training sample size.
You can also head straight to performing a more granular test.
This one makes sense when there is no obvious excess of data. Or when the data is not sequential: we might simply have a group of labeled objects, like images or text.
What should we do then?
This check is similar to how we define the number of training iterations in gradient methods. Or select the number of trees in the ensemble during model validation—in the random forest algorithm, for example.
Here is a refresher: to arrive at an optimal number, we evaluate the alternatives. We pick the minimum and the maximum number of trees. We define the step. Then, we check the model quality for all the options. Minding the cross-validation, of course!
Next, we plot how the model quality depends on the number of trees. At some point, it arrives at a plateau: adding more trees does not improve the performance.
We can adapt the approach to check if adding more data improves the model quality.
In this case, there is no temporal dimension. We treat each observation in our training set as an individual entry. Think one labeled image, one credit applicant, one user session, etc.
How to proceed?
We can use a random split method and perform the cross-validation with the changing train size. The idea is to evaluate the impact of the data volume on the model performance.
At some point, we can reach a plateau and call it a day. Adding more data stops improving the outcome.
We can look back and decide on the optimal sample size required to train a good enough model. If it is smaller than the training set we have, we might be glad to drop the "extra" data. This makes the training more lightweight.
Or, we might not make it to a peak yet! If the quality keeps going up and up, our model is hungry for more data.
Good to know that in advance!
There is no point in checking the speed of the model decay yet. We are still in the improvement phase. Assuming that getting a better model makes business sense, we should work on that.
Maybe, we can get more labeled data? Otherwise, get ready for continuous improvement and frequent retraining—until we get to that peak performance.
By extrapolating the results, we can also roughly estimate how many instances we need to reach target quality. Here is an example of an image classification case from a Keras tutorial.
As a side effect, this exercise gives us a sense of "scale." How quickly does the model quality change? Does it take 10 observations or at least a 100? Would it translate into days or months to get that amount of new data?
It will help us pick a reasonable "step" when we run more analyses on the decay and retraining.
This check by itself does not tell us yet how often to retrain the model. Why? Not all data is equally useful. More "valuable" chunks of data might arrive, say, only on the weekends. When we mixed the data, we downplayed this factor.
There are some more checks to run.
Moving on to the next one!
Here is another helpful heuristic.
Let's assume that there is some average, expected speed of drift. This can be the rate at which user tastes evolve or the equipment wears off.
Then, we can look back on our data—and estimate it!
We can train our model on the older data and consecutively apply it to the later periods.
We will then calculate how well our model performs as it gets "older." The goal is to detect when the model performance meaningfully declines.
Such a single-point estimate can already decrease the uncertainty. We can roughly estimate how fast our model ages: is it one week or one month? Or maybe much more?
If we have enough data, we can repeat this check several times, shifting the initial training set. We can then average the results (accounting for outliers, if any!)
This way, we get a more precise assessment.
Of course, the model training approach should remain the same during the exercise. Otherwise, model parameters might affect the decay more than the changes in the environment.
Word of caution: if you have some critical subpopulations in your data, you might want to check the model quality for those as a separate metric!
The outcome of the test can vary. Sometimes, you'd learn your model is very stable: even a 6-month old model performs as well as a new one!
In other cases, you'd see that model decay is swift.
If that is hard to handle, you might decide to reconsider your training approach.
Some options to consider:
If we have settled on the results, let's write down the number! This expected speed of decay is a key variable in our retraining equation.
Not always.
The assumption behind it is the gradual speed of the change in the world.
If you are creating an image model to detect road signs, the problem is of a different kind. You have to watch out for the "long tail" of dark and noisy images, but the actual signs barely change.
The same is often true for text models, like sentiment analysis or entity extraction. They still need updates but rarely follow a schedule. Instead of checking the speed of decay, it makes more sense to search for underperforming segments and monitor the changes.
Other cases are more predictably dynamic. Think consumer behavior, mobility patterns, or time series problems like electricity consumption.
Assuming a certain pace of change often makes sense. So does looking at the past drift to estimate the rate of model degradation.
This data frequently comes time-stamped. You can easily hop on the "time machine" to perform the calculation.
We got a ballpark of our retraining needs. But there is a rub. Can we retrain it?
To run the retraining pipeline, we need to know the target values or have the newly labeled data.
If the feedback is quick, this is not a concern.
But sometimes, there is a delay. We might need to involve the expert labelers. Or, simply wait!
It makes sense to check what comes first: the new data or the model decay.
This is primarily a business process question: how quickly the data is generated, and when it lands in your data warehouse.
Say, we learned that historically our sales model degrades in two weeks. But new data on actual retail sales is collected only once per month. That brings a limit to our retraining abilities.
Can we make up for it? Some options to consider:
What does it mean for our retraining schedule?
At this point, we can get into one of the two situations:
If we have this luxury of schedule choice, we finally get to the question.
By now, we know a thing or two about our model profile.
Let's consider a specific example.
So when, exactly, should we plan to retrain it? Weekly? Monthly? Every 60 days?
You can, of course, just pick a number. Excessive retraining has downsides (more on this later!) but should not make the model worse.
That said, we can make a more analytical choice.
We can follow a similar approach as before. In this case, we pick our test set from beyond the point of decay. Then, we run our experiment within the period of stable performance.
We start adding more data in smaller increments: week by week, in our example.
Our goal is to detect when the quality on the test set starts improving.
What we might learn is that adding data in small buckets does not always change the outcome.
For example, retraining the model with daily data has no impact on test performance. We need a couple of weeks to see the results.
There is no point in retraining the model more often!
That is often due to the same phenomenon of "data usefulness." There might be seasonal patterns or a critical mass of rare events that accumulate over time. The model needs more data to be able to capture the changes.
The realistic choice of training window might be smaller than it seems.
To make the specific choice, you can consider how much improvement each new bucket brings. On the other half of the equation, consider the costs and hassle of frequent retraining.
You can also repeat the check several times to see when you get a meaningful quality gain on average.
Or yes—just pick a number inside this more narrow corridor.
One could say: in principle, it should not hurt.
Why do we bother? What is wrong with updating the model more often than needed?
The reason is, it adds complexity, costs and is often error-prone.
Don't fix what's not broken: this motto also applies to machine learning.
Updates come at a price.
There are direct computing costs involved. Sometimes, the labeling ones. Then, the required team involvement!
Yes, you might trigger the updates automatically. But imagine something like a new model failing during validation due to data issues or minor quality decay. The team would rush to solve this!
That is quite an unwanted distraction—when the model did not need retraining to begin with.
It is not always just the data science team involved.
A new model might require sign-off from other stakeholders. It might mean a whole validation process! That adds a fair share of organizational and compliance costs.
Lastly, we need to justify the architectural decisions.
If we are yet to build our retraining pipelines, we need to decide how complex a system we need.
If we only retrain our models once per quarter, we can get away with a simpler service architecture. If we need daily updates, we should invest in a more complex toolset upfront. We'd better do that knowing the expected quality uplift!
Of course, if you already have an advanced MLOps platform, the computing costs are marginal, and the use case is not critical, you can probably sign off all these concerns.
Otherwise, it pays off to be a bit more precise with the model maintenance!
Here is a bonus one.
Let's assume we defined a reasonable schedule for our model retraining. How exactly should we perform it? Should we take the new data and drop the old one? Should we keep the "tail"?
That is something we can think through as well.
Say our model profile looks like this:
Let's keep the retraining plan fixed. Two buckets it is. We know they bring the quality uplift we need!
Then, we can experiment with excluding old data bucket by bucket. Our goal is to evaluate the impact of this "old" data on performance.
It is similar to the first check when we considered excluding old data from the initial training. But now, we use a defined test set (with the known decay) and likely a more precise step (think dropping months, not years of data).
Which result can we get?
Dropping the old data makes the model worse. Okay, let's keep it all for now!
Dropping the old data does not affect the quality. Something to consider: we can make the model updates more lightweight. For example, we can exclude a bucket of older data every time we add a newer one.
Dropping the old data improves the quality! These things happen. Our model forgets the outdated patterns and becomes more relevant. That is a good thing to know: we operate in a fast-changing world, and the model should not be too "conservative"!
We might investigate more complex combinations:
It depends.
Planning your retraining definitely makes sense. How precise do you want to be? Your use case will define it more than anything else.
Some models are lightweight, use little data, and don't pose significant risks in production. Nothing to overthink here! Just one or two sanity checks will do.
Others are highly critical, need extensive governance and a detailed plan to maintain those. It is probably a good idea to study your model behavior in extra detail.
Neither estimate is an exact recipe. But all are helpful heuristics in the practitioner's cookbook.
As always, analytics is best served with common sense and appropriate assumptions!
Here comes the second part.
Until now, we've been studying the past. Once the model is live, we move into the wild!
The schedule is helpful but not bulletproof. Things might change in the course of model operations.
How do we know if it is time to update the model that is in production?
Monitoring! The reality check on top of the proper planning.
We need to evaluate the performance on the live data and compare it to our benchmark. This is usually the training data or some past period.
This way, we would know if to intervene earlier than planned. Or the opposite, skip an update, all things considered.
What exactly to look for?
If the ground truth is known quickly, one can calculate the actual performance metrics.
This can be the mean error rate for a regression model or the precision of a probabilistic classification model. Ideally, one should add a business metric to directly track the model's impact on it. If you have specific important segments, keep an eye on those as well.
You can set up a dashboard and monitoring thresholds to alert you when things go wrong or set triggers for automated retraining.
Setting a threshold is often unique to the use case. Sometimes, even a minor decline leads to business losses. You literally know how much each accuracy percent costs you. In others, fluctuations do not matter that much.
It makes sense to set these thresholds during the initial performance assessment. Which quality drop is severe enough? Knowing the business case, you can define what exactly warrants the attention—and the retraining effort.
At this point, you can also re-run some of the earlier checks using the newly logged data.
There is always a chance your offline training data was not fully representative of the real world. You can verify how well your assessment of the model decay matches the actual state of affairs.
If you use a golden set for your model evaluation, you might also want to review and update it. Live data can bring new corner cases or segments that you want your model to handle well.
And what if you do not get the ground truth immediately?
We might again get trapped in the waiting room: expecting the actual values or new labels. When this happens, we are left with the first half of the story: the data!
We can look at the data drift and prediction drift as a proxy to evaluate the likely model decay.
Knowing the shape of the input data and the model response, we can evaluate how different both look compared to training.
Imagine you are running a model monthly to choose the right marketing offer for your clients to pass them to the call center team. You can start with checking the input data for statistical drift.
If the distributions of your input data remain stable, your model should be able to handle it. In most cases, there is no need even to update the model. It ain't broken!
If the drift is detected, it comes as an early warning. You might rethink how to act on predictions. For example, you might decide to set a different decision threshold in your probabilistic classification or exclude some segments.
And if you can get new labeled data and retrain—time to do so!
Drift in the key features often precedes the visible model decay. In this tutorial, we illustrated the example. Key features shifted a couple of weeks before the model performance went haywire.
That is why you might want to keep an eye on the data drift even if the new labels come quickly.
For highly critical models, that is a handy advance notice.
Let's say we did our best to prepare for model decay. We got our estimates right to define the retraining schedule. We built alerts to detect real-time changes.
But when they happen, we are left with the last question. How exactly to act?
The default idea is to keep your model fixed and feed it with new data. That is what we assumed when checking for optimal training windows.
Drop some old data, add some new data, repeat the same training pipeline. This is a reasonable step in case of any performance decline.
But it might not always solve it!
If we face a significant drift, we might need to rethink our model. Imagine a major external change, such as facing a completely new customer segment.
Maybe, we'll need to tune the model parameters or change its architecture? Review pre- or post-processing? Reweigh the data to give priority to the most recent examples? Build an ensemble to account for new segments?
Here, it becomes more of an art than science. The solution depends on the use case, the severity of changes, and the data scientist's judgment.
There is another side of the medal: we might be able to improve the model! We could have started with a simpler, limited model. As more data is collected, we might be able to rebuild it and capture more complex patterns. Maybe, also add new features from other sources?
Word of caution: this new model might have a different decay profile!
To prepare for both, we need three things.
First, keep the update option in mind.
If naive retraining is the only option considered, we might completely miss out on other ways to maintain the models. It makes sense to schedule some regular time to review existing models for the improvement potential.
Second, build up a model analytics practice.
What do we mean by this? Get more visibility into your model behavior!
Is not just a singular performance metric like ROC AUC. We can search for underperforming segments, changes in the data, feature correlations, patterns in the model errors, and more.
To get these insights, you can add more views to your monitoring dashboards. Or, run the deep-dives in batches during a scheduled model checkup.
Having a clear context helps prioritize the experiments and know how to act.
Third, design the communication channels.
Data does not always change "by itself."
Say your business stakeholders are planning an update to a process your model is involved in. You should not learn about a new product line as a "sudden data drift" in your monitoring! It is something you can be informed about to prepare the model for a "colder" start.
Model updates and retraining might rely on such external triggers: business decisions, direct requests, changes in the data storage process. There should be a way to streamline those!
So when should we retrain our models?
To figure out an answer for a specific use case, you might consider a few things: