📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

ML Monitoring

To retrain, or not to retrain? Let's get analytical about ML model updates

Last updated:

February 18, 2025

Published:

June 21, 2021

contents‍

Start testing your AI systems today

Get demo

Is it time to retrain your machine learning model?

Even though data science is all about… data, the answer to this question is surprisingly often based on a gut feeling.

Some retrain the models overnight—because it is convenient. Others do it every month—it seemed about right, and someone had to pick the schedule. Or even "when users come complaining"—ouch!

Can we do better?

To answer the Retraining Question more precisely, we can convert it into three.

First, how often should we usually retrain a given model? We can get a ballpark of our retraining needs in advance by looking at the past speed of drift.

Second, should we retrain the model now? How is our model doing in the present, and has anything meaningfully changed?

Third, a bit more nuanced. Should we retrain, or should we update the model? We can simply feed the new data in the old training pipeline. Or review everything, from feature engineering to the entire architecture.

Let us dive in.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

[fs-toc-omit]The Why

To start, why do we talk about changing the models? We surely did our best when we built them.

‍The most obvious answer: machine learning models grow old. Even if nothing drastic happens, small changes accumulate. We can experience data drift, concept drift, or both.

To stay up to date, the models should re-learn the patterns. They need to look at the most recent data that better reflects reality.

That is what we call "retraining": adding new data to the old training pipelines and running them once again.

‍Depending on the domain, the model decay can be fast or slow.

Industrial model: decay in 1 year, retail model: decay in 1 month — *The speed of decay varies for different models.*

For example, manufacturing processes tend to be stable. A quality prediction model can last a whole year. Or until the first external change, such as a new raw materials supplier.

Consumer demand can be more volatile. Each new week's data could bring some novelty you better factor in.

Of course, nothing is set in stone. We've seen manufacturing models that need updates after every process batch and sales that remain boringly stable.

In cases like fraud detection, we have to consider the adversaries. They might adapt quickly. New types of fraud—new concepts for our model—will appear now and then.

Fraud classification model — *If the model remains the same, adversaries can adapt to it with time.*

Take search engine optimization. Bad actors might try to game the system. In this case, model retraining can have a dual reason. First, we do that to maintain the ranking quality. Second, to address this unwanted behavior.

Updates can make sense even if the ranking quality remains the same! The goal is to make the adaptation more difficult by changing the way the system operates.

‍In real-time systems, we might care a lot about the most recent feedback.

Downvoting a song — *The recommendation model needs to respond to the recent user feedback.*

Consider music recommendations. If a user downvotes a few songs, the recommendation engine should react to this in a matter of seconds.

This does not mean the model is no longer valid. We don't always have to retrain it completely! But we need to factor in the most recent tastes.

For example, we can rebalance the weights of the user's preferences. We can lower the importance of specific tastes and promote others. Such adjustment is computationally fast, and we can do it on the fly.

These are only a few examples. Different models have different maintenance needs. They vary from minor online calibration to a complete offline update or a combination of both. Some models drift quickly, some not that much. We can run our training pipelines daily or do that once per month.

We cannot define a universal schedule for all model types and applications. But at least when it comes to the regular model "aging," we can address it in a systematic way.

We'll describe the approach below.

In more complex cases, you might need to adapt it—use these checks as an inspiration!

Part 1. Defining the retraining strategy in advance

Here comes our question number one. How often should we retrain the models, and is it even needed at all?

As usual in this line of business, we can learn from the data!

To come up with a schedule, we can run a few checks. Depending on how critical your model is, you might go through some or all of them.

Check #1. How much data is needed?

Relation between model quality and train size

Yes, the learning curves.

Our goal is to see how much data we need to train the model in the first place. We might have too much, or we might have too little.

‍That is something to check before any retraining discussion: are we done with training yet?

‍This check comes with nice side effects. We might reduce the training data volume. And, we can get a sense of how much signal each chunk of our data contains.

Where do we start?

‍If you have training data for a long period, you can check if you need it all.‍

Say we have 5 years of historical data. We can see if the model trained on the complete dataset does better than the model trained only on the most recent periods.

Maybe, just one or two years of data will be enough to get the same results?

Model quality with different training periods. — *In this example, we may decide to drop 40% of the older data.*

If you deal with time-series data and known seasonal patterns, you might need a custom test design. But in many cases of abundant data, this rough check helps conveniently reduce the training sample size.

You can also head straight to performing a more granular test.

This one makes sense when there is no obvious excess of data. Or when the data is not sequential: we might simply have a group of labeled objects, like images or text.

What should we do then?

This check is similar to how we define the number of training iterations in gradient methods. Or select the number of trees in the ensemble during model validation—in the random forest algorithm, for example.

Here is a refresher: to arrive at an optimal number, we evaluate the alternatives. We pick the minimum and the maximum number of trees. We define the step. Then, we check the model quality for all the options. Minding the cross-validation, of course!

Next, we plot how the model quality depends on the number of trees. At some point, it arrives at a plateau: adding more trees does not improve the performance.

We can adapt the approach to check if adding more data improves the model quality.

In this case, there is no temporal dimension. We treat each observation in our training set as an individual entry. Think one labeled image, one credit applicant, one user session, etc.

How to proceed?

Take the available training data
Fix the test set
Choose the initial training data size—maybe half, maybe 10%
Define a step—the volume of training data to add
And start pouring it!

We can use a random split method and perform the cross-validation with the changing train size. The idea is to evaluate the impact of the data volume on the model performance.

‍At some point, we can reach a plateau and call it a day. Adding more data stops improving the outcome.

model quality at different train size, flats out at a certain point

We can look back and decide on the optimal sample size required to train a good enough model. If it is smaller than the training set we have, we might be glad to drop the "extra" data. This makes the training more lightweight.‍

Or, we might not make it to a peak yet! If the quality keeps going up and up, our model is hungry for more data.

model quality at different train size, continues to grow

Good to know that in advance!

There is no point in checking the speed of the model decay yet. We are still in the improvement phase. Assuming that getting a better model makes business sense, we should work on that.

Maybe, we can get more labeled data? Otherwise, get ready for continuous improvement and frequent retraining—until we get to that peak performance.

By extrapolating the results, we can also roughly estimate how many instances we need to reach target quality. Here is an example of an image classification case from a Keras tutorial.

As a side effect, this exercise gives us a sense of "scale." How quickly does the model quality change? Does it take 10 observations or at least a 100? Would it translate into days or months to get that amount of new data?

It will help us pick a reasonable "step" when we run more analyses on the decay and retraining.

This check by itself does not tell us yet how often to retrain the model. Why? Not all data is equally useful. More "valuable" chunks of data might arrive, say, only on the weekends. When we mixed the data, we downplayed this factor.

There are some more checks to run.

Moving on to the next one!

Check #2. How quickly will the quality drop?

Here is another helpful heuristic.

Let's assume that there is some average, expected speed of drift. This can be the rate at which user tastes evolve or the equipment wears off.

Then, we can look back on our data—and estimate it!

We can train our model on the older data and consecutively apply it to the later periods.

We will then calculate how well our model performs as it gets "older." The goal is to detect when the model performance meaningfully declines.

‍Such a single-point estimate can already decrease the uncertainty. We can roughly estimate how fast our model ages: is it one week or one month? Or maybe much more?

model quality at different test sets removed in time — *In this example, we observe a performance decline only after 3 intervals.*

If we have enough data, we can repeat this check several times, shifting the initial training set. We can then average the results (accounting for outliers, if any!)

This way, we get a more precise assessment.

multiple checks for model quality decay on different periods

Of course, the model training approach should remain the same during the exercise. Otherwise, model parameters might affect the decay more than the changes in the environment.

Word of caution: if you have some critical subpopulations in your data, you might want to check the model quality for those as a separate metric!

‍The outcome of the test can vary. Sometimes, you'd learn your model is very stable: even a 6-month old model performs as well as a new one!

In other cases, you'd see that model decay is swift.

declining model quality when it is 1 day old, 1 week old, 1 month old — *Some models might become outdated quickly.*

If that is hard to handle, you might decide to reconsider your training approach.

Some options to consider:

Make a model a bit "heavier" but more stable. For example, add more data sources or perform more complex feature engineering. The model can be less performant on the surface but more stable in the long run.
Make a model more dynamic. Train a model on a shorter period but calibrate it often (once per hour?) or consider active learning.
Address the model pitfalls separately. The model might be degrading quickly only on specific segments. We can apply the model selectively to exclude those (check this tutorial where we explain the idea!) and use some non-ML methods. Maybe, rely on good old rules or human-in-the-loop?

If we have settled on the results, let's write down the number! This expected speed of decay is a key variable in our retraining equation.

[fs-toc-omit]Does estimating past drift always make sense?

Not always.

The assumption behind it is the gradual speed of the change in the world.

use cases on a scale: from mostly static to predictably dynamic — *Some problems are mode dynamic in nature.*

If you are creating an image model to detect road signs, the problem is of a different kind. You have to watch out for the "long tail" of dark and noisy images, but the actual signs barely change.

The same is often true for text models, like sentiment analysis or entity extraction. They still need updates but rarely follow a schedule. Instead of checking the speed of decay, it makes more sense to search for underperforming segments and monitor the changes.

Other cases are more predictably dynamic. Think consumer behavior, mobility patterns, or time series problems like electricity consumption.

Assuming a certain pace of change often makes sense. So does looking at the past drift to estimate the rate of model degradation.

This data frequently comes time-stamped. You can easily hop on the "time machine" to perform the calculation.

Check #3. When do you get the new data?

model quality declines but the data is available later

We got a ballpark of our retraining needs. But there is a rub. Can we retrain it?

To run the retraining pipeline, we need to know the target values or have the newly labeled data.

If the feedback is quick, this is not a concern.

But sometimes, there is a delay. We might need to involve the expert labelers. Or, simply wait!

‍It makes sense to check what comes first: the new data or the model decay.

what we wanted: state-of-the-art daily forecasting, what we got: manual data entry once per month

This is primarily a business process question: how quickly the data is generated, and when it lands in your data warehouse.

Say, we learned that historically our sales model degrades in two weeks. But new data on actual retail sales is collected only once per month. That brings a limit to our retraining abilities.

Can we make up for it? Some options to consider:

A cascade of models. Is some data available earlier? E.g. some points of sales might deliver the data before the end of the month. We can then create several models with different update schedules.
A hybrid model ensemble. We can combine models of different types to better combat the decay. For example, use sales statistics and business rules as a baseline and then add machine learning on top to correct the prediction. Rough rules might perform better towards the end of the period, which would help maintain the overall performance.
Rebuild the model to make it more stable. Back to training! We might be able to trade some digits of the test performance for slower degradation.
Adjust the expectations. How critical is the model? We might accept reality and just go with it. Just don't forget to communicate the now-expected performance metrics! The model will not live up to its all-star performance in the first test week.

scale: speed of decay vs data availability

What does it mean for our retraining schedule?

At this point, we can get into one of the two situations:

We are chasing the model quality. It degrades faster than new data comes in. We don't have many choices here. We should probably retrain the model every time a large enough chunk of data arrives.
You have a queue of new data points. We get new data quickly, but our "old" model is still doing great! We need to decide if to retrain it more often or less.

If we have this luxury of schedule choice, we finally get to the question.

Check #4. How often should you retrain?

retrain as soon as we get the data or wait till the model decay

By now, we know a thing or two about our model profile.

Let's consider a specific example.

Our model is good enough. No need to update it all the time to get a bit better.
It usually takes about 60 days to see the performance go down.
The new labeled data comes at the end of every week.

So when, exactly, should we plan to retrain it? Weekly? Monthly? Every 60 days?

You can, of course, just pick a number. Excessive retraining has downsides (more on this later!) but should not make the model worse.

That said, we can make a more analytical choice.

We can follow a similar approach as before. In this case, we pick our test set from beyond the point of decay. Then, we run our experiment within the period of stable performance.

We start adding more data in smaller increments: week by week, in our example.

Our goal is to detect when the quality on the test set starts improving.

model quality with different training sets

What we might learn is that adding data in small buckets does not always change the outcome.

‍For example, retraining the model with daily data has no impact on test performance. We need a couple of weeks to see the results.

There is no point in retraining the model more often!

‍That is often due to the same phenomenon of "data usefulness." There might be seasonal patterns or a critical mass of rare events that accumulate over time. The model needs more data to be able to capture the changes.

The realistic choice of training window might be smaller than it seems.

3 periods: too early to retrain, retraining window, no data available

To make the specific choice, you can consider how much improvement each new bucket brings. On the other half of the equation, consider the costs and hassle of frequent retraining.

You can also repeat the check several times to see when you get a meaningful quality gain on average.

Or yes—just pick a number inside this more narrow corridor.

[fs-toc-omit]What is wrong with frequent retraining?

One could say: in principle, it should not hurt.

Why do we bother? What is wrong with updating the model more often than needed?

The reason is, it adds complexity, costs and is often error-prone.

Don't fix what's not broken: this motto also applies to machine learning.

rocket launch with a new version — *Not every model update is a rocket launch, but they don't come free either.*

Updates come at a price.

There are direct computing costs involved. Sometimes, the labeling ones. Then, the required team involvement!

Yes, you might trigger the updates automatically. But imagine something like a new model failing during validation due to data issues or minor quality decay. The team would rush to solve this!

That is quite an unwanted distraction—when the model did not need retraining to begin with.

It is not always just the data science team involved.

A new model might require sign-off from other stakeholders. It might mean a whole validation process! That adds a fair share of organizational and compliance costs.

Lastly, we need to justify the architectural decisions.

If we are yet to build our retraining pipelines, we need to decide how complex a system we need.

If we only retrain our models once per quarter, we can get away with a simpler service architecture. If we need daily updates, we should invest in a more complex toolset upfront. We'd better do that knowing the expected quality uplift!

Of course, if you already have an advanced MLOps platform, the computing costs are marginal, and the use case is not critical, you can probably sign off all these concerns.

Otherwise, it pays off to be a bit more precise with the model maintenance!

Check #5. Should you drop the old data?

Here is a bonus one.

Let's assume we defined a reasonable schedule for our model retraining. How exactly should we perform it? Should we take the new data and drop the old one? Should we keep the "tail"?

That is something we can think through as well.

Say our model profile looks like this:

We used 6 data "buckets" in training.
The model performs nicely during the next 3.
Then, the quality goes down during the 4th one.
We decided on the retraining cadence after each 2 new "buckets" of data.

planned model retraining to improve the model quality

Let's keep the retraining plan fixed. Two buckets it is. We know they bring the quality uplift we need!‍

Then, we can experiment with excluding old data bucket by bucket. Our goal is to evaluate the impact of this "old" data on performance.

It is similar to the first check when we considered excluding old data from the initial training. But now, we use a defined test set (with the known decay) and likely a more precise step (think dropping months, not years of data).

model quality with gradual exclusion of older periods

Which result can we get?

Dropping the old data makes the model worse. Okay, let's keep it all for now!

Dropping the old data does not affect the quality. Something to consider: we can make the model updates more lightweight. For example, we can exclude a bucket of older data every time we add a newer one.

Dropping the old data improves the quality! These things happen. Our model forgets the outdated patterns and becomes more relevant. That is a good thing to know: we operate in a fast-changing world, and the model should not be too "conservative"!

We might investigate more complex combinations:

Keep the data for less-represented populations and minor classes. Dropping past data might disproportionately affect the performance of less popular classes. This is an important thing to control for. You might decide to drop the old data selectively: remove what's frequent, keep what's rare.
Assign higher weights to the newer data. If the old data makes the model worse, you might decide to downgrade its importance but not exclude it entirely. If you are feeling creative, you can do that at a different speed for different classes!

[fs-toc-omit]Sounds complicated! Do I need this all?

It depends.

Planning your retraining definitely makes sense. How precise do you want to be? Your use case will define it more than anything else.

Some models are lightweight, use little data, and don't pose significant risks in production. Nothing to overthink here! Just one or two sanity checks will do.

Others are highly critical, need extensive governance and a detailed plan to maintain those. It is probably a good idea to study your model behavior in extra detail.

Neither estimate is an exact recipe. But all are helpful heuristics in the practitioner's cookbook.

As always, analytics is best served with common sense and appropriate assumptions!

Part 2. Monitoring the model performance

Here comes the second part.

Until now, we've been studying the past. Once the model is live, we move into the wild!

The schedule is helpful but not bulletproof. Things might change in the course of model operations.

How do we know if it is time to update the model that is in production?

Monitoring! The reality check on top of the proper planning.

We need to evaluate the performance on the live data and compare it to our benchmark. This is usually the training data or some past period.

This way, we would know if to intervene earlier than planned. Or the opposite, skip an update, all things considered.

What exactly to look for?

[fs-toc-omit]Check #1. Monitor the performance changes

model monitoring, model 1 is up to date, model 2 fails on a new segment

If the ground truth is known quickly, one can calculate the actual performance metrics.

This can be the mean error rate for a regression model or the precision of a probabilistic classification model. Ideally, one should add a business metric to directly track the model's impact on it. If you have specific important segments, keep an eye on those as well.

You can set up a dashboard and monitoring thresholds to alert you when things go wrong or set triggers for automated retraining.

monitoring dashboards — Evidently *dashboard: error monitoring for the demand forecasting model.*

Setting a threshold is often unique to the use case. Sometimes, even a minor decline leads to business losses. You literally know how much each accuracy percent costs you. In others, fluctuations do not matter that much.

It makes sense to set these thresholds during the initial performance assessment. Which quality drop is severe enough? Knowing the business case, you can define what exactly warrants the attention—and the retraining effort.

At this point, you can also re-run some of the earlier checks using the newly logged data.

There is always a chance your offline training data was not fully representative of the real world. You can verify how well your assessment of the model decay matches the actual state of affairs.

If you use a golden set for your model evaluation, you might also want to review and update it. Live data can bring new corner cases or segments that you want your model to handle well.

[fs-toc-omit]Check #2. Monitor the shifts in data

And what if you do not get the ground truth immediately?

We might again get trapped in the waiting room: expecting the actual values or new labels. When this happens, we are left with the first half of the story: the data!

We can look at the data drift and prediction drift as a proxy to evaluate the likely model decay.

Knowing the shape of the input data and the model response, we can evaluate how different both look compared to training.

Imagine you are running a model monthly to choose the right marketing offer for your clients to pass them to the call center team. You can start with checking the input data for statistical drift.

If the distributions of your input data remain stable, your model should be able to handle it. In most cases, there is no need even to update the model. It ain't broken!

data drift dashboard — Evidently *dashboard: no drift is detected in the feature distributions.*

If the drift is detected, it comes as an early warning. You might rethink how to act on predictions. For example, you might decide to set a different decision threshold in your probabilistic classification or exclude some segments.

And if you can get new labeled data and retrain—time to do so!

Drift in the key features often precedes the visible model decay. In this tutorial, we illustrated the example. Key features shifted a couple of weeks before the model performance went haywire.

That is why you might want to keep an eye on the data drift even if the new labels come quickly.

For highly critical models, that is a handy advance notice.

Part 3. Retraining vs Updates

Let's say we did our best to prepare for model decay. We got our estimates right to define the retraining schedule. We built alerts to detect real-time changes.

But when they happen, we are left with the last question. How exactly to act?

The default idea is to keep your model fixed and feed it with new data. That is what we assumed when checking for optimal training windows.

Drop some old data, add some new data, repeat the same training pipeline. This is a reasonable step in case of any performance decline.

But it might not always solve it!

model retraining: new data, model updates: new experiments

If we face a significant drift, we might need to rethink our model. Imagine a major external change, such as facing a completely new customer segment.

Maybe, we'll need to tune the model parameters or change its architecture? Review pre- or post-processing? Reweigh the data to give priority to the most recent examples? Build an ensemble to account for new segments?

Here, it becomes more of an art than science. The solution depends on the use case, the severity of changes, and the data scientist's judgment.‍

There is another side of the medal: we might be able to improve the model! We could have started with a simpler, limited model. As more data is collected, we might be able to rebuild it and capture more complex patterns. Maybe, also add new features from other sources?

Word of caution: this new model might have a different decay profile!

To prepare for both, we need three things.

‍First, keep the update option in mind.

‍If naive retraining is the only option considered, we might completely miss out on other ways to maintain the models. It makes sense to schedule some regular time to review existing models for the improvement potential.

‍Second, build up a model analytics practice.

‍What do we mean by this? Get more visibility into your model behavior!

Is not just a singular performance metric like ROC AUC. We can search for underperforming segments, changes in the data, feature correlations, patterns in the model errors, and more.

dashboards: data drift, new segments, error patterns

To get these insights, you can add more views to your monitoring dashboards. Or, run the deep-dives in batches during a scheduled model checkup.

Having a clear context helps prioritize the experiments and know how to act.

‍Third, design the communication channels.‍

Data does not always change "by itself."

Say your business stakeholders are planning an update to a process your model is involved in. You should not learn about a new product line as a "sudden data drift" in your monitoring! It is something you can be informed about to prepare the model for a "colder" start.

Model updates and retraining might rely on such external triggers: business decisions, direct requests, changes in the data storage process. There should be a way to streamline those!

Summing up

So when should we retrain our models?

To figure out an answer for a specific use case, you might consider a few things:

‍Plan for periodic retraining. Instead of making an arbitrary schedule, we can measure how often it is necessary and possible and choose an optimal retraining strategy.‍
Monitor for actual performance. We should keep an eye on the production model. We can detect the decay to intervene on time or have informed peace of mind otherwise.‍
Analyze. Straightforward retraining is not the only option. By getting detailed visibility in the model performance, we can decide how exactly to respond to changes.