Machine learning models do not last forever. When you deploy a model in production, things change, and the patterns the model learned may become irrelevant. This phenomenon is often called “concept drift.”
In this guide, we explore the concept of concept drift in detail (pun intended!). We will look at its different types, how to know when it’s happening, and what to do about it.
We will also introduce how to evaluate concept drift using the open-source Evidently Python library.
Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
TL;DR. Concept drift is a change in the input-output relationships an ML model has learned.
Concept drift is a change in the relationship between the input data and the model target. It reflects the evolution of the underlying problem statement or process over time. Simply put, whatever you are trying to predict – it’s changing.Â
If the patterns and relationships in the data are no longer the same, the model predictions might become less accurate. Or, if the shift is drastic, they may go completely off course. Â
The underlying reason is that ML models usually learn from a fixed set of training data (unless you use techniques like active learning, which have their challenges). Models don’t automatically notice new patterns or react to them. So, they will keep making predictions using what they previously learned until you explicitly retrain or update the model.Â
‍Let’s take a simple example.Â
Imagine you have an ML model predicting whether emails are spam. You train it using a massive dataset of emails, and it becomes pretty good at it. But as time passes, how people and services compose and send email changes. On top of it, spammers adapt and become better at mimicking legitimate emails. Now, what used to be obvious spam suddenly looks like regular email, and vice versa. Your once-accurate model starts making mistakes. This change is an example of concept drift.
Some related terms include data drift, model drift, and target drift. Let’s explore what they mean! Â
A word of warning – these terms are not set in stone! It is useful to understand different issues that affect production ML models. But in practice, the semantic distinctions don’t matter as much. Multiple shifts can co-occur, and practitioners often use terms like data and concept drift interchangeably.
TL;DR. Model drift is the decrease in model quality without a specified cause. Concept drift implies a change in the learned relationships. Model drift is often caused by concept drift.Â
Model drift refers to the decay of the ML model quality over time. Simply put, it is a way of saying “the model quality got worse” or “the model no longer serves its purpose.” Model drift doesn't pinpoint a specific cause; it's just an observation that the model no longer works as well as it used to. The model decay might happen due to various reasons, including data drift, data quality issues, or concept drift.
‍The difference: concept drift explicitly refers to the change in the relationships the model learned during training. Model drift, on the other hand, is about the model's quality declining without specifying why.
The similarity: model drift is often caused by concept drift. These terms are often used interchangeably to describe a situation where the model quality drops due to the change in the underlying patterns.
TL;DR. Data drift is the change in the input data distributions. Concept drift is the change in relations between model inputs and outputs. However, they frequently coincide.
Data drift, or feature drift, refers to the change in the distributions of the incoming data. This means that some or all model features look different compared to the training set or earlier prediction period.Â
Let’s take a simple example.Â
Suppose you’re developing a spam detection feature for an email app. In the past, most emails were sent from web platforms, but now, there’s a growing trend of emails being sent from mobile devices. Your spam detection model hasn’t seen as many mobile-based examples and is struggling to differentiate between spam and legitimate emails on mobile, which have different characteristics, like being shorter.Â
With the increase in mobile-sent emails, you’ll notice data drift: a shift in the distribution of the input features, such as “device type” or “email length.” As a result, your model’s performance may decline. However, the fundamental concept of what constitutes spam or its characteristics hasn’t changed in this scenario. The quality drop is due to the shift in the prevalence of mobile-based emails.
Conversely, an example of concept drift could be a new phishing technique that causes a sudden increase in mobile-based spam emails. This surge is independent of the overall distribution of device types, showcasing concept drift without data drift.
The difference: data drift refers to the change in the distributions of the input features, while concept drift refers to the change in the relationship between model inputs and outputs.Â
In the extreme case, you can observe concept drift without data drift. For example, in credit scoring, the distribution and characteristics of the credit applicants remain the same. Still, due to the changes in the macro environment, the risk of default (model target) is now different.Â
The similarity: both data and concept drift might lead to model quality decay. On top of this, data and concept drift often occur at the same time.Â
Concept drift might manifest itself through changes in the distribution of specific features. In the credit example above, you may observe sign of concept drift through shifts in the distribution of environmental features, such as interest rates or spending patterns.
‍Want to read more about data drift? Head to the deep dive guide.
TL;DR. Target drift is often synonymous with concept drift. It can also refer to the shift in the observed labels or actual values without implying a change in the relationships.
Target drift occurs when the goal or target of your prediction changes.Â
Broadly, it is synonymous with concept drift: it refers to the target function that maps input data to an output. The drift in the target function means a change in the relationships.Â
In a more narrow definition, it refers to the changes in the distribution of the labels or target values without implying the difference in the relationship or patterns.Â
For example, label drift might occur when the frequency of the predicted event changes. Imagine pneumonia being detected more often on X-rays due to a seasonal infection surge. You might observe this scenario without concept drift. In this case, both model inputs and outputs would shift, but the relationship (“what is pneumonia”) stays the same.
It could also change – if you face a new type of illness with a different pattern, like COVID-19. In this case, you’d talk about concept drift since the concept of pneumonia evolves.Â
You can analyze target drift (change in the distribution of the model target values) as one of the techniques in production ML model monitoring. In this case, target drift is just one of the signals you can interpret together with other inputs to understand better what is going on with the model and how to react to it.Â
For example, you can look at target distribution when exploring the reasons for observed model quality decay. Or, if the model quality is stable, you can run a test for target distribution change. If you do not detect it, you can skip retraining altogether – since the model has nothing new to learn.
The difference: narrowly, target drift refers to the distribution shift in the true labels or actual values. It may or may not coincide with the change in relationships between the inputs and the target (but very often does).Â
The similarity: both concept and target drift refers to the changes in the model target function. Broadly, they are synonyms.Â
There are different types of concept drift patterns one can observe in production. Let’s dive into those and explore some examples.
When referring to “drift,” we usually expect a gradual change. After all, this is what the word itself means – a slow change or movement.
This is the most frequent type of drift indeed. Gradual drift happens when the underlying data patterns change over time.Â
For example, consider a model that predicts user preferences for movies. As people’s tastes evolve and new movies are released, the model will make less and less relevant suggestions – unless you update and retrain it. In cases like fraud detection, you have to account for the bad actors adapting and coming up with new fraud attacks over time.
This gradual drift is almost an in-built property of a machine learning model. When you create the model, you know it will not perform as well as in training indefinitely. The world will change sooner or later.
In production, you can often observe a smooth decay in the core model quality metric over time. The exact speed of this decay varies and heavily depends on the modeled process and rate of change in the environment. It can be hours or days for fast-changing user preferences or months and years for stable processes like manufacturing. Â
Example: Spotify frequently retrains a model that recommends new podcasts on the home screen – so that the model can recommend newly published shows. However, the model that suggests best shortcuts to the user does not require frequent updates. Source: The Rise (and Lessons Learned) of ML Models to Personalize Content on Home.
To prepare for the gradual concept drift, you can evaluate the speed of the model decay and environmental changes in advance and plan for a periodic model retraining. For example, you can update your models daily, weekly or monthly.
Sudden concept drift is the opposite: an abrupt and unexpected change in the model environment.Â
Imagine you're predicting product sales, and a new competitor enters the market with a heavily discounted product, completely changing customer behavior. This sudden shift can catch models off guard.
Another example could be a complete change to the modeled process. Imagine that your model processes input data about user interactions in the app to predict the likelihood of conversions and recommend specific content. If you update the application design and add new screens and flows, the model trained on the events collected in the older version will become obsolete.
If you are working on revenue projections or credit scoring models, any significant macroeconomic changes, such as a change in the interest rate, might make your previous model outdated.
COVID-19 was a perfect example of a drastic change that affected all ML models across industries.
Example: Instacart has models to predict the availability of grocery items. From the onset of COVID-19, they faced challenges due to the evolving shopping patterns and disruptions caused by the pandemic. They had to update the architecture of their models, for example, by implementing dynamic thresholds. Source: How Instacart's Item Availability Evolved Over the Pandemic.
Many of such drastic changes are hard to miss. However, the possibility of facing an unannounced change in production – even on a smaller scale – is one of the reasons why you might want to set up ML model monitoring even if you retrain the models regularly.Â
Want a deeper dive? Read an in-depth guide to ML model monitoring.Â
Sometimes, practitioners refer to "recurring" concept drift, meaning pattern changes that happen repeatedly or follow a cycle.Â
For instance, in a sales model, you might notice sales going up during holidays, discount periods, or Black Fridays. Ice cream sales differ by season, and weekends often have different patterns than business days.Â
It's important to know that many of these recurring changes are systematic, and you can account for them in modeling. You can pick model architectures that can handle seasonality well and may adapt to these patterns, ensuring accurate predictions even when concept drift happens periodically. You can also build ensemble models or switch between models to account for cyclic changes and special events in your system design.
However, you might also encounter recurring drops in model performance in production that you can't address directly. For instance, many users may sign up for a service during specific periods, but they leave shortly afterward. This can lead to a decline in the quality of your marketing and upsell models.Â
Understanding the nature of these events can be helpful for monitoring. Sometimes, you might avoid reacting to the drop in model quality as long as it still performs well enough for your core user segment. Additionally, it's essential to ensure that you don't retrain the model using the data that doesn't represent the usual patterns.
Detecting concept drift is the first step in effectively handling it. To spot it in time, you need to set up ML model monitoring. Â
Machine learning model monitoring helps track how well your machine learning model is doing over time. You have different ways to do this, from simple reports and one-off checks to real-time monitoring dashboards.Â
For example, if your model processes data in batches, you can run reports on model quality whenever you get the new labeled data. If your model works in real time, you can gather available metrics directly from the machine learning service and display them on a live dashboard. You can also set up alerts to warn you if something goes wrong. For instance, if the model's performance drops below a certain level, you'll get a signal to retrain the model or dig deeper into the issue.
Want a deeper dive? Read an in-depth guide to ML model monitoring.Â
Which metrics can you track to detect concept drift? There are several groups.Â
Model quality metrics are the most direct and reliable reflection of concept drift. For example, for classification problems, you can track accuracy, precision, recall, or F1-score. A significant drop in these metrics over time can indicate the presence of concept drift.Â
The good news is that sometimes you can calculate these metrics during production use. For example, in scenarios like spam detection, you gather user feedback, such as moving emails to spam or marking them as "not spam."Â Â
However, you can't get the labels in many other situations as easily. For instance, if you predict future sales, you can only know the forecast error once that period is over.
In these scenarios, you can utilize proxy metrics and heuristics as early warning signs of concept drift. While they do not directly reflect a drop in the model quality, they can tell you about a change in the environment or model behavior that might precede it.
You can develop heuristics that reflect the model quality or correlate with it.Â
Suppose you have a recommendation system that displays products users might be interested in. You might decide to track an average share of clicks on a recommendation block as an “aggregated” reflection of the recommendation system quality. If fewer users start clicking on the block, this might mean that the model no longer shows relevant suggestions.
Another example could be tracking the appearance of new categories in certain input features. For instance, if your model is responsible for forecasting future sales and suddenly starts encountering entirely new product categories that were not present in the training data, it could be a sign of concept drift.
It makes sense to leverage input from domain experts to develop meaningful heuristics.Â
Prediction drift occurs when the model’s predictions for new data differ significantly from earlier periods. If the model’s outputs deviate, it’s a sign that concept drift might be happening.
For example, if your model starts predicting spam much more often (or more rarely), it might signal that it’s worth looking at. To detect this drift in predicted classes, you can compare the distributions of the model predictions over a specific period, for example, by comparing today’s data to the previous day.Â
With probabilistic classification, you can also look at the change in the distribution of the predicted probabilities. Say your model initially assigned high probabilities to positive outcomes but now consistently assigns lower scores to them. This change could indicate concept drift, especially if your model is well-calibrated.
You can use different distribution drift detection techniques to detect shifts in the predicted classes, probabilities, or target values. They include:
If your model output is unstructured, such as text or embeddings, you can employ other distribution drift detection techniques, such as model-based drift detection, embedding drift detection, or tracking interpretable text descriptors.Â
Want a deeper dive on drift detection in embeddings? Read the research where we compared multiple drift detection methods.
While there's a distinction between pure "data drift" (changes in input features) and "concept drift" (changes in the relationships), they often go hand in hand. To spot concept drift, you can keep an eye on how the properties of your input data evolve because they often mirror the environment in which your model operates.
For instance, consider a spam email classification model. Shifts in email characteristics, such as the average length, languages used, or delivery time, could signal a significant change in the environment or the emergence of a new attack strategy.Â
While data drift doesn't guarantee a drop in model quality (your spam classification model might handle it well), it's still worth tracking. It also gives the option to react preventatively, say, by adding new examples to the training dataset to adapt to these evolving patterns.
To detect input distribution drift, you can employ similar methods as when evaluating output drift, such as statistical tests and distance metrics. However, one key difference is that you may have many input features. Monitoring drift in each column might be impractical, and you can simplify this by tracking the overall percentage of drifted features. Monitoring a single metric provides a manageable way to keep track of data changes.
Want a deeper dive on data drift? Read the dedicated guide.
Monitoring changes in correlations is another way to spot concept drift. You can look at the correlations between model features and predictions and pairwise features correlations. If there is a significant change in how they relate to each other, this might signal a meaningful pattern shift.Â
To evaluate the correlation strength, you can use correlation coefficients like Pearson's or Spearman's and visualize the relationships on a heatmap.Â
This method works best when features are interpretable, there are known strong correlations, and when you deal with smaller datasets, like often in a healthcare setting. However, it can be too noisy in other scenarios. When tracking individual feature correlations is impractical, you run occasional checks to surface the most significant shifts in correlations or look only at the correlation between a few strong features.
Summing up. You should track ongoing model quality using metrics like accuracy or error rates to detect concept drift in production ML models. If you cannot directly measure the model quality, you can use proxy metrics that may correlate with the model quality or reflect critical environmental changes. Such metrics include model prediction drift, input data drift, correlation changes, and heuristics that depend on the use case.
Detecting concept drift through established model monitoring is crucial, but knowing how to address it when it occurs is equally important. Here are some strategies and techniques to help you manage concept drift effectively.
The most straightforward way to address concept drift and related model quality decay is by retraining it using the most recent data. Retraining helps your model adapt to changing patterns: you can add the new data collected during the model operations to the training dataset and re-run the model training process.Â
You might proactively schedule periodic model retraining to address gradual concept drift. Regularly updating the model lowers the risk of it becoming obsolete.Â
However, retraining and rolling out a new model might be costly or require a major approval process. In these cases, you should closely monitor the ongoing model performance to initiate retraining when the model performance starts to deteriorate. Model monitoring also helps detect drastic changes that occur in between scheduled updates.
Get labeled data. Model retraining has a significant limitation: you must have the newly labeled ground truth data to run it.
You can sometimes collect this data (immediately or with a delay) during model operations. For example, after you send out a marketing campaign to the users, you can track the purchases that resulted from it and use the new conversion data to update your model.Â
In other scenarios, you might need a labeling process to acquire the training data. For example, in cases like image classification, you might need to initiate a new labeling process to use expert knowledge to generate labels for newly discovered concepts: for example, edge cases or specific scenarios the model struggles to classify correctly.Â
You must ensure that your training dataset is sufficiently large for the model to grasp the new patterns.Â
Choose the model retraining strategy. When retraining on new data, you can also consider different approaches. Depending on the properties of your process and how drastic the change is, you might consider training a new model:
Consider training a new model. If the concept drift is severe, and simple retraining does not help, you might consider a complete model revamp. Instead of feeding the new data into the existing model architecture, you might run a new set of experiments with different model types and parameters. For example, consider ensemble techniques. Â
You can also reconsider your model's features based on the changed context. For example, you can augment your dataset with new features and try new feature engineering techniques to capture the evolving patterns in the data. Conversely, you may exclude less informative and more volatile features to improve model stability and robustness.
Make sure you have a robust testing and roll-out process. If you retrain your models frequently to address or prevent concept drift, it is essential to have a thorough evaluation process.Â
You must ensure that your new model performs better than the old one, for example, by testing its performance on a curated evaluation dataset that also includes known corner cases your model should handle well.
Monitor the ongoing model performance. Every new model release can have risks, from data bugs to accidentally retraining on low-quality data or data from non-representative periods, such as a seasonal spike.Â
You must clearly define the model performance indicators and track how a new model performs in production after an update. You may also track the model performance separately on specific important segments, such as premium users.Â
Logging and monitoring help trace the evolution of your model and its responses to drift and, if necessary, override and intervene again.Â
‍Want a deeper dive? Read an in-depth guide to ML model monitoring.
Sometimes, model retraining under concept drift is not possible. Some possible reasons are:
In these scenarios, you can consider other interventions.Â
Business rules and policies. For example, you can modify the decision thresholds for classification models. By changing the point for what constitutes a positive or negative prediction, you can adjust model sensitivity to changes in the data distribution.Â
For instance, instead of assigning the “fraud” label for predicted probabilities over 50%, you might only assign the “fraud” label when it is over 80%, thus reducing the number of false positives. You can also make dynamic thresholds for different categories or data inputs.
You can also consider applying other heuristics on top of the model output. For example, add correctional rules over the predictions made by the dynamic pricing model.
Example. DoorDash relies on combining expert judgment with machine learning in demand forecasting, to account for variance in model training data. Source: Why Good Forecasts Treat Human Input as Part of the Model.
Human-in-the-loop. You can consider alternative decision-making scenarios if you observe concept drift and can no longer rely on the model outputs. For example, you can send some or all of the data for manual decision-making.
In many scenarios, you can return to the “classic” decision-making process, for example, to manually review insurance claims or possible fraud cases. You can do this for some data, such as unusual inputs detected through outlier detection algorithms or predictions with probabilities below threshold. The idea is to "catch" the most strange or unreliable inputs the model probably won't be able to handle.
This method requires setting a separate workflow, which is only justified in some scenarios. However, sometimes, a manual decision review already exists as a fallback for critical models. In this case, you can set a new process for selecting the cases and assign more predictions to go through this flow.
Alternative models. Human decision-making is not the only alternative decision-making strategy. You can consider heuristics or other model types. For example, you can use first-principle physical models in manufacturing process control, rule-based systems for prioritizing leads, or set the recommendation block to show “top-10 most popular” items.Â
You could also switch between different models based on the identified environmental scenarios. For example, you may continue using an existing machine learning model when you know it performs well but override its predictions in more volatile situations.
Take food delivery time forecasting. If heavy rain conditions make it harder to predict the delivery time correctly, you might choose only to use your model during good weather. However, if it starts raining, you'd apply a correction to the model predictions to improve user experience.
Pause or stop the model. If the model quality is unsatisfactory, you might consider turning it off.Â
Example. When X (formerly Twitter) figured out their image cropping algorithm might be biased, they gave the control back to the user instead of using an ML system to suggest the optimal way to crop an image. Source: Sharing learnings about our image cropping algorithm.
Do nothing. Lastly, you can choose to do nothing – this is always an option. Depending on the scenario, some models might be less critical, and you can accept a diminished performance, for example, until you can collect the newly labeled data or if you expect a volatile period to be over soon.
Evidently is an open-source Python library that helps implement testing and monitoring for production machine learning models. Evidently helps run various checks for your datasets and get interactive visual reports to analyze the results.Â
You can choose from 100+ pre-built checks and metrics to evaluate concept drift. For example:Â
Finally, you can use Evidently to deploy a live monitoring dashboard to track how metrics change over time and build a continuous monitoring process.Â
Would you like to learn more? Check out the open-source Getting Started tutorial.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶