contents‍
TL;DR: Monitoring embedding drift is relevant for the production use of LLM and NLP models. We ran experiments to compare different drift detection methods. We implemented them in an open-source library and recommend model-based drift detection as a good default.
What you’ll find in the blog:
When ML systems are in production, you might need to keep tabs on the data drift. This is a component of machine learning model monitoring.
Getting the ground truth can be a challenge. An ML model predicts something, but you do not immediately know how well it works. Sometimes you simply wait. Say, you predict food delivery time, and half an hour later, you know the model error. In cases like credit scoring, the wait is much longer. In scenarios like text classification, you might need to label the data to evaluate the model quality. Otherwise, you are flying blind.Â
Tracking drift in the model inputs and outputs can be a useful proxy. The assumption is that if the input data shifts compared to a reference period, the model might perform worse since the environment is no longer familiar. If the model outputs (= what the model predicts) change, this is also a good reason to investigate.Â
Sign up for our free email course on LLM evaluations for AI product teams. A gentle introduction to evaluating LLM-powered apps, no coding knowledge required.
Save your seat ⟶
When you work with texts, images, or other unstructured data, it is common to use embeddings. In this case, you create a numerical representation of the input data.
For example, you can take pre-trained models like BERT to convert your raw texts into vectors. This process maps each object in the dataset (say, a text that the model should classify) to a multi-dimensional space and represents it as a set of numbers.   Â
Monitoring data drift is just as relevant when you work with embeddings. When an NLP or LLM-based application is in production, you can detect when the data become too different or unusual and react before it affects the model quality.Â
How can data drift in texts look? Here are a few examples:
Drift detection helps notice when something changes enough to warrant an investigation. You can then retrain or tune the model or send the data for labeling.Â
But how exactly to detect drift? How to measure the “change”?Â
In tabular data, drift detection is somewhat easier since features are often interpretable. You know what it means if, say, the average value of the "age" column changes. There are also ways to detect drift in the raw text data. However, if you use embeddings, you need a way to catch drift in the numerical representation of words instead!
In this blog, we will explore how.Â
How we approached the comparison:
A few disclaimers:
To keep things simple, we focused on text data. While the findings might apply to other types of embeddings (e.g., images), this was out of the experiment's scope.
This is not academic research. Our goal is to build intuition on the behavior of the different methods and help develop heuristics for ML practitioners.Â
The definition of drift. We imitated drift by increasing the presence of one of the classes. The learnings might not precisely translate to differently-shaped changes.
The research is fully reproducible if you want to challenge some of our findings!Â
We worked with 3 datasets:
Sources:
- Wikipedia comments. Source on Kaggle.
- News categories. Source on Kaggle. Citation: Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
- Food reviews. Source on Kaggle. Citation: J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.
We pre-processed datasets to have two classes each time. Wikipedia comments were already split into “toxic” and “not.” For news, we contrasted the “food and drink” category against all others. With food reviews, we picked “neutral” comments (ranked as “3”) against “positive” (rated “4” or “5”).Â
We used two methods to convert raw texts into embeddings: BERT and FastText. These are relatively different pre-trained models that help obtain vector representations of input texts.Â
Let's take a look at the resulting embedding distributions. We sampled the data and projected them into a two-dimensional space. The two classes in each dataset are color-coded.
When using BERT:
When using FastText:
The results are visibly different.Â
The choice of vectorization method matters. When creating the model, a data scientist selects one of many pre-trained embedding models and fine-tunes it (or not). This affects how well the classes separate in the dataset – meaning, how far the objects of different classes are from each other.
In our case, the objects in the food reviews separate worse than others: there is no clear visual grouping.Â
We split each dataset into two parts: “reference” and “current.” We treat “reference” as a representative dataset that shows what to expect from the data. In practice, it could be a validation dataset or a curated golden set. “Current” is the production data we test for drift.Â
Then, we injected artificial drift into the current dataset. We gradually added a bit (and then a lot!) of the objects of a specific class.
Let’s illustrate the approach on the Wikipedia comments dataset:
We did the same for other datasets: injected more reviews of “documentaries” and “neutral” food reviews accordingly.Â
These are not drastic changes: we add examples of a known class, not a totally new one. However, even with the same approach, the results vary. The shift might be less vivid when a category is not easy to identify in the first place (like “neutral food reviews”) and vice versa.Â
We picked the following drift detection methods:Â
We tested each method with and without dimensionality reduction. In the latter case, we used the PCA (Principal Component Analysis) method, reducing the dimensions to 30.Â
The idea behind PCA is to further “compress” information in each embedding. If you map it to a smaller number of dimensions, you can also speed up the calculations in some cases.Â
Our goal was to test whether the results of each drift detection method are different with PCA while the initial “drift” remains the same.
How do the chosen drift detection methods compare? We sum up the results below.Â
We’ll use the following evaluation criteria:
Experiment code. We share the Colab notebooks with the results for each method. Here is the summary notebook. There are also 6 for each dataset with different embeddings: food reviews (with BERT and FastText), Wikipedia comments (with BERT and FastText), and news categories (with BERT and FastText).
How it works. We average all embeddings in the “current” and “reference” data to get a single representative embedding for each dataset. Then, we measure the distance between the embeddings to evaluate how far they are from each other.Â
In this experiment, we use Euclidean distance. This is probably the most straightforward distance metric. If you have two vectors, it measures the length of the line that connects them.
There are other distance metrics. For example, Cosine, Cityblock (Manhattan), or Chebyshev distance. While each metric measures “how far” the vectors are, they work differently. It’s best to choose a familiar metric and run a few experiments to set the thresholds. In the Evidently Python library, we implemented all the listed methods.
Drift score. The drift score is the value of the Euclidean distance. It can take any value from 0 to infinity. Identical datasets will have a distance of 0. Intuitively, a smaller distance indicates that the two vectors are similar, while a larger distance shows they are further apart.
🔬 Statistical hypothesis testing. With large datasets, you can directly use the Euclidean distance to measure drift. With smaller samples (e.g., <1000), you can apply statistical testing. For example, compare the measured distance to the possible distance values for the reference data at a set percentile (we picked 95% as the default). This helps avoid false positive drift detection. You can do the same for other methods, such as Cosine distance, MMD, or model-based drift detection.
Experiment results.Â
For each method, we plot the drift score against the “size” of the introduced drift. We show four plots: for BERT and FastText embeddings, with and without PCA.Â
We color-coded our datasets on all plots:
Here is how Euclidean distance detects drift in our datasets.
The drift on the food reviews dataset is more subtle, and Euclidean distance appears less sensitive.Â
Pros and Cons.
Impact of sample size. The experimental notebooks include the code to compare the drift detection results for samples of different sizes. If you want to play around more on your own, check the section for Euclidean distance in any of the notebooks to get the sample code. You can replicate it for other methods (just mind the computation time!)
How the method works. Once again, we average all embeddings in the “current” and “reference” data to get a single representative embedding for each dataset.Â
But instead of measuring the “length,” we look at the Cosine similarity between the embeddings. It measures the cosine of the angle between the two vectors in a multi-dimensional space. To get the Cosine distance, you should subtract the cosine from 1. Â
Cosine distance = 1 — Cosine similarity
If two vectors are identical, the Cosine similarity is 1, and the Cosine distance is 0.
This is a popular measure in machine learning. For example, it is used for similarity search in information retrieval or recommendation systems.
Drift score. The drift score, in this case, is the Cosine distance value. It can take values from 0 to 2. Â
🔬 For smaller datasets, you can apply statistical hypothesis testing.
Experiment result. Here is how drift detection using Cosine distance works on our datasets.Â
Once again, the drift on the food reviews dataset is harder to detect.
Pros and Cons.
We also refer to this method as “domain classifier” or “model-based drift detection.”
How it works. We train a binary classification model to discriminate between data from reference and current distributions. If the model can confidently identify which embeddings refer to the “current” and “reference,” you can consider the two datasets significantly different.
You can read more about the domain classifier approach in the paper “Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift.”Â
Drift score. The drift score is the ROC AUC score of the domain classifier computed on a validation dataset. It can take values from 0 to 1. Everything over 0.5 is better than random. 1 is an “absolute drift” when the model can perfectly define where the data belongs. Â
For example, you can set the threshold to 0.55.
🔬 For smaller datasets, you can apply statistical hypothesis testing. In this case, you’d contrast the obtained ROC AUC against the quality of the “best random model.”
Experiment result. Here is how model-based drift detection works on our datasets.Â
Pros and Cons.
All in all, this is an excellent default approach: easy to use and interpret, with reasonable compute speed, and with consistent results for different embeddings with and without PCA.
We also refer to this method as "ratio" or "numerical drift detection."
How it works. With embeddings, you get a numerical representation of the input objects. Say, you convert each text into 300 numbers. Effectively, you get a structured tabular dataset. Rows are individual texts represented as embeddings, and columns are components of each embedding.Â
You can treat each component as a numerical “feature” and check for the drift in its distribution between reference and current datasets. If many embedding components drift, you can consider that there is a meaningful change in the data. Â
What is the intuition behind the "ratio" method? You can treat each embedding as a set of coordinates of a vector in a multidimensional space. While you cannot interpret their meaning, you can evaluate the share of coordinates that have changed. If many coordinates drift, it is likely that the dataset now looks different.
We use Wasserstein (Earth-Mover) distance with the 0.1 threshold to detect drift in individual components. There is some intuitive explanation for the metric: when you set the threshold to 0.1, you notice changes in the size of the "0.1 standard deviations" of a given value. Â
📊 There are other numerical drift detection methods. For small datasets, you can use statistical tests like Kolmogorov-Smirnov. For larger datasets, you can use distance and divergence metrics like Jensen-Shannon or Kullback-Leibler divergence. We gave an overview of methods in the blog “Which test is the best?”.
💡 Should you correct for multiple hypothesis testing? We believe it is not necessary. In this scenario, we do not need to be certain about the drift in specific embedding components. Instead, we look for significant shifts in a large dataset. A type I error (false positive drift detection) occurs when trying to detect small changes in insufficient data. In this case, the likelihood of this error is small. Even if we falsely detect some components as drifting, this won’t be critical: what we want to evaluate is the overall share of drifting elements.
Drift score. After you test individual columns for drift, you can estimate the overall share of drifted embedding components. For example, you can set a threshold to 20%.
This means that if each object is represented by 300 numbers, and 60 of them drift across all objects, you consider data drift to be detected. Â
Experiment result. Here is how the method works on our datasets.Â
Interestingly, the method does not detect drift on the food reviews dataset: the obtained scores are very low. It again appears not very sensitive, given that the initial dataset was more “noisy.” You might want to tune the numerical drift detection thresholds (e.g., Wasserstein distance in this case).Â
Pros and Cons.
All in all, this is a useful approach: fairly interpretable and with consistent results for different embeddings. It can be helpful if you are already familiar with numerical drift detection.
How it works. Maximum Mean Discrepancy (MMD) measures the distance between the means of the vectors. This multi-dimensional distance is calculated using a kernel function defined in terms of a Hilbert space. You can read more in the paper “A Kernel Method for the Two-Sample Problem.”
The goal is to distinguish between two probability distributions p and q based on the mean embeddings µp and µq of the distributions in the reproducing kernel Hilbert space F. Formally:Â
This looks a bit confusing, doesn't it?Â
Here is another way to represent MMD:
Where K is a reproducing kernel Hilbert space. To simplify, you can think about it as some measure of closeness. The more similar the objects, the larger this value.
This becomes more intuitive. If the two distributions are the same, MMD should be 0:
If the two distributions are different, MMD will be larger than 0:
Drift score. The drift score, in this case, is the obtained distance measure (MMD). It can take values above zero. MMD is 0 for identical distributions.
🔬 Statistical hypothesis testing. Some MMD implementations include statistical hypothesis testing. For example, one can compare the obtained MMD values in the current dataset against possible MMD values in reference at a set percentile. This method is useful on smaller datasets (under 1000 objects) but might not be necessary with large datasets. In this case, you can set the MMD threshold directly. This way, you also significantly increase the computation speed, which still remains lower than with other methods.
Experiment result. Here is how drift detection using MMD works on our datasets.Â
MMD is also not very sensitive to drift on the food reviews dataset.
Pros and Cons.
To evaluate the computation speed for each method, we ran a small experiment using the same instance of Google Colab. We compared the computation for smaller (1000 objects) and larger (15000 objects) datasets with and without PCA.
Let’s sum up the findings!
Model-based drift detection is a good default. You can intuitively tune the ROC AUC threshold to react only to the confident drift.Â
Evaluating the share of drifted embedding components comes a close second. You can tweak the thresholds to your data and reuse the methods from tabular drift detection. Just make sure to account for the impact of dimensionality reduction!
To track the “size” of drift in time, you can pick metrics like Euclidean distance. However, you might need a few experiments to tune the alert thresholds since the values are absolute.
A few more thoughts on applying drift detection in practice.
Data drift detection is a heuristic. It helps notice changes of a certain magnitude. You need to make assumptions about what you want to detect.
There is no universal way to define data changes that strictly correlate to model quality. Even if the data shifts, some models generalize well. You might also have different error tolerance. Sometimes you want to react to major changes, and sometimes to the smallest shift.
To avoid false alarms, you should tune your drift detection method and thresholds to the particular use case and data. There is no silver bullet here!Â
Data drift detection can be separate from interpretation. You might use embeddings to detect drift (meaning, get an alert that something changed). However, after drift is detected, it makes sense to analyze raw data to locate and interpret the specific changes, such as new topics appearing in the dataset.
We implemented all the mentioned drift detection methods in Evidently, an open-source Python library to evaluate, test and monitor ML models.
You need to pass a DataFrame, select which columns contain embeddings, and choose the method (or go with defaults!)Â
‍
The visualization is based on UMAP. It projects the embeddings to a two-dimensional space. Instead of plotting individual points, it summarizes them as contours. This is easier to perceive for a large number of data points.
What else is cool about it?
You can check the Getting Started tutorial to understand the Evidently capabilities or jump directly to a code example on embedding drift detection.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶