contents
Once you deploy an NLP or LLM-based solution, you need a way to keep tabs on it. But how do you monitor unstructured data to make sense of the pile of texts?
There are a few approaches here, from detecting drift in raw text data and embedding drift to using regular expressions to run rule-based checks.
In this tutorial, we’ll dive into one particular approach – tracking interpretable text descriptors that help assign specific properties to every text.
First, we’ll cover some theory:
Next, get to code! You will work with e-commerce review data and go through the following steps:
We will use the Evidently open-source Python library to generate text descriptors and evaluate changes in the data.
Code example: If you prefer to go straight to the code, here is the example notebook.
A text descriptor is any feature or property that describes objects in the text dataset. For example, the length of texts or the number of symbols in them.
You might already have helpful metadata to accompany your texts that will serve as descriptors. For example, e-commerce user reviews might come with user-assigned ratings or topic labels.
Otherwise, you can generate your own descriptors! You do this by adding “virtual features” to your text data. Each helps describe or classify your texts using some meaningful criteria.
By creating these descriptors, you basically come up with your own simple “embedding” and map each text to several interpretable dimensions. This helps make sense of the otherwise unstructured data.
You can then use these text descriptors:
Sign up for our free email course on LLM evaluations for AI product teams. A gentle introduction to evaluating LLM-powered apps, no coding knowledge required.
Save your seat ⟶
Here are a few text descriptors we consider good defaults:
An excellent place to start is simple text statistics. For example, you can look at the length of texts measured in words, symbols, or sentences. You can evaluate average and min-max length and look at distributions.
You can set expectations based on your use case. Say, product reviews tend to be between 5 and 100 words. If they are shorter or longer, this might signal a change in context. If there is a spike in fixed-length reviews, this might signal a spam attack. If you know that negative reviews are often longer, you can track the share of reviews above a certain length.
There are also quick sanity checks: if you run a chatbot, you might expect non-zero responses or that there is some minimum length for the meaningful output.
Evaluating the share of words outside the defined vocabulary is a good “crude” measure of data quality. Did your users start writing reviews in a new language? Are users talking to your chatbot in Python, not English? Are users filling the responses with “ggg” instead of actual words?
This is a single practical measure to detect all sorts of changes. Once you catch a shift, you can then debug deeper.
You can shape expectations about the share of OOV words based on the examples from “good” production data accumulated over time. For example, if you look at the corpus of previous product reviews, you might expect OOV to be under 10% and monitor if the value goes above this threshold.
Related, but with a twist: this descriptor will count all sorts of special symbols that are not letters or numbers, including commas, brackets, hashes, etc.
Sometimes you expect a fair share of special symbols: your texts might contain code or be structured as a JSON. Sometimes, you only expect punctuation marks in human-readable text.
Detecting a shift in non-letter characters can expose data quality issues, like HTML codes leaking into the texts of the reviews, spam attacks, unexpected use cases, etc.
Text sentiment is another indicator. It is helpful in various scenarios: from chatbot conversations to user reviews and writing marketing copy. You can typically set an expectation about the sentiment of the texts you deal with.
Even if the sentiment “does not apply,” this might translate to the expectation of a primarily neutral tone. The potential appearance of either a negative or positive tone is worth tracking and looking into. It might indicate unexpected usage scenarios: is the user using your virtual mortgage advisor as a complaint channel?
You might also expect a certain balance: for example, there is always a share of conversations or reviews with a negative tone, but you’d expect it not to exceed a certain threshold or the overall distribution of review sentiment to remain stable.
You can also check whether the texts contain words from a specific list or lists and treat this as a binary feature.
This is a powerful way to encode multiple expectations about your texts. You need some effort to curate lists manually, but you can design many handy checks this way. For example, you can create lists of trigger words like:
You can curate (and continuously extend) lists like this that are specific to your use case.
For example, if an advisor chatbot helps choose between products offered by the company, you might expect most of the responses to contain the names of one of the products from the list.
The inclusion of specific words from the list is one example of a pattern you can formulate as a regular expression. You can come up with others: do you expect your texts to start with “hello” and end with “thank you”? Include emails? Contain known named elements?
If you expect the model inputs or outputs to match a specific format, you can use regular expression match as another descriptor.
You can extend this idea further. For example:
Here are a few things to keep in mind when designing descriptors to monitor:
Mind the computation cost. Using external models to score your texts by every possible dimension is tempting, but this comes at a cost. Consider it when working with larger datasets: every external classifier is an extra model to run. You can often get away with fewer or simpler checks.
To illustrate the idea, let's walk through the following scenario: you are building a classifier model to score reviews that users leave on an e-commerce website and tag them by topic. Once it is in production, you want to detect changes in the data and model environment, but you do not have the true labels. You need to run a separate labeling process to get them.
How can you keep tabs on the changes without the labels?
Let's take an example dataset and go through the following steps:
Code example: head to the example notebook to follow all the steps.
First, install Evidently. Use the Python package manager to install it in your environment. If you are working in Colab, run !pip install. In the Jupyter Notebook, you should also install nbextension. Check out the instructions for your environment.
You will also need to import a few other libraries like pandas and specific Evidently components. Follow the instructions in the notebook.
Once you have it all set, let’s look at the data! You will work with an open dataset from e-commerce reviews.
Here is how the dataset looks:
We’ll focus on the “Review_Text” column for demo purposes. In production, we want to monitor changes in the texts of the reviews.
You will need to specify the column that contains texts using column mapping:
You should also split the data into two: reference and current. Imagine that "reference" data is the data for some representative past period (e.g., previous month) and "current" is the current production data (e.g., this month). These are the two datasets that you will compare using descriptors.
Note: it's important to establish a suitable historical baseline. Pick the period that reflects your expectations about how the data should look in the future.
We selected 5000 examples for each sample. To make things interesting, we introduced an artificial shift by selecting the negative reviews for our current dataset.
To better understand the data, you can generate a visual report using Evidently. There is a pre-built Text Overview Preset that helps quickly compare two text datasets. It combines various descriptive checks and evaluates overall data drift (in this case, using a model-based drift detection method).
This report also includes a few standard descriptors and allows you to add descriptors using lists of Trigger Words. We’ll look at the following descriptors as part of the report:
Check out the Evidently docs on Descriptors for details.
Here is the code you need to run this report. You can assign custom names to each descriptor.
Running a report like this helps explore patterns and shape your expectations about particular properties, such as text length distribution.
The distribution of the “sentiment” descriptor quickly exposes the trick we did when splitting the data. We put reviews with a ranking above 3 in “reference” and more negative reviews in “current” datasets. The results are visible:
The default report is very comprehensive and helps look at many text properties at once. Up to exploring correlations between descriptors and other columns in the dataset!
You can use it during the exploratory phase, but this is probably not something you’d need to go through all the time.
Luckily, it’s easy to customize.
Evidently Presets and Metrics. Evidently has report presets that quickly generate the reports out of the box. However, there are a lot of individual metrics to choose from! You can combine them to create a custom report. Browse the presets and metrics to understand what’s there.
Let’s say that based on exploratory analysis and your understanding of the business problem, you decide only to track a small number of properties:
You want to notice when there is a statistical change: the distributions of these properties differ from the reference period. To detect it, you can use drift detection methods implemented in Evidently. For example, for numerical features like “sentiment,” it will, by default, monitor the shift using Wasserstein distance. You can also choose a different method.
Here is how you can create a simple drift report to track changes in the three descriptors.
Once you run the report, you will get combined visualizations for all chosen descriptors. Here is one:
The dark green line is the mean sentiment in the reference dataset. The green area covers one standard deviation from the mean. You can notice that the current distribution (in red) is visibly more negative.
Note: In this scenario, it also makes sense to monitor the output drift: by tracking shifts in the predicted classes. You can use categorical data drift detection methods, like JS divergence. We do not cover this in the tutorial, as we focus only on inputs and do not generate predictions. In practice, prediction drift is often the first signal to react to.
Let's say you decided to track one more meaningful property: the emotion expressed in the review. The overall sentiment is one thing, but it also helps distinguish between "sad" and "angry" reviews, for example.
Let's add this custom descriptor! You can find an appropriate external open-source model to score your dataset. Then, you will work with this property as an additional column.
We will take the Distilbert model from Huggingface, which classifies the text by five emotions.
You can consider using any other model for your use case, such as named entity recognition, language detection, toxicity detection, etc.
You must install transformers to be able to run the model. Check the instructions for more details. Then, apply it to the review dataset:
Note: this step will score the dataset using the external model. It will take some time to execute, depending on your environment. To understand the principle without waiting, refer to the "Simple Example" section in the example notebook.
After you add the new column "emotion" to the dataset, you must reflect this in Column Mapping. You should specify that it is a new categorical variable in the dataset.
Now, you can add the “emotion” distribution drift monitoring to the Report.
Here is what you get!
You can see a significant increase in "sad" reviews and a decrease in "joy."
Does it appear helpful to track over time? You can continue running this check by scoring new data as it comes.
To perform regular analysis of your data inputs, it makes sense to package the evaluations as tests. You get a clear "pass" or "fail" result in this scenario. You probably do not need to look at the plots if all tests pass. You're only interested when things change!
Evidently has an alternative interface called Test Suite that works this way.
Here is how you create a Test Suite to check for statistical distribution in the same four descriptors:
Note: we go with defaults, but you can also set custom drift methods and conditions.
Here is the result. The output is neatly structured so you can see which descriptors have drifted.
Detecting statistical distribution drift is one of the ways to monitor changes in the text property. There are others! Sometimes, it is convenient to run rule-based expectations on the descriptor's min, max, or mean values.
Let's say you want to check that all review texts are longer than two words. If at least one review is shorter than two words, you want the test to fail and see the number of short texts in the response.
Here is how you do that! You can pick a TestNumberOfOutRangeValues() check. This time, you should set a custom boundary: the “left” side of the expected range is two words. You must also set a test condition: eq=0. This means you expect the number of objects outside this range to be 0. If it is higher, you want the test to return a fail.
Here is the result. You can also see the test details that show the defined expectation.
You can follow this principle to design other checks.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
Text descriptors map text data to interpretable dimensions you can express as a numerical or a categorical attribute. They help describe, evaluate, and monitor unstructured data.
In this tutorial, you learned how to monitor text data using descriptors.
You can use this approach to monitor the behavior of NLP and LLM-powered models production. You can customize and combine your descriptors with other methods, such as monitoring embedding drift.
Are there other descriptors you consider universally useful? Let us know! Join our Discord community to share your thoughts.