contents‍
This blog is a part of the Machine Learning Monitoring series. In our previous posts, we discussed Why Model Monitoring Matters, Who Should Care About ML Monitoring and What Can Go Wrong With Your Data.
In our previous blog, we listed a bunch of things that can go wrong with the data that goes into a machine learning model.
To recap, these are data processing issues, changing data schemas, lost data, and broken upstream models. At the very least.
We want these things never to happen, but let's be realistic. So, our goal is to catch them on time instead.
Usually, data reliability and consistency fall under data engineering. You might even have some checks or monitoring systems at the database level. Is there anything else to keep an eye on?
The deal is that with machine learning systems, we do not care about overall data quality. We want to track a particular data subset consumed by a given model. Sometimes exclusively. It does not matter that 99% of data in the warehouse is correct; we want to check our piece.
Feature processing code is also a separate moving piece to monitor. This requires a custom set-up.
So, on the data quality and integrity side, MLOps meets DataOps. We'd better double-check.
There are a few data-related things to look at:
The first question to answer is whether the model even works. For that, look at the number of model responses. It is a basic but useful check to add on top of software monitoring.
Why? The service itself might be operational, but not the model. Or, you might rely on a fallback mechanism, like a business rule, more often than planned.
If your model is used from time to time, this is less useful. But if there is a "normal" usage pattern, it is a great sanity check. For example, your model is deployed on the e-commerce website or fed every day with new sales data. You know then what consumption to expect.
Looking at the number of model calls is an easy way to catch when something is very wrong.
Depending on the model environment, you might want to check requests and responses separately. Was the model not asked (e.g., because a recommendation widget crashed) or failed to answer (e.g., the model timed out and we had to use a static recommendation instead)? The answer would point to where you should start debugging.
Now, let's look at the data.
In the previous blog, we described how data schemas might change. Be it due to bad practices or best intentions; we want to detect it.
Our goal is to learn when the features get dropped, added, or changed.
The straightforward way is to perform a feature-by-feature check and investigate:
1/ If the feature set remains the same. In the case of tabular data: how many columns are there? Is anything missing, or anything new?
2/ If the feature data types match. Did we get categorical values instead of numerical somewhere? For example, we had numerical features ranging from 1 to 10 at a given column. Now when we query it, we see values like "low," "medium," and "high." We should be able to catch this.
In the end, you want a quick summary view that the incoming dataset is shaped as expected.
We also want to detect any missing data.
Often, there is some acceptable share of the missing values. We do not want to react at each empty entry. But we want to compare if the level of missing data stays within the "normal" range, both for the whole dataset and individual features. Are any critical features lost?
It is important to keep in mind since missing values can come in many flavors. Sometimes they are empty, and sometimes they are "unknown" or "999"s. If you do a simplistic check for the absent features, you might miss those other ones. It's best to scan for standard expressions of missing data, such as "N/A," "NaN," "undefined," etc. Having an occasional audit with your own eyes is not a bad idea, either.
If you have a limited number of features, you might visualize them all in a single plot. That is how we did it, color-coding the share of missing values:
You can also set a data validation threshold to define when to pause the model or use a fallback. For example, if too many features are missing. The definition of "too many," of course, depends on your use case and model's cost of error.
One useful tip is to single out your key driving factors. You can do that based on model feature importance or SHAPley values. Or, combine either method with your domain knowledge on what matters.
The idea is to set up different monitoring policies. You always need your critical features to run the model. With auxiliary ones, the absence is not a show-stopper. You just take a note and later investigate it with the data owner.
Just because the data is there, it does not mean it is correct.
Examples:
In all the cases, the model works, the data is available—but is corrupted.
To detect this, you want to monitor the feature statistics and distribution.
1/ Feature value range. For numerical features, check if the values stay within a reasonable range. For categorical attributes, define a list of possible values and keep an eye for novelties.
How to do this?
It also helps to state when nulls are allowed explicitly.
2/ The key feature statistics. For numerical features, you can look at the average, mean, min-max ratio, quantiles.
The latter would help catch the cases like this broken sensor. Formally, it stays within the range, but the measurement is completely static:
For categorical inputs, you can check their frequencies. If you work with texts, the equivalent might be % of the vocabulary words, for example.
The goal is then to monitor the live dataset for compliance and validate the data at the input. This way, you can catch when there is a range violation, unusual values, or a shift in statistics.
One more aspect to consider is where to run your data validation checks.
When the data is wrong, the first question is why. In an ideal world, we want to locate the error as soon as we catch it. Broken joins or feature code can be a reason. In this case, the source data is just fine, but something happens during its transformation into model features.
Sometimes it makes sense to validate the inputs and outputs separately for each step in the pipeline. This way, we can locate the problem and debug it faster.
For example, we predict customer churn for a mobile operator. Marketing data comes from one source. Purchase logs are joined with ever-changing product plans. Then you merge usage logs with external data on technical connection quality. Feature transformation takes several steps.
Of course, you can simply validate the output of your last calculation. But if you then notice that some features make no sense, you'd have to retrace each step. If pipelines are complex, separate checks might save you some detective work.
Data quality monitoring is the first line of defense for production machine learning systems. By looking at the data, you can catch many issues before they hit the actual model performance.
You can, and should do that for every model. It is a basic health check, similar to latency or memory monitoring. It is essential both for human- and machine-generated inputs. Each has its own types of errors. Data monitoring also helps reveal abandoned or unreliable data sources.
Of course, the data issues do not stop just here. In the next post, we'll dig deeper into data and concept drift.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶