contents‍
This blog is a part of the Machine Learning Monitoring series. In our previous posts, we explored Why Model Monitoring Matters and Who Should Care About ML Monitoring.
Now, let's get into more detail on what exactly to monitor.
As the saying goes: garbage in is garbage out. Input data quality is the most crucial component of a machine learning system. Whether or not you have an immediate feedback loop, your monitoring always starts here.
There are two types of data issues one encounters. Put simply:
1) something goes wrong with the data itself; or
2) the data changes because the environment does.
Let us start with the first category. It alone has plenty.
A machine learning application usually relies on upstream systems to provide inputs. The most trivial— but frequent— occasion is when the production model does not receive the data. Or, it receives corrupted or limited data, all due to some pipeline issues.
Let's take a marketing example.
The data science team in a bank developed a mighty machine learning system to personalize promo offers sent to clients each month.
This system uses data from an internal customer database, clickstream logs from the internet banking and mobile app, and call center logs. Also, the marketing team manually maintains a spreadsheet where they add this month's promo options.
All the data streams are merged and stored in a data warehouse. When the model is run, it calculates the necessary features on top of the joint table. The model then ranks the offers for each client based on the likelihood of acceptance and spits the result.
This pipeline uses multiple data sources. And, a different functional owner maintains each of them. Quite some opportunity to mess with it!
Here is an incomplete list of the nasty things to happen:
When data processing goes bad, the model code can simply crash. At least, you'll learn about the issue fast. But if your Python code had some "Try...Except" clauses, it might execute on incorrect and incomplete input. The consequences are all yours.
The promo example we looked at has batch inference. It is less dramatic. You have some room for error. If you catch the pipeline issue on time, you can simply repeat the model run.
In high-load streaming models, the data processing problems multiply (think e-commerce, gaming, or bank transactions).
In other cases, data processing works just fine. But then a valid change happens at the data source. Whatever the reason, new data formats, types, and schemas are rarely good news to the model.
On top of this, the author of the change is often unaware of the impact. Or, that some model even exists down there.
Let's go back to the promo example.
One day, the call center's operational team decides to tidy up the CRM and enrich the information they collect after each customer call.
They might introduce better, more granular categories to classify calls by the type of issue. They would also ask each client on their preferred communication channel, and start to log this in a new field. And since we are here: let's rename and change the order of fields to make it more intuitive for new users.
Now, that looks neat!
But not so to the model.
‍In technical terms, this all translates to lost signal.
‍Unless explicitly told so, the model will not match new categories with the old ones or process extra features. If there is no data completeness check, it will generate the response based on partial input it knows how to handle.
‍This pain is well-known to anyone who deals with catalogs.‍
For example, in demand forecasting or e-commerce recommendations. Often, you would have some complex features based on category type. Say, "laptop" or "mobile phone" is in "electronics." That is expensive. Let's make it a feature. "Phone case" is in "accessories." That is sort of "cheap." We'll use that too.
Then, someone reorganizes the catalog. Now, "mobile phone" and "phone case" are both under "mobile." A whole different category, with a different interpretation. The model will need to learn it all over again or wait until someone explains what happened.
‍No magic here. If catalog updates occur often, you'd better factor it into the model design. Otherwise, educate the business users and keep track of sudden changes.
Some more examples:
The irony is, domain experts can perceive the change as operational improvement. For example, a new sensor allows you to capture high granularity data at a millisecond rate. Much better! But the model is trained on the aggregates and expects to calculate them the usual way.
Lack of clear data ownership and documentation makes it harder. There might be no easy way to trace or know whom to inform about an upcoming data update inside an organization. Data quality monitoring becomes the only way to capture the change.
The data not only changes. It can also be lost, due to some failure at the very source.
For example, you lose the application clickstream data due to a bug in logging. The physical sensor breaks and the temperature is no longer known. External API is not available, and so on. We want to catch these issues early since often they mean the irreversible loss of the future retraining data.
Such outages may affect only a subset of data. For instance, users in one geography or a specific operating system. This makes the detection harder. Unless another (properly monitored!) system relies on the same data source, the failure can go unnoticed.
Even worse, a corrupted source might still provide the data. For example, a broken temperature sensor will return the last measurement as a constant value. That is hard to spot unless you keep track of "unusual" numbers and patterns.
As with physical failures, we can't always resolve the issue immediately. But catching it on time helps quickly assess the damage. If needed, we can update, replace, or pause the model.
In more complex setups, you have several models that depend on each other. One model's output is another model's input.
This also means: one model's broken prediction is another model's corrupted feature.
Take a content or product recommendation engine.
It might first predict the popularity of a given product or item. Then, it makes recommendations to different users, taking into account the estimated popularity.
These would be separate models, basically looped into each other. Once the item is recommended to the user, it is more likely to be clicked on, and thus more likely to be seen as "popular" by the fist model.
A more tech-y example: a car route navigation system.
First, your system constructs possible routes. Then, a model predicts the expected time of arrival for each of them. Next, another model ranks the options and decides on the optimal route. Which, sort of, influences the actual traffic jams. Once cars follow the suggested routes, this creates a new road situation.
Other models in logistics, routing, and delivery often face the same issue.
These linked systems bear an obvious risk: if something is wrong with one of the models, you get an interconnected loop of problems.
All these varying issues ask for a number of checks for the input data quality. Some of these errors are trivial but are also most painful to miss.‍
How to track them? In the next blog, we will go into detail on what exactly to monitor.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶