Machine Learning Monitoring, Part 1: What It Is and How It Differs

This blog is a part of the Machine Learning Monitoring series.

Is there life after deployment?

Congratulations! Your machine learning model is now live. Many models never make it that far. Some claim, as much as 87% are never deployed. Given how hard it is to get from a concept to a working application, the celebration is well deserved.

It might feel like a final step.

Indeed, even the design of machine learning courses and the landscape of machine learning tools add to this perception. They extensively address data preparation, iterative model building, and (most recently) the deployment phase.

Still, both in tutorials and practice, what happens after the model goes into production is often left up to chance.

Machine learning model lifecycle: data preparation, feature engineering, model training, model evaluation, model deployment — ?

The simple reason for this neglect is a lack of maturity.

Aside from a few technical giants that live and breathe machine learning, most industries are only starting up. There is limited experience with real-life machine learning applications. Companies are overwhelmed with sorting many things out for the first time and rushing to deploy. Data scientists do everything from data cleaning to the A/B test setup. Model operations, maintenance, and support are often only an afterthought.

One of the critical, but often overlooked components of this machine learning afterlife is monitoring.

Why monitoring matters

With the learning techniques we use these days, a model is never final. In training, it studies the past examples. Once released into the wild, it works with new data: this can be user clickstream, product sales, or credit applications. With time, this data deviates from what the model has seen in training. Sooner or later, even the most accurate and carefully tested solution starts to degrade.

‍The recent pandemic illustrated this all too vividly.

Some cases even made the headlines:

Instacart's model's accuracy predicting item availability at stores dropped from 93% to 61% due to a drastic shift in shopping habits.
Bankers question whether credit models trained on good times can adapt to the stress scenarios.
Trading algorithms misfired in response to market volatility. Some funds had a 21% fall.
Image classification models had to learn the new normal: a family at home in front of laptops can now mean "work," not "leisure."
Even weather forecasts are less accurate since valuable data disappeared with the reduction of commercial flights.

Home office with children playing on the background — A new concept of "office work" your image classification model might need to learn in 2020. (Image by Ketut Subiyanto, Pexels)

On top of this, all sorts of issues occur with live data.

There are input errors and database outages. Data pipelines break. User demographic changes. If a model receives wrong or unusual input, it will make an unreliable prediction. Or many, many of those.

‍Model failures and untreated decay cause damage.

‍Sometimes this is just a minor inconvenience, like a silly product recommendation or wrongly labeled photo. The effects go much further in high-stake domains, such as hiring, grading, or credit decisions.

Even in otherwise "low-risk" areas like marketing or supply chain, underperforming models can severely hit the bottom line when they operate at scale. Companies waste money in the wrong advertising channel, display incorrect prices, understock items, or harm the user experience.

‍Here comes monitoring.

‍We don't just deploy our models once. We already know that they will break and degrade. To operate them successfully, we need a real-time view of their performance. Do they work as expected? What is causing the change? Is it time to intervene?

This sort of visibility is not a nice-to-have, but a critical part of the loop. Monitoring bakes into the model development lifecycle, connecting production with modeling. If we detect a quality drop, we can trigger retraining or step back into the research phase to issue a model remake.

Machine learning model lifecycle: after model deployment comes model serving and performance monitoring.

What is machine learning model monitoring?

Machine learning model monitoring is a practice of tracking and analyzing production model performance to ensure acceptable quality as defined by the use case. It provides early warnings on performance issues and helps diagnose their root cause to debug and resolve.

How machine learning monitoring is different

One might think: we have been deploying software for ages, and monitoring is nothing new. Just do the same with your machine learning stuff. Why all the fuss?

There is some truth to it. A deployed model is a software service, and we need to track the usual health metrics such as latency, memory utilization, and uptime. But in addition to that, a machine learning system has its unique issues to look after.

Monitoring Iceberg. Above water: service health. Below water: data and model health (data drift, model accuracy, concept drift, model bias, underperforming segments).

First of all, data adds an extra layer of complexity.‍

It is not just the code we should worry about, but also data quality and its dependencies. More moving pieces—more potential failure modes! Often, these data sources reside completely out of our control.

And even if the pipelines are perfectly maintained, the environmental change creeps in and leads to a performance drop.

Is the world changing too fast? In machine learning monitoring, this abstract question becomes applied. We watch out for data shifts and casually quantify the degree of change. Quite a different task from, say, checking a server load.

‍To make things worse, models often fail silently.

‍There are no "bad gateways" or "404"s. Despite the input data being odd, the system will likely return the response. The individual prediction might seemingly make sense⁠—while being harmful, biased, or wrong.

Imagine, we rely on machine learning to predict customer churn, and the model fell short. It might take weeks to learn the facts (such as whether an at-risk client eventually left) or notice the impact on the business KPI (such as a drop in quarterly renewals). Only then, we would suspect the system needs a health check! You'd hardly miss a software outage for that long. In the land of unmonitored models, this invisible downtime is an alarming norm.

To save the day, you have to react early. This means assessing just the data that went in and how the model responded: a peculiar type of half-blind monitoring.

Half-blind Model Monitoring: Input Data — Model response — You are here — Ground Truth — Business KPI

The distinction between "good" and "bad" is not clear-cut.‍

One accidental outlier does not mean the model went rogue and needs an urgent update. At the same time, stable accuracy can also be misleading. Hiding behind an aggregate number, a model can quietly fail on some critical data region.

Importance of context in machine learning monitoring. One model with 99% accuracy: doing great! Another model with 99% accuracy: a compete disaster!

Metrics are useless without context.‍

Acceptable performance, model risks, and costs of errors vary across use cases. In lending models, we care about fair outcomes. In fraud detection, we barely tolerate false negatives. With stock replenishment, ordering more might be better than less. In marketing models, we would want to keep tabs on the premium segment performance.

All these nuances inform our monitoring needs, specific metrics to keep an eye on, and the way we'll interpret them.

With this, machine learning monitoring falls somewhere in between traditional software and product analytics. We still look at "technical" performance metrics—accuracy, mean absolute error, and so on. But what we primarily aim to check is the quality of the decision-making that machine learning enables: whether it is satisfactory, unbiased, and serves our business goal.

In a nutshell

Looking only at software metrics is too little. Looking at the downstream product or business KPIs is too late. Machine learning monitoring is a distinct domain, and it requires appropriate practices, strategies, and tools.

In the next post, we'll explore who should care about monitoring.

[fs-toc-omit]Get started with open-source ML monitoring

Evaluate, test, and monitor ML models in production with Evidently. From tabular data to NLP and LLM. Built for data scientists and ML engineers.

Get started ⟶

Cloud waitlist ⟶

Machine Learning Monitoring, Part 1: What It Is and How It Differs

Is there life after deployment?

Why monitoring matters

How machine learning monitoring is different

In a nutshell

[fs-toc-omit]Get started with open-source ML monitoring

You might also like:

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

Machine Learning Monitoring, Part 1: What It Is and How It Differs

Is there life after deployment?

Why monitoring matters

How machine learning monitoring is different

In a nutshell

[fs-toc-omit]Get started with open-source ML monitoring

You might also like:

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?