contents
When one mentions "ML monitoring," this can mean many things. Are you tracking service latency? Model accuracy? Data quality? The share of visitors that click on the recommendation block?
ML monitoring can be all of the above or none.
This blog organizes all metrics into a single framework. It is high-level, but we hope a comprehensive overview. Read on if you are new to ML monitoring and want a quick grasp of it.
Along with the blog, we'll also link to articles that companies like Doordash, Nubank, Booking.com, and Linkedin wrote about how they approach ML monitoring.
For starters, why even talk about monitoring?
When you deploy an ML system in production, it integrates into the business. It's ROI time! You expect it to deliver value. For example, a recommendation system should improve user experience and increase revenue.
But these ML systems can fail. Some of the failures are obvious and trivial, like the service going down. Others are silent and particular to machine learning, such as data and concept drift. You might also face critical second-order effects. For example, a credit scoring system can show bias towards certain customer groups.
To control these risks, you must monitor the production ML system.
Here is how DoorDash describes the motivation for building the ML monitoring:
"In the past, we've seen instances where our models became out-of-date and began making incorrect predictions. These problems impacted the business and customer experience negatively and forced the engineering team to spend a lot of effort investigating and fixing them. Finding this kind of model drift took a long time because we did not have a way to monitor for it".
Source: "Maintaining Machine Learning Model Accuracy Through Monitoring," DoorDash Engineering Blog.
We once made this iceberg image at the header of the blog. It was pretty well-received. Indeed, it does make a point. Monitoring ML models in production means more than tracking software performance. There are a bunch of other things!
But this iceberg is a binary classifier. We compare software-related aspects to the unseen "everything else" that makes ML monitoring different.
Let's now try to organize the rest. What should you look at?
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
Here is a way to structure all components of the ML system monitoring:
Let's quickly grasp what is there at each layer, bottom-up.
First, you still have the software backend. Yep, you cannot ignore this: let's place it at the base of the pyramid.
To generate the predictions, you need to invoke the ML model somehow. A simpler example is batch inference. You can run it daily, hourly, or on-demand and use a workflow manager to orchestrate the process. It would access the data source, run the model and write the predictions to a database. The online inference is a bit more complex. You might wrap the model as a service and expose REST API to serve predictions at request. There are more moving pieces to track.
At the ground level, you still need to monitor how this software component works. Did the prediction job execute successfully? Did the service respond? Is it working fast enough?
Second, you have the data. The production ML models take new data as input, and this data changes over time. There are also many issues with data quality and integrity that might occur at the source or during transformation.
This data represents the model's reality, and you must monitor this crucial component. Is the data OK? Can you use it to generate predictions? Can you retrain the model using this data?
Third, you have the hero: the ML model itself. Finally!
No model is perfect, and no model lasts forever. Still, some of them are useful and relevant for the given task. Once the model is in production, you must ensure its quality remains satisfactory.
This model-focused component of ML monitoring is the most specific one. Is the model still fit for the task? Are the predictions accurate? Can you trust them?
Lastly, there is the business or product KPI. No one uses ML to get "90% accuracy" in something. There is a business need behind it, such as converting users into buyers, making them click on something, getting better forecasts, decreasing delivery costs, etc. You need a dollar value assigned to the model, a measurable product metric, or the best proxy you can get.
That is the ultimate goal of why you have the ML system in place and the tip of the monitoring pyramid. Does the model bring value to the business? Are the product metrics affected by the model OK?
An ML system has these four components: the software piece, the flowing data, the machine learning model, and the business reason for its existence.
Since an ML system is all of the above, ML monitoring has to be too.
Here is how Booking.com looks at the quality of the ML system as a whole:
"Each model is part of a whole machine learning system where data is being preprocessed, predictions are being made, and finally, these predictions are used to influence the day-to-day operations of our business. Obviously, we want these models to be of good quality, but not only the models: the entire system needs to be of good quality to ensure long-lasting business impact".
Source: "A Quality Model for Machine Learning Systems," Booking.com Data Science Blog.
This pyramid structure describes the hierarchy of ML monitoring. But it does not answer the "how." Which metrics should you calculate? Do you need a lot of them?
Let's now review each level of the pyramid in more detail. We will consider:
We'll not go into details of the logging and monitoring architecture but will primarily focus on the contents.
Let's dive in!
Operational metrics help evaluate the software system's health. After all, it does not matter how great your ML model is if the system is down.
What can go wrong with the software component in ML? Pretty much anything that can go wrong with any other production system. Bugs in the code, human errors, infrastructure issues, you name it.
What is the goal of ML software monitoring? First, to know that the system is up and running and immediately intervene if it fails. Second, understand more fine-grained performance characteristics to ensure you comply with service level objectives. If necessary, you can make changes, like spinning up new instances as the service usage grows.
How specific is this software monitoring to ML? Business as usual. There is no difference between ML and non-ML software in this context. You can borrow the monitoring practices from the traditional software monitoring stack.
Who is usually on call? A backend engineer or a site reliability engineer.
What metrics can you monitor? What impacts the choice?
The monitoring setup depends on the model deployment architecture, including:
These factors affect not only the metrics choice but the overall monitoring system design.
In the simplest form, an ML system can look like a set of infrequent batch jobs. In this case, you can remain in the data engineering realm and treat it, well, like any other data job. You can monitor execution time and job completion and set up a notification if they fail.
If you wrap the model as an API to serve predictions on request, you'd need more. You will need to instrument your service to collect event-based metrics. You can then track various operational metrics of software and infrastructure health in real-time, including:
SRE terminology often refers to these operational metrics as SLIs (service level indicators). The idea is to carefully pick a few measures that quantify different aspects of service performance.
Here is how Linkedin monitors latency as part ML Health Assurance system:
"Model inference latency is an important metric for the application owners because this tells the overall time the model took in serving a particular scoring request. We typically monitor the mean, 50th, 75th, 90th, and 99th percentile latency. These quantiles for latency can be used in multiple ways, such as helping isolate the offending piece of a model within the entire lifecycle of a request."
Source: "Model Health Assurance platform at Linkedin," Linkedin Blog.
Let's jump to the data level.
Say the ML service is up and running, and all jobs are completed smoothly. But what about the data that is flowing through? Data quality is the usual culprit of ML model failures. And, the next big thing to monitor!
What can go wrong with data for an ML model? A lot! Here is an incomplete list of potential data quality issues. Some examples are:
What is the goal of data quality monitoring in ML? To know that you can trust the data to generate the predictions and react if not. There is no use in the model if the data is broken or absent. You'd want to stop and use a fallback until you restore the data quality. In less extreme cases, you can proceed with the predictions but use the monitoring signal to investigate and resolve the issue.
How specific is this data quality monitoring to ML? Somewhat! Of course, you also need to monitor the data for other analytical use cases. You can re-use some of the existing approaches and tools. But this traditional data monitoring is often performed at a macro level, for example, when you monitor all data assets and flows in the warehouse.
In contrast, ML data monitoring is granular. You need to ensure that particular model inputs comply with expectations. You can still rely on existing upstream data quality monitoring in some cases. For example, if the model re-uses shared data tables already under guard. But often, you'll need to introduce additional checks to control for feature transformation steps, quality of real-time model input, or because the model uses an external data source.
In this sense, ML data quality monitoring is closer to data testing and validation that might exist in other data pipelines. It is often performed as checks on data ingestion before you serve the model.
Who is usually on call? A data engineer (if the issue is with infrastructure), or a data analyst or data scientist (if the issue is with the data "contents").
What metrics can you monitor? What impacts the choice?
The exact data monitoring setup again depends on:
We can roughly split the types of data metrics and checks into several groups.
How Google designed a data validation system:
In the paper "Data Validation for Machine Learning," the Google team presents how they designed a data validation system to evaluate and test the data fed into machine learning pipelines. They suggest a data-centric approach to ML, treating the data as an important production asset, together with the algorithm and infrastructure.
What is tricky with ML data quality monitoring?
Even if the software system works fine and the data quality is as expected, does this mean you are covered? Nope! Welcome to the land of ML model issues.
What can go wrong with ML models in production? They drift!
Models can break abruptly in case of sudden change or start gradually performing worse. We can broadly split the causes into two:
Here are some of the things that might cause model drift:
When the model drifts, you'd usually see an increase in the model error or the number of incorrect predictions. In the case of radical drifts, the model can become inadequate overnight.
What is the goal of ML model quality monitoring? To give you peace of mind that you trust the model and continue using it, and to alert you if something is wrong. A good monitoring setup should provide enough context to troubleshoot the model decay efficiently. You need to evaluate the root cause and address the drift, for example, trigger the retraining, rebuild the model or use a fallback strategy.
How specific is this to ML monitoring? Entirely! This piece is pretty unique to ML systems. You can adapt some of the model monitoring practices from other industries, such as validation and governance of credit scoring models in finance. Otherwise, it is ML monitoring as you know it.
Who is usually on call? A data scientist or a machine learning engineer. Whoever built the model and knows "what the feature X is about," or whom to ask about it.
What metrics can you monitor? What impacts the choice?
The ML monitoring setup can vary. Here are some of the things that affect it:
There are probably hundreds of different metrics you can calculate! Let's try to group them for a quick outlook.
Model quality metrics. This group of metrics evaluates the true quality of the model predictions. You can calculate them once you have the ground truth or feedback (e.g., data on clicks, purchases, delivery time, etc.) Here are some examples:
Model quality by segment. Aggregate metrics are essential but are often not enough. You might have 90% overall accuracy but only 60% in some important subpopulations like new users. To detect such discrepancies, you can track the model quality for the known segments in data (for example, accuracy for different geographical locations) or proactively search for underperforming segments.
Prediction drift. This is the first type of proxy quality monitoring. If you don't know how good your model is, you can at least keep tabs on how different its predictions are. Imagine that a spam detection model suddenly starts assigning the "spam" label in every second prediction. You can raise alarms even before you get the true labels. To evaluate prediction drift, you can use different drift detection approaches:
Input data drift. In addition to the prediction drift, you can monitor the shifts in the input data and interpret them together. The goal is to detect situations when the model operates in an unfamiliar environment, as seen from the data. The detection approach is similar to the prediction drift. You can monitor descriptive stats for the individual features (such as frequencies of categories), run statistical tests or use distance metrics to detect distribution shifts. You can also track specific patterns, such as changes in linear correlations between features and predictions.
Outliers. You can detect individual cases that appear unusual and where the model might not work as expected. This is different from data drift, where the goal is to detect the overall distribution shift. You can, of course, still use the rate of outliers as a metric to plot and alert on. But the goal of outlier detection is usually to identify individual anomalous inputs and act on them, for example, flag them for expert review. You can use different statistical methods, such as isolation forests or distance metrics, to detect them.
Fairness. This is a specific dimension of the model quality, dictated by the use case importance and risks. If ML decisions have serious implications as they often do in finance, healthcare, and education use cases, you might need to ensure that the model performs equally well for different demographic groups. There are different metrics to evaluate model bias, such as demographic parity or equalized odds. It is particularly important to track these metrics if you have automated model retraining, and its behavior can deviate with time.
You can notice some of the metrics, such as feature stats and outliers, appear in both data quality and model quality contexts. ML and data monitoring often come hand in hand as you look at the data anyway. However, the ML-focused part of the monitoring looks at the data to evaluate model relevance. In contrast, the data quality-focused part of the monitoring looks for corruption and errors in the data itself.
What is tricky with ML monitoring?
There are hardly any blueprints!
How to implement ML monitoring:
We work to implement some of the best practices in ML monitoring in Evidently, an open-source ML monitoring toolset we created. If that is something you are looking to solve, jump on our Discord community to chat and share or test out the tool on GitHub!
The business value, at last!
To judge the performance of the ML model, you ultimately need to tie it to the business KPI. Is the ML model doing the job it is built for?
What is the goal of business metrics monitoring? To estimate the business value of the ML system and adjust if things go off track. There is always a risk of a mismatch between the ML model quality and the business value. This can happen due to changing reality, or if the model is not used in the way it was designed (or is not used at all!)
How specific is this to ML monitoring? This part is strictly business-specific. The metrics and the way you measure them are all over the board. It can range from tracking engagement metrics in a web app to evaluating savings of raw materials in an industrial plant. It all boils down to the business use case.
Who is usually on call? A product manager, a business owner, or an operational team, together with the data scientists, to bridge the gap.
What can you monitor?
Some advice from the Nubank ML team on monitoring the policy layer:
"Monitor the decisions made using the model. For example: how many people got loans approved by the risk model on each day? How many people had their accounts blocked by the fraud model on each day? It's often useful to monitor both absolute and relative values here".
Source: "ML Model Monitoring – 9 Tips From the Trenches," Nubank Blog.
What is tricky with monitoring business KPIs?
Bottom line: track them if you can, but track other things, too.
We know, that was a lot!
The goal of this overview was to introduce all the different aspects of the production ML system. Once the ML application is deployed, it is no longer just a model but a complex system made of data, code, infrastructure, and the surrounding environment. You need to monitor it as a whole.
This does not mean that you should look at dozens of metrics and plots all the time.
First of all, there might be several dashboards or views used by different people on the team.
You can perform operational monitoring in the existing backend monitoring tool. You can visualize product metrics in the BI dashboard your business stakeholders already use. Depending on the setup, you can add ML monitoring metrics to the same Grafana dashboard, check them through a pipeline test orchestrated by Airflow, or spin up a standalone dashboard used by the ML team. Different aspects of ML monitoring have different internal users (and problem-solvers). It's fine to have multiple dashboards.
Here is the case from Monzo:
Understanding the live performance of a model is a critical part of the model development process. For this stage, we lean on our reuse over rebuild principle and have adopted tools that are used across the company. We wanted our monitoring tools to be available to everyone, including people outside of machine learning.
Source: "Monzo's machine learning stack," Monzo Blog.
Second, you should distinguish between monitoring and debugging. You might proactively monitor and introduce alerts only on a handful of metrics most indicative of the service performance. Your goal is to be informed about a potential problem. The rest of the metrics and plots would be helpful during debugging as they provide the necessary context. But you won't actively set alerts or define specific thresholds for them.
For example, if you get your model feedback fast, you might skip alerting on the feature drift. You can evaluate the model quality itself, after all. But if you notice a performance drop, you would need to identify the reasons and decide how to handle drift. It might make sense to pre-build the distribution visualizations for the important features or have an easy way to spin it on demand. In other cases, you might prefer to monitor for data drift, even if you get the labels. It depends!
To sum up, the goal of monitoring is to give confidence that the system is running well and alert if not. In the event of failure, you'd need the necessary context to diagnose and solve the problem, and that's where the extra metrics come in handy, but you don't need to look at them all the time.
ML monitoring means monitoring an ML system. To observe and evaluate its performance, you usually need a bunch of metrics that describe the system state. There are several facets of the system to look at.
We can group them into:
You probably don't need to look in detail at every pyramid layer. They might have different internal consumers, between backend engineers, data engineering, ML team, and business stakeholders.
The exact monitoring strategy will also depend on whether the model is batch or real-time, how quickly you get the ground truth labels, how critical the model is, and the associated risks. You'd probably use some metrics for the actual monitoring (set an alert on them) while making others available for reporting and debugging purposes (for example, pre-compute and store them somewhere).
Here is a summary with some examples of metrics:
We'll continue our deep dive into the ML monitoring theory and practice. In the following blogs, we'll cover the following: