Machine learning models have a lifespan. They decay over time and can make mistakes or predictions that don't make sense. They can also encounter unexpected or corrupted data that affect the quality of the outputs. To address this, you need ML model monitoring.
This guide explores the topic of ML model monitoring in detail. We'll cover what it is, why it matters, which metrics to monitor, and how to design an ML monitoring strategy, considering aspects like ML monitoring architecture or the type of data you deal with.
We will also introduce Evidently, an open-source Python library for ML monitoring.
Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Model monitoring is the ongoing process of tracking, analyzing, and evaluating the performance and behavior of machine learning models in real-world, production environments. It involves measuring various data and model metrics to help detect issues and anomalies and ensure that models remain accurate, reliable, and effective over time.
Want to study the topic in every detail? We created a free course on ML observability and monitoring. All videos, code examples, and course notes are publicly available. Sign up here.
Building a machine learning model is just the beginning. Once you deploy that model into the real world, it faces many challenges that can affect its performance and require continuous monitoring.
Here are some examples of issues that can affect production ML models.
Gradual concept drift. Gradual concept drift refers to the ongoing changes in the relationships between variables or patterns in the data over time. They may ultimately lead to a degradation in the model quality. Consider a product recommendation system: as user preferences evolve, what was relevant last month might not be today, impacting the quality of the model's suggestions.
Sudden concept drift. Sudden concept drift, in contrast, involves abrupt and unexpected changes to the model environment that can significantly impact its performance. This might be, for instance, an external change such as the outbreak of COVID-19 or an unexpected update to a third-party application that disrupts data logging and, in turn, makes your model obsolete.
Want to read more about concept drift? Head to the deep dive guide.
Data drift. Data distribution drift occurs when the statistical properties of the input data change. An example could be a change in customer demographics, resulting in the model underperforming on a previously unseen customer segment.
Want to read more about data drift? Head to the deep dive guide.
Data quality issues. Data quality issues encompass a range of problems related to the input data's accuracy, completeness, and reliability. Examples include missing values, duplicate records, or shifts in the feature range: imagine milliseconds replaced for seconds. If a model receives unreliable inputs, it will likely produce unreliable predictions.
Data pipeline bugs. Many errors occur within the data processing pipeline. These bugs can lead to data delay or data that doesn't match the expected format, causing issues in the mode performance. For instance, a bug in data preprocessing may result in features having a wrong type or not matching the input data schema.
Adversarial adaptation. External parties might deliberately target and manipulate the model's performance. For example, spammers may adapt and find ways to overcome spam detection filters. With LLM models, malicious actors intentionally provide input data to manipulate the model outputs, using techniques such as prompt injection.
Broken upstream models. Often, there is a chain of machine learning models operating in production. If one model gives wrong outputs, it can propagate downstream and lead to the model quality drop in the dependent models.
If these issues occur in production, your model could produce inaccurate results. Depending on the use case, getting predictions wrong can lead to measurable negative business impact.
The risks vary from lost revenue and customer dissatisfaction to reputational damage and operational disruption. The more crucial a model is to a company's success, the greater the need for robust monitoring.
A robust model monitoring system not only helps mitigate the risks discussed in the previous section, but also offers additional benefits. Let’s take an overview of what you can expect from ML monitoring.
Issue detection and alerting. ML monitoring is the first line of defense that helps identify when something goes wrong with the production ML model. You can alert on various symptoms, from direct drops in model accuracy to proxy metrics like increased share of missing data or data distribution drift.
Root cause analysis. Once alerted, well-designed monitoring helps pinpoint the root causes of problems. For example, it can help identify specific low-performing segments the model struggles with or help locate corrupted features.
ML model behavior analysis. Monitoring helps get insights into how users interact with the model and whether there are shifts in the model's operational environment. This way, you can adapt to changing circumstances or find ways to improve the model performance and user experience.
Action triggers. You can also use the signals a model monitoring system supplies to trigger specific actions. For example, if the performance goes below a certain threshold, you can switch to a fallback system or previous model version or initiate retraining or data labeling.
Performance visibility. A robust model logging and monitoring system enables recording the ongoing model performance for future analysis or audits. Additionally, having a clear view of the model operations helps communicate the model value to stakeholders.
You might wonder: there is an established practice of tracking software health and product performance; how is ML model monitoring different? Is it possible to use the same methods?
While it is partially true – you still need to monitor the software system health – model monitoring addresses a particular set of challenges, which makes it a separate field. Firstly, you focus on different groups of metrics – such as model and data quality metrics. Secondly, how you compute these metrics and design your model monitoring is also different.
Let’s explore some of these challenges.
Silent failures. Software errors are usually visible: if things don’t work, you will get an error message. With machine learning, you may encounter different types of errors: a model returning an unreliable or biased prediction. Such errors are “silent”: a model will typically respond as long as it can process the incoming data inputs. Even if the input data is incorrect or significantly different – the model will make a potentially low-quality prediction without raising an alarm. To detect such "non-obvious" errors, you must evaluate model reliability using proxy signals and design use-cases-specific validations.
Lack of ground truth. In a production ML environment, it is typical for feedback on model performance to have a delay. Because of this, you cannot measure the true model quality in real time. For example, if you forecast sales for the next week, you can only estimate the actual model performance once this time passes and you know the sales numbers. To evaluate the model quality indirectly, you need to monitor the model inputs and outputs. You also often need two monitoring loops: the real-time one that uses proxy metrics and the delayed one that runs once the labels are available.
Relative definition of quality. What's considered good or bad model performance depends on the specific problem. For instance, a 90% accuracy rate might be an excellent result for one model, a symptom of a huge quality issue for another, or simply a wrong choice of metrics for the third one. On top of this, there is inherent variability in the model performance. This makes it challenging to set clear, universal metrics and alerting thresholds. You will need to adjust the approach depending on the use case, cost of error, and business impact.
Complex data testing. Data-related metrics are often sophisticated and computationally intensive. For example, you could compare input distributions by running statistical tests. This requires collecting a data batch of data of significant size as well as passing a reference dataset. The architecture of such implementation significantly differs from traditional software monitoring, where you expect a system to emit metrics like latency continuously.
To better understand the concept of production model monitoring, let’s explore and contrast a few other related terms, such as model experiment management or software monitoring.
TL;DR: ML model monitoring focuses on detecting known issues. ML model observability provides root cause analysis and comprehensive visibility into the system performance.
ML model monitoring primarily involves tracking a predefined set of metrics to detect issues and answer questions like "What happened?" and "Is the system working as expected?" It is more reactive and is instrumental in identifying "known unknowns."
On the other hand, ML observability provides a deeper level of insight into the system's behavior, helping understand and analyze the root causes of issues, addressing questions like "Why did it happen?" and "Where exactly did it go wrong?" ML observability is a proactive approach to uncover "unknown unknowns."
ML monitoring is a subset of ML observability. In practice, both terms are often used interchangeably.
TL;DR: Production ML monitoring tracks the quality of the model performance on live data. Experiment tracking helps compare different models and parameters on an offline test set.
Sometimes, practitioners use the term “model monitoring” to describe the idea of tracking the quality of different models during the model training phase. A more common name is experiment tracking.
Experiment tracking helps record different iterations and configurations of models during the development phase. It ensures that you can reproduce, compare, and document various experiments – such as recording a specific output and how you arrived at it. While it also involves visualizations of different model metrics, it concerns model performance on the offline training set. The goal is typically to help compare models to choose the best one you’ll deploy in production.
Model monitoring, on the other hand, focuses on the models that are already in production. It helps track how they perform in real-world scenarios as you generate predictions in an ongoing manner for real-time data.
TL;DR: Software health monitoring is focused on monitoring the application layer. Model monitoring is focused on the quality of the data and model outputs within this application.
Model monitoring occasionally comes up in the context of traditional software and application performance monitoring (APM).
For example, when you deploy a model as a REST API, you must monitor its service health, such as its uptime and prediction latency. This software-level monitoring is crucial for ensuring the reliability of the overall ML system, and should always be implemented. However, it is not specific to ML: it works just like software monitoring for other production applications and can reuse the same approaches.
With ML model monitoring, in contrast, you specifically look at the behavior and performance of machine learning models within the software. Model monitoring focuses primarily on monitoring the data and ML model quality. This requires distinct metrics and approaches: think tracking the share of missing values in the incoming data and predictive model accuracy (model monitoring) versus measuring the utilization of the disk space (software system monitoring).
TL;DR: Data monitoring tracks the overall health of data assets. Model monitoring tracks the quality of individual ML models, which may involve checking their input data quality.
There is some overlap between data and model quality monitoring – especially at the implementation level when it comes to specific tests and metrics you can run, such as tracking missing data. However, each practice has its application focus.
Data monitoring involves continuous oversight of the organizational data sources to ensure their integrity, quality, security, and overall health. This encompasses all data assets, whether used by ML models or not. Typically, the central data team handles data monitoring at the organizational level.
In contrast, ML model monitoring is the responsibility of the ML platform team and ML engineers and data scientists who develop and operate specific ML models. While data monitoring oversees various data sources, ML model monitoring focuses on the specific ML models in production and their input data.
In addition, data quality monitoring is only a subset of ML monitoring checks. Model monitoring covers various metrics and aspects of ML model quality on top of the quality of the input data – from model accuracy to prediction bias.
TL;DR. Model governance sets standards for responsible ML model development across the entire lifecycle. Model monitoring helps continuously track model performance in production.
Model governance refers to practices and policies for managing machine learning models throughout their lifecycle. They help ensure that ML models are developed, deployed, and maintained responsibly and compliantly. ML model governance programs may include components related to model development standards, privacy and diversity of training data, model documentation, testing, audits, and ethical and regulatory alignment.
While model governance covers the entire model lifecycle, model monitoring is specific to the post-deployment phase.
Model monitoring is a subset of model governance that explicitly covers tracking ongoing model performance in production. While model governance sets rules and guidelines for responsible machine learning model development, ML monitoring helps continuously observe the deployed models to ensure their real-world performance and reliability.
Both ML governance and ML monitoring involve various stakeholders. However, data and ML engineers and ML operations teams are typically the ones to implement model monitoring. In contrast, AI governance, risk, and compliance teams often lead model governance programs.
Now that we've got the basics, let's dive into the specific metrics that might need attention as you implement model monitoring.
Since an ML-based service goes beyond just the ML model, the ML system quality has several facets: software, data, model, and business KPIs. Each involves monitoring different groups of metrics.
Software system health. Regardless of the model's quality, you must ensure the reliability of the entire prediction service first. This includes tracking standard software performance metrics such as latency, error rates, memory, or disk usage. Software operations teams can perform this monitoring similarly to how they monitor other software applications.
Data quality. Many model issues can be rooted in problems with the input data. You can track data quality and integrity using metrics like the percentage of missing values, type mismatches, or range violations in critical features to ensure the health of data pipelines.
ML model quality and relevance. To ensure that ML models perform well, you must continuously assess their quality. This involves tracking performance like precision and recall for classification, MAE or RMSE for regression, or top-k accuracy for ranking. If you do not get the true labels fast, you might use use-case specific heuristics or proxy metrics.
Business Key Performance Indicators (KPIs). The ultimate measure of a model's quality is its impact on the business. You may monitor metrics such as clicks, purchases, loan approval rates, or cost savings. Defining these business KPIs is custom to the use case and may involve collaboration with business teams to ensure alignment with the organization's goals.
Monitoring data and model quality are typically the primary concern of ML model monitoring. Let’s look deeper into the metrics that fall in this category.
These metrics focus on the predictive quality of a machine learning model. They help understand how well the model performs in production and whether it's still accurate.
Monitoring model quality metrics is typically the best way to detect any production issues.
Direct model quality metrics. You can assess the model performance using standard ML evaluation metrics, such as accuracy, mean error, etc. The choice of metrics depends on the type of model you're working with. These metrics usually match those used to evaluate the model performance during training.
Examples:
Performance by segment. It often makes sense to examine the model quality across various cohorts and prediction slices. This approach can reveal variations you might miss when looking at the aggregated metrics that account for the entire dataset. For example, you can evaluate the model quality for specific customer groups, locations, devices, etc.
However, while model performance metrics are usually the best measure of the actual model quality in production, the caveat is that you need newly labeled data to compute them. In practice, this is often only possible after some time.
Heuristics. When ground truth data is unavailable, you can look at proxy metrics that reflect model quality or can provide a signal when something goes wrong. For example, if you have a recommendation system, you can track the share of recommendations blocks displayed that do not earn any clicks or share of products excluded from model recommendations – and react if it goes significantly above baseline.
When labels come with a delay, you can also monitor data and prediction drift. These metrics help assess if the model still operates in a familiar setting, and serve as proxy indicators of the potential model quality issues. Additionally, drift analysis helps debug and troubleshoot the root cause of model quality drops.
With this type of early monitoring, you can look at shifts in both model inputs and outputs.
Output drift. You can look at the distribution of predicted scores, classes, or values. Does your model predict higher prices than usual? More fraud than on average? A significant shift from the past period might indicate a model performance or environment change.
Input drift. You can also track changes in the model features. If the distribution of the key model variables remains stable, you can expect model performance to be reasonably consistent. However, as you look at feature distributions, you might also detect meaningful shifts. For instance, if your model was trained on data from one location, you might want to learn in advance when it starts making predictions for users from a different area.
To detect data drift, you typically need a reference dataset as a baseline for comparison. For example, you can use data from a past stable production period large enough to account for seasonal variations. You can then compare the current batch of data against reference and evaluate if there is a meaningful shift.
There are different methods for evaluating data distribution shift, including:
Ultimately, drift detection is a heuristic: you can tweak the methods depending on the context, data size, the scale of change you consider acceptable, the model’s importance and known ability to generalize, and environment volatility.
Want a deeper dive into data drift? Check out this introductory guide.
Data quality metrics focus on the integrity and reliability of the incoming data. Monitoring data quality ensures that your model makes predictions based on high-quality features.
To safeguard against corruption in the input data, you can consider running the following validations or monitor aggregate metrics:
Evaluating data quality is critical to ML monitoring since many production issues stem from corrupted inputs and pipeline bugs. This is also highly relevant when you use data from multiple sources, especially supplied by external providers, which might introduce unexpected changes to the data format.
Bias and fairness metrics help ensure that machine learning models don't discriminate against specific groups or individuals based on certain characteristics, such as race, gender, or age.
This type of monitoring is especially relevant for particular domains, such as healthcare or education, where biased model behavior can have far-reaching consequences.
For example, if you have a classifier model, you can pick metrics like:
There are other related metrics, such as disparate parity, statistical parity, and so on. Choice of fairness metrics should involve domain experts to ensure they align with the specific goals, context, and potential impact of biases in a given application.
A model monitoring strategy is a systematic plan that outlines how you will track, assess, and maintain the performance of your machine learning models once they are deployed.
Both the monitoring architecture and the composition of the monitoring metrics vary based on the specific ML application. Here are some considerations that might affect the setup.
The goals of monitoring. The monitoring depth, granularity, and format can vary depending on the intended audience and purpose.
Sometimes, the primary objective is to detect technical issues or anomalies and communicate them to ML teams. In that case, the monitoring system might be minimalistic and focused on alerting a few key performance indicators. If you expect debugging capabilities, the setup might become more comprehensive and focused on ease of data exploration.
When the monitoring reports also aim at product managers, business stakeholders, or even end-users, the focus often shifts towards providing comprehensible visualizations. They should enable a clear understanding of how the model performs and its impact on key business metrics.
The data and model types affect the primary choice of monitoring metrics.
Different machine learning tasks, such as classification, regression, or ranking, require distinct performance metrics tailored to their objectives. Generative models have other monitoring criteria compared to predictive models.
The data structure also matters, with monitoring approaches differing significantly between tabular data and unstructured data, like text, images, or embeddings. While some components of model monitoring, like data quality checks, can be reusable across various model types, the ultimate design of the monitoring dashboard varies per model.
Label availability. The feedback loop affects whether you can access the ground truth data to evaluate the actual model performance. This affects the contents of monitoring.
If you can easily access or acquire labels, you can monitor the true model quality and design monitoring around direct metrics like model accuracy. If you obtain ground truth with a significant delay, you might need to rely on proxies or substitute metrics.
Model criticality and importance. Models used for financial decisions, medical diagnosis, or various business-critical systems require rigorous monitoring due to the high risks and costs associated with model underperformance. High-stakes applications demand proactive issue detection and rapid response mechanisms.
On the other hand, less critical models, such as those used for non-essential recommendations, may have lower monitoring requirements. These models may not directly impact the core operations of a business and might allow for some margin of error without severe consequences. This affects the monitoring granularity and frequency.
Model deployment architecture. You can deploy and use ML models differently: e.g., you can run some of them as weekly batch jobs while others make real-time decisions. This affects the architectural implementation of the monitoring system.
Online models, such as recommendation or fraud detection systems, require immediate monitoring to detect and respond to issues as they arise. This requires the monitoring system to handle sizable incoming data volume and run computations in near real-time.
Batch models, often used in offline data processing, may need periodic checks, but those do not necessarily require a comprehensive setup, especially if you use the model infrequently. You can design simpler batch monitoring jobs or validations executed on demand.
Speed of environmental changes. The rate at which the data changes also affects the design of the ML monitoring. Rapidly shifting environments might require more frequent monitoring or greater attention to the distribution of input features. If the model operates in a more stable environment, you might design monitoring around a couple of key performance indicators.
Want a comprehensive deep dive? Join an open course on ML observability and monitoring. All content and videos are publicly available. Sign up here.
To establish an effective model monitoring strategy for a particular model, you can go through the following steps.
Step 1. Define objectives. It's essential to start with a clear understanding of who will be using the monitoring results. Do you want to help data engineers detect missing data? Is it for data scientists to evaluate the changes in the key features? Is it to provide insight for product managers? Will you use the monitoring signals to trigger retraining? You should also consider the specific risks associated with the model usage you want to protect against. This sets the stage for the entire process.
Step 2. Choose the visualization layer. You must then decide how to deliver the monitoring results to your audience. You might have no shared interface – and only send alerts through preferred channels when some checks or validation fail. If you operate at a higher scale and want a visual solution, it can vary from simple reports to a live monitoring dashboard accessible to all stakeholders.
Step 3. Select relevant metrics. Next, you must define the monitoring contents: the right metrics, tests, and statistics to track. A good rule of thumb is to monitor direct model performance metrics first. If they are unavailable or delayed, or you deal with critical use cases, you can come up with proxy metrics like prediction drift. Additionally, you can track input feature summaries and data quality indicators to troubleshoot effectively.
Step 4. Choose the reference dataset. Some metrics require a reference dataset, for example, to serve as a baseline for data drift detection. You must pick a representative dataset that reflects expected patterns, such as the data from hold-out model testing or earlier production operations. You may also consider having a moving reference or several reference datasets.
Step 5. Define the monitoring architecture. Decide whether you'll monitor your model in real-time or through periodic batch checks – hourly, daily, or weekly. The choice depends on the model deployment format, risks, and existing infrastructure. A good rule of thumb is to consider batch monitoring unless you expect to encounter near real-time issues. You can also compute some metrics on a different cadence: for example, evaluate model quality monthly when the true labels arrive.
Step 6. Alerting design. You can typically choose a small number of key performance metrics to alert on so that you know when the model behavior significantly deviates from expected values. You'd also need to define specific conditions or thresholds and alerting mechanisms. For example, you can send email notifications or integrate your model monitoring system with incident management tools to immediately inform you when issues arise. You can also combine issue-focused alerting with reporting – such as scheduled weekly emails on the model performance for manual analysis that would include a more extensive set of metrics.
Many ML applications involve unstructured data such as text or images. Use cases span sentiment analysis, chatbots, face recognition, content recommendation, text generation using large language models, and more.
Monitoring such ML applications presents specific challenges.
Several ML monitoring approaches help tackle it.
Analyzing raw data. Some methods work with unstructured data directly. For example, you can train domain classifiers to detect data drift, allowing you to evaluate changes in the text datasets interpretably.
Monitoring text descriptors. You can generate features on top of text data that help evaluate specific properties of text, such as its length or sentiment. You can then monitor text descriptors to spot changes in data patterns and quality.
Monitoring embeddings. Some methods allow monitoring embeddings — vector representations of unstructured data – to detect shifts in the model inputs or outputs. For example, you can apply distance metrics like Euclidean or Cosine distance or model-based drift detection.
Want a deeper dive? Check out this blog on monitoring embeddings and monitoring text data with descriptors, or a complete section on NLP and LLM monitoring in our course.
Batch and near real-time ML monitoring are two alternative architectures to power the backend of ML monitoring.
Batch monitoring involves executing evaluation jobs on a cadence or responding to a trigger. For example, you can run daily monitoring jobs by querying model prediction logs and computing the model quality or data quality metrics. This method is versatile and suitable for batch data pipelines and online ML services.
Running monitoring jobs is generally easier than maintaining a continuous ML monitoring service. Another benefit of designing ML monitoring as a set of batch jobs is that it allows combining immediate (done at serving time) and delayed (done when the labels arrive) monitoring within the same architecture. You can also run data validation jobs alongside monitoring.
However, batch monitoring does introduce some delay in metric computation and requires expertise in workflow orchestrators and data engineering resources.
Real-time (streaming) model monitoring expects that you send the data directly from your ML service to the monitoring service and maintain an ML monitoring service that continuously computes and publishes ML quality metrics. This architecture is suitable for online ML prediction services and allows for detecting issues like missing data close to real-time.
The downsides are that the real-time ML monitoring architecture might be more costly to operate from the engineering resources standpoint, and you may still need batch monitoring pipelines for delayed ground truth.
Ultimately, the choice between these approaches depends on your specific requirements, resources, model deployment formats, and the necessity for near real-time issue detection.
Evidently is an open-source Python library for ML model monitoring.
It helps implement testing and monitoring for production machine learning models, including different model types (classification, regression, ranking) and data types (texts, embeddings, tabular data).
You can use Evidently to:
To get started with Evidently open-source, check out this Quickstart tutorial.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶