🎓 Free introductory course "LLM evaluations for AI product teams". Save your seat
ML in Production Guide

Model monitoring for ML in production: a comprehensive guide

Last updated:
January 15, 2025

Machine learning models have a lifespan. They decay over time and can make mistakes or predictions that don't make sense. They can also encounter unexpected or corrupted data that affect the quality of the outputs. To address this, you need ML model monitoring.

This guide explores the topic of ML model monitoring in detail. We'll cover what it is, why it matters, which metrics to monitor, and how to design an ML monitoring strategy, considering aspects like ML monitoring architecture or the type of data you deal with.

We will also introduce Evidently, an open-source Python library for ML monitoring.

TL;DR

  • Model monitoring means continuous tracking of the ML model quality in production. It helps detect and debug issues and understand and document model behavior.  
  • Model monitoring is different from software health monitoring and is focused on the behavior of the model within the software.
  • Model monitoring includes tracking metrics related to model quality (e.g., accuracy, precision, etc.), data and prediction drift, data quality, model bias, and fairness.
Evidently Classification Performance Report
Start with AI observability

Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Evidently Classification Performance Report
Start with AI observability

Want to keep tabs on your production ML models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

What is ML model monitoring?

Model monitoring is the ongoing process of tracking, analyzing, and evaluating the performance and behavior of machine learning models in real-world, production environments. It involves measuring various data and model metrics to help detect issues and anomalies and ensure that models remain accurate, reliable, and effective over time.

‍Want to study the topic in every detail? We created a free course on ML observability and monitoring. All videos, code examples, and course notes are publicly available. Sign up here.
Evidently ML model monitoring dashboard
An example model monitoring dashboard.

Why you need ML monitoring 

Building a machine learning model is just the beginning. Once you deploy that model into the real world, it faces many challenges that can affect its performance and require continuous monitoring.

ML monitoring in the model lifecycle
ML monitoring in the ML model lifecycle.

Here are some examples of issues that can affect production ML models.

Gradual concept drift. Gradual concept drift refers to the ongoing changes in the relationships between variables or patterns in the data over time. They may ultimately lead to a degradation in the model quality. Consider a product recommendation system: as user preferences evolve, what was relevant last month might not be today, impacting the quality of the model's suggestions.

Concept drift
Example of a sudden concept drift during the pandemic.

Sudden concept drift. Sudden concept drift, in contrast, involves abrupt and unexpected changes to the model environment that can significantly impact its performance. This might be, for instance, an external change such as the outbreak of COVID-19 or an unexpected update to a third-party application that disrupts data logging and, in turn, makes your model obsolete. 

Want to read more about concept drift? Head to the deep dive guide.

Data drift. Data distribution drift occurs when the statistical properties of the input data change. An example could be a change in customer demographics, resulting in the model underperforming on a previously unseen customer segment. 

Data distribution drift
A shift in the sales channel distribution is an example of data drift.
Want to read more about data drift? Head to the deep dive guide.

Data quality issues. Data quality issues encompass a range of problems related to the input data's accuracy, completeness, and reliability. Examples include missing values, duplicate records, or shifts in the feature range: imagine milliseconds replaced for seconds. If a model receives unreliable inputs, it will likely produce unreliable predictions.

Data pipeline bugs. Many errors occur within the data processing pipeline. These bugs can lead to data delay or data that doesn't match the expected format, causing issues in the mode performance. For instance, a bug in data preprocessing may result in features having a wrong type or not matching the input data schema.

Adversarial adaptation. External parties might deliberately target and manipulate the model's performance. For example, spammers may adapt and find ways to overcome spam detection filters. With LLM models, malicious actors intentionally provide input data to manipulate the model outputs, using techniques such as prompt injection. 

Adversarial adaptation
If the ML model remains the same, adversaries can adapt to it with time.

Broken upstream models. Often, there is a chain of machine learning models operating in production. If one model gives wrong outputs, it can propagate downstream and lead to the model quality drop in the dependent models. 

If these issues occur in production, your model could produce inaccurate results. Depending on the use case, getting predictions wrong can lead to measurable negative business impact. 

The risks vary from lost revenue and customer dissatisfaction to reputational damage and operational disruption. The more crucial a model is to a company's success, the greater the need for robust monitoring.

Model monitoring goals

A robust model monitoring system not only helps mitigate the risks discussed in the previous section, but also offers additional benefits. Let’s take an overview of what you can expect from ML monitoring.

Issue detection and alerting. ML monitoring is the first line of defense that helps identify when something goes wrong with the production ML model. You can alert on various symptoms, from direct drops in model accuracy to proxy metrics like increased share of missing data or data distribution drift.

Evidently Test Suites

Root cause analysis. Once alerted, well-designed monitoring helps pinpoint the root causes of problems. For example, it can help identify specific low-performing segments the model struggles with or help locate corrupted features. 

ML model behavior analysis. Monitoring helps get insights into how users interact with the model and whether there are shifts in the model's operational environment. This way, you can adapt to changing circumstances or find ways to improve the model performance and user experience.

Action triggers. You can also use the signals a model monitoring system supplies to trigger specific actions. For example, if the performance goes below a certain threshold, you can switch to a fallback system or previous model version or initiate retraining or data labeling.

Performance visibility. A robust model logging and monitoring system enables recording the ongoing model performance for future analysis or audits. Additionally, having a clear view of the model operations helps communicate the model value to stakeholders. 

Stakeholders of ML model monitoring
Model monitoring systems might have multiple stakeholders.

Why ML monitoring is hard

You might wonder: there is an established practice of tracking software health and product performance; how is ML model monitoring different? Is it possible to use the same methods? 

While it is partially true – you still need to monitor the software system health – model monitoring addresses a particular set of challenges, which makes it a separate field. Firstly, you focus on different groups of metrics – such as model and data quality metrics. Secondly, how you compute these metrics and design your model monitoring is also different. 

Let’s explore some of these challenges.

‍Silent failures. Software errors are usually visible: if things don’t work, you will get an error message. With machine learning, you may encounter different types of errors: a model returning an unreliable or biased prediction. Such errors are “silent”: a model will typically respond as long as it can process the incoming data inputs. Even if the input data is incorrect or significantly different – the model will make a potentially low-quality prediction without raising an alarm. To detect such "non-obvious" errors, you must evaluate model reliability using proxy signals and design use-cases-specific validations. 

‍Lack of ground truth. In a production ML environment, it is typical for feedback on model performance to have a delay. Because of this, you cannot measure the true model quality in real time. For example, if you forecast sales for the next week, you can only estimate the actual model performance once this time passes and you know the sales numbers. To evaluate the model quality indirectly, you need to monitor the model inputs and outputs. You also often need two monitoring loops: the real-time one that uses proxy metrics and the delayed one that runs once the labels are available. 

Ground truth is not immediately available

Relative definition of quality. What's considered good or bad model performance depends on the specific problem. For instance, a 90% accuracy rate might be an excellent result for one model, a symptom of a huge quality issue for another, or simply a wrong choice of metrics for the third one. On top of this, there is inherent variability in the model performance. This makes it challenging to set clear, universal metrics and alerting thresholds. You will need to adjust the approach depending on the use case, cost of error, and business impact. 

Relative definition of quality

Complex data testing. Data-related metrics are often sophisticated and computationally intensive. For example, you could compare input distributions by running statistical tests. This requires collecting a data batch of data of significant size as well as passing a reference dataset. The architecture of such implementation significantly differs from traditional software monitoring, where you expect a system to emit metrics like latency continuously.

Model monitoring vs. others

To better understand the concept of production model monitoring, let’s explore and contrast a few other related terms, such as model experiment management or software monitoring. 

Model observability

‍TL;DR: ML model monitoring focuses on detecting known issues. ML model observability provides root cause analysis and comprehensive visibility into the system performance.

‍ML model monitoring primarily involves tracking a predefined set of metrics to detect issues and answer questions like "What happened?" and "Is the system working as expected?" It is more reactive and is instrumental in identifying "known unknowns." 

On the other hand, ML observability provides a deeper level of insight into the system's behavior, helping understand and analyze the root causes of issues, addressing questions like "Why did it happen?" and "Where exactly did it go wrong?" ML observability is a proactive approach to uncover "unknown unknowns." 

ML monitoring vs. ML observability

ML monitoring is a subset of ML observability. In practice, both terms are often used interchangeably.

Experiment tracking

TL;DR: Production ML monitoring tracks the quality of the model performance on live data. Experiment tracking helps compare different models and parameters on an offline test set.

Sometimes, practitioners use the term “model monitoring” to describe the idea of tracking the quality of different models during the model training phase. A more common name is experiment tracking.

‍Experiment tracking helps record different iterations and configurations of models during the development phase. It ensures that you can reproduce, compare, and document various experiments – such as recording a specific output and how you arrived at it. While it also involves visualizations of different model metrics, it concerns model performance on the offline training set. The goal is typically to help compare models to choose the best one you’ll deploy in production. 

Experiment tracking vs. Production monitoring

Model monitoring, on the other hand, focuses on the models that are already in production. It helps track how they perform in real-world scenarios as you generate predictions in an ongoing manner for real-time data.

Software monitoring 

‍TL;DR: Software health monitoring is focused on monitoring the application layer. Model monitoring is focused on the quality of the data and model outputs within this application. 

Model monitoring occasionally comes up in the context of traditional software and application performance monitoring (APM).

For example, when you deploy a model as a REST API, you must monitor its service health, such as its uptime and prediction latency. This software-level monitoring is crucial for ensuring the reliability of the overall ML system, and should always be implemented. However, it is not specific to ML: it works just like software monitoring for other production applications and can reuse the same approaches.

Facets of ML system monitoring

With ML model monitoring, in contrast, you specifically look at the behavior and performance of machine learning models within the software. Model monitoring focuses primarily on monitoring the data and ML model quality. This requires distinct metrics and approaches: think tracking the share of missing values in the incoming data and predictive model accuracy (model monitoring) versus measuring the utilization of the disk space (software system monitoring).

Data monitoring

TL;DR: Data monitoring tracks the overall health of data assets. Model monitoring tracks the quality of individual ML models, which may involve checking their input data quality.

There is some overlap between data and model quality monitoring – especially at the implementation level when it comes to specific tests and metrics you can run, such as tracking missing data. However, each practice has its application focus.

‍Data monitoring involves continuous oversight of the organizational data sources to ensure their integrity, quality, security, and overall health. This encompasses all data assets, whether used by ML models or not. Typically, the central data team handles data monitoring at the organizational level. 

In contrast, ML model monitoring is the responsibility of the ML platform team and ML engineers and data scientists who develop and operate specific ML models. While data monitoring oversees various data sources, ML model monitoring focuses on the specific ML models in production and their input data. 

In addition, data quality monitoring is only a subset of ML monitoring checks. Model monitoring covers various metrics and aspects of ML model quality on top of the quality of the input data – from model accuracy to prediction bias. 

Model governance

TL;DR. Model governance sets standards for responsible ML model development across the entire lifecycle. Model monitoring helps continuously track model performance in production.

‍Model governance refers to practices and policies for managing machine learning models throughout their lifecycle. They help ensure that ML models are developed, deployed, and maintained responsibly and compliantly. ML model governance programs may include components related to model development standards, privacy and diversity of training data, model documentation, testing, audits, and ethical and regulatory alignment.

While model governance covers the entire model lifecycle, model monitoring is specific to the post-deployment phase. 

‍Model monitoring is a subset of model governance that explicitly covers tracking ongoing model performance in production. While model governance sets rules and guidelines for responsible machine learning model development, ML monitoring helps continuously observe the deployed models to ensure their real-world performance and reliability. 

Both ML governance and ML monitoring involve various stakeholders. However, data and ML engineers and ML operations teams are typically the ones to implement model monitoring. In contrast, AI governance, risk, and compliance teams often lead model governance programs. 

Model monitoring metrics

Now that we've got the basics, let's dive into the specific metrics that might need attention as you implement model monitoring. 

[fs-toc-omit]Metric overview

Since an ML-based service goes beyond just the ML model, the ML system quality has several facets: software, data, model, and business KPIs. Each involves monitoring different groups of metrics. 

Model monitoring metrics pyramid

Software system health. Regardless of the model's quality, you must ensure the reliability of the entire prediction service first. This includes tracking standard software performance metrics such as latency, error rates, memory, or disk usage. Software operations teams can perform this monitoring similarly to how they monitor other software applications.

Data quality. Many model issues can be rooted in problems with the input data. You can track data quality and integrity using metrics like the percentage of missing values, type mismatches, or range violations in critical features to ensure the health of data pipelines. 

ML model quality and relevance. To ensure that ML models perform well, you must continuously assess their quality. This involves tracking performance like precision and recall for classification, MAE or RMSE for regression, or top-k accuracy for ranking. If you do not get the true labels fast, you might use use-case specific heuristics or proxy metrics.

Business Key Performance Indicators (KPIs). The ultimate measure of a model's quality is its impact on the business. You may monitor metrics such as clicks, purchases, loan approval rates, or cost savings. Defining these business KPIs is custom to the use case and may involve collaboration with business teams to ensure alignment with the organization's goals.

Data and model quality monitoring

Monitoring data and model quality are typically the primary concern of ML model monitoring. Let’s look deeper into the metrics that fall in this category.

Model quality metrics

These metrics focus on the predictive quality of a machine learning model. They help understand how well the model performs in production and whether it's still accurate.

ML model quality metrics

Monitoring model quality metrics is typically the best way to detect any production issues.

Direct model quality metrics. You can assess the model performance using standard ML evaluation metrics, such as accuracy, mean error, etc. The choice of metrics depends on the type of model you're working with. These metrics usually match those used to evaluate the model performance during training.

Examples:

  • Classification: model accuracy, precision, recall, F1-score.
  • Regression: mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), etc.
  • Ranking and recommendations: normalized discounted cumulative gain (NDCG), precision at K, mean average precision (MAP), etc.

Performance by segment. It often makes sense to examine the model quality across various cohorts and prediction slices. This approach can reveal variations you might miss when looking at the aggregated metrics that account for the entire dataset. For example, you can evaluate the model quality for specific customer groups, locations, devices, etc. 

However, while model performance metrics are usually the best measure of the actual model quality in production, the caveat is that you need newly labeled data to compute them. In practice, this is often only possible after some time. 

‍Heuristics. When ground truth data is unavailable, you can look at proxy metrics that reflect model quality or can provide a signal when something goes wrong.  For example, if you have a recommendation system, you can track the share of recommendations blocks displayed that do not earn any clicks or share of products excluded from model recommendations – and react if it goes significantly above baseline. 

Data and prediction drift

When labels come with a delay, you can also monitor data and prediction drift. These metrics help assess if the model still operates in a familiar setting, and serve as proxy indicators of the potential model quality issues. Additionally, drift analysis helps debug and troubleshoot the root cause of model quality drops. 

Data drift metrics

With this type of early monitoring, you can look at shifts in both model inputs and outputs. 

‍Output drift. You can look at the distribution of predicted scores, classes, or values. Does your model predict higher prices than usual? More fraud than on average? A significant shift from the past period might indicate a model performance or environment change.

‍Input drift. You can also track changes in the model features. If the distribution of the key model variables remains stable, you can expect model performance to be reasonably consistent. However, as you look at feature distributions, you might also detect meaningful shifts. For instance, if your model was trained on data from one location, you might want to learn in advance when it starts making predictions for users from a different area.

To detect data drift, you typically need a reference dataset as a baseline for comparison. For example, you can use data from a past stable production period large enough to account for seasonal variations. You can then compare the current batch of data against reference and evaluate if there is a meaningful shift. 

There are different methods for evaluating data distribution shift, including:

  • Summary statistics. You can compare mean, median, variance, or quantile values between individual features in reference and current datasets. For instance, you can react if the mean value of a numerical variable shifts beyond two standard deviations.
  • Statistical tests. You can run hypothesis testing to assess whether the dataset differences are statistically significant. You can use tests like Kolmogorov-Smirnov for numerical features and Chi-square for categorical ones and treat the p-value as a drift score. 
  • Distance-based methods. You can also use metrics such as Wasserstein distance or Jensen-Shannon divergence to evaluate the extent of drift. The resulting score quantifies the distance between distributions. You can track it over time to determine how “far” the feature distributions drift apart.
  • Rule-based checks. You can also set up simple rules, like alerting when a minimum value of a specific category goes above the threshold or new categorical values appear. These checks do not “measure” drift but can help detect meaningful changes for further investigation.

Ultimately, drift detection is a heuristic: you can tweak the methods depending on the context, data size, the scale of change you consider acceptable, the model’s importance and known ability to generalize, and environment volatility.

Want a deeper dive into data drift? Check out this introductory guide. 

Data quality metrics

Data quality metrics focus on the integrity and reliability of the incoming data. Monitoring data quality ensures that your model makes predictions based on high-quality features. 

Data quality metrics

To safeguard against corruption in the input data, you can consider running the following validations or monitor aggregate metrics:

  • Missing data. You can check for the share of missing values in particular features or the dataset overall. 
  • Data schema checks. You can validate that the input data structure matches the expected format, including column names and types.
  • Feature range and list constraints. You can establish what constitutes "normal" feature values, whether it's limits like "sales should be non-negative" or domain-specific ranges like "sensor values are between 10 and 20" or feature lists for categorical features. You can then track deviations from these constraints.
  • Monitoring feature statistics. You can track statistics like mean values, min-max ranges, or percentile distribution of specific features to detect abnormal inputs.
  • Outlier detection. You set up monitoring to detect unusual data points significantly different from the rest through anomaly and outlier detection techniques. Monitoring can focus on finding individual outliers (for example, in order to process them differently than the rest of the inputs) or tracking their overall frequency (as a measure of changes in the dataset).

Evaluating data quality is critical to ML monitoring since many production issues stem from corrupted inputs and pipeline bugs. This is also highly relevant when you use data from multiple sources, especially supplied by external providers, which might introduce unexpected changes to the data format. 

Bias and fairness

Bias and fairness metrics help ensure that machine learning models don't discriminate against specific groups or individuals based on certain characteristics, such as race, gender, or age.

This type of monitoring is especially relevant for particular domains, such as healthcare or education, where biased model behavior can have far-reaching consequences.

For example, if you have a classifier model, you can pick metrics like: 

  • Predictive parity. This metric assesses whether the model's predictions are consistent across different groups. It measures whether the true positive rates (e.g., successful disease diagnosis) are equal among selected groups. If predictive parity is not achieved, the model might favor one group over another, potentially leading to unjust outcomes.
  • Equalized odds. This metric goes a step further by evaluating both false positives and false negatives. It checks if the model's error rates are comparable among different groups. If the model significantly favors one group in false positives or negatives, it could lead to unfair treatment.

There are other related metrics, such as disparate parity, statistical parity, and so on. Choice of fairness metrics should involve domain experts to ensure they align with the specific goals, context, and potential impact of biases in a given application.

Model monitoring strategy 

A model monitoring strategy is a systematic plan that outlines how you will track, assess, and maintain the performance of your machine learning models once they are deployed. 

ML monitoring strategy

[fs-toc-omit]What affects the monitoring strategy 

Both the monitoring architecture and the composition of the monitoring metrics vary based on the specific ML application. Here are some considerations that might affect the setup.

The goals of monitoring. The monitoring depth, granularity, and format can vary depending on the intended audience and purpose. 

Sometimes, the primary objective is to detect technical issues or anomalies and communicate them to ML teams. In that case, the monitoring system might be minimalistic and focused on alerting a few key performance indicators. If you expect debugging capabilities, the setup might become more comprehensive and focused on ease of data exploration.

When the monitoring reports also aim at product managers, business stakeholders, or even end-users, the focus often shifts towards providing comprehensible visualizations. They should enable a clear understanding of how the model performs and its impact on key business metrics. 

Why monitor ML models?

The data and model types affect the primary choice of monitoring metrics. 

Different machine learning tasks, such as classification, regression, or ranking, require distinct performance metrics tailored to their objectives. Generative models have other monitoring criteria compared to predictive models. 

The data structure also matters, with monitoring approaches differing significantly between tabular data and unstructured data, like text, images, or embeddings. While some components of model monitoring, like data quality checks, can be reusable across various model types, the ultimate design of the monitoring dashboard varies per model. 

Label availability. The feedback loop affects whether you can access the ground truth data to evaluate the actual model performance. This affects the contents of monitoring.

If you can easily access or acquire labels, you can monitor the true model quality and design monitoring around direct metrics like model accuracy. If you obtain ground truth with a significant delay, you might need to rely on proxies or substitute metrics. 

Model criticality and importance. Models used for financial decisions, medical diagnosis, or various business-critical systems require rigorous monitoring due to the high risks and costs associated with model underperformance. High-stakes applications demand proactive issue detection and rapid response mechanisms. 

On the other hand, less critical models, such as those used for non-essential recommendations, may have lower monitoring requirements. These models may not directly impact the core operations of a business and might allow for some margin of error without severe consequences. This affects the monitoring granularity and frequency.

Model deployment architecture. You can deploy and use ML models differently: e.g., you can run some of them as weekly batch jobs while others make real-time decisions. This affects the architectural implementation of the monitoring system.

Online models, such as recommendation or fraud detection systems, require immediate monitoring to detect and respond to issues as they arise. This requires the monitoring system to handle sizable incoming data volume and run computations in near real-time. 

Batch models, often used in offline data processing, may need periodic checks, but those do not necessarily require a comprehensive setup, especially if you use the model infrequently. You can design simpler batch monitoring jobs or validations executed on demand. 

The speed of model decay and data changes can vary a lot between models
The speed of model decay and data changes can vary a lot between models.

Speed of environmental changes. The rate at which the data changes also affects the design of the ML monitoring. Rapidly shifting environments might require more frequent monitoring or greater attention to the distribution of input features. If the model operates in a more stable environment, you might design monitoring around a couple of key performance indicators.  

Want a comprehensive deep dive? Join an open course on ML observability and monitoring. All content and videos are publicly available. Sign up here.

[fs-toc-omit]Establishing the monitoring strategy 

To establish an effective model monitoring strategy for a particular model, you can go through the following steps.

Step 1. Define objectives. It's essential to start with a clear understanding of who will be using the monitoring results. Do you want to help data engineers detect missing data? Is it for data scientists to evaluate the changes in the key features? Is it to provide insight for product managers? Will you use the monitoring signals to trigger retraining? You should also consider the specific risks associated with the model usage you want to protect against. This sets the stage for the entire process.

Step 2. Choose the visualization layer. You must then decide how to deliver the monitoring results to your audience. You might have no shared interface – and only send alerts through preferred channels when some checks or validation fail. If you operate at a higher scale and want a visual solution, it can vary from simple reports to a live monitoring dashboard accessible to all stakeholders. 

Step 3. Select relevant metrics. Next, you must define the monitoring contents: the right metrics, tests, and statistics to track. A good rule of thumb is to monitor direct model performance metrics first. If they are unavailable or delayed, or you deal with critical use cases, you can come up with proxy metrics like prediction drift. Additionally, you can track input feature summaries and data quality indicators to troubleshoot effectively. 

Evidently ML monitoring dashboard

Step 4. Choose the reference dataset. Some metrics require a reference dataset, for example, to serve as a baseline for data drift detection. You must pick a representative dataset that reflects expected patterns, such as the data from hold-out model testing or earlier production operations. You may also consider having a moving reference or several reference datasets.

Step 5. Define the monitoring architecture. Decide whether you'll monitor your model in real-time or through periodic batch checks – hourly, daily, or weekly. The choice depends on the model deployment format, risks, and existing infrastructure. A good rule of thumb is to consider batch monitoring unless you expect to encounter near real-time issues. You can also compute some metrics on a different cadence: for example, evaluate model quality monthly when the true labels arrive. 

Batch model monitoring architecture
Example of the batch model monitoring architecture. Source.

Step 6. Alerting design. You can typically choose a small number of key performance metrics to alert on so that you know when the model behavior significantly deviates from expected values. You'd also need to define specific conditions or thresholds and alerting mechanisms. For example, you can send email notifications or integrate your model monitoring system with incident management tools to immediately inform you when issues arise. You can also combine issue-focused alerting with reporting – such as scheduled weekly emails on the model performance for manual analysis that would include a more extensive set of metrics. 

Monitoring unstructured data

Many ML applications involve unstructured data such as text or images. Use cases span sentiment analysis, chatbots, face recognition, content recommendation, text generation using large language models, and more.

Monitoring such ML applications presents specific challenges.  

  • Unstructured data, such as text or images, is typically more complex to analyze. Unlike tabular data, it lacks a predefined structure. You cannot simply apply standard exploratory data analysis techniques like distribution visualization. 
  • Additionally, it is often difficult to evaluate the quality of the models built with unstructured data due to the lack of labels. You typically need a separate labeling process to collect them.
  • Evaluating the output of generative models is especially tricky. If you deal with use cases like composing emails or chatbot conversations, you typically do not have a single correct answer. But there are many acceptable ones! This differs from evaluating error or accuracy in predictive applications, where you have a specific ground truth to compare against.  

Several ML monitoring approaches help tackle it. 

‍Analyzing raw data. Some methods work with unstructured data directly. For example, you can train domain classifiers to detect data drift, allowing you to evaluate changes in the text datasets interpretably.

Monitoring text descriptors. You can generate features on top of text data that help evaluate specific properties of text, such as its length or sentiment. You can then monitor text descriptors to spot changes in data patterns and quality.

Monitoring text data with descriptors
You can monitor text descriptors – features that describe text data.

Monitoring embeddings. Some methods allow monitoring embeddings — vector representations of unstructured data – to detect shifts in the model inputs or outputs. For example, you can apply distance metrics like Euclidean or Cosine distance or model-based drift detection. 

‍Want a deeper dive? Check out this blog on monitoring embeddings and monitoring text data with descriptors, or a complete section on NLP and LLM monitoring in our course.

Model monitoring architectures

Batch and near real-time ML monitoring are two alternative architectures to power the backend of ML monitoring.

Batch monitoring involves executing evaluation jobs on a cadence or responding to a trigger. For example, you can run daily monitoring jobs by querying model prediction logs and computing the model quality or data quality metrics. This method is versatile and suitable for batch data pipelines and online ML services. 

Running monitoring jobs is generally easier than maintaining a continuous ML monitoring service. Another benefit of designing ML monitoring as a set of batch jobs is that it allows combining immediate (done at serving time) and delayed (done when the labels arrive) monitoring within the same architecture. You can also run data validation jobs alongside monitoring. 

However, batch monitoring does introduce some delay in metric computation and requires expertise in workflow orchestrators and data engineering resources.

Near real-time model monitoring architecture
Example of a near real-time model monitoring architecture.

Real-time (streaming) model monitoring expects that you send the data directly from your ML service to the monitoring service and maintain an ML monitoring service that continuously computes and publishes ML quality metrics. This architecture is suitable for online ML prediction services and allows for detecting issues like missing data close to real-time. 

The downsides are that the real-time ML monitoring architecture might be more costly to operate from the engineering resources standpoint, and you may still need batch monitoring pipelines for delayed ground truth. 

Ultimately, the choice between these approaches depends on your specific requirements, resources, model deployment formats, and the necessity for near real-time issue detection.

Model monitoring with Evidently

Evidently is an open-source Python library for ML model monitoring. 

It helps implement testing and monitoring for production machine learning models, including different model types (classification, regression, ranking) and data types (texts, embeddings, tabular data).

Evidently ML monitoring dashboard
Example monitoring dashboards to display failed data tests over time.

You can use Evidently to:

  • Generate Reports to explore and debug ML model quality or data. 
  • Implement checks as part of your prediction pipelines using Test Suites.
  • Deploy a live monitoring dashboard to track how metrics change over time and build a continuous monitoring process for batch or real-time ML models.

To get started with Evidently open-source, check out this Quickstart tutorial. 

[fs-toc-omit]Get started with AI observability
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶


Or try open source ⟶

[fs-toc-omit]Additional resources

Read next

Get Started with AI Observability

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
🎓 Free course on LLM evaluations for AI product teams. Sign up ⟶
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.