📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Tutorials

How to set up ML Monitoring with Evidently. A tutorial from CS 329S: Machine Learning Systems Design.

Last updated:

April 9, 2025

Published:

June 3, 2022

contents‍

Start testing your AI systems today

Get demo

Our CTO Emeli Dral gave a tutorial on how to use Evidently at the Stanford Winter 2022 course CS 329S on Machine Learning Systems Design. In this blog, we sum up the tutorial and walk you through an example of how to set up batch ML monitoring using open-source tools. Here are a few useful links:

Code example from the Evidently tutorial
Video recording of the Evidently tutorial
Complete course syllabus, by Chip Huyen

Fun fact: the tutorial was recorded at 3am local time due to the time difference! Kudos to Emeli for managing to walk you through model monitoring basics even in the middle of the night.

[fs-toc-omit]What is this tutorial about?

This entry-level tutorial introduces you to the basics of ML monitoring. It requires some knowledge of Python and experience in training ML models.

During this tutorial, you will learn:

Which factors to consider when setting up ML monitoring
How to generate ML model performance dashboards with Evidently
How to investigate the reasons for the model quality drop
How to customize the ML monitoring dashboard to your needs
[briefly] How to automate ML performance checks with MLflow or Airflow

By the end of this tutorial, you will know how to set up ML model monitoring using Evidently for a single ML model that you use in batch mode.

Let's proceed!

You can also watch a video version of this tutorial.

The code is slightly different in the video (the blog and example were updated following the release of newer versions of Evidently), but the workflow and principles are just the same.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

What affects the ML monitoring setup?

There is no one-size-fits-all ML monitoring setup. Here are some of the factors that have an impact on it.

ML service deployment. One can implement the machine learning service in many ways, including:

A simple Python script or a Jupyter notebook. You can use the model this way when experimenting or running your model in shadow mode.
A batch inference pipeline that includes multiple steps. You can orchestrate them through a workflow manager like Airflow.
Real-time production service. You can expose the model as API and serve it, including under high load.

ML feedback loop. ML service might have an immediate feedback loop or a delayed one. There was an excellent example of this difference In Chip's lecture notes on data drift. If you predict the arrival time on Google Maps, you will soon know how long the route took. If you have a translation system like Google Translate, your feedback comes much later.

‍Model environment. An ML model can operate in very different environments. It can vary from predictable and stable, like in a manufacturing process, to somewhat volatile, e.g., in the case of user behavior.

‍Service criticality. Finally, ML use cases have different importance and risks. The cost of each prediction error may differ from almost zero (think content recommendation) to very high (e.g., healthcare applications).

All these factors influence the design of ML monitoring. It should match the way you deploy the model, the environment it operates in, and its importance.

What makes an ML monitoring approach a good one? Here are some ideas.

It matches the ML service complexity. ML monitoring setup and maintenance should NOT be more complicated than the ML service itself. Keep it as simple as you can.
It reflects the risks. ML monitoring should cover the known risks and model failure modes. This can be highly context-specific.
It helps avoid alert fatigue. ML monitoring should not overload you with false alerts or track hundreds of metrics at once.
It is easy to configure. A data scientist who built the model should be able to configure the ML-specific part of the monitoring as they know what "acceptable model quality" is.
Can be expanded or replaced. Many ML-based services become more complex with time. It is useful when monitoring is flexible, and you can extend or replace it as needs change.

[fs-toc-omit]A practical example

We will create a toy ML model and walk through how to validate it and define the monitoring approach.Here is the use case. Imagine that you have:

A forecasting task. You want to build a service that predicts demand for the city bike rentals depending on the weather conditions and season. It is a regression model. The goal is to optimize the number of bikes at the pickup location.
A batch model. Before you deploy the model into wide use and implement a stable service, you want to test if it serves the purpose. You can use the model in batch mode, making daily predictions.
A delayed feedback loop. The ground truth is delayed. You only receive feedback by the end of the week.

How can you evaluate this model and design the basics of ML monitoring for the initial trial run? We will work with this example notebook to explore it.

⚠️ Disclaimer:
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

Data prep and model training

First, let's quickly build the model and evaluate its quality. You can't jump to monitoring before looking at the test model performance! Model evaluation is an essential step, as it helps shape the expectations on model quality in production.

You can follow along with the code using the example notebook. In this blog, we'll only highlight selected steps.

You will use a publicly available dataset on bike rentals from the UCI machine learning repository. Here are some transformations to kick off the analysis:

Add DateTime to the index. It will be easier to select data batches for a particular period when imitating the model application.
Split the dataset into "reference" and "current." The reference data is used to train the model, and the current data is used to imitate the model application.
Specify which features are numerical and categorical. That will be important when you look for feature drift.

You will then train the model on January data using a standard random forest model. Remember that it is a toy application! In real life, you'd probably want to take a longer training period and be more intentional about feature selection and choice of algorithm.

Model validation

After you have trained the model and generated the prediction for the last week of January (the validation set), you should create a joint dataframe that contains both features and predictions.

If you are working with a real use case, you might already have model prediction logs available in CSV or as a pandas.DataFrame. That is the input required for the next step of the analysis.

‍Let's generate the performance report!

regression_performance_report = Report(metrics=[
    RegressionPreset(),
])

regression_performance_report.run(reference_data=X_train.sort_index(), current_data=X_test.sort_index(),
                                  column_mapping=column_mapping)
regression_performance_report

Evidently can spin up a visual report that evaluates the regression model performance. (There is the classification counterpart, too! See the documentation.) By default, this report shows a rich set of metrics and visualization for two datasets: train and test datasets in this case.

You can get the output directly in the notebook cell or export it as an HTML file. You can then save it in your file system and open it in the browser. It contains a lot of widgets that detail model performance, starting with quality metrics.

The mean error on top training and testing data is not too bad!

You can visually confirm it. The predicted versus actual values line up diagonally for training and test data.

The model error is symmetric.

There are more plots to explore, including error normality and under- and overestimation segments. You can dig in if you want to understand the model quality better!

Production model training

Now you know what to expect in terms of model quality. You can train the model on the whole "reference" period, to make use of all the data we have.

After you train the model, you can repeat the model evaluation, at least to ensure that you haven't made any technical errors.

You can generate Evidently report once again.

regression_performance_report = Report(metrics=[
    RegressionPreset(),
])

regression_performance_report.run(reference_data=None, current_data=reference,
                                  column_mapping=column_mapping)
regression_performance_report

This sanity check confirms that the model is good to go!

Monitoring ML model in production

Let's go to the "production" phase. In this demo, you remain in the notebook environment. In reality, you might deploy the model as a batch pipeline using a tool like Airflow. It would generate new predictions every time the new daily batch of data arrives and write it to a database. Then, you might have a weekly DAG for when you receive the feedback data and evaluate the model quality as a part of the process.

Of course, you'd want to automate the process (we'll cover it later) and not run the analysis manually and look at the dashboards all the time. However, visually exploring the model's performance can make sense when you have just deployed it. You might want to closely understand what goes right and wrong and calibrate your expectations around metrics, important segments, etc.

Let's watch the model closely for a few weeks to ensure it doesn't do anything too crazy.

Week 1. Here is the model performance for the first week of the "production." The mean error is a bit worse but probably still in a normal range.

Week 2. By this moment, you might know which exact metrics and plots you want to look at every time, and you can better customize your Evidently report.

You can select only the components you like and make your performance report:

regression_performance_report = Report(metrics=[
    RegressionQualityMetric(),
    RegressionErrorPlot(),
    RegressionErrorDistribution()
])

regression_performance_report.run(reference_data=reference, current_data=current.loc['2011-02-07 00:00:00':'2011-02-14 23:00:00'], 
                                            column_mapping=column_mapping)
regression_performance_report.show()

Looking at this new simple report, you can also notice that the model performance got worse.

Week 3. Now, the model is truly broken. The error is also skewed, as the model underestimates the demand. The error distribution shifted left.

What has happened? You can already suspect the culprit: the use case has a clear seasonality. Let's explore how you can do the root-cause analysis to confirm.

Why did the ML model quality drop?

When the ML model performance goes down, the data is often to blame (if you face a data quality issue). Otherwise, it can provide the necessary context to interpret the reasons (for example, if you face data or concept drift). Looking at the data is always the first step!

To find the root cause of the decay, you can generate the Evidently Data Drift report.

For simplicity, you can do it only for the numerical features. Categorical features in this dataset are not so informative in terms of drift as they basically tell you the day of the week and season.

Here is the data drift evaluation result if you compare the previous week to the training data. Five out of seven features shifted.

You can interpret what's happening: the weather has changed. It literally got warmer.

Here is an example of the temperature distribution.

You can perform a similar analysis for other features.

Indeed, the reason for the model quality decay is seasonality. The model was trained using data only for a short period and doesn't handle the seasonality well.

How to define the ML monitoring approach

Now that you have explored the real-world behavior of the model, you can define what exactly to monitor. Assuming you want to continue using this model, you will probably care about two things:

Model quality metrics, as compared against the training or earlier performance.
Data drift to detect the model performance decay before getting the ground truth values. This can also help with debugging.

It is a reasonably straightforward approach. You deal with a not-so-critical use case. There are also no particular segments in the data. You'd probably use the model monitoring to make a few decisions:

Is the model working? To have peace of mind and keep the model running.
Can I trust the model? Should I stop it? You can decide whether to trust the model or maybe better not use it all. For example, if things change drastically, you can resort to some simpler rules and statistics instead or work to rebuild the model.
Should I retrain the model? You can use monitoring to decide whether it's time to retrain the model and whether the new data is good enough to use in retraining.

You can also generate a dashboard to report the model performance to your business stakeholders or other team members.

Having this goal in mind, you can set up a customized ML monitoring dashboard, including only the widgets and metrics you need.

How to customize the ML monitoring dashboard

Evidently is a flexible tool, and you can customize pretty much everything, from the statistical tests you use to the widgets you look at.

Evidently has the default drift detection logic. It automatically chooses one of several statistical tests based on the feature type, number of observations, and number of unique values. You can override this logic and make your own choice. For example, if you know that a different test would better fit the distribution of a particular feature.

Here are the changes made in the example notebook:

The Anderson-Darling test is used to detect data drift for all features.
The confidence level for the test is set at 0.9.

from evidently.calculations.stattests import StatTest

def _anderson_stat_test(reference_data: pd.Series, current_data: pd.Series, feature_type: str, threshold: float):
    p_value = anderson_ksamp(np.array([reference_data, current_data]))[2]
    return p_value, p_value < threshold

anderson_stat_test = StatTest(
    name="anderson",
    display_name="Anderson test (p_value)",
    func=_anderson_stat_test,
    allowed_feature_types=["num"]
)

# options = DataDriftOptions(feature_stattest_func=anderson_stat_test, all_features_threshold=0.9, nbinsx=20)

Next, you can define the composition of the report you'd regularly use. You can unite the Regression Performance and Data Drift parts of the report, keeping only the components you like and using the custom statistical test selected earlier.

the_report = Report(metrics=[
    RegressionQualityMetric(),
    RegressionErrorPlot(),
    RegressionErrorDistribution(),
    DataDriftPreset(stattest=anderson_stat_test, stattest_threshold=0.9),
])


the_report.run(
    reference_data=reference,
    current_data=current.loc['2011-02-14 00:00:00':'2011-02-21 23:00:00'], 
    column_mapping=column_mapping_drift
)
the_report

Here is the new customized dashboard that you can re-use for the regular model monitoring.

How to automate batch ML monitoring

You can, of course, generate the visual dashboards on-demand or, for example, schedule them for every week. But in many cases, you'd want to limit visual exploration only to the instances when something is wrong, and you need to debug the issue.

There are a few ways how you can automate the checks and integrate them with other open-source tools you use in the ML workflow.

MLflow. In the example notebook, you can see how to log the metrics generated by Evidently with MLflow. In this case, you can get all the same checks as you did in the visual dashboard but export them as JSON and log it to MLfllow.

For example, you can log mean error and share of drifted features to MLflow. Here is how it looks in the MLflow interface if you generate the same metrics for each week in the cycle.

If you want to re-create it, follow the code example. You will also need to install MLflow.

When you run it on the same example data, you can spot that the data was in fact drifting from the beginning. Data drift was detected in the very first week.

Airflow. If you have a batch ML pipeline, you can integrate the data and model quality check as part of the pipeline. You can also add a condition, for example, and get an alert and generate a visual report if the check condition is satisfied (e.g., drift is detected).

Grafana. If you have a real-time prediction service, you can spin up a monitoring service next to it that would collect metrics and quickly push them to a dashboard. Here is a code example of how to do this with Evidently, Prometheus, and Grafana.

[fs-toc-omit]Let's recap

If you are thinking about setting up ML monitoring for your model, here are some tips to keep in mind.

Start with model evaluation. Monitoring is, in essence, continuous model evaluation. You should review your test model performance to define expectations for your model quality. You should also consider the model environment and risks to identify potential failure models.

You can set up monitoring early. Even if you run the model in the shadow mode, you can set up regular model performance checks instead of only evaluating the model at the end of the test.

Define a set of metrics/plots you want to see. You can start with a richer set of metrics and then choose the ones you find valuable and actionable. In the minimalist example, you can stick to the same metrics you evaluated in training. However, if you have delayed ground truth, you will probably need to add the extra ones related to the input data (and prediction) distributions.

Standardization. If you are using visual reports or dashboards to evaluate your model performance, it is best to standardize your approach and keep the structure of reports, plots, and naming consistent. This will make it easier to compare the results and share them with the other team members.

Start manual, then automate. It often makes sense to evaluate model performance during the first runs visually. Later, you can adapt your monitoring approach and automate the batch model checks using tools like Airflow.