contents
Our CTO Emeli Dral gave a tutorial on how to use Evidently at the Stanford Winter 2022 course CS 329S on Machine Learning Systems Design. In this blog, we sum up the tutorial and walk you through an example of how to set up batch ML monitoring using open-source tools. Here are a few useful links:
Fun fact: the tutorial was recorded at 3am local time due to the time difference! Kudos to Emeli for managing to walk you through model monitoring basics even in the middle of the night.
This entry-level tutorial introduces you to the basics of ML monitoring. It requires some knowledge of Python and experience in training ML models.
During this tutorial, you will learn:
By the end of this tutorial, you will know how to set up ML model monitoring using Evidently for a single ML model that you use in batch mode.
Let's proceed!
You can also watch a video version of this tutorial.
The code is slightly different in the video (the blog and example were updated following the release of newer versions of Evidently), but the workflow and principles are just the same.
Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
There is no one-size-fits-all ML monitoring setup. Here are some of the factors that have an impact on it.
ML service deployment. One can implement the machine learning service in many ways, including:
ML feedback loop. ML service might have an immediate feedback loop or a delayed one. There was an excellent example of this difference In Chip's lecture notes on data drift. If you predict the arrival time on Google Maps, you will soon know how long the route took. If you have a translation system like Google Translate, your feedback comes much later.
Model environment. An ML model can operate in very different environments. It can vary from predictable and stable, like in a manufacturing process, to somewhat volatile, e.g., in the case of user behavior.
Service criticality. Finally, ML use cases have different importance and risks. The cost of each prediction error may differ from almost zero (think content recommendation) to very high (e.g., healthcare applications).
All these factors influence the design of ML monitoring. It should match the way you deploy the model, the environment it operates in, and its importance.
What makes an ML monitoring approach a good one? Here are some ideas.
We will create a toy ML model and walk through how to validate it and define the monitoring approach.Here is the use case. Imagine that you have:
How can you evaluate this model and design the basics of ML monitoring for the initial trial run? We will work with this example notebook to explore it.
First, let's quickly build the model and evaluate its quality. You can't jump to monitoring before looking at the test model performance! Model evaluation is an essential step, as it helps shape the expectations on model quality in production.
You can follow along with the code using the example notebook. In this blog, we'll only highlight selected steps.
You will use a publicly available dataset on bike rentals from the UCI machine learning repository. Here are some transformations to kick off the analysis:
You will then train the model on January data using a standard random forest model. Remember that it is a toy application! In real life, you'd probably want to take a longer training period and be more intentional about feature selection and choice of algorithm.
After you have trained the model and generated the prediction for the last week of January (the validation set), you should create a joint dataframe that contains both features and predictions.
If you are working with a real use case, you might already have model prediction logs available in CSV or as a pandas.DataFrame. That is the input required for the next step of the analysis.
Let's generate the performance report!
Evidently can spin up a visual report that evaluates the regression model performance. (There is the classification counterpart, too!) By default, this report shows a rich set of metrics and visualization for two datasets: train and test datasets in this case.
You can get the output directly in the notebook cell or export it as an HTML file. You can then save it in your file system and open it in the browser. It contains a lot of widgets that detail model performance, starting with quality metrics.
The mean error on top training and testing data is not too bad!
You can visually confirm it. The predicted versus actual values line up diagonally for training and test data.
The model error is symmetric.
There are more plots to explore, including error normality and under- and overestimation segments. You can dig in if you want to understand the model quality better! Here is the complete description of what's included.
Now you know what to expect in terms of model quality. You can train the model on the whole "reference" period, to make use of all the data we have.
After you train the model, you can repeat the model evaluation, at least to ensure that you haven't made any technical errors.
You can generate Evidently report once again.
This sanity check confirms that the model is good to go!
Let's go to the "production" phase. In this demo, you remain in the notebook environment. In reality, you might deploy the model as a batch pipeline using a tool like Airflow. It would generate new predictions every time the new daily batch of data arrives and write it to a database. Then, you might have a weekly DAG for when you receive the feedback data and evaluate the model quality as a part of the process.
Of course, you'd want to automate the process (we'll cover it later) and not run the analysis manually and look at the dashboards all the time. However, visually exploring the model's performance can make sense when you have just deployed it. You might want to closely understand what goes right and wrong and calibrate your expectations around metrics, important segments, etc.
Let's watch the model closely for a few weeks to ensure it doesn't do anything too crazy.
Week 1. Here is the model performance for the first week of the "production." The mean error is a bit worse but probably still in a normal range.
Week 2. By this moment, you might know which exact metrics and plots you want to look at every time, and you can better customize your Evidently report.
You can select only the components you like and make your performance report:
Looking at this new simple report, you can also notice that the model performance got worse.
Week 3. Now, the model is truly broken. The error is also skewed, as the model underestimates the demand. The error distribution shifted left.
What has happened? You can already suspect the culprit: the use case has a clear seasonality. Let's explore how you can do the root-cause analysis to confirm.
When the ML model performance goes down, the data is often to blame (if you face a data quality issue). Otherwise, it can provide the necessary context to interpret the reasons (for example, if you face data or concept drift). Looking at the data is always the first step!
To find the root cause of the decay, you can generate the Evidently Data Drift report.
For simplicity, you can do it only for the numerical features. Categorical features in this dataset are not so informative in terms of drift as they basically tell you the day of the week and season.
Here is the data drift evaluation result if you compare the previous week to the training data. Five out of seven features shifted.
You can interpret what's happening: the weather has changed. It literally got warmer.
Here is an example of the temperature distribution.
You can perform a similar analysis for other features.
Indeed, the reason for the model quality decay is seasonality. The model was trained using data only for a short period and doesn't handle the seasonality well.
Now that you have explored the real-world behavior of the model, you can define what exactly to monitor. Assuming you want to continue using this model, you will probably care about two things:
It is a reasonably straightforward approach. You deal with a not-so-critical use case. There are also no particular segments in the data. You'd probably use the model monitoring to make a few decisions:
You can also generate a dashboard to report the model performance to your business stakeholders or other team members.
Having this goal in mind, you can set up a customized ML monitoring dashboard, including only the widgets and metrics you need.
Evidently is a flexible tool, and you can customize pretty much everything, from the statistical tests you use to the widgets you look at.
Evidently has the default drift detection logic. It automatically chooses one of several statistical tests based on the feature type, number of observations, and number of unique values. You can override this logic and make your own choice. For example, if you know that a different test would better fit the distribution of a particular feature.
Here are the changes made in the example notebook:
Next, you can define the composition of the report you'd regularly use. You can unite the Regression Performance and Data Drift parts of the report, keeping only the components you like and using the custom statistical test selected earlier.
Here is the new customized dashboard that you can re-use for the regular model monitoring.
You can, of course, generate the visual dashboards on-demand or, for example, schedule them for every week. But in many cases, you'd want to limit visual exploration only to the instances when something is wrong, and you need to debug the issue.
There are a few ways how you can automate the checks and integrate them with other open-source tools you use in the ML workflow.
MLflow. In the example notebook, you can see how to log the metrics generated by Evidently with MLflow. In this case, you can get all the same checks as you did in the visual dashboard but export them as JSON and log it to MLfllow.
For example, you can log mean error and share of drifted features to MLflow. Here is how it looks in the MLflow interface if you generate the same metrics for each week in the cycle.
If you want to re-create it, follow the code example. You will also need to install MLflow.
When you run it on the same example data, you can spot that the data was in fact drifting from the beginning. Data drift was detected in the very first week.
Airflow. If you have a batch ML pipeline, you can integrate the data and model quality check as part of the pipeline. You can also add a condition, for example, and get an alert and generate a visual report if the check condition is satisfied (e.g., drift is detected). This part is not covered in the video, but you can reproduce the example from the documentation. You will also need to install Airflow.
Grafana. If you have a real-time prediction service, you can spin up a monitoring service next to it that would collect metrics and quickly push them to a dashboard. Here is a code example of how to do this with Evidently, Prometheus, and Grafana.
If you are thinking about setting up ML monitoring for your model, here are some tips to keep in mind.
Start with model evaluation. Monitoring is, in essence, continuous model evaluation. You should review your test model performance to define expectations for your model quality. You should also consider the model environment and risks to identify potential failure models.
You can set up monitoring early. Even if you run the model in the shadow mode, you can set up regular model performance checks instead of only evaluating the model at the end of the test.
Define a set of metrics/plots you want to see. You can start with a richer set of metrics and then choose the ones you find valuable and actionable. In the minimalist example, you can stick to the same metrics you evaluated in training. However, if you have delayed ground truth, you will probably need to add the extra ones related to the input data (and prediction) distributions.
Standardization. If you are using visual reports or dashboards to evaluate your model performance, it is best to standardize your approach and keep the structure of reports, plots, and naming consistent. This will make it easier to compare the results and share them with the other team members.
Start manual, then automate. It often makes sense to evaluate model performance during the first runs visually. Later, you can adapt your monitoring approach and automate the batch model checks using tools like Airflow.