contents
In this code tutorial, you will learn the following:
To complete the tutorial, you must have basic Python knowledge and be comfortable using the terminal. You will go through the end-to-end process of creating an ML monitoring dashboard for a toy model and view it in your browser.
We’ll use Evidently, an open-source Python library for ML model monitoring.
Want to go straight to code? Here is a Python script we will use to create a live ML monitoring dashboard.
Building an ML model is not a one-and-done process. Many things can go wrong once you deploy them to the real world. Here are some examples:
To address all this, you need ML monitoring – a way to oversee and evaluate ML models in production. Monitoring is an essential component of MLOps. Once you deploy the models, you must keep tabs on them!
By implementing ML monitoring, you can:
Whether it's incorrect inputs or shifting trends, monitoring acts as an early warning system.
It’s easy to talk about the benefits of monitoring but to be fair; this is often the least loved task. Building models is much more fun than babysitting them!
Building a complete monitoring system also sounds like a lot of work. You need both metric computation and a visualization layer which might require stitching different tools.
However, there are easier ways to start.
In this tutorial, we’ll work with Evidently – an open-source MLOps tool that helps evaluate, test, and monitor ML models. We aim to cover the core ML monitoring workflow and how to implement it in practice with the least effort.
To start, you can generate monitoring reports ad hoc. A more "manual" approach can make sense if you have only just deployed your ML model. It also helps shape expectations about the model and data quality before you automate the metric tracking.
You can query your model logs and generate a report with metrics you care about. You can explore them in the Jupyter notebook or save them as HTML files to share with others.
Let’s take a look at how it can work! First, install Evidently. If you work in Google Colab, run:
If you work in a Jupyter notebook, you must install nbextension. Check out the detailed installation instructions.
Next, import a few libraries and components required to run an example.
We will import a toy dataset as a demo. We’ll use the "adult" dataset from OpenML.
In practice, you should use the model prediction logs. They can include input data, model predictions, and true labels or actuals, if available.
Now, let’s split our toy data into two. This way, we will create a baseline dataset to compare against: we call it "reference." The second dataset is the current production data. To make things more interesting, we also introduce some changes to the data by filtering our selection using one of the columns. This is a quick way to add some artificial drift.
In practice, you can pick comparison windows based on your assumptions about the data stability: for example, you can compare last month to training.
Now, let's get to the data drift report. You must create a Report object, specify the analytical preset you want to include (in this case, the "Data Drift Preset"), and pass the two datasets you compare as "reference" and "current."
Once you run the code, you will see the Report directly in your notebook. Evidently will check for distribution drift in all the individual features and sum it up in a report.
If you click on individual features, it will show additional plots to explore the specific distributions.
In this example, we detect distribution drift in many of the features – since we artificially selected only a subset of the data with a particular education level.
If you want to share the Report, you can also export it as a standalone HTML file.
Evidently has other pre-built Reports and Tests Suites. For example, related to data or model quality. You can also design a custom Report or a Test Suite by picking individual checks. Check out the complete introductory tutorial for more details.
This tutorial will focus on Reports, but Test Suites work similarly. The difference is that Tests allow you to verify a condition explicitly: "Is my feature within a specified min-max range?" On the other hand, reports compute and visualize metrics without expectations: "Here is the min, max, mean, and number of constant values, and this is how the distribution looks."
You can go pretty far with report-based monitoring, especially if you work with batch models. For example, you can generate weekly reports, and log them to MLflow. You can run Reports on schedule using an orchestrator tool like Airflow and even build a conditional workflow – for example, generate a notification if drift is detected in your dataset.
However, this approach has its limitations, especially as you scale. Organizing and navigating multiple reports may be inconvenient. When you compute individual Reports for specific periods, there is also no easy way to track the trends – for example, see how the share of drifting features changes over time.
Here is the next thing to add: a live dashboard!
Evidently has a user interface component that helps track the metrics over time.
Here is how it looks:
It conveniently sits on top of the Reports like the one we just generated, so it’s easy to progress from ad hoc checks to a complete monitoring setup.
Here is the principle behind it:
You can run the dashboard locally to take a quick look. Let’s do this right now!
Want to get a web dashboard instead? Sign up for Evidently Cloud.
To start, let’s launch a demo project to see an example monitoring dashboard. We’ll now head from the notebook environment to the Terminal: Evidently will then run as a web application in a browser.
1. Create virtual environment
This is an optional, but highly recommended step. Create a virtual environment and activate it.
Run the following command in the Terminal:
2. Install Evidently
Now, install Evidently in your environment:
3. Run demo project
To launch the Evidently service with the demo project, run:
To view the Evidently interface, go to URL http://localhost:8000 in your web browser.
You'll find a ready-made project that displays the performance of a simple model across 20 days. You can switch between tabs, for example, to look at individual Reports. You can even download them! They will be the same as the Data Drift report we generated in the notebook.
Each Report or Test Suite corresponds to a single day of the model’s performance. The monitoring dashboard takes the data from these Reports and presents how metrics evolve.
Do you want to run a dashboard like this for your model?
Let's now walk through an end-to-end example to connect the dots and understand how you generate multiple Reports and run a dashboard on top of them.
Here is what you’ll learn to do now:
Code example. We wrote a Python script that implements the process end-to-end. You can access it here. You can simply run the script and will get the new dashboard to look at.
To better understand what’s going on, we will go through the script step by step. You can open the file and follow the explanation.
Here is what the script does:
Let’s now go through each of the steps.
First, import the required components.
Next, import the data and create a pandas.DataFrame using the OpenML adult dataset.
We separate a part of the dataset as "reference" and call it adult_ref. We will later use it as a baseline for drift detection. We use the adult_cur ("current") to imitate batch inference.
Now, let’s name the workspace and project. A project will typically correspond to an ML model you monitor. You will see this name and description in the user interface.
A workspace defines the folder where Evidently will log data to. It will be created in the directory where you launch the script from. This helps organize different logs that relate to one model over time.
You can decide what to log – statistical data summaries, test results, or specific metrics. For example, you can capture the same data drift report shared above and log a new one daily. This way, you can later visualize the share of drifted features or particular drift scores over time – and browse individual Reports from the interface.
You can also capture data quality metrics, such as share of missing values, number constant column, min-max values, etc. You can also compute model quality summaries if you have true labels available.
It’s entirely up to you – you can log whatever you like! You can check the list of Evidently presets, metrics and tests.
To define the monitoring setup, you must create a Report, just like the drift report above, and include the Metrics or Presets you wish to capture.
Here is what we do in our example script:
There is one slight difference compared to the earlier ad hoc workflow. Instead of rendering a Report in the notebook, we now save it as a "snapshot." A snapshot is a JSON "version" of the Evidently Report or Test Suite. It contains all the information required to recreate the visual HTML report.
You must store these snapshots in a directory that the Evidently UI service can access. The monitoring service will parse the data from snapshots and visualize metrics over time.
When we generate the Reports inside a workspace (you will see it later in this script), Evidently will automatically generate them as snapshots. As simple as that!
You must add monitoring panels to define what you will see on the dashboard for this particular model. You can choose between different panel types: for example, add a simple counter to show the number of model predictions made, a time series line plot to display the share of drifting features over time, a bar chart, and so on.
Here is how we do this in the script.
First, create a new project in the workspace:
Next, add panels to the project. Here is an example of adding a counter metric to show the share of drifted features.
You can check out the complete script to see the implementations for other metrics:
After you define the design of your monitoring panels, you must save the project.
Note: these panels relate specifically to the monitoring dashboard. Since Evidently captures multiple metrics in Reports and Test Suites, you must pick which ones to plot. However, many visuals are available out of the box: logged inside Evidently Reports for each period.
Finally, we need to create the workspace, the project and generate the snapshots. When you execute the script, Evidently will compute and write the snapshots with the selected metrics to the defined workspace folder, as if you captured data for 5 days. It will also create the dashboard panels as defined above.
Now, once we’ve walked through the complete script, let’s execute it!
Run the command to generate a new example project using the script:
Then, launch the user interface to see it! Run:
To view the service, head to localhost:8000.
You will be able to see a new project in the interface:
After you click on "new project", you can see the monitoring dashboard you just created.
To go through the steps in more detail, refer to the complete Monitoring User Guide.
To start monitoring an existing ML model, you must build a workflow to collect the data from your production pipelines or services. You can also run monitoring jobs over production logs stored in a data warehouse. The exact integration scenario depends on the model deployment type and infrastructure.
Here is one possible approach. You can implement it using a workflow manager like Airflow to compute Evidently snapshots on a regular cadence.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
ML monitoring is a necessary component of production ML deployment. It involves tracking data inputs, predictions, and outcomes to ensure that models remain accurate and reliable. ML monitoring helps get visibility into how well the model functions and detect and resolve issues.
It is possible to implement the complete ML monitoring workflow using open-source tools. With Evidently, you can start with simple ad hoc checks with Reports or Test Suites and then add a live monitoring dashboard as you scale. This way, you can start small and introduce complexity progressively.