📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Tutorials

How to stop worrying and start monitoring your ML models: a step-by-step guide

Last updated:

April 15, 2025

Published:

August 22, 2023

contents‍

Start testing your AI systems today

Get demo

In this code tutorial, you will learn the following:

What production ML monitoring is, and why you need it.
How to start with ML monitoring by generating model performance reports.
How to implement continuous model tracking and host an ML monitoring dashboard.

To complete the tutorial, you must have basic Python knowledge and be comfortable using the terminal. You will go through the end-to-end process of creating an ML monitoring dashboard for a toy model and view it in your browser.

We’ll use Evidently, an open-source Python library for ML model monitoring.

‍Want to go straight to code? Here is a Python script we will use to create a live ML monitoring dashboard.

⚠️ Disclaimer:
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

📈 Why do you need ML monitoring?

Building an ML model is not a one-and-done process. Many things can go wrong once you deploy them to the real world. Here are some examples:

Data quality issues. Data pipelines can break. Pre-processing code might contain bugs. Data producers can mess things up. As a result, the model can receive erroneous data inputs – and make unreliable predictions.
Data drift. As your models interact with the real world, they encounter new subpopulations and data slices that weren't part of their initial training. For example, if a new demographic of users emerges, a demand forecasting model might struggle to make accurate predictions for it.
Concept drift. As the world evolves, the patterns the model learned during training might not hold anymore. Say, your model predicts shopping preferences based on historical data. As trends change and new products come to market, its suggestions can become outdated.

To address all this, you need ML monitoring – a way to oversee and evaluate ML models in production. Monitoring is an essential component of MLOps. Once you deploy the models, you must keep tabs on them!

ML model lifecycle — *A simplified MLOps lifecycle*

By implementing ML monitoring, you can:

Get visibility into ML model performance. Answering "How is my model doing?" and "Is it even working?" without ad hoc queries goes a long way. It also helps build trust with the business stakeholders.
Quickly detect and resolve model issues. Is your model getting a lot of nulls? Predicting fraud way too often? Is there a strange spike in users coming from a particular region? You want to notice before it affects the business outcomes.
Know when it's time to retrain. Models degrade over time. Monitoring helps notice performance dips and shifts in data distribution. This way, you will know when it's time for a model update and get some context to develop the best approach.

Whether it's incorrect inputs or shifting trends, monitoring acts as an early warning system.

🚀 How to start?

It’s easy to talk about the benefits of monitoring but to be fair; this is often the least loved task. Building models is much more fun than babysitting them!

Building a complete monitoring system also sounds like a lot of work. You need both metric computation and a visualization layer which might require stitching different tools.

However, there are easier ways to start.

In this tutorial, we’ll work with Evidently – an open-source MLOps tool that helps evaluate, test, and monitor ML models. We aim to cover the core ML monitoring workflow and how to implement it in practice with the least effort.

To start, you can generate monitoring reports ad hoc. A more "manual" approach can make sense if you have only just deployed your ML model. It also helps shape expectations about the model and data quality before you automate the metric tracking.

You can query your model logs and generate a report with metrics you care about. You can explore them in the Jupyter notebook or save them as HTML files to share with others.

‍Let’s take a look at how it can work! First, install Evidently. If you work in Google Colab, run:

!pip install evidently

If you work in a Jupyter notebook, you must install nbextension. Check out the detailed installation instructions in the docs.

Next, import a few libraries and components required to run an example.

import pandas as pd
import numpy as np

from sklearn import datasets

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

We will import a toy dataset as a demo. We’ll use the "adult" dataset from OpenML.

adult_data = datasets.fetch_openml(name='adult', version=2, as_frame='auto')
adult = adult_data.frame

In practice, you should use the model prediction logs. They can include input data, model predictions, and true labels or actuals, if available.

‍Now, let’s split our toy data into two. This way, we will create a baseline dataset to compare against: we call it "reference." The second dataset is the current production data. To make things more interesting, we also introduce some changes to the data by filtering our selection using one of the columns. This is a quick way to add some artificial drift.

adult_ref = adult[~adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])]
adult_cur = adult[adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])]
adult_cur.iloc[:2000, 3:5] = np.nan

In practice, you can pick comparison windows based on your assumptions about the data stability: for example, you can compare last month to training.

‍Now, let's get to the data drift report. You must create a Report object, specify the analytical preset you want to include (in this case, the "Data Drift Preset"), and pass the two datasets you compare as "reference" and "current."

report = Report(metrics=[
   DataDriftPreset(),
])

report.run(reference_data=adult_ref, current_data=adult_cur)
report

Once you run the code, you will see the Report directly in your notebook. Evidently will check for distribution drift in all the individual features and sum it up in a report.

If you click on individual features, it will show additional plots to explore the specific distributions.

Evidently distribution drift for individual features

In this example, we detect distribution drift in many of the features – since we artificially selected only a subset of the data with a particular education level.

If you want to share the Report, you can also export it as a standalone HTML file.

Evidently has other pre-built Reports and Tests Suites. For example, related to data or model quality. You can also design a custom Report or a Test Suite by picking individual checks. Check out the complete introductory tutorial for more details.

This tutorial will focus on Reports, but Test Suites work similarly. The difference is that Tests allow you to verify a condition explicitly: "Is my feature within a specified min-max range?" On the other hand, reports compute and visualize metrics without expectations: "Here is the min, max, mean, and number of constant values, and this is how the distribution looks."

You can go pretty far with report-based monitoring, especially if you work with batch models. For example, you can generate weekly reports, and log them to MLflow (see the docs). You can run Reports on schedule using an orchestrator tool like Airflow (see the docs) and even build a conditional workflow – for example, generate a notification if drift is detected in your dataset.

However, this approach has its limitations, especially as you scale. Organizing and navigating multiple reports may be inconvenient. When you compute individual Reports for specific periods, there is also no easy way to track the trends – for example, see how the share of drifting features changes over time.

Here is the next thing to add: a live dashboard!

📊 ML monitoring dashboard

Evidently has a user interface component that helps track the metrics over time.

Here is how it looks:

It conveniently sits on top of the Reports like the one we just generated, so it’s easy to progress from ad hoc checks to a complete monitoring setup.

Here is the principle behind it:

You can compute multiple Reports over time. Each Report captures data or model quality for a specific period. For example, you can generate a drift report every week.
You must save these reports as JSON "snapshots." You log them to a directory corresponding to a given “project” – for example, to a specific ML model.
You can launch an ML monitoring service over these snapshots. The Evidently service will parse the data from multiple individual Reports and visualize it over time.

You can run the dashboard locally to take a quick look. Let’s do this right now!

Want to get a web dashboard instead? Sign up for Evidently Cloud.

🧪 Demo project

To start, let’s launch a demo project to see an example monitoring dashboard. We’ll now head from the notebook environment to the Terminal: Evidently will then run as a web application in a browser.

1. Create virtual environment

This is an optional, but highly recommended step. Create a virtual environment and activate it.

Run the following command in the Terminal:

pip install virtualenv
virtualenv venv
source venv/bin/activate

2. Install Evidently

Now, install Evidently in your environment:

pip install evidently

3. Run demo project

To launch the Evidently service with the demo project, run:

evidently ui --demo-project

To view the Evidently interface, go to URL http://localhost:8000 in your web browser.

You'll find a ready-made project that displays the performance of a simple model across 20 days. You can switch between tabs, for example, to look at individual Reports. You can even download them! They will be the same as the Data Drift report we generated in the notebook.

Each Report or Test Suite corresponds to a single day of the model’s performance. The monitoring dashboard takes the data from these Reports and presents how metrics evolve.

💻 An end-to-end example

Do you want to run a dashboard like this for your model?

Let's now walk through an end-to-end example to connect the dots and understand how you generate multiple Reports and run a dashboard on top of them.

Here is what you’ll learn to do now:

Create a new project as if you add a new ML model to monitor.
Imitate daily batch model inference to compute reports daily.
Design monitoring panels to visualize the metrics.

‍Code example. We wrote a Python script that implements the process end-to-end. You can access it here. You can simply run the script and will get the new dashboard to look at.

⚠️ Disclaimer:
This example uses the Evidently API as available in version 0.6.7 or lower. Please ensure you are using the correct version when running this example. For updated and new examples, visit our documentation.

To better understand what’s going on, we will go through the script step by step. You can open the file and follow the explanation.

Here is what the script does:

Imports the required Evidently components
Imports a toy dataset (we will again use the OpenML "adult" dataset)
Creates a new Evidently workspace and project
Defines the metrics to log using Evidently Reports and Test Suites
Computes the metrics iterating over toy data
Creates several panels to visualize the metrics

Let’s now go through each of the steps.

1. Imports

First, import the required components.

import datetime

from sklearn import datasets

from evidently.metrics import ColumnDriftMetric
from evidently.metrics import ColumnSummaryMetric
from evidently.metrics import DatasetDriftMetric
from evidently.metrics import DatasetMissingValuesMetric
from evidently.report import Report
from evidently.test_preset import DataDriftTestPreset
from evidently.test_suite import TestSuite
from evidently.ui.dashboards import CounterAgg
from evidently.ui.dashboards import DashboardPanelCounter
from evidently.ui.dashboards import DashboardPanelPlot
from evidently.ui.dashboards import PanelValue
from evidently.ui.dashboards import PlotType
from evidently.ui.dashboards import ReportFilter
from evidently.ui.remote import RemoteWorkspace
from evidently.ui.workspace import Workspace
from evidently.ui.workspace import WorkspaceBase

Next, import the data and create a pandas.DataFrame using the OpenML adult dataset.

We separate a part of the dataset as "reference" and call it adult_ref. We will later use it as a baseline for drift detection. We use the adult_cur ("current") to imitate batch inference.

adult_data = datasets.fetch_openml(name="adult", version=2, as_frame="auto")
adult = adult_data.frame

adult_ref = adult[~adult.education.isin(["Some-college", "HS-grad", "Bachelors"])]
adult_cur = adult[adult.education.isin(["Some-college", "HS-grad", "Bachelors"])]

2. Name the workspace and project

Now, let’s name the workspace and project. A project will typically correspond to an ML model you monitor. You will see this name and description in the user interface.

WORKSPACE = "workspace"

YOUR_PROJECT_NAME = "New Project"
YOUR_PROJECT_DESCRIPTION = "Test project using Adult dataset."

A workspace defines the folder where Evidently will log data to. It will be created in the directory where you launch the script from. This helps organize different logs that relate to one model over time.

3. Define monitoring metrics and log them

You can decide what to log – statistical data summaries, test results, or specific metrics. For example, you can capture the same data drift report shared above and log a new one daily. This way, you can later visualize the share of drifted features or particular drift scores over time – and browse individual Reports from the interface.

Parse Reports data into ML monitoring dashboard

You can also capture data quality metrics, such as share of missing values, number constant column, min-max values, etc. You can also compute model quality summaries if you have true labels available.

It’s entirely up to you – you can log whatever you like! You can check the list of Evidently presets, metrics and tests in the docs.

To define the monitoring setup, you must create a Report, just like the drift report above, and include the Metrics or Presets you wish to capture.

Here is what we do in our example script:

We came up with a custom combination of Metrics to track. They include overall dataset drift, the share of missing features, and drift scores and summaries for a couple of specific columns in the dataset – these could be our most important features, for example.
We generate multiple Reports over time to imitate batch model inference. We repeat computations for i days, each time taking 100 observations. In practice, you must work with actual prediction data and compute the logs as they come.
We pass the reference dataset. We pass adult_ref to use as the basis for distribution drift detection. This is not always required: you can compute metrics like nulls or feature statistics without reference.

def create_report(i: int):
    data_drift_report = Report(
        metrics=[
            DatasetDriftMetric(),
            DatasetMissingValuesMetric(),
            ColumnDriftMetric(column_name="age", stattest="wasserstein"),
            ColumnSummaryMetric(column_name="age"),
            ColumnDriftMetric(column_name="education-num", stattest="wasserstein"),
            ColumnSummaryMetric(column_name="education-num"),
        ],
        timestamp=datetime.datetime.now() + datetime.timedelta(days=i),
    )

    data_drift_report.run(reference_data=adult_ref, current_data=adult_cur.iloc[100 * i : 100 * (i + 1), :])
    return data_drift_report

There is one slight difference compared to the earlier ad hoc workflow. Instead of rendering a Report in the notebook, we now save it as a "snapshot." A snapshot is a JSON "version" of the Evidently Report or Test Suite. It contains all the information required to recreate the visual HTML report.

You must store these snapshots in a directory that the Evidently UI service can access. The monitoring service will parse the data from snapshots and visualize metrics over time.

When we generate the Reports inside a workspace (you will see it later in this script), Evidently will automatically generate them as snapshots. As simple as that!

4. Add monitoring panels

You must add monitoring panels to define what you will see on the dashboard for this particular model. You can choose between different panel types: for example, add a simple counter to show the number of model predictions made, a time series line plot to display the share of drifting features over time, a bar chart, and so on.

Here is how we do this in the script.

First, create a new project in the workspace:

def create_project(workspace: WorkspaceBase):
    project = workspace.create_project(YOUR_PROJECT_NAME)
    project.description = YOUR_PROJECT_DESCRIPTION

Next, add panels to the project. Here is an example of adding a counter metric to show the share of drifted features.

  
    project.dashboard.add_panel(
        DashboardPanelCounter(
            title="Share of Drifted Features",
            filter=ReportFilter(metadata_values={}, tag_values=[]),
            value=PanelValue(
                metric_id="DatasetDriftMetric",
                field_path="share_of_drifted_columns",
                legend="share",
            ),
            text="share",
            agg=CounterAgg.LAST,
            size=1,
        )
    )

You can check out the complete script to see the implementations for other metrics:

The number of model predictions (counter).
The share of missing values (line plot).
Feature drift scores (bar plot).

After you define the design of your monitoring panels, you must save the project.

project.save()

Note: these panels relate specifically to the monitoring dashboard. Since Evidently captures multiple metrics in Reports and Test Suites, you must pick which ones to plot. However, many visuals are available out of the box: logged inside Evidently Reports for each period.

5. Create the workspace and project

Finally, we need to create the workspace, the project and generate the snapshots. When you execute the script, Evidently will compute and write the snapshots with the selected metrics to the defined workspace folder, as if you captured data for 5 days. It will also create the dashboard panels as defined above.

def create_demo_project(workspace: str):
    ws = Workspace.create(workspace)
    project = create_project(ws)

    for i in range(0, 5):
        report = create_report(i=i)
        ws.add_report(project.id, report)

        test_suite = create_test_suite(i=i)
        ws.add_test_suite(project.id, test_suite)


if __name__ == "__main__":
    create_demo_project(WORKSPACE)

6. Run the script and launch the service

Now, once we’ve walked through the complete script, let’s execute it!

Run the command to generate a new example project using the script:

python get_started_monitoring.py

Then, launch the user interface to see it! Run:

evidently ui

To view the service, head to localhost:8000.

You will be able to see a new project in the interface:

After you click on "new project", you can see the monitoring dashboard you just created.

✅ How does this work in practice?

To go through the steps in more detail, refer to the complete Monitoring User Guide in the docs.

To start monitoring an existing ML model, you must build a workflow to collect the data from your production pipelines or services. You can also run monitoring jobs over production logs stored in a data warehouse. The exact integration scenario depends on the model deployment type and infrastructure.

Here is one possible approach. You can implement it using a workflow manager like Airflow to compute Evidently snapshots on a regular cadence.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

[fs-toc-omit]Summing up

ML monitoring is a necessary component of production ML deployment. It involves tracking data inputs, predictions, and outcomes to ensure that models remain accurate and reliable. ML monitoring helps get visibility into how well the model functions and detect and resolve issues.

It is possible to implement the complete ML monitoring workflow using open-source tools. With Evidently, you can start with simple ad hoc checks with Reports or Test Suites and then add a live monitoring dashboard as you scale. This way, you can start small and introduce complexity progressively.