contents
In this code tutorial, you will learn how to run batch ML model inference, collect data and ML model quality monitoring metrics, and visualize them on a live dashboard.
This is a blueprint for an end-to-end batch ML monitoring workflow using open-source tools. You can copy the repository and use this reference architecture to adapt for your use case.
Code example: if you prefer to head straight to the code, open this example folder on GitHub.
When an ML model is in production, you need to keep tabs on the ML-related quality metrics in addition to traditional software service monitoring. This typically includes:
Sign up for our Open-source ML observability course. Designed for data scientists and ML engineers. Yes, it's free!
Save my seat ⟶
In this tutorial, we introduce a possible implementation architecture for ML monitoring as a set of monitoring jobs.
You can adapt this approach to your batch ML pipelines. You can also use this approach to monitor online ML services: when you do not need to compute metrics every second, but instead can read freshly logged data from the database, say, every minute or once per hour.
In this tutorial, you will learn how to build a batch ML monitoring workflow using Evidently, Prefect, PostgreSQL, and Grafana.
The tutorial includes all the necessary steps to imitate the batch model inference and subsequent data joins for model quality evaluation.
By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:
You will run your monitoring solution in a Docker environment for easy deployment and scalability.
We expect that you:
You also need the following tools installed on your local machine:
Note: we tested this example on macOS/Linux.
First, install the pre-built example. Check the README file for more detailed instructions.
Clone the Evidently GitHub repository with the example code. This repository provides the necessary files and scripts to set up the integration between Evidently, Prefect, PostgreSQL, and Grafana.
Create a Python virtual environment to isolate the dependencies for this project. Then, install the required Python libraries from the requirements.txt file:
Set up the monitoring infrastructure using Docker Compose. It will launch a cluster with the required services such as PostgreSQL and Grafana. This cluster is responsible for storing the monitoring metrics and visualizing them.
To store the ML monitoring metrics in the PostgreSQL database, you must create the necessary tables. Run a Python script below to set up the database structure to store metrics generated by Evidently.
This example is based on the NYC Taxi dataset. The data and the model training are out of the scope of this tutorial. We prepared a few scripts to download data, pre-process it and train a simple machine learning model.
To generate monitoring reports with Evidently, we usually need two datasets:
In this example, we take data from January 2021 as a reference. We use this data as a baseline and generate monitoring reports for consecutive periods.
Do you always need the reference dataset? It depends. A reference dataset is required to compute data distribution drift. You can also choose to work with a reference dataset to quickly generate expectations about the data, such as data schema, feature min-max ranges, baseline model quality, etc. This is useful if you want to run Test Suites with auto-generated parameters. However, you can calculate most metrics (e.g., the share of nulls, feature min/max/mean, model quality, etc.) without the reference dataset.
After completing the installation, you have a working Evidently integration with Prefect, PostgreSQL, and Grafana.
Follow the steps below to launch the example. You will run a set of inference and ML monitoring pipelines to generate the metrics, store them in the database and explore them on a Grafana dashboard.
Set the Prefect API URL environment variable to enable communication between the Prefect server and the Python scripts. Then, execute the scheduler script to automatically run the Prefect flows for inference and monitoring:
The scheduler.py script runs the following pipelines:
For simplicity, the scheduler.py script uses the following hardcoded parameters to schedule the pipelines.
By fixing the parameters, we ensure the reproducibility. When you run the tutorial, you should get the same visuals. We will further discuss how to customize this example.
Instead of running the scheduler script, you can execute each pipeline individually. This way, you will have more control over the specific execution time and interval of each pipeline.
Access the Prefect UI by navigating to http://localhost:4200 in a web browser. The Perfect UI shows the executed pipelines and their current status.
Open the Grafana monitoring dashboards by visiting http://localhost:3000 in a web browser. The example contains pre-built Grafana dashboards showing data quality, target drift, and model performance metrics.
You can navigate and see the metrics as they appear on the Grafana dashboard.
Now, let’s explore each component of the ML model monitoring architecture.
In this section, we will explain the design of the three Prefect pipelines to monitor input data quality, model predictions, and model performance. You will understand how they work and how to modify them.
First, let’s visualize the pipeline order and dependencies.
You execute three pipelines at different time intervals (T-1, T, and T+1). For each period, you make new predictions, run input data checks, and monitor model performance.
The pipelines perform the following tasks:
In Prefect, tasks are the fundamental building blocks of workflows. Tasks represent individual operations, such as reading data, preprocessing data, training a model, or evaluating a model.
Let’s consider a simple example below:
To define a task in Prefect, one can use the @task decorator. This decorator turns any Python function into a Prefect task. This example demonstrates a simple flow containing two tasks:
Flows are the backbone of Prefect workflows. They represent the relationships between tasks and define the order of execution. To create a flow, by using the @flow decorator. The help_world first calls the say_hello. The output of this task (the greeting message) is then passed as an argument to the do_good_open_source task. The resulting list of messages from do_good_open_source is printed using a list comprehension.
Running this Python module outputs looks like:
Prefect can automatically log the details of the running flow and visualize them in the Prefect UI.
This pipeline makes predictions using a pre-trained ML model. The predict function is a Prefect flow that generates predictions for a new batch of data within a specified interval.
The predict flow orchestrates the entire prediction process. It takes a timestamp and an interval (in minutes) as input arguments. The flow consists of the following steps:
By defining the predict as a Prefect flow, you create a reusable and modular pipeline for generating predictions on new data batches.
💡 Note: The predict flow is decorated with the @flow decorator, which includes the flow_run_name parameter that gives a unique name for each flow run based on the timestamp (ts).
The data quality monitoring pipeline tracks the quality and stability of the input data. We'll use Evidently to perform data profiling and generate a data quality report.
💡 For ease of demonstration, we check for both input data and the prediction distribution drift as a part of the data quality pipeline. (Both included in the Data Drift Preset). You may want to split these tasks into separate pipelines in your projects.
The code snippet below from the src/pipelines/monitor_data.py shows how to create a Prefect flow to monitor data quality and data drift in a machine learning pipeline.
The monitor_data flow orchestrates the data monitoring process. It takes a timestamp ts and an interval (in minutes) as input arguments. The flow consists of the following steps:
Let’s dive deeper into the generate_reports task!
This task generates a set of Evidently metrics related to the data quality and data drift. It takes the current data, reference data, numerical features, categorical features, and the prediction column as input arguments and computes two reports.
The Data Quality report includes the DatasetSummaryMetric. It profiles the input dataset by computing metrics like the number of observations, missing values, empty and almost empty values, etc.
🚦 Conditional data validation. In this example, we compute and log the model quality metrics. As an alternative, you can directly check if the input data complies with certain conditions (for example, if there are features out of range, schema violations, etc.) and log the pass/fail test result in addition to metric values. In this case, use Evidently Test Suites.
The Data Drift report includes the DatasetDriftPreset. It compares the distributions of the features and predictions between the current and reference dataset. We do not pass any custom parameters, so it uses the default Evidently drift detection algorithm.
In this case, we do not generate the visual reports using Evidently, but instead, get the metrics as a Python dictionary using .as_dict() Evidently output format. This output includes the metric values, relevant metadata (such as applied drift detection method and threshold), and even optional visualization information, such as histogram bins.
This task then commits the computed metrics to the database for future analysis and visualization.
💡 Customizing the Metrics. In this example, we use only a couple of metrics available in Evidently. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.
The model performance monitoring pipeline tracks the model quality over time. It uses Evidently to generate model quality metrics and compare the distribution of the model target against the reference period (evaluate target drift).
The code snippet below from the src/pipelines/monitor_model.py demonstrates how to create a Prefect flow for model monitoring:
The monitor_model Prefect flow generates the relevant metrics and commits them to the database. It consists of the following steps:
The generate_reports task generates model performance and target drift reports. It utilizes the following Evidently metrics:
Now, let’s look at what happens with the computed metrics.
In this example, we use SQLAlchemy, a popular Python SQL toolkit and Object-Relational Mapper (ORM), to interact with the PostgreSQL database. With SQLAlchemy, you can define the table schema using Python classes and easily manage the database tables using Python code.
We prepared a Python script named create_db.py, that creates the database tables required for storing monitoring metrics.
python src/scripts/create_db.py.
The table models are defined in the src/utils/models.py module.
For example, the TargetDriftTable class represents a table schema for storing target drift metrics in the PostgreSQL database.
This DB table contains the following columns for monitoring purposes:
It works similarly for other tables in the database.
📊 How do the data drift checks work? You can explore "Which test is the best" research blog to understand the behavior of different drift detection methods. To understand the parameters of the Evidently drift checks, check the documentation.
💡 Why not Prometheus? A popular combination is to use Grafana together with Prometheus. In this case, we opt for a SQL database. The reason is that Prometheus is well-suited to store time series metrics in near real-time. However, it is not convenient to write delayed data. In our case, we compute metrics on a schedule (which can be as infrequent as once per day or week) and compute model quality metrics with a delay. Using Prometheus also adds an additional service to run (and monitor!)
We prepared Grafana dashboard configurations to visualize the collected metrics and data source configurations to connect it to the PostgreSQL database. You can find them in the grafana/ directory.
After you launch the monitoring cluster, Grafana will connect to the monitoring database and create dashboards based on the templates.
As an example, let's explore the Evidently Numerical Target Drift dashboard, which provides insights into the distribution drift of the model target.
The top widgets show the name of the drift detection method (in this case, Wasserstein distance) and the threshold values.
The middle and bottom widgets display the history of drift checks for each period. This helps identify specific time points when drift occurred and the corresponding drift scores. You can understand the severity of drift and decide whether you want to intervene.
You can easily customize the dashboard widgets and scripts used to build them.
Alerts. You can also use Grafana to define alert conditions to inform when metrics are outside the expected range.
To adapt this example for your machine learning projects, follow these guidelines:
Data inputs. Modify the load_data task to load your dataset from the relevant data source (e.g., CSV, parquet file, or database).
Model inference. Replace the existing model with your trained machine learning model. You may need to adjust the get_predictions task to ensure compatibility with your chosen algorithm and data format.
Monitoring metrics. Customize the monitoring tasks to include metrics relevant to your project. Consider including data quality, data drift, target drift, and model performance checks. Update the generate_reports task with the appropriate Evidently metrics or tests.
💡 Evidently Metrics and Tests. You can browse other metrics in these sample notebooks or explore the list of 100+ Metrics and Tests to choose those suitable for your use case.
Database setup. Modify the database configuration and the table models in src/utils/models.py to store the monitoring metrics relevant to your project.
Reference dataset. Define a representative reference dataset and period suitable for your use case. This should be a stable data snapshot that captures the typical distribution and characteristics of the features and target variable. The period should be long enough to reflect the variations in the data. Consider the specific scenario and seasonality: sometimes, you might use a moving reference, for example, by comparing each week to the previous.
Store the reference dataset or generate it on the fly. Consider the size of the data and resources available for processing. Storing the reference dataset may be preferable for larger or more complex datasets, while generating it on the fly could be more suitable for smaller or highly dynamic datasets.
Grafana dashboards. Customize the Grafana dashboards to visualize the specific monitoring metrics relevant to your project. This may involve updating the SQL queries and creating additional visualizations to display the results of your custom metrics.
Applicable for both batch and real-time. This monitoring architecture can be used for both batch and real-time ML systems. With batch inference, you can directly follow this example, adding data validation or monitoring steps to your existing pipelines. For real-time systems, you can log model inputs and predictions to the database and then read the prediction logs from the database at a defined interval.
Async metrics computation. Metrics computation is separate from model serving. This way, it does not affect the model serving latency in cases when this is relevant.
Adaptable. You can replace specific components for those you already use. For example, you can use the same database you use to store model predictions or use a different workflow orchestrator, such as Airflow. (Here is an example integration of how to use Evidently with Airflow). You can also replace Grafana with a different BI dashboard and even make use of the additional visualization information available in the Evidently JSON/Python dictionary output to recreate some of the Evidently visualizations faster.
Might be too “heavy.” We recommend using this or similar architecture when you already use one or two of the mentioned tools as part of your workflow. For example, you already use Grafana for software monitoring or Prefect to orchestrate dataflows.
However, it might be suboptimal to introduce several complex new services to monitor a few ML models, especially if this is infrequent batch inference.
You can use Evidently to compute HTML reports and store them in any object storage. In this case, you will implement a complete “offline” monitoring setup as a set of monitoring jobs. You will also make use of the rich pre-built Evidently visuals for debugging.
Here is an example tutorial of using Evidently with Streamlit to host HTML reports.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶
This tutorial demonstrated how to integrate Evidently into production pipelines using Prefect, PostgreSQL, and Grafana.
You can further work with this example: