contents
TL;DR: Meet the new feature in the Evidently open-source Python library! You can easily integrate data and model checks into your ML pipeline with a clear success/fail result. It comes with presets and defaults to make the configuration painless. There is also an upcoming API update with breaking changes.
There is new functionality in the Evidently library called Tests. It is a better, more structured way to check your data and ML model performance as a part of a production pipeline.
Tests help explicitly define the expectations from your data and model. You can declare them by setting parameters. Once you execute a test, you receive a success/fail/warning result. A test can be as simple as "check that mean error is less than X" or as complex as "test the dataset for input data drift."
You can group the Tests into Test Suites to run several checks at once.
We created several Test Suite presets. Each preset is a template that combines tests that go well together. For example, there is a preset to check for Data Quality and another for Data Stability.
Tests perfectly fit in a batch ML prediction pipeline. For example, you can orchestrate your ML pipeline with a tool like Airflow and run the Evidently tests as part of a DAG. You can:
You can then define actions such as alerting or switching to a fallback based on the test result.
If you prefer to look straight at the code, here are the example notebooks.
You can do the installation, data prep, and column mapping as usual in Evidently: select two datasets you want to compare as "reference" and "current." In some cases, you can proceed with a single dataset. But instead of generating a visual Dashboard or a JSON profile, you can now import and call Tests.
To start, you can choose one of the existing presets. For example, "NoTargetPerformance" consists of several tests to monitor a model without ground truth labels.
Here is how you call this preset and pass the list of the most important features:
Once you run the check, you can get an output as a JSON file to see a summary of the results.
You can also get a visual HTML report by calling an object inside a Jupyter notebook or Colab.
The report shows which tests passed and failed. If you click on details, it shows visuals to help debug a particular issue.
If you are an existing user, you might wonder: Evidently already does things like Data Drift detection. What is the difference?
We now made the test-based ML monitoring workflow a first-class citizen.
This is an improvement over the existing JSON profile functionality. JSON profiles return a summary of data and model metrics. Many users add them as a step in a prediction pipeline to run a conditional check. (Here is an example of how it works with Airflow now). Yes, you could already run pipeline tests, but it was not perfectly convenient.
First, you needed an extra layer of Python code to define the test condition. JSON profiles provide the metrics, but you must add a function to say, "If this metric is over X, please return me a fail result." It is not hard, but it is not so fun when you want to define many conditions!
Second, a visual dashboard would exist separately from the conditional check. Except for the Data Drift, dashboards don't provide much context about the condition you tested for. Visuals are great for exploration, but in the testing workflow, you first want to clearly see what is wrong.
We fixed it now.
ML-specific tests and presets. There are dozens of tests in the library, and we will be adding more.
All checks are tailored to the ML use cases. For example, when you get a new batch of data, you might want to look for constant or almost constant features or evaluate the most important features for statistical drift. We also grouped the individual tests together to make useful templates.
The tests are easy to configure. You can simply select a test and define thresholds or ranges. We also provide multiple ways of performing a similar check, as one might fit better than the other.
For example, if you want to detect a change in mean values, you can manually set a margin in absolute values or percentages for each feature. But it might be hard to do that if you have many of them. Instead, you can define the expected range of mean values in standard deviations for all numerical features at once. It is readily available in the DataStability preset.
We also added useful defaults for each test. You can configure everything, but you don't have to!
The checks will run independently as long as you provide the reference dataset (for example, the data used in training). This makes it easy to start monitoring and adjust the parameters as you go.
Some checks will run even without the reference dataset. For example, if you do not provide the reference model quality, Evidently will compare the current model quality against the quality of the dummy model that predicts the most popular class. There are similar algorithmic defaults for other metrics, too!
User-friendly visual reports.
Evidently is known for the informative and helpful visualizations that are part of the Evidently dashboards. We reworked them into test suites HTML dashboards, but they are still here! When you run the tests, you will get a new report version that clearly shows the failed checks and helps with debugging.
To start, we included the checks that make sense when you do not have true labels or ground truth yet.
Here are the presets that come with the first release:
Data Quality. This preset is focused on the data quality issues like duplicate rows or null values. It helps detect bad or corrupted data.
Data Stability. This test suite identifies the changes in the data or differences between the batches. For example, it detects the appearance of the new columns or values, the difference in the number of entries, or if features are far out of range.
Data Drift. While the previous preset relies on descriptive statistics, this one compares feature distributions using statistical tests and distance metrics. By default, it uses the in-built Evidently drift detection logic which selects the detection method based on data volume and type. You can also swap it for a test or metric of your choice.
NoTargetPerformance. It combines several checks you might want to run when you generate the model predictions, but you do not yet have the actuals or ground truth labels. This includes checking for prediction drift and some of the data quality and stability checks.
We will continue expanding the presets list and welcome community contributions! The checks focused on the Model Performance will come in the next update.
Here are some ideas.
Test-based monitoring for batch ML pipelines.
In essence, tests are batch checks. They are perfectly suited for batch prediction pipelines when you generate your predictions every hour, day, or week. You might check on the production model every time you get the data, generate the predictions, or get the true labels.
Test-based monitoring for real-time ML applications.
Even if you have a real-time ML service, you might prefer to perform checks in a batch manner. For example, if you generate predictions on demand but only get the true labels at a defined interval. Checks like data drift also often do not require event-based calculations. Instead of running the statistical test after a single new prediction, and can perform it in batches, e.g., every day.
Data and model tests during the model development or retraining.
The test suites are very flexible. They also help you to be consistent with your evaluation. For example, you can test each new model version in training against the previous version, be it during the experiment phase or the ongoing model updates.
Evidently provides an interface to define test suites, and you are not limited in how you can use them.
We do hope you will like the testing functionality!
In about 1 month, there will be a new release that will introduce breaking changes. Of course, it comes with some goodies, such as faster visual reports and cleaner API.
The goal is to separate and make it more convent to implement 4 different workflows that we see our users already performing with Evidently:
The API for the newly introduced Tests will remain fixed. The same goes for the existing minimal version of Monitors in real-time monitoring. But, there will be breaking changes to the API of Dashboards and JSON profiles and some internal refactoring.
If you are using JSON profiles, we suggest you evaluate if the new testing functionality fits your needs and migrate to it.
If you are using Dashboards, you will need to update a few lines of code after this future release. To prepare, you can explicitly fix the current Evidently version (0.1.52) so that your existing code won't break the day the new version is out.
If you want to get an update about the next release, subscribe to the Newsletter and join our Discord.
Do you want to share your opinion on this new release? Does it make your work better, or is something else missing? Please do!
You can join our Discord and post your thoughts and questions there.
Here is how you can try the new test-based monitoring feature:
If you like it, give us a star on GitHub!
Sign up to the User newsletter to get updates on new features, integrations and code tutorials. No spam, just good old release notes.
Subscribe ⟶
For any questions, contact us via hello@evidentlyai.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.