contents
TL;DR: Meet the new Data Quality report in the Evidently open-source Python library! You can use it to explore your dataset and track feature statistics and behavior changes. It is available both as a visual HTML report or a JSON profile.
We are happy to announce a new addition to the Evidently open-source Python library: an interactive report on Data Quality.
The Data Quality report helps explore the dataset and feature behavior and track and debug data quality when the model is in production.
You can generate the report for a single dataset. For example, when you are doing your exploratory data analysis.
It will quickly answer the questions like:
You can also generate the report for two datasets and contrast the properties of each feature and whole data side by side.
It will then help you answer the comparison questions:
You can use the comparison feature to understand different segments in your data. For example, you can contrast data from one geographic region against another. You can also use it to compare older and newer data batches: for example, when evaluating different model runs.
The report is available in two formats:
You are reading a blog about an early Evidently release. This functionality has since been improved and simplified. You can read more new available reports and additional features in the documentation.
One might ask, how is it different from the Data Drift report in Evidently?
The data drift report performs statistical tests to detect changes in the feature distributions between the two datasets. It helps visualize distributions but does not go into further detail on feature behavior.
The data quality report looks at the descriptive statistics and helps visualize relationships in the data. Unlike the data drift report, the data quality report can also work for a single dataset.
If you are looking to evaluate the data changes for your production model, you might use both reports together as they complement each other.
To generate the report Evidently needs one or two datasets. If you are working with a notebook, you should prepare them as Pandas DataFrames. If you are using a command-line interface, prepare them as .csv files.
If you use two datasets, the first dataset is "Reference," and the second is "Current." You can also prepare a single dataset and then explicitly state where the rows belong to perform the comparison.
Once you import Evidently and its components, you can spin up your report with just a couple of lines of code:
You might need to specify column mapping to ensure all features are processed correctly. Otherwise, it would work automatically by deriving the feature type from the pandas data type.
Pro tip: if you have a lot of data, you might want to apply some sampling strategy or generate the report only for some of the features first.
Let's have a look at what's inside!
The first table quickly gives an overview of the complete dataset (or two!). You can immediately spot things like a high share missing or constant features.
What's cool here: note "almost missing" and "almost constant" rows. It is often relevant to detect such issues for real-world datasets to sort out features that would be hard to rely on.
Next, you will see a statistical overview and a set of visualizations for each feature. They include descriptive statistics, feature distribution visuals, distribution of the feature in time, and distribution of the feature in relation to the target.
What's cool here:
What's more, each plot is interactive! Evidently uses Plotly on the back end, and you can zoom in and out as needed, or switch between logarithmic and linear scale for a feature distribution, for example.
For example, here is how the summary widget for a numerical feature might look:
Here is the numerical feature distribution in time that highlights the values that belong to the reference and current distribution:
The feature by target functionality helps explore the relationship between the feature and the target and its changes between the two datasets. Here is an example of a categorical feature:
The report also generates a table summary of pairwise feature correlations and correlation heat maps.
What's cool here:
This way, you can quickly grasp the properties of your dataset and select the features that need a closer look (or should be excluded from the modeling).
And of course, the visuals:
You can check out the complete documentation for more details and examples.
Of course! All Evidently reports can be customized.
You can mix and match the existing widgets however you like or even add a custom widget. Here is the detailed documentation on customization options.
Business as usual! You can get the report output as a JSON or Python dictionary.
You can use it however you like. For example, you can generate and log the data quality snapshot for each model run and save it for future evaluation. You can also build a conditional workflow around it: maybe generate an alert or a visual report, for example, if you get a high number of new categorical values for a given feature.
Here are a few ideas on how to use the Data Quality report:
Go to Github, pip install evidently, and give it a spin! Here are the notebooks examples you can start with.
If you need any help or want to share your feedback, join our Discord community!
Sign up to the User newsletter to get updates on new features, integrations and code tutorials. No spam, just good old release notes.
Subscribe ⟶
For any questions, contact us via hello@evidentlyai.com. That's an early release, so let us know of any bugs! You can also open an issue on Github.