🚀 Open-source RAG evaluation and testing with Evidently. New release

All blogs about

#ml-testing

How companies evaluate LLM systems: 7 examples from Asana, GitHub, and more

Community

How companies evaluate LLM systems: 7 examples from Asana, GitHub, and more

We put together 7 examples of how top companies like Asana and GitHub run LLM evaluations. They share how they approach the task, what methods and metrics they use, what they test for, and their learnings along the way.

10 LLM safety and bias benchmarks

Community

10 LLM safety and bias benchmarks

LLM safety benchmarks help to ensure models are robust and reliable. In this blog, we highlight 10 key safety and bias LLM benchmarks that help assess and improve LLM reliability.

Evidently 0.6.3: Open-source RAG evaluation and testing

Evidently

Evidently 0.6.3: Open-source RAG evaluation and testing

Evidently open-source now has more tools for evaluating RAG. You can score context relevance, evaluate generation quality, and use different LLMs as evaluators.

10 RAG examples and use cases from real companies

LLM Evals

10 RAG examples and use cases from real companies

RAG helps make LLM systems more accurate and reliable. We compiled 10 real-world examples of how companies use RAG to improve customer experience, automate routine tasks, and improve productivity.

Upcoming Evidently API Changes

Evidently

Upcoming Evidently API Changes

The Evidently API is evolving — and it’s getting better! We are updating the open-source Evidently API to make it simpler, more flexible, and easier to use. Explore the new features.

AI regulations: EU AI Act, AI Bill of Rights, and more

Community

AI regulations: EU AI Act, AI Bill of Rights, and more

In this guide, we’ll discuss key AI regulations, such as the EU AI Act and the Blueprint for AI Bill of Rights, and explain what they mean for teams building AI-powered products.

LLM hallucinations and failures: lessons from 4 examples

Community

LLM hallucinations and failures: lessons from 4 examples

Real-world examples of LLM hallucinations and other failures that can occur in LLM-powered products in the wild, such as prompt injection and out-of-scope usage scenarios.

When AI goes wrong: 10 examples of AI mistakes and failures

Community

When AI goes wrong: 10 examples of AI mistakes and failures

From being biased to making things up, there are numerous instances where we’ve seen AI going wrong. In this post, we’ll explore ten notable AI failures when the technology didn’t perform as expected.

Wrong but useful: an LLM-as-a-judge tutorial

Tutorials

Wrong but useful: an LLM-as-a-judge tutorial

This tutorial shows how to create, tune, and use LLM judges. We'll make a toy dataset and assess correctness and verbosity. You can apply the same workflow for other criteria.

Meet Evidently Cloud for AI Product Teams

Evidently

Meet Evidently Cloud for AI Product Teams

We are launching Evidently Cloud, a collaborative AI observability platform built for teams developing products with LLMs. It includes tracing, datasets, evals, and a no-code workflow. Check it out!

45 real-world LLM applications and use cases from top companies

Community

45 real-world LLM applications and use cases from top companies

How do companies use LLMs in production? We compiled 45 real-world LLM applications from companies that share their learnings from building LLM systems.

LLM regression testing workflow step by step: code tutorial

Tutorials

LLM regression testing workflow step by step: code tutorial

In this tutorial, we introduce the end-to-end workflow of LLM regression testing. You will learn how to run regression testing as a process and build a dashboard to monitor the test results.

A tutorial on regression testing for LLMs

Tutorials

Watch the language: A tutorial on regression testing for LLMs

In this tutorial, you will learn how to systematically check the quality of LLM outputs. You will work with issues like changes in answer content, length, or tone, and see which methods can detect them.

MLOps Zoomcamp recap: how to monitor ML models in production?

Community

MLOps Zoomcamp recap: how to monitor ML models in production?

Our CTO Emeli Dral was an instructor for the ML Monitoring module of MLOps Zoomcamp 2024, a free MLOps course. We summarized the ML monitoring course notes and linked to all the practical videos.

AI, Machine Learning, and Data Science conferences in 2025

Community

AI, Machine Learning, and Data Science conferences to attend in 2025

We put together the most interesting conferences on AI, Machine Learning, and Data Science in 2025. And the best part? Some of them are free to attend or publish the content after the event.

Evidently 0.4.25: An open-source tool to evaluate, test and monitor your LLM-powered apps

Evidently

Evidently 0.4.25: An open-source tool to evaluate, test and monitor your LLM-powered apps

Evidently open-source Python library now supports evaluations for LLM-based applications, including RAGs and chatbots. You can compare, test, and monitor your LLM system quality from development to production.

7 new features at Evidently: ranking metrics, data drift on Spark, and more

Evidently

7 new features at Evidently: ranking metrics, data drift on Spark, and more

Did you miss some of the latest updates at Evidently open-source Python library? We summed up a few features we shipped recently in one blog.

Batch inference and ML monitoring with Evidently and Prefect

Tutorials

Batch inference and ML monitoring with Evidently and Prefect

In this tutorial, you will learn how to run batch ML model inference and deploy a model monitoring dashboard for production ML models using open-source tools.

MLOps courses to take in 2023

Community

MLOps courses to take in 2023

Looking for MLOps courses to attend in 2023? We put together five great online MLOps courses for data scientists and ML engineers. They are free to join or publish their content for everyone to access without a fee.

An MLOps story: how DeepL monitors ML models in production

ML Monitoring

An MLOps story: how DeepL monitors ML models in production

How do different companies start and scale their MLOps practices? In this blog, we share a story of how DeepL monitors ML models in production using open-source tools.

How to start with ML model monitoring. A step-by-step guide.

Tutorials

How to stop worrying and start monitoring your ML models: a step-by-step guide

A beginner-friendly MLOps tutorial on how to evaluate ML data quality, data drift, model performance in production, and track them all over time using open-source tools.

Evidently 0.4: an open-source ML monitoring dashboard to track all your models

Evidently

Evidently 0.4: an open-source ML monitoring dashboard to track all your models

Evidently 0.4 is here! Meet a new feature: Evidently user interface for ML monitoring. You can now track how your ML models perform over time and bring all your checks to one central dashboard.

Monitoring unstructured data for LLM and NLP with text descriptors

Tutorials

Monitoring unstructured data for LLM and NLP with text descriptors

How do you monitor unstructured text data? In this code tutorial, we’ll explore how to track interpretable text descriptors that help assign specific properties to every text.

A simple way to create ML Model Cards in Python

Tutorials

A simple way to create ML Model Cards in Python

In this code tutorial, you will learn how to create interactive visual ML model cards to document your models and data using Evidently, an open-source Python library.

ML serving and monitoring with FastAPI and Evidently

Tutorials

ML serving and monitoring with FastAPI and Evidently

In this code tutorial, you will learn how to set up an ML monitoring system for models deployed with FastAPI. This is a complete deployment blueprint for ML serving and monitoring using open-source tools.

5 methods to detect drift in ML embeddings

ML Monitoring

Shift happens: we compared 5 methods to detect drift in ML embeddings

Monitoring embedding drift is relevant for the production use of LLM and NLP models. We ran experiments to compare 5 drift detection methods. Here is what we found.

AMA with Lina Weichbrodt: ML monitoring, LLMs, and freelance ML engineering

Community

AMA with Lina Weichbrodt: ML monitoring, LLMs, and freelance ML engineering

In this blog, we recap the Ask-Me-Anything session with Lina Weichbrodt. We chatted about ML monitoring and debugging, adopting LLMs, and the challenges of being a freelance ML engineer.

Batch ML monitoring blueprint: Evidently, Prefect, PostgreSQL, and Grafana

Tutorials

Batch ML monitoring blueprint: Evidently, Prefect, PostgreSQL, and Grafana

In this code tutorial, you will learn how to run batch ML model inference, collect data and ML model quality monitoring metrics, and visualize them on a live dashboard.

How to set up ML monitoring with email alerts using Evidently and AWS SES

Tutorials

How to set up ML monitoring with email alerts using Evidently and AWS SES

In this tutorial, you will learn how to implement Evidently checks as part of an ML pipeline and send email notifications based on a defined condition.

An MLOps story: how Wayflyer creates ML model cards

ML Monitoring

An MLOps story: how Wayflyer creates ML model cards

How do different companies start and scale their MLOps practices? In this blog, we share a story of how Wayflyer creates ML model cards using open-source tools.

A tutorial on building ML and data monitoring dashboards with Evidently and Streamlit

Tutorials

A tutorial on building ML and data monitoring dashboards with Evidently and Streamlit

In this tutorial, you will learn how to create a data quality and ML model monitoring dashboard using the two open-source libraries: Evidently and Streamlit.

AMA with Stefan Krawczyk: from building ML platforms at Stitch Fix to an open-source startup on top of Hamilton

Community

AMA with Stefan Krawczyk: from building ML platforms at Stitch Fix to an open-source startup on top of the Hamilton framework

In this blog, we recap the Ask-Me-Anything session with Stefan Krawczyk. We chatted about how to build an ML platform and what data science teams do wrong about ML dataflows.

How to build an ML platform? Lessons from 10 tech companies

ML Monitoring

How to build an ML platform? Lessons from 10 tech companies

How to approach building an internal ML platform if you’re not Google? We put together stories from 10 companies that shared their platforms’ design and learnings along the way.

Monitoring NLP models in production: a tutorial on detecting drift in text data

Tutorials

Monitoring NLP models in production: a tutorial on detecting drift in text data

In this tutorial, we will explore issues affecting the performance of NLP models in production, imitate them on an example toy dataset, and show how to monitor and debug them.

AMA with Neal Lathia: data science career tracks and Monzo ML stack

Community

AMA with Neal Lathia: data science career tracks, shipping ML models to production, and Monzo ML stack

In this blog, we recap the Ask-Me-Anything session with Neal Lathia. We chatted about career paths of an ML Engineer, building and expanding ML teams, Monzo’s ML stack, and 2023 ML trends.

Evidently 0.2.2: Data quality monitoring and drift detection for text data

Evidently

Evidently 0.2.2: Data quality monitoring and drift detection for text data

Meet the new feature: data quality monitoring and drift detection for text data! You can now use the Evidently open-source Python library to evaluate, test, and monitor text data.

50 best machine learning blogs from engineering teams

Community

50 best machine learning blogs from engineering teams

Want to know how companies with top engineering teams do machine learning? We put together a list of the best machine learning blogs from companies that share specific ML use cases, lessons learned from building ML platforms, and insights into the tech they use.

AMA with Ben Wilson: planning ML projects, AutoML, and ML deployment at scale

Community

AMA with Ben Wilson: planning ML projects, AutoML, and deploying at scale

In this blog, we recap the Ask-Me-Anything session with Ben Wilson. We chatted about AutoML use cases, deploying ML models to production, and how one can learn about ML engineering.

Meet Evidently 0.2, the open-source ML monitoring tool to continuously check on your models and data

Evidently

Meet Evidently 0.2, the open-source ML monitoring tool to continuously check on your models and data

We are thrilled to announce our latest and largest release: Evidently 0.2. In this blog, we give an overview of what Evidently is now.

Evidently feature spotlight: NoTargetPerformance test preset

Evidently

Evidently feature spotlight: NoTargetPerformance test preset

In this series of blogs, we are showcasing specific features of the Evidently open-source ML monitoring library. Meet NoTargetPerformance test preset!

AMA with Rick Lamers: orchestration tools and data pipelines

Community

AMA with Rick Lamers: the evolution of data orchestration tools and the perks of open source

In this blog, we recap the Ask-Me-Anything session with Rick Lamers, where we chatted about the evolution of orchestration tools, their place within the MLOps landscape, the future of data pipelines, and building an open-source project amidst the economic crisis.

How to contribute to open source as a Data Scientist, and Hacktoberfest 2022 recap

Community

How to contribute to open source as a Data Scientist, and Hacktoberfest 2022 recap

Now that Hacktoberfest 2022 is over, it’s time to celebrate our contributors, look back at what we’ve achieved together, and share what we’ve learned during this month of giving back to the community through contributing to open source.

Evidently 0.1.59: Migrating from Dashboards and JSON profiles to Reports

Evidently

Evidently 0.1.59: Migrating from Dashboards and JSON profiles to Reports

In Evidently v0.1.59, we moved the existing dashboard functionality to the new API. Here is a quick guide on migrating from the old to the new API. In short, it is very, very easy.

ML model maintenance. “Should I throw away the drifting features”?

ML Monitoring

ML model maintenance. “Should I throw away the drifting features”?

Imagine you have a machine learning model in production, and some features are very volatile. Their distributions are not stable. What should you do with those? Should you just throw them away?

AMA with Jacopo Tagliabue: reasonable scale ML and testing RecSys

Community

AMA with Jacopo Tagliabue: reasonable scale ML, testing recommendation systems, and hot DataOps

In this blog, we recap the Ask-Me-Anything session with Jacopo Tagliabue, where we chatted about ML at a reasonable scale, testing RecSys, MLOps anti-patterns, what’s hot in DataOps, fundamentals in MLOps, and more.

AMA with Bo Yu and Sean Sheng: why ML deployment is hard

Community

AMA with Bo Yu and Sean Sheng: why ML deployment is hard

In this blog, we recap the Ask-Me-Anything session with Bozhao Yu and Sean Sheng, where we chatted about why deploying a model is hard, beginner mistakes and how to avoid them, the challenges of building an open-source product, and BentoML’s roadmap.

Pragmatic ML monitoring for your first model. How to prioritize metrics?

ML Monitoring

Pragmatic ML monitoring for your first model. How to prioritize metrics?

There is an overwhelming set of potential metrics to monitor. In this blog, we'll try to introduce a reasonable hierarchy.

AMA with Doris Xin: AutoML, modern data stack, and MLOps tools

Community

AMA with Doris Xin: AutoML, modern data stack, and reunifying the tools

In this blog, we recap Ask-Me-Anything session with Doris Xin, that covered the roles of Data Scientists and Data Engineers in an ML cycle, automation, MLOps tooling, bridging the gap between development and production, and more.

AMA with Fabiana Clemente: synthetic data, data-centric approach, and rookie mistakes to avoid

Community

AMA with Fabiana Clemente: synthetic data, data-centric approach, and rookie mistakes to avoid

We recap Ask-Me-Anything session with Fabiana Clemente, which covered synthetic data, its quality, beginner mistakes in data generation, the data-centric approach, and how well companies are doing in getting there.

Monitoring ML systems in production. Which metrics should you track?

ML Monitoring

Monitoring ML systems in production. Which metrics should you track?

When one mentions "ML monitoring," this can mean many things. Are you tracking service latency? Model accuracy? Data quality? This blog organizes everything one can look at in a single framework.

Evidently 0.1.52: Test-based ML monitoring with smart defaults

Evidently

Evidently 0.1.52: Test-based ML monitoring with smart defaults

Meet the new feature in the Evidently open-source Python library! You can easily integrate data and model checks into your ML pipeline with a clear success/fail result. It comes with presets and defaults to make the configuration painless.

Which test is the best? We compared 5 methods to detect data drift on large datasets

ML Monitoring

Which test is the best? We compared 5 methods to detect data drift on large datasets

We ran an experiment to help build an intuition on how popular drift detection methods behave. In this blog, we share the key takeaways and the code to run the tests on your data.

AMA with Matt Squire: what makes a good open-source tool great, and the future of MLOps

Community

AMA with Matt Squire: what makes a good open-source tool great, and the future of MLOps

In this blog we recap Ask-Me-Anything session with Matt Squire, that covered MLOps maturity and future, how MLOps fits in data-centric AI, and why open-source wins.

How to set up ML Monitoring with Evidently. A tutorial from CS 329S: Machine Learning Systems Design.

Tutorials

How to set up ML Monitoring with Evidently. A tutorial from CS 329S: Machine Learning Systems Design.

Our CTO Emeli Dral gave a tutorial on how to use Evidently at the Stanford Winter 2022 course CS 329S on Machine Learning System design. Here is the written version of the tutorial and a code example.

AMA with Hamza Tahir: MLOps trends, tools, and building an open-source startup

Community

AMA with Hamza Tahir: MLOps trends, tools, and building an open-source startup

In this blog we recap Ask-Me-Anything session with Hamza Tahir, that covered MLOps trends and tools, the future of real-time ML, and building an open-source startup.

Evidently Community Call #2 Recap: custom text comments, color schemes and a library of statistical tests

Community

Evidently Community Call #2 Recap: custom text comments, color schemes and a library of statistical tests

In this blog we recap the second Evidently Community Call that covers the recent feature updates in our open-source ML monitoring tool.

AMA with Alexey Grigorev: MLOps tools, best practices for ML projects, and tips for community builders

Community

AMA with Alexey Grigorev: MLOps tools, best practices for ML projects, and tips for community builders

In this blog we recap Ask-Me-Anything session with Alexey Grigorev, that covered all things production machine learning, from tools to workflow, and even a bit on community building.

Q&A: ML drift that matters. "How to interpret data and prediction drift together?"

ML Monitoring

Q&A: ML drift that matters. "How to interpret data and prediction drift together?"

Data and prediction drift often need contextual interpretation. In this blog, we walk you through possible scenarios for when you detect these types of drift together or independently.

Evidently 0.1.46: Evaluating and monitoring data quality for ML models.

Evidently

Evidently 0.1.46: Evaluating and monitoring data quality for ML models.

Meet the new Data Quality report in the Evidently open-source Python library! You can use it to explore your dataset and track feature statistics and behavior changes.

7 highlights of 2021: A year in review for Evidently AI

Evidently

7 highlights of 2021: A year in review for Evidently AI

We are building an open-source tool to evaluate, monitor, and debug machine learning models in production. Here is a look back at what has happened at Evidently AI in 2021.

Evidently 0.1.35: Customize it! Choose the statistical tests, metrics, and plots to evaluate data drift and ML performance.

Evidently

Evidently 0.1.35: Customize it! Choose the statistical tests, metrics, and plots to evaluate data drift and ML performance.

Now, you can easily customize the pre-built Evidently reports to add your metrics, statistical tests or change the look of the dashboards with a bit of Python code.

Q&A: Do I need to monitor data drift if I can measure the ML model quality?

ML Monitoring

Q&A: Do I need to monitor data drift if I can measure the ML model quality?

Even if you can calculate the model quality metric, monitoring data and prediction drift can be often useful. Let’s consider a few examples when it makes sense to track the distributions of the model inputs and outputs.

"My data drifted. What's next?" How to handle ML model drift in production.

ML Monitoring

"My data drifted. What's next?" How to handle ML model drift in production.

What can you do once you detect data drift for a production ML model? Here is an introductory overview of the possible steps.

Evidently 0.1.30: Data drift and model performance evaluation in Google Colab, Kaggle Kernel, and Deepnote

Evidently

Evidently 0.1.30: Data drift and model performance evaluation in Google Colab, Kaggle Kernel, and Deepnote

Now, you can use Evidently to display dashboards not only in Jupyter notebook but also in Colab, Kaggle, and Deepnote.

Q&A: What is the difference between outlier detection and data drift detection?

ML Monitoring

Q&A: What is the difference between outlier detection and data drift detection?

When monitoring ML models in production, we can apply different techniques. Data drift and outlier detection are among those. What is the difference? Here is a visual explanation.

Real-time ML monitoring: building live dashboards with Evidently and Grafana

Evidently

Real-time ML monitoring: building live dashboards with Evidently and Grafana

You can use Evidently together with Prometheus and Grafana to set up live monitoring dashboards. We created an integration example for Data Drift monitoring. You can easily configure it to use with your existing ML service.

How to detect, evaluate and visualize historical drifts in the data

Tutorials

How to detect, evaluate and visualize historical drifts in the data

You can look at historical drift in data to understand how your data changes and choose the monitoring thresholds. Here is an example with Evidently, Plotly, Mlflow, and some Python code.

To retrain, or not to retrain? Let's get analytical about ML model updates

ML Monitoring

To retrain, or not to retrain? Let's get analytical about ML model updates

Is it time to retrain your machine learning model? Even though data science is all about… data, the answer to this question is surprisingly often based on a gut feeling. Can we do better?

Evidently 0.1.17: Meet JSON Profiles, an easy way to integrate Evidently in your prediction pipelines

Evidently

Evidently 0.1.17: Meet JSON Profiles, an easy way to integrate Evidently in your prediction pipelines

Now, you can use Evidently to generate JSON profiles. It makes it easy to send metrics and test results elsewhere.

Can you build a machine learning model to monitor another model?

ML Monitoring

Can you build a machine learning model to monitor another model?

Can you train a machine learning model to predict your model’s mistakes? Nothing stops you from trying. But chances are, you are better off without it.

What Is Your Model Hiding? A Tutorial on Evaluating ML Models

Tutorials

What Is Your Model Hiding? A Tutorial on Evaluating ML Models

There is more to performance than accuracy. In this tutorial, we explore how to evaluate the behavior of a classification model before production use.

Evidently 0.1.8: Machine Learning Performance Reports for Classification Models

Evidently

Evidently 0.1.8: Machine Learning Performance Reports for Classification Models

You can now use Evidently to analyze the performance of classification models in production and explore the errors they make.

How to break a model in 20 days. A tutorial on production model analytics

Tutorials

How to break a model in 20 days. A tutorial on production model analytics

What can go wrong with ML model in production? Here is a story of how we trained a model, simulated deployment, and analyzed its gradual decay.

Evidently 0.1.6: How To Analyze The Performance of Regression Models in Production?

Evidently

Evidently 0.1.6: How To Analyze The Performance of Regression Models in Production?

You can now use Evidently to analyze the performance of production ML models and explore their weak spots.

Evidently 0.1.4: Analyze Target and Prediction Drift in Machine Learning Models

Evidently

Evidently 0.1.4: Analyze Target and Prediction Drift in Machine Learning Models

Our second report is released! Now, you can use Evidently to explore the changes in your target function and model predictions.

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

Evidently

Introducing Evidently 0.0.1 Release: Open-Source Tool To Analyze Data Drift

We are excited to announce our first release. You can now use Evidently open-source python package to estimate and explore data drift for machine learning models.

Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift

ML Monitoring

Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift

No model lasts forever. While the data quality can be fine, the model itself can start degrading. A few terms are used in this context. Let’s dive in.

Machine Learning Monitoring, Part 4: How To Track Data Quality and Data Integrity

ML Monitoring

Machine Learning Monitoring, Part 4: How To Track Data Quality and Data Integrity

A bunch of things can go wrong with the data that goes into a machine learning model. Our goal is to catch them on time.

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

ML Monitoring

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

Garbage in is garbage out. Input data is a crucial component of a machine learning system. Whether or not you have immediate feedback, your monitoring starts here.

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

ML Monitoring

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

Who should care about machine learning monitoring? The short answer: everyone who cares about the model's impact on business.

Machine Learning Monitoring, Part 1: What It Is and How It Differs

ML Monitoring

Machine Learning Monitoring, Part 1: What It Is and How It Differs

Congratulations! Your machine learning model is now live. Many models never make it that far. Some claim, as much as 87% are never deployed.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

No credit card required

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.