contentsâ
Evidently open-source Python library now includes more tools for evaluating RAG systems.
With the latest update, you can:
These evaluations integrate with the Evidently framework, so you can generate Reports, set up Test Suites, and build live monitoring Dashboards.
Want to try it now? Check out the code example.
For details, keep reading.
RAG (Retrieval-Augmented Generation) combines search with generative models. Itâs used in chatbots, Q&A systems, and other AI products to fetch relevant data before generating a response instead of relying only on what the LLM learned from its training data.
Like any AI system, RAG needs quality evaluation, both during development and in production.Â
This RAG evaluation has two parts:
Evaluating these separately helps diagnose specific problems. Sometimes, RAG retrieves the right data but doesnât use it effectively. Other times, it retrieves the wrong data, leading the LLM to âhallucinateâ to fill in the gaps. By running retrieval and generation checks separately, you can pinpoint where things go wrong.
Of course, you can also run them at once â but first, letâs break down how each of them works.
Retrieval is a well-known problem in search and recommendation systems. Many established ranking metrics exist, so it makes sense to reuse them for RAG.Â
Traditional retrieval evaluation relies on ground truth. You first prepare:
Then, you check if the system can find and rank these known documents correctly. Having the labeled dataset lets you run evaluations repeatedly as you refine your system. This approach works for RAG, too. If you can, you should totally use this: label a high-quality dataset with queries and relevant items for each and use ranking metrics for evaluation.Â
But in modern RAG applications, ground truth is often the bottleneck.
You are not always working with a document index. Instead, you deal with chunks â small text segments pulled from different sources. Deciding how to create these chunks is actually one of the problems you solve during experimentation:
Each decision may impact retrieval quality. Thatâs why you need evaluations: to quantify if changes make things better or worse.
Since chunking strategies may change, what counts as a "correct source to findâ also shifts. That makes it hard to fix a ground truth dataset once and for all. And it's generally quite tedious to label small text chunks as relevant or not. One way to solve this problem is using LLMs to evaluate relevance. This approach has already been tested in search, where LLMs help predict search preferences (Thomas et al., 2023).Â
We implemented this same idea in Evidently.Â
You start with a test set of valid questions â queries that should be answerable using your data. You can create these manually or generate them synthetically.Â
For example, if we were to test a chatbot that answers queries about Evidently using our documentation, we could include different questions about the tool:
Next, you need to:
Once you have this data, you can use Evidently Python library to assess retrieval quality.
Overall context quality. If you retrieve a single chunk or merge all chunks into one context, you can directly check its validity: âDoes this context contain enough information to answer the question?â We provide a prebuilt LLM judge evaluator that gives a label and explanation.Â
Here is a simple toy example:
Per-chunk relevance. But often RAG retrieves multiple chunks â either from different documents or different parts of the same document. In this case, Evidently lets you score each retrieved text individually to confirm that it contains useful information for answering the question. You can use a built-in relevance judge or check semantic similarity to the query.Â
Once all chunks are scored, you can aggregate the result for a specific query. The built-in approach we recommend is to check if at least one retrieved chunk is relevant â similar to how Hit Rate is calculated.Â
For example, here we retrieve three chunks for each query. (All are listed inside the âcontextâ column). We score them independently. If at least one of them helps answer the question, we count it as a Hit.Â
If you prefer, you can aggregate relevance scores across all chunks:Â
You can also export results to compute dataset ranking metrics like NDCG or MRR.
With this setup, you can continuously evaluate retrieval quality and re-run tests whenever you update your RAG system â like trying different vector databases.Â
Since you donât need ground truth, you can also use it for production monitoring. For example, you can track average relevance scores over time. If certain groups of queries score lower, it probably means your RAG database doesnât have enough useful data in that area.
Generation evaluation answers a simple question: are the responses good? There are two ways to assess this â one requires ground truth, one doesnât.
During testing, you can collect ground truth data and compare new outputs against it. In this case, you need:
You can create this dataset using LLMs, too. In this case, youâd first pick individual parts of texts and then ask questions answerable from them.Â
A good evaluation dataset makes a big difference. If you generate it synthetically, it is useful to let domain experts review it and edit some questions and answers as needed. This helps make your evals trustworthy and representative.
Then, you need to:
Evidently provides multiple ways to run this comparison, including LLM-based matching, semantic similarity, and BERTScore.
Here is how this matching can look in a toy example:
In this case, we donât look at the context: we evaluate only the final response as it is.
Once your system is live, you wonât have ground truth for every answer. Instead, you need to rely on reference-free LLM evaluation methods.Â
You can also use this approach in testing, for example, after generating a diverse set of plausible questions without predefined answers.
This approach helps augment your test cases by creating questions that the users can ask but that might not be directly answerable from your knowledge source or have answers scattered across the texts. This makes the tests more realistic! And in production, users will generate these questions for you.
To evaluate the results, you can use reference-free LLM judges. Two of the most important criteria we suggest to look at are:
For example, here is how the faithfulness evaluation looks for our toy example:
You can add other useful checks over your final response like:
You have all these options in Evidently.
You can combine all the evaluations you picked and get a summary report with score distribution.
This gives you a well-rounded evaluation of both generation and retrieval quality.
For example, hereâs what you might see:
In this scenario, the system retrieves the right data but doesnât use it effectively. The next step would be improving your generation prompt to ensure responses stay true to the provided information.
You can also set up explicit pass/fail tests based on expected score distributions.
In this case, we expect all retrieved contexts to be valid and all responses to be faithful. But you can adjust these conditions â for example, allowing a certain percentage of responses to fail.Â
There are a few other tools for RAG evaluation, but hereâs why you might want to try Evidently.
Itâs open-source â all metrics, the Report interface, and the testing API are free to use.
We take a practical approach to RAG evaluation, keeping things as simple as possible while providing a useful signal. Our goal is to create metrics that you can actually interpret and compare against your labels. We also added per-chunk relevance assessments, which can be fed into ranking metrics like Hit Rate â havenât seen this implemented elsewhere.
We try to minimize API calls to keep evaluation costs low when you use LLM judges. Our zero-shot prompts work well with affordable models like GPT-4.0-mini. You can also swap them for other models and non-LLM alternatives like semantic similarity checks.
Evidently includes 100+ evaluation metrics â ranking, classification, embedding-based similarity, regex, LLM-based scoring, and more. Having everything in one place makes it easier to manage evaluations without switching tools.
If you need custom metrics, weâve got templates for both LLM and deterministic checks. You can easily tweak things like definitions of "correctness" to match your specific needs.
Finally, itâs not just about metrics. You get a test interface for structured checks, visual reports for easy debugging, and a self-hostable dashboard to track results over time.
Whatâs next? Weâre working on further improvements, including testing prompts across different models and adding more parameters.Â
What do you think? Weâd love your feedback â join the conversation in our Discord!
Support Evidently: if you like this release, give us a star on GitHub! â
______________
If you're running complex RAG or AI agent evaluations, check out Evidently Cloud. It helps you generate synthetic test data, set up and run LLM judges with no code, track evaluation results, and collaborate with your team â all in a single platform.
Sign up for free or schedule a demo to see Evidently Cloud in action.Â