📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

Evidently 0.4.25: An open-source tool to evaluate, test and monitor your LLM-powered apps

Last updated:

April 25, 2025

Published:

May 28, 2024

contents‍

Start testing your AI systems today

Get demo

TL;DR

Evidently open-source Python library now supports evaluations for LLM-based applications, including RAGs and chatbots. With the new functionality, you can:

Apply several evaluation methods for open-ended text outputs, from regex matches to model-based scoring.
Implement LLM-as-a-judges with your custom prompts and models.
Use the new evaluations across the Evidently framework to get visual Reports, run Test Suites, and build live monitoring Dashboards.

Want to try it right away? Check out the Getting Started tutorial.

For more details, read on.

[fs-toc-omit]Building an LLM-powered product?

Sign up for our free video course on LLM evaluations with 10 hands-on code tutorials. From designing custom LLM judges to RAG evaluations and adversarial testing. If you’re working hands-on with LLMs and have basic Python skills, this course is for you.

Save your seat ⟶

When to evaluate LLMs?

LLMs now handle many use cases previously solved with classic NLP models – and new generative applications, from writing product summaries to customer support chatbots.

As LLMs become more widespread, we need reliable methods to evaluate their performance on specific tasks. The goal isn't to determine "How good is the newly released LLM model in general?" (that's where LLM benchmarks are useful) but rather, "How well do we solve our specific problem using LLM on the backend?"

You can't improve what you can't measure, so you need an evaluation framework from the initial design stage to production:

During development. As you experiment with prompts and models, you must know if you are moving in the right direction. Comparing outputs by defined criteria helps iterate quickly and effectively.
When making changes. When you make updates, you must ensure that new changes don't break what worked well before and that the newer version performs as expected on critical test scenarios.
In production. You can run evaluations over live traffic to check that the outputs are safe, reliable, and accurate and to understand how users interact with your LLM-powered system – and which improvements you should prioritize.

Developers often start with "vibe checks" by looking at individual outputs to decide if they are good enough. It is essential to know your data, but assessing the responses one by one doesn't scale. Eventually, you need a way to automate your checks.

With this open-source Evidently release, we aim to give anyone working on LLM applications a simple way to design task-specific evaluations and run them at scale.

Evidently LLM Monitoring dashboard — *Example Evidently monitoring dashboard with custom evaluations*

What can you evaluate?

Evaluating LLMs is ultimately task-specific: what matters for a support chatbot is not the same as for a creative writing assistant. Hallucinations are a problem for one and a feature for another. However, you can always pick from a set of shared evaluation approaches.

Depending on when you run your evaluations, you might also have different data available.

Inputs and/or Outputs. With live production traffic, you often deal with open-ended text data. You know what the user asked and how the model responded, possibly with an occasional upvote or downvote. But there is nothing else to rely on.

You can then focus on directly assessing the properties of the questions and responses. For example, check the text length or readability score if your LLM generates social media posts. For chatbots, track toxicity, tone, or whether the LLM denied the user an answer.

Outputs and Context. With RAG architecture, you typically have both the LLM response and the retrieved context chunk used to generate it. You can then run evaluations that take both as input, for example, to check if the context backs up the response. If the generated response includes facts not present in the context, this might indicate hallucination.

Model Outputs and Ground Truth. During iterations, you might collect examples of high-quality completions, either human-written or approved LLM responses. For classification use cases, you might have a labeled dataset. You may also record the corrected data if users edit the LLM output. All these pairs of "good examples" can become your test cases.

When you make significant updates to your application, you can use these examples to catch regressions. You can generate new responses for the same inputs and compare them against the "golden set" using metrics like semantic similarity – or good old precision and recall for classification.

In these examples, we already listed a few different evaluation methods for LLM-based apps:

Rule-based, like checking for regular pattern matches and descriptive text stats.
Statistical measures, such as semantic similarity between texts.
Model-based, like classifying inputs by topic, sentiment, or toxicity. You can run such evaluations with pre-trained ML models or prompt LLMs to judge the text output.

What can you do with the new Evidently release? All of the above. Let’s take a look.

Evaluation methods

There are multiple LLM evaluation metrics and methods we implemented in the library.

Regular expressions

Regular expressions are simple and reliable methods to track patterns within text data.

Checking regex matches is often more helpful than it seems at first glance. It's an inexpensive way to catch mentions of competitors, classify responses by topic based on word lists, or detect when your LLM uses expressions usual for low-quality responses (how many canned social media posts that "delve into" topics do we see these days?).

With Evidently, you can define custom regex patterns or use pre-built ones like IncludesWords, BeginsWith, Contains, etc. As you run the evaluation, you will see how many responses match the pattern. You can also publish the results to a table to explore individual outcomes.

Here is a simple example from our Get Started tutorial: you can see how many conversations mention salary and when these conversations occurred over a day.

Here is what it takes to run this regular expression check for the "Response" column in the "assistant_logs" dataset and get a visual Report:

text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        IncludesWords(words_list=['salary'], display_name="Mentions salary"),
        ]),
])

text_evals_report.run(reference_data=None, current_data=assistant_logs)
text_evals_report

At Evidently, we call these text-level evaluations descriptors. Each text in the dataset receives a numerical score or a categorical label, such as "True" or "False" for a regex match.

Descriptive text stats

You can generate various statistics about your texts, such as sentence count, word length, or the share of out-of-vocabulary words.

Evidently provides several built-in descriptors. While they might not capture complex qualities, they can identify issues like a sudden rise in fixed-length questions (indicating a spam attack) or an increase in out-of-vocabulary words (suggesting a new foreign audience). It also helps compare the outputs of two LLMs or prompts, for example, to see which one tends to give wordier responses.

Here’s how to perform these checks in Evidently:

text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        SentenceCount(),
        WordCount(),
        ]),
])

text_evals_report.run(reference_data=None, current_data=assistant_logs)
text_evals_report

Semantic similarity

Semantic similarity (using metrics like Cosine Similarity) reflects the degree of likeness in meaning between two pieces of text, even when the wording does not match exactly.

You can use it to compare input-output or output-context pairs or new responses with an ideal answer.

Semantic similarity isn't flawless and may not capture all nuances in meaning or context, particularly in complex or lengthy texts. However, it is valuable for dynamic tracking and identifying individual outliers.

Model-based evals

You can use pre-trained ML models to assess text data across various dimensions, including sentiment, tone, topic, emotion, or the presence of personally identifiable information (PII). You can use open-source models or even fine-tune your own.

Using focused ML models (as opposed to LLMs) is often cheaper, faster, and more predictable — and it does not require passing your data to an external LLM provider.

Evidently offers in-built model-based descriptors like Sentiment, and wrappers to call external Python functions or models published on HuggingFace.

Here is an example of using an external classifier model that returns a Toxicity score ranging from 0 to 1 for each text.

In our example, everything appears to be fine, as indicated by the very low score on the "toxicity" label. Typically, a score of 0.5 would be the threshold for determining toxicity.

LLM as a judge

For more complex and nuanced evaluations, you can use LLMs to assess texts based on specific criteria.

This approach helps assess both single-text responses (e.g., "Is this tone professional?") or context-response pairs (e.g., "How well is this response grounded in the context?").

To learn more, check out this guide on LLM judges and a hands-on tutorial.

LLM as a judge is meant to scale human evaluation, and you have almost unlimited possibilities of what to evaluate by tapping into the ability of LLMs to work with text. However, it is critical to come up with good prompts and be specific in your criteria.

To illustrate, let us write a simple prompt to detect personally identifiable information.

pii_prompt = """
Personally identifiable information (PII) is information that, when used alone or with other relevant data, can identify an individual.

PII may contain direct identifiers (e.g., passport information) that can identify a person uniquely or quasi-identifiers (e.g., race) that can be combined with other quasi-identifiers (e.g., date of birth) to successfully recognize an individual.
PII may contain a person's name, person's address, and something I may forget to mention.

Please identify whether or not the text below contains PII.

text: REPLACE

Use the following categories for PII identification:
1: if the text contains PII
0: if the text does not contain PII
0: if the information provided is not sufficient to make a clear determination

Return a category only.
"""

Here is the code to call this eval on your data:

text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        OpenAIPrompting(prompt=pii_prompt,
                        prompt_replace_string="REPLACE",
                        model="gpt-3.5-turbo-instruct",
                        feature_type="num",
                        display_name="PII for response (by gpt3.5)"),
    ])
])
text_evals_report.run(reference_data=None, current_data=assistant_logs[:10])
text_evals_report

Working with LLM judges helps encode nuances. For instance, our prompt asks to return 0 if "there is no sufficient information to decide." This should help increase the precision. However, you might sometimes prefer to get a "TRUE" or another label even if the judgment is unclear to know about any suspicious cases and get a higher recall. You can adjust the prompt accordingly.

Drift detection

For dynamic monitoring, you can also use drift detection. This lets you compare text distributions from different periods, like today and yesterday. You can evaluate both raw text data and descriptors over time to spot shifts, like changes in text topics.

For raw text data, Evidently trains a classifier to differentiate between the two datasets and returns the ROC AUC of the resulting model. Values over 0.5 indicate some predictive power and potential drift. You can also review characteristic texts from both datasets to see what exactly changed.

Here's an example of content drift detection. You can notice increased HR-related questions about the "employee portal" in the current dataset.

Using evals in Evidently

Evidently provides a library of evaluation methods: you can pick the combination that fits your use case and scenario.

You always have options – even for solving the same task. Say, you want to detect texts with negative sentiment. You can call a pre-trained ML model, write a prompt for LLM-as-a-judge, or even use regex for negative patterns or swear words. Which option to choose? A nuanced LLM eval can make sense if the correct response tone is a core product quality. In many other scenarios, an off-the-shelf ML model will do just great.

With Evidently, you can freely mix and match evals or even use a few methods alongside each other. And, if wrappers and built-in checks are not enough, you can always add your custom evaluation functions.

Importantly, when you run any LLM evals in Evidently, they will fit into all existing Evidently interfaces: you can get a Report, run a Test Suite, and track them on a Dashboard in time.

Reports. You can get visual Reports to summarize the evaluation results and compare two datasets. You can view them in Python or export them to HTML or JSON.

Test Suites. To automate the checks, you can set specific conditions for your text descriptors, such as whether their values should be "always less than" or "equal" to something or whether they should match the condition "in 90% of cases", etc.

You can use the ready-made Test interface for this. Once you run the Test Suite on your data, you will clearly see which Tests passed and which failed.

ML monitoring dashboard. Finally, you can log the results of your evaluation over time and build a live monitoring dashboard with alerts.

Here, you can see a monitoring Panel with the history of Test runs. You can notice that we consistently fail the test that checks that the sentiment of the text should be non-negative (the “gt=0” condition stands for “greater than 0”).

Alternatively, you can create Panels to see individual metric values you want to keep an eye on, flexibly choosing from available plots – just like in the first image in this blog.

What’s next?

This is the very first major release of the Evidently functionality for LLMs. The framework now allows anyone to define custom task-specific LLM evals.

As the next step, we’ll continue with more presets and examples for specific use cases. Which features should we add, and which examples should we prioritize? Jump on the Discord community to share or open a GitHub issue.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶