contents
Evidently open-source Python library now supports evaluations for LLM-based applications, including RAGs and chatbots. With the new functionality, you can:
Want to try it right away? Here is the Getting Started tutorial.
For more details, read on.
LLMs now handle many use cases previously solved with classic NLP models – and new generative applications, from writing product summaries to customer support chatbots.
As LLMs become more widespread, we need reliable methods to evaluate their performance on specific tasks. The goal isn't to determine "How good is the newly released LLM model in general?" (that's where LLM benchmarks are useful) but rather, "How well do we solve our specific problem using LLM on the backend?"
You can't improve what you can't measure, so you need an evaluation framework from the initial design stage to production:
Developers often start with "vibe checks" by looking at individual outputs to decide if they are good enough. It is essential to know your data, but assessing the responses one by one doesn't scale. Eventually, you need a way to automate your checks.
With this open-source Evidently release, we aim to give anyone working on LLM applications a simple way to design task-specific evaluations and run them at scale.
Evaluating LLMs is ultimately task-specific: what matters for a support chatbot is not the same as for a creative writing assistant. Hallucinations are a problem for one and a feature for another. However, you can always pick from a set of shared evaluation approaches.
Depending on when you run your evaluations, you might also have different data available.
Inputs and/or Outputs. With live production traffic, you often deal with open-ended text data. You know what the user asked and how the model responded, possibly with an occasional upvote or downvote. But there is nothing else to rely on.
You can then focus on directly assessing the properties of the questions and responses. For example, check the text length or readability score if your LLM generates social media posts. For chatbots, track toxicity, tone, or whether the LLM denied the user an answer.
Outputs and Context. With RAG architecture, you typically have both the LLM response and the retrieved context chunk used to generate it. You can then run evaluations that take both as input, for example, to check if the context backs up the response. If the generated response includes facts not present in the context, this might indicate hallucination.
Model Outputs and Ground Truth. During iterations, you might collect examples of high-quality completions, either human-written or approved LLM responses. For classification use cases, you might have a labeled dataset. You may also record the corrected data if users edit the LLM output. All these pairs of "good examples" can become your test cases.
When you make significant updates to your application, you can use these examples to catch regressions. You can generate new responses for the same inputs and compare them against the "golden set" using metrics like semantic similarity – or good old precision and recall for classification.
In these examples, we already listed a few different evaluation methods for LLM-based apps:
What can you do with the new Evidently release? All of the above. Let’s take a look.
Regular expressions are simple and reliable methods to track patterns within text data.
Checking regex matches is often more helpful than it seems at first glance. It's an inexpensive way to catch mentions of competitors, classify responses by topic based on word lists, or detect when your LLM uses expressions usual for low-quality responses (how many canned social media posts that "delve into" topics do we see these days?).
With Evidently, you can define custom regex patterns or use pre-built ones like IncludesWords, BeginsWith, Contains, etc. As you run the evaluation, you will see how many responses match the pattern. You can also publish the results to a table to explore individual outcomes.
Here is a simple example from our Get Started tutorial: you can see how many conversations mention salary and when these conversations occurred over a day.
Here is what it takes to run this regular expression check for the "Response" column in the "assistant_logs" dataset and get a visual Report:
At Evidently, we call these text-level evaluations descriptors. Each text in the dataset receives a numerical score or a categorical label, such as "True" or "False" for a regex match.
You can generate various statistics about your texts, such as sentence count, word length, or the share of out-of-vocabulary words.
Evidently provides several built-in descriptors. While they might not capture complex qualities, they can identify issues like a sudden rise in fixed-length questions (indicating a spam attack) or an increase in out-of-vocabulary words (suggesting a new foreign audience). It also helps compare the outputs of two LLMs or prompts, for example, to see which one tends to give wordier responses.
Here’s how to perform these checks in Evidently:
Semantic similarity (using metrics like Cosine Similarity) reflects the degree of likeness in meaning between two pieces of text, even when the wording does not match exactly.
You can use it to compare input-output or output-context pairs or new responses with an ideal answer.
Semantic similarity isn't flawless and may not capture all nuances in meaning or context, particularly in complex or lengthy texts. However, it is valuable for dynamic tracking and identifying individual outliers.
You can use pre-trained ML models to assess text data across various dimensions, including sentiment, tone, topic, emotion, or the presence of personally identifiable information (PII). You can use open-source models or even fine-tune your own.
Using focused ML models (as opposed to LLMs) is often cheaper, faster, and more predictable — and it does not require passing your data to an external LLM provider.
Evidently offers in-built model-based descriptors like Sentiment, and wrappers to call external Python functions or models published on HuggingFace.
Here is an example of using an external classifier model that returns a Toxicity score ranging from 0 to 1 for each text.
In our example, everything appears to be fine, as indicated by the very low score on the "toxicity" label. Typically, a score of 0.5 would be the threshold for determining toxicity.
For more complex and nuanced evaluations, you can use LLMs to assess texts based on specific criteria.
This approach helps assess both single-text responses (e.g., "Is this tone professional?") or context-response pairs (e.g., "How well is this response grounded in the context?").
To learn more, check out this guide on LLM judges and a hands-on tutorial.
LLM as a judge is meant to scale human evaluation, and you have almost unlimited possibilities of what to evaluate by tapping into the ability of LLMs to work with text. However, it is critical to come up with good prompts and be specific in your criteria.
To illustrate, let us write a simple prompt to detect personally identifiable information.
Here is the code to call this eval on your data:
Working with LLM judges helps encode nuances. For instance, our prompt asks to return 0 if "there is no sufficient information to decide." This should help increase the precision. However, you might sometimes prefer to get a "TRUE" or another label even if the judgment is unclear to know about any suspicious cases and get a higher recall. You can adjust the prompt accordingly.
For dynamic monitoring, you can also use drift detection. This lets you compare text distributions from different periods, like today and yesterday. You can evaluate both raw text data and descriptors over time to spot shifts, like changes in text topics.
For raw text data, Evidently trains a classifier to differentiate between the two datasets and returns the ROC AUC of the resulting model. Values over 0.5 indicate some predictive power and potential drift. You can also review characteristic texts from both datasets to see what exactly changed.
Here's an example of content drift detection. You can notice increased HR-related questions about the "employee portal" in the current dataset.
Evidently provides a library of evaluation methods: you can pick the combination that fits your use case and scenario.
You always have options – even for solving the same task. Say, you want to detect texts with negative sentiment. You can call a pre-trained ML model, write a prompt for LLM-as-a-judge, or even use regex for negative patterns or swear words. Which option to choose? A nuanced LLM eval can make sense if the correct response tone is a core product quality. In many other scenarios, an off-the-shelf ML model will do just great.
With Evidently, you can freely mix and match evals or even use a few methods alongside each other. And, if wrappers and built-in checks are not enough, you can always add your custom evaluation functions.
Importantly, when you run any LLM evals in Evidently, they will fit into all existing Evidently interfaces: you can get a Report, run a Test Suite, and track them on a Dashboard in time.
Reports. You can get visual Reports to summarize the evaluation results and compare two datasets. You can view them in Python or export them to HTML or JSON.
Test Suites. To automate the checks, you can set specific conditions for your text descriptors, such as whether their values should be "always less than" or "equal" to something or whether they should match the condition "in 90% of cases", etc.
You can use the ready-made Test interface for this. Once you run the Test Suite on your data, you will clearly see which Tests passed and which failed.
ML monitoring dashboard. Finally, you can log the results of your evaluation over time and build a live monitoring dashboard with alerts.
Here, you can see a monitoring Panel with the history of Test runs. You can notice that we consistently fail the test that checks that the sentiment of the text should be non-negative (the “gt=0” condition stands for “greater than 0”).
Alternatively, you can create Panels to see individual metric values you want to keep an eye on, flexibly choosing from available plots – just like in the first image in this blog.
Want to try it for yourself? Check the tutorial.
This is the very first major release of the Evidently functionality for LLMs. The framework now allows anyone to define custom task-specific LLM evals.
As the next step, we’ll continue with more presets and examples for specific use cases. Which features should we add, and which examples should we prioritize? Jump on the Discord community to share or open a GitHub issue.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶