LLM evaluations are quality checks that help make sure your AI product does what it's supposed to, whether that's writing code or handling support tickets.Â
They are different from benchmarking: you’re testing your system on your tasks, not just comparing models. You need these checks throughout the AI product lifecycle, from testing to live monitoring.Â
This guide focuses on automated LLM evaluations. We’ll introduce methods that apply across use cases, from summarization to chatbots. The goal is to give you an overview so you can easily pick the right method for any LLM task you come across.Â
For hands-on code examples, check the open-source Evidently library and docs Quickstart.
Test fast, ship faster. Evidently Cloud gives you reliable, repeatable evaluations for complex systems like RAG and agents — so you can iterate quickly and ship with confidence.Â
If you're new to LLM evaluations, check out this introduction guide. It covers why evals are important and breaks down the core workflows. Here is a quick recap:Â
Automated evaluation workflows have two parts. First, you need data: this could be synthetic examples, curated test cases, or real logs from your LLM app. Second, you need a scoring method. This might return a pass/fail, a label, or a numerical score that tells you if the output is sound. In the end you always get some measure, but there are many ways to arrive at it.
This evaluation process looks simple on the surface, but it can get messy fast. There are different tasks, data structures, and definitions of what “good” is for each use case. And how exactly do you perform this automated scoring?Â
To pick the right method, you first need to identify the evaluation scenario. They fall into two main buckets:
Reference-based evaluations compare outputs to known answers, often referred to as "ground truth." Once you create this test dataset, your job is to measure how closely the new LLM outputs match these known responses. Techniques range from an exact match to semantic comparisons.Â
While it may seem strange to test the system on something you already know, this gives a clear target to aim for. As you iterate on your LLM app, you can track how much better (or worse) it is getting.
Reference-free evaluations don't require predefined answers. Instead, they assess outputs based on proxy metrics or specific qualities like tone, structure, or safety.Â
This applies to tasks where creating ground truth data is difficult or impractical, like chatbot conversations or creative writing. It’s also ideal for live monitoring, where you score responses in real-time. A common method here is LLM-as-a-judge, where you prompt an LLM to evaluate outputs along various dimensions, like asking, "Is this response helpful?"
Some methods, like LLM judges, work well in both cases. However, many metrics are limited to reference-based scenarios, making this distinction useful.
Even if you have reference examples, an important question is whether each input has a single correct answer. If it does, you can use straightforward, deterministic checks. It's easy to compare if you get what you expect.
When LLMs handle predictive tasks, you can also rely on established machine learning metrics. This comes up more often than it may seem! Even if your system isn’t predictive, parts of it could be. For example, a classifier might detect user intent in a chatbot, or you may solve a ranking problem for your RAG system. Both tasks have multiple metrics to choose from.
However, you don't always have a single "perfect" answer. In translation, content generation, or summarization, there are multiple valid outputs and exact matching won’t work. Instead, you need comparison methods like semantic similarity that can handle variations.
A similar thing applies to reference-free evals. Sometimes, you can use objective checks like running generated code to see if it works or verifying JSON keys. But in most cases, you’d deal with open-ended evaluations. So you need to find ways to quantify subjective qualities or come up with proxy metrics.
Here is a decision tree to navigate different methods. We’ll explain each option below.
Let’s talk about another distinction: dataset-level vs. input-level evaluations.Â
Some metrics naturally work on entire datasets. For example, classification metrics like precision or F1-score aggregate the results across all predictions and give you a single quality measure. You can pick different metrics depending on what’s most important — like focusing on minimizing certain types of errors.
In contrast, LLM-specific evaluation methods typically assess each response individually. For example, you might use LLM judges or calculate semantic similarity for each output. But there’s no built-in mechanism to combine these scores into a single performance metric. It’s up to you how to aggregate results across your test set or live responses.
Sometimes it’s simple: average scores or count how many outputs got a "good" label. But in other cases, you might need extra steps. For instance, numerical averages don’t tell you how many individual bad outputs you have. To account for this, you could set a threshold first, like marking any response with a semantic similarity below 0.85 as "incorrect".
You can also run compound checks that evaluate multiple criteria per response, such as assessing its tone, length, and relevance. Each input can then get multiple descriptors.Â
To avoid drowning in metrics, it helps to define clear test conditions. For example, you might set quality thresholds like:
By breaking evaluations into these structured tests, you can better understand the results of each evaluation run. If something goes wrong, you’ll know where exactly.
Finally, let’s move on to methods! We’ll break them down by whether or not you need ground truth data and go through methods one by one. Each section is self-contained, so you can skip to what interests you most.
Here are all the methods at a glance:
Reference-based evals are great for experiments. When you’re testing different prompts, models, or configurations, you need a way to track progress. Otherwise, you’re just blindly trying ideas. Running repeated checks on your test set helps you see if you’re getting more things right over time.Â
In regression testing, the same approach helps you confirm that updates don’t break what’s already working or introduce new bugs.
The evaluation process itself is simple:Â
But here’s the catch: your evaluation is only as good as the test dataset. You’ve got to put in the work — label some data, create synthetic examples, or pull from production logs. The dataset needs to be diverse and kept up to date as you find new user scenarios or issues. If it’s too small and simple, the evaluation won’t tell you much.Â
TL;DR: These metrics help quantify performance across a dataset of examples for tasks like binary and multi-class classification.
Classification is about predicting a discrete label for each input. This often appears as a component in larger workflows, but LLMs can also handle them directly. Some examples:
Each task has predefined categories, and the system needs to assign each input to one of them — or sometimes several, if multiple tags are allowed. To evaluate performance, you check whether the system picks the right classes across a diverse set of inputs.
For example, say you’re testing a chatbot intent detector. You start by preparing a test set with user questions and their correct categories. Then, you must run those questions through your AI app and compare its predictions to the actual labels.
An intuitive metric here is accuracy, which tells you how many classifications are correct. However, it’s not always the best measure.
Imagine you’re classifying queries as "safe" or "unsafe" to avoid risky situations like giving personal financial advice. In this case, you’d focus more on:
Both metrics give a balanced view. Say, high recall could show that you caught all bad outputs. But if the precision is low, you flag too many harmless queries incorrectly. That's not a good user experience!
You might also need per-class metrics. For instance, in content moderation, your system might perfectly detect "offensive language" but miss a lot of "spam." Tracking only overall accuracy could hide this imbalance when you have multiple categories.
Here’s a quick summary of key classification metrics:
These metrics are core to traditional machine learning evaluation. There’s plenty of material out there on their pros and cons. For example, check our guides about precision-recall tradeoff and classification metrics by class.Â
TL;DR: These metrics measure performance for tasks like retrieval (including RAG) and recommendations by evaluating how well systems rank relevant results.
When we talk about ranking tasks, we usually mean either search or recommendations, both of which are important in LLM applications.
In both cases, each item (document, product, chunk etc.) can be labeled as relevant or not, creating the ground truth for performance evaluation.
Interestingly, LLMs themselves can help generate these relevance labels by assigning scores to query-item pairs. (Thomas et al. 2024). They can also help generate ground truth question-response pairs for RAG evaluations.
Once you have ground truth data, you can evaluate the system performance. Ranking metrics fall into two main types:
For example, a metric like NDCG evaluates both relevance and ranking order, assigning more weight to items near the top of the list. In contrast, Hit Rate checks if at least one correct answer was found, even if it’s ranked last.
Evaluations often target top-K results (e.g., top-5 or top-10), since systems usually retrieve many items but display or use only a few. For example, Hit Rate@5 measures how often at least one relevant result shows up in the top 5.
Here are some commonly used ranking metrics:
In recommendation systems, additional metrics like diversity, novelty, and serendipity evaluate user experience. These metrics check whether suggestions are varied, fresh, and non-repetitive.
Ranking metrics explainers. Check this entry-level guide for a deeper dive.
TL;DR: Whenever you can write code to verify outputs against correct responses.
Classification and ranking are examples of narrow tasks with their own well-defined metrics. But there are other cases where you can have one right answer – or something close to it. Examples are:
In these cases, you can perform deterministic matching to check outputs programmatically, similar to software unit tests. But with LLMs, getting one answer right doesn’t mean others will be correct, so you need a variety of test inputs to ensure good coverage.
Imagine testing a system that extracts information from job ads, like identifying job titles. (OLX does something similar). You can perform an exact match between the output and the expected result or fuzzy matching to handle minor variations, like formatting or capitalization.
If the output is structured as JSON, like { "job role": "AI engineer", "min_experience_yrs": "3" }
, you can also match the JSON key-value pairs.
If you deal with narrow Q&A, you can build a ground truth dataset with expected keywords. For example, the answer to "What’s the capital of France?" should always include "Paris." You can then check if each response contains the right words without full-text matching.
In stress-testing, you might compare all outputs to a single list of expected words or phrases. For instance, if you want a chatbot to avoid mentioning competitors, you could create challenging prompts and test if the responses contain the expected refusal words.
You can apply similar non-ambiguous checks in other scenarios. For example, you might verify that an AI agent calls correct tools or that a scripted interaction leads to a known outcome, such as retrieving a specific database entry.
For coding tasks, you won’t always expect an exact match to reference, but you can find other ways to verify correctness. For example, prepare literal unit tests. GitLab uses this when evaluating their Copilot system: they introduce failures on purpose and then check if the model can generate code that fixes the errors and passes the tests.
To aggregate the results of deterministic checks, you can track the overall pass rate or organize test cases by scenarios for more detailed insights.
TL;DR: Comparing the responses using word, item or character overlap. Â
Until now, we mostly looked at constrained tasks where there is a single right outcome. But many LLM applications generate free-form texts. In these cases, you may have an example response but won’t expect to match it precisely.
For instance, when summarizing financial reports, you might compare each new summary to a human-written “ideal” one. But there are many ways to convey the same information, so many summaries with only partial matches would still be as good.
To address this, the machine learning research community came up with overlap-based metrics. They reflect how many shared symbols, words, or word sequences there are between the reference and the generated response.Â
Here are some examples:
It's common to use averages to summarize the scores across multiple examples, but variations are possible, like recomputing BLEU or ROUGE for the entire example dataset.
But here is the rub: while overlap-based metrics have been foundational in NLP research, they often don't correlate well with human judgements, and aren’t suitable for very open-ended tasks. Modern alternatives, such as embedding- or LLM-based evaluations, offer more context-aware assessments.
TL;DR: Using pre-trained embedding models for semantic matching.Â
Both exact match and overlap-based metrics compare words and elements of responses, but they don’t consider their meaning. But that's what we often care about!Â
Consider these two chatbot responses:
Most human readers would agree these are roughly equivalent. However, overlap-based metrics wouldn’t, since they share few words between them.Â
Semantic similarity methods help compare meaning instead of words. They use pre-trained models like BERT, many of which are open-source. These models turn text into vectors that capture the context and relationships between words. By comparing the distance between these vectors, you can get a sense of how similar the texts really are.Â
Some popular approaches include:
You can also implement Semantic Similarity directly: pick an embedding model (e.g., one that works at the sentence level), turn the texts into vectors, and measure cosine similarity. Â
To aggregate the results, you can look at averages or define a cut-off threshold and calculate the share of matches with sufficiently high similarity. If you have meaningful metadata (like topics in a Q&A dataset), you can also compute category-specific scores.
The limitation of these methods is that they depend entirely on the embedding model you use. It may not capture the nuances of meaning well, and false matches are possible. Focusing only on the token- or sentence-level comparisons can also sometimes miss the broader context of the word passage.
TL;DR: Prompting an LLM to compare responses or choose the best one.
While making a semantic match ("Do these two responses mean the same thing?") is often the goal, embedding-based similarity isn’t always the most precise option. Embeddings can capture general ideas but miss important details. For example:Â
Though the order changes the meaning, embedding vectors for these sentences might still appear similar. Or consider:
In this case, the correct naming of specific menu items can be critical to determining accuracy, but embedding-based checks will not penalize this enough. To get better results, you can use an LLM as a judge for similarity matching.Â
The process is close to how you use LLMs in your product — except here, the task is a narrow classification. For instance, you could pass a reference and new response to LLM and ask: "Does the RESPONSE convey the same meaning as the REFERENCE? Yes or no".
This method is very adaptable. You can specify what to prioritize in matching — such as precise terminology, omissions, or style consistency. You can also ask the LLM to explain its reasoning. This makes the results easier to parse and improves the quality of the evals.
For example, you can separately assess if the two responses match in fact and style.
This method is widely used in practice. For instance, segment describes how they use LLM-based matching to compare generated queries. Prompt-based evals have also proven themselves for translation (see Kocmi et al. 2023).
Another way to use LLMs is through pairwise comparison, where you give the LLM the two responses and ask it to determine which is better. This is where the nickname "LLM as a Judge" comes from (see Zheng et al, 2023).
Both methods might require tuning to match your preferences: an LLM judge requires its own evaluation! Pairwise comparisons, in particular, can also be biased — LLMs may favor longer or more polished responses or outputs similar to their own.
If matching two answers is not an option, use reference-free evaluations. This applies to:
In these cases, you can measure custom qualities or use proxy metrics — quantifiable properties that say something useful about the output, starting with simple text length.Â
It's worth noting that while we often refer to checks like "average helpfulness" as metrics, they don’t have a strict mathematical definition like NDCG or precision. You tailor their implementation to your use case, and that’s where most of the work goes.
What's consistent are the evaluation methods you can apply. Let’s break them down!
TL;DR. Tracking frequency of words and patterns, like topical or risky keywords.
Regular expressions are a simple way to check for specific keywords, phrases, or structures in text. While they might seem basic they can also be surprisingly useful.
For example, you can use regex to track mentions of competitors or products. Each is usually a limited list of entities, so it’s easy to define and flag how often they come up.
You can also track topical keywords. Say, if you have a travel chatbot, you can look for words "cancel," "refund," or "change booking" to check how often it has to deal with cancellations. To simplify setting up such evals, you can use LLMs to brainstorm word lists, and use libraries that automatically handle variations of the same root word.
You can also flag conversations by checking for high-risk keywords. For example:
Regex can also catch repetitive error patterns. Some LLMs generate canned responses like "As an AI language model" or frequently start with "Certainly!" You can set up regex checks to catch these during regression testing. Similarly, refusal patterns (e.g., "I'm sorry" or "I can't") are easy to track since models often reuse the same wording.
Regex helps with structure checks, too. If your output needs specific elements (e.g., headers or disclaimers in the generated report), string matching can confirm that all sections are present.
Once you define your patterns, regex can efficiently scan large datasets. It’s fast, reliable, and doesn’t require external API calls. Plus, it adds a welcome touch of certainty in an otherwise unpredictable field — if the pattern exists, regex will detect it.
Here are examples of metrics you can implement:
That said, regex has its limitations. It’s strict by nature, meaning it won't catch typos or variations unless you explicitly define them. You'll need to carefully design and maintain your pattern definitions to cover the full range of possible outputs.
TL;DR. Tracking statistics like word or sentence count.Â
You can often tie quality to measurable text features. For instance, text length matters for tasks like generating social media posts, help articles, or summaries — outputs should be concise but still meaningful. Sometimes, length requirements are directly baked into the prompt, like: “Give a one-sentence takeaway”.Â
Checks like word and sentence count are quick, cheap, and easy to implement, making them useful at every stage. In regression testing: are all outputs within expected limits? In monitoring: what’s the average, minimum, or maximum length? Individual outliers can often signal problems, like longer prompts containing jailbreaks.Â
You can also explore metrics like readability scores. These provide a rough estimate of the education level someone would need to understand the text. This is useful if you're generating content that should be accessible to younger audiences or non-native speakers.
There are other stats to consider, like non-letter character counts. They are not always useful as is, but can give a good signal if things shift. For example, a sudden increase in non-vocabulary words might indicate spam, model behavior changes, or automated attacks that need investigation.
TL;DR. Validating format and structure for tasks like code or JSON generation.
There are other programmatic checks beyond regex and text stats. If your LLM generates structured content like SQL, code, JSON, or interacts with APIs and databases, you can often write code to verify correctness at least partially.
These checks focus on format, structure, or functionality. A few examples:
Test code execution. If the model returns code, you can test that it's syntactically correct and, ideally, executable. You can run it in a sandbox or against your own test suite to verify that it performs as expected.
TL;DR. Checking if responses align with inputs, context, or known patterns.
Semantic similarity isn’t just for reference matching. It’s also useful when you don’t have a ground truth response. Here are a few ways it can help:
Input-output similarity. For example, if you’re turning bullet points into a full email, you can measure how closely it matches the original points. This helps check whether it stays true to the source. Similarly, in Q&A tasks, a large semantic gap between the question and the response might indicate the answer isn’t relevant.
Response-context similarity. In RAG tasks, you can compare the generated answers to the retrieved context. High similarity would reflect that the model is properly using the information. Low similarity could suggest hallucination — when the model fabricates unsupported details.Â
For example, DoorDash monitors chatbot responses this way, flagging outputs with low similarity to relevant knowledge base articles.
Pattern similarity. For example, you can compare a model’s response to a set of denial templates. Even if the wording varies, high similarity can indicate that the response is a refusal. You can do the same for other patterns, including jailbreaks.Â
TL;DR: Prompting LLMs to evaluate outputs against custom criteria.
We already looked at reference-based LLM judges, but you can do the same for any custom property that you can't define through rules. Just explain your criteria in natural language, put them in a prompt, and let the LLM judge the output and return a score or a label.Â
Such LLM judges help scale human labeling efforts and are one of the most universal evaluation methods. Here’s how they can work.
Direct evaluation. You can judge any standalone text qualities. For example, check if a chatbot response has the right tone or if user queries are appropriate.
Context-based evaluation. You can also assess ​​outputs alongside inputs or retrieved data. In this case, you pass two pieces of text to the evaluator. For instance, you can check whether RAG search finds the context that can answer the user question:
Or, you can look for “hallucinations” to detect if any responses give unsupported information. Pass both the context and the answer:
Conversation-level evaluation. You can also assess entire user sessions by asking the LLM to review transcripts. This is particularly useful for AI agents and chatbots. The goal could be to check if the LLM completed the user's task or maintained a consistent tone.
To create an LLM judge, you can start with direct prompting by asking something like: "Is this response polite?" Add details on what you mean by politeness and request a binary (yes/no) or multi-level rating (e.g., 1-5 stars).
You can enhance this with Chain of Thought (CoT), asking the model to reason step-by-step and optionally include examples. (See Wang et al., 2023 and Zheng et al., 2023).
Several more complex approaches build on this idea. For example:Â
Navigating these nicknames can be confusing: terms like G-Eval or GPTScore may sound like specific metrics, but they're actually prompting techniques for LLM evaluations. There are also fine-tuned evaluation models like Prometheus (Li et al., 2024). You can use them or prompt any LLM yourself to evaluate specific qualities like “faithfulness” or "politeness".
Studies also show that complex techniques don't always yield better results. For example, simply asking the LLM to think step by step can outperform G-Eval.(Chiang et al, 2023). It's always best to test different approaches against your own assessments!
Here are a few common metrics implemented through LLM judges:
While LLM judges are powerful, their quality depends entirely on the prompt and underlying LLM. Just like system prompts, the evaluation prompts need to be refined and tested. Also, be cautious with domains like medicine or finance — generic LLMs may not provide reliable evaluations for specialized topics.
TL;DR: Using pre-trained ML models to score your texts.Â
To use machine learning for LLM evaluations, you don’t always need a large, general-purpose LLM. Smaller models can often perform just as well for specific tasks.
These models are trained on labeled data to predict or detect specific features. For example:
Many of these qualities apply broadly across use cases. Thanks to open-source communities, you can access publicly available pre-trained models. You can run them locally without incurring API costs or sending sensitive data externally.Â
Additionally, if you have labeled data (or collected enough assessments from your LLM judges), you can train your own specialized models. By fine-tuning a pre-trained model with your examples, you can create a highly tailored evaluator that is cheap and fast to run.
Of course, there's a catch: you need to know what each ML model does and test it against your criteria. The quality will vary if your data is very different from what the model was trained on.Â
There are tons of evaluation methods and metrics out there, but you won’t need all of them for every app. Focus on what matters most based on your use case and the errors you’re running into.
For example, for a RAG-based chatbot, you might check retrieval quality, answer accuracy, and over-refusals. A few ranking metrics and custom LLM judges could do the trick.
Here are some key tips to keep in mind:
First, always look at your data. Nothing builds intuition better than looking at real-world examples. This helps you spot patterns, shape your quality criteria, and understand failures. Bring your whole team in early — especially domain experts — and curate test data.Â
Next, define what "quality" means for your app. Consider the basics: response structure, length, and language. What are your positive indicators, like tone or helpfulness? What risks do you want to avoid, like toxic responses, denials, or irrelevant content?
Finally, choose your methods wisely. Don’t just go for a complex, obscure LLM metric you can’t explain or match to your labels. Start by making your own qualitative assessments. Then, think back to metrics and decide how to implement them. For instance, you could measure toxicity with LLM judges, ML models, or regex filters. The best option will depend on factors like cost, privacy, and accuracy.
When in doubt, start with LLM judges. They’re widely used in practice and offer a flexible alternative to human review, especially for open-ended or session-level evaluations. Plus, they’re easy to get started with — just write a custom prompt and refine it as needed.
LLM evals can get tricky: we built Evidently to make this process easier. Our open-source library (with over 25 million downloads!) supports a variety of evaluation methods. For teams, we provide Evidently Cloud — a no-code workspace to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows.
We specialize in helping you assess complex systems like RAGs, AI agents and mission-critical apps where you need an extra layer of safety and security. Evidently also helps you generate test scenarios and agent simulations using synthetic data. If this is what you’re working on, reach out — we’d love to help!