📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

LLM guide

LLM evaluation metrics and methods, explained simply

Last updated:

April 9, 2025

contents‍

LLM evaluations are quality checks that help make sure your AI product does what it's supposed to, whether that's writing code or handling support tickets.

They are different from benchmarking: you’re testing your system on your tasks, not just comparing models. You need these checks throughout the AI product lifecycle, from testing to live monitoring.

This guide focuses on automated LLM evaluations. We’ll introduce metrics and methods that apply across use cases, from summarization to chatbots. The goal is to give you an overview so you can easily pick the right method for any LLM task you come across.

For hands-on code examples, check the open-source Evidently library and docs Quickstart.

TL;DR

LLM evaluation methods fall into two main types: reference-based and reference-free.
Reference-based LLM evaluation methods compare responses to known ground truth answers using exact matching, word overlap, embedding similarity, or LLMs as judges. For classification and ranking (common in RAG), there are task-specific quality metrics.
Reference-free LLM evaluation methods assess outputs through proxy metrics and custom criteria using regular expressions, text statistics, programmatic validation, custom LLM judges, and ML-based scoring.
LLM as a judge is one of the most popular methods. It prompts an LLM to score outputs based on custom criteria and can handle conversation-level evaluations.

Build AI systems you can rely on

Test fast, ship faster. Evidently Cloud gives you reliable, repeatable evaluations for complex systems like RAG and agents — so you can iterate quickly and ship with confidence.

Synthetic data and agent simulations

100+ built-in checks and evals

Create LLM judges with no code

Open-source with 25M+ downloads

Start for free

Or try open source

How LLM evals work

If you're new to LLM evaluations, check out this introduction guide. It covers why evals are important and breaks down the core workflows. Here is a quick recap:

LLM evals help assess if the system is doing well in terms of capabilities and safety.
Since LLMs are non-deterministic and have unique failure patterns, this is an ongoing process. For example, you may run evals every time you change a prompt.
Manually reviewing LLM outputs is ideal but not scalable, so you need automation.

Automated LLM evaluation workflows have two parts. First, you need data: this could be synthetic examples, curated test cases, or real logs from your LLM app. Second, you need a scoring method. This might return a pass/fail, a label, or a numerical score that tells you if the output is sound. In the end you always get some metric, but there are many ways to arrive at it.

This LLM evaluation process looks simple on the surface, but it can get messy fast. There are different tasks, data structures, and definitions of what “good” is for each use case. And how exactly do you perform this automated scoring?

[fs-toc-omit]Do you have reference examples?

To pick the right method and metric, you first need to identify the LLM evaluation scenario. They fall into two main buckets:

Reference-based: you compare outputs to predefined answers.
Reference-free: you evaluate the outputs directly.

Reference-based LLM evaluations compare outputs to known answers, often referred to as "ground truth." Once you create this test dataset, your job is to measure how closely the new LLM outputs match these known responses. Techniques range from an exact match to semantic comparisons.

Reference-based LLM evals — Reference-based evals: compare new responses to the examples.

While it may seem strange to test the system on something you already know, this gives a clear target to aim for. As you iterate on your LLM app, you can track how much better (or worse) it is getting.

Reference-free LLM evaluations don't require predefined answers. Instead, they assess outputs based on proxy metrics or specific qualities like tone, structure, or safety.

This applies to tasks where creating ground truth data is difficult or impractical, like chatbot conversations or creative writing. It’s also ideal for live monitoring, where you score responses in real-time. A common method here is LLM-as-a-judge, where you prompt an LLM to evaluate outputs along various dimensions, like asking, "Is this response helpful?"

Open-ended LLM evals — Reference-free evals: score the responses directly.

Some LLM evaluation methods, like LLM judges, work well in both cases. However, many metrics are limited to reference-based scenarios, making this distinction useful.

[fs-toc-omit]Is there a single right answer?

Even if you have reference examples, an important question is whether each input has a single correct answer. If it does, you can use straightforward, deterministic checks. It's easy to compare if you get what you expect.

When LLMs handle predictive tasks, you can also rely on established machine learning metrics. This comes up more often than it may seem! Even if your system isn’t predictive, parts of it could be. For example, a classifier might detect user intent in a chatbot, or you may solve a ranking problem for your RAG system. Both tasks have multiple evaluation metrics to choose from.

However, you don't always have a single "perfect" answer. In translation, content generation, or summarization, there are multiple valid outputs and exact matching won’t work. Instead, you need comparison methods like semantic similarity that can handle variations.

A similar thing applies to reference-free LLM evals. Sometimes, you can use objective checks like running generated code to see if it works or verifying JSON keys. But in most cases, you’d deal with open-ended LLM evaluations. So you need to find ways to quantify subjective qualities or come up with proxy metrics.

Here is a decision tree to navigate different methods. We’ll explain each option below.

[fs-toc-omit]Dataset-level evaluations

Let’s talk about another distinction: dataset-level vs. input-level evaluations.

Some metrics naturally work on entire datasets. For example, classification metrics like precision or F1-score aggregate the results across all predictions and give you a single quality measure. You can pick different metrics depending on what’s most important — like focusing on minimizing certain types of errors.

In contrast, LLM-specific evaluation methods typically assess each response individually. For example, you might use LLM judges or calculate semantic similarity for each output. But there’s no built-in mechanism to combine these scores into a single performance metric. It’s up to you how to aggregate results across your test set or live responses.

Sometimes it’s simple: average scores or count how many outputs got a "good" label. But in other cases, you might need extra steps. For instance, numerical averages don’t tell you how many individual bad outputs you have. To account for this, you could set a threshold first, like marking any response with a semantic similarity below 0.85 as "incorrect".

You can also run compound checks that evaluate multiple criteria per response, such as assessing its tone, length, and relevance. Each input can then get multiple descriptors.

To avoid drowning in metrics, it helps to define clear test conditions. For example, you might set quality thresholds like:

No outputs should have sentiment below zero.
No more than 80% of responses can exceed a certain length.
Less than 5% of outputs should be irrelevant.

By breaking LLM evaluations into these structured tests, you can better understand the results of each evaluation run. If something goes wrong, you’ll know where exactly.

LLM evaluation testing — *You can run these checks and get exportable reports using* *Evidently Python library*.

LLM evaluation methods

Finally, let’s move on to LLM evaluation methods! We’ll break them down by whether or not you need ground truth data and go through methods one by one. Each section is self-contained, so you can skip to what interests you most.

Here are all the methods at a glance:

Reference-based evals

Reference-based LLM evals are great for experiments. When you’re testing different prompts, models, or configurations, you need a way to track progress. Otherwise, you’re just blindly trying ideas. Running repeated checks on your test set helps you see if you’re getting more things right over time.

In LLM regression testing, the same approach helps you confirm that updates don’t break what’s already working or introduce new bugs.

The evaluation process itself is simple:

Pass your test inputs through the system.
Generate new outputs.
Compare them to the referenced answers.

But here’s the catch: your LLM evaluation is only as good as the test dataset. You’ve got to put in the work — label some data, create synthetic examples, or pull from production logs. The dataset needs to be diverse and kept up to date as you find new user scenarios or issues. If it’s too small and simple, the evaluation won’t tell you much.

Classification metrics

TL;DR: These LLM evaluation metrics help quantify performance across a dataset of examples for tasks like binary and multi-class classification.

Classification is about predicting a discrete label for each input. This often appears as a component in larger workflows, but LLMs can also handle them directly. Some examples:

Intent detection. Sorting chatbot queries into categories like "returns" or "payments".
Agent routing. Predicting the correct next step based on user input.
Support ticket classification. Tagging tickets by urgency (e.g., high, medium, low) or topic (e.g., technical issue, billing).
Content moderation. Flagging content that violates policies, such as spam or profanity.
Review tagging. Labeling feedback as positive, negative, or neutral.

Each task has predefined categories, and the system needs to assign each input to one of them — or sometimes several, if multiple tags are allowed. To evaluate performance, you check whether the system picks the right classes across a diverse set of inputs.

For example, say you’re testing a chatbot intent detector. You start by preparing a test set with user questions and their correct categories. Then, you must run those questions through your AI app and compare its predictions to the actual labels.

An intuitive metric here is accuracy, which tells you how many classifications are correct. However, it’s not always the best measure.

Imagine you’re classifying queries as "safe" or "unsafe" to avoid risky situations like giving personal financial advice. In this case, you’d focus more on:

Recall: "Did we catch all unsafe inputs?"
Precision: "How many flagged queries were truly unsafe?”

Both metrics give a balanced view. Say, high recall could show that you caught all bad outputs. But if the precision is low, you flag too many harmless queries incorrectly. That's not a good user experience!

You might also need per-class metrics. For instance, in content moderation, your system might perfectly detect "offensive language" but miss a lot of "spam." Tracking only overall accuracy could hide this imbalance when you have multiple categories.

Here’s a quick summary of key classification metrics:

Metric	Description
Accuracy	Proportion of correctly classified examples.
Precision	Proportion of true positives out of all predicted positives.
Recall	Proportion of true positives out of all actual positives.
F1-Score	Harmonic mean of precision and recall.
Per-class metrics	Precision, recall, and F1-scores for each class.

These metrics are core to traditional machine learning evaluation. There’s plenty of material out there on their pros and cons. For example, check our guides about precision-recall tradeoff and classification metrics by class.

Ranking metrics

TL;DR: These metrics measure performance for tasks like retrieval (including RAG) and recommendations by evaluating how well systems rank relevant results.

When we talk about ranking tasks, we usually mean either search or recommendations, both of which are important in LLM applications.

Search (retrieval) is the "R" in RAG. For example, a chatbot might need to search a support database to find relevant content. It retrieves and ranks documents or context chunks, which the LLM then uses to generate answers. LLMs can also rewrite queries to help users find what they need more easily (like Picnic does for e-commerce).
Recommendations share a similar goal — returning a ranked list of options — but with a different focus. Instead of finding a few precise answers, they aim to surface many good options for users to explore. This could be "you may also like" blocks on e-commerce sites or type-ahead search suggestions like those on LinkedIn.

In both cases, each item (document, product, chunk etc.) can be labeled as relevant or not, creating the ground truth for performance evaluation.

Interestingly, LLMs themselves can help generate these relevance labels by assigning scores to query-item pairs. (Thomas et al. 2024). They can also help generate ground truth question-response pairs for RAG evaluations.

Once you have ground truth data, you can evaluate the system performance. Ranking metrics fall into two main types:

Rank-agnostic metrics focus on whether relevant items are retrieved, regardless of their order. (Did the system find the items it should find?)
Rank-aware metrics consider item order, rewarding systems that display relevant results near the top.

For example, a metric like NDCG evaluates both relevance and ranking order, assigning more weight to items near the top of the list. In contrast, Hit Rate checks if at least one correct answer was found, even if it’s ranked last.

NDCG ranking metric — *NDCG rewards the correct order of items.*

Evaluations often target top-K results (e.g., top-5 or top-10), since systems usually retrieve many items but display or use only a few. For example, Hit Rate@5 measures how often at least one relevant result shows up in the top 5.

Here are some commonly used ranking metrics:

Metric	Description
Precision @k	Proportion of top-K items that are relevant.
Recall @k	Proportion of all relevant items retrieved within the top-K results.
Normalized Discounted Cumulative Gain (nDCG@K)	Measures ranking quality, giving higher weight to relevant items ranked near the top.
Hit Rate @K	Binary metric that checks if at least one relevant item appears in the top-K.
Mean Reciprocal Rank (MRR@K)	Average of the reciprocal ranks of the first relevant item for all queries in top-K.

In recommendation systems, additional metrics like diversity, novelty, and serendipity evaluate user experience. These metrics check whether suggestions are varied, fresh, and non-repetitive.

Ranking metrics explainers. Check this entry-level guide for a deeper dive.

Deterministic matching

TL;DR: Whenever you can write code to verify outputs against correct responses.

Classification and ranking are examples of narrow tasks with their own well-defined metrics. But there are other cases where you can have one right answer – or something close to it. Examples are:

Coding
Data extraction
Narrow Q&A
Individual steps in agentic workflows

In these cases, you can perform deterministic matching to check outputs programmatically, similar to software unit tests. But with LLMs, getting one answer right doesn’t mean others will be correct, so you need a variety of test inputs to ensure good coverage.

Imagine testing a system that extracts information from job ads, like identifying job titles. (OLX does something similar). You can perform an exact match between the output and the expected result or fuzzy matching to handle minor variations, like formatting or capitalization.

If the output is structured as JSON, like { "job role": "AI engineer", "min_experience_yrs": "3" }, you can also match the JSON key-value pairs.

If you deal with narrow Q&A, you can build a ground truth dataset with expected keywords. For example, the answer to "What’s the capital of France?" should always include "Paris." You can then check if each response contains the right words without full-text matching.

Deterministic matches for LLM evaluaions

In stress-testing, you might compare all outputs to a single list of expected words or phrases. For instance, if you want a chatbot to avoid mentioning competitors, you could create challenging prompts and test if the responses contain the expected refusal words.

You can apply similar non-ambiguous checks in other scenarios. For example, you might verify that an AI agent calls correct tools or that a scripted interaction leads to a known outcome, such as retrieving a specific database entry.

For coding tasks, you won’t always expect an exact match to reference, but you can find other ways to verify correctness. For example, prepare literal unit tests. GitLab uses this when evaluating their Copilot system: they introduce failures on purpose and then check if the model can generate code that fixes the errors and passes the tests.

Method	Description
Exact Match	Check if the response exactly matches the expected output.
Fuzzy Match	Allows for minor variations, such as ignoring whitespace or formatting.
Word or Item Match	Verifies if the response includes specific fixed words or strings, regardless of full phrasing.
JSON match	Matches key-value pairs in structured JSON outputs.
Unit test pass rate	Tracks whether generated code passes predefined test cases.

To get a single LLM evaluation metric, you can track the overall pass rate or organize test cases by scenarios for more detailed insights.

Overlap-based metrics

TL;DR: Comparing the responses using word, item or character overlap.

Until now, we mostly looked at constrained tasks where there is a single right outcome. But many LLM applications generate free-form texts. In these cases, you may have an example response but won’t expect to match it precisely.

For instance, when summarizing financial reports, you might compare each new summary to a human-written “ideal” one. But there are many ways to convey the same information, so many summaries with only partial matches would still be as good.

To address this, the machine learning research community came up with overlap-based metrics. They reflect how many shared symbols, words, or word sequences there are between the reference and the generated response.

Overlap-based metrics for LLM evaluation

Here are some examples:

The BLEU score was developed for machine translation but is also used for other language tasks. It evaluates the overlap of n-grams (short word sequences) and reflects precision: the fraction of words in the generated response that appear in the reference. It also applies a brevity penalty to discourage overly short responses.
ROUGE, originally designed for summarization, works similarly but emphasizes recall, measuring how much of the reference text is present in the generated response.
METEOR performs word-level matches but also considers synonyms and root forms using external linguistic dictionaries.
There are also metrics focused on overlapping characters like Levenstein distance.

It's common to use averages to summarize the scores across multiple examples, but variations are possible, like recomputing BLEU or ROUGE for the entire example dataset.

Method	Description
BLEU (Bilingual Evaluation Understudy)	Evaluates n-gram overlap (up to 4). Focuses on precision; penalizes brevity.
ROUGE-n (Recall-Oriented Understudy for Gisting Evaluation)	Evaluates the specified n-gram overlap. Focuses on recall.
ROUGE-l (Recall-Oriented Understudy for Gisting Evaluation)	Evaluates the longest common subsequence between generated and reference texts. Focus on recall.
METEOR (Metric for Evaluation of Translation with Explicit ORdering)	Evaluates the word overlap, accounting for synonyms and stemming. Balances precision and recall.
Levenstein distance	Calculates the number of character edits (insertions, deletions, or substitutions) needed to match two strings.

But here is the rub: while overlap-based metrics have been foundational in NLP research, they often don't correlate well with human judgements, and aren’t suitable for very open-ended tasks. Modern alternatives, such as embedding- or LLM-based evaluations, offer more context-aware assessments.

Semantic similarity

TL;DR: Using pre-trained embedding models for semantic matching.

Both exact match and overlap-based metrics compare words and elements of responses, but they don’t consider their meaning. But that's what we often care about!

Consider these two chatbot responses:

Yes, we accept product returns after purchase.
Sure, you may send the item back after buying it.

Most human readers would agree these are roughly equivalent. However, overlap-based metrics wouldn’t, since they share few words between them.

Semantic similarity methods help compare meaning instead of words. They use pre-trained models like BERT, many of which are open-source. These models turn text into vectors that capture the context and relationships between words. By comparing the distance between these vectors, you can get a sense of how similar the texts really are.

Some popular approaches include:

BERTScore uses BERT embeddings to compare each token in the generated text to tokens in the reference, calculating cosine similarity for these comparisons. It then aggregates the scores to provide precision, recall, or F1.
MoverScore builds on BERTScore and quantifies the contextual differences between texts. It uses Earth Mover’s Distance (Wasserstein distance) to measure how much "work" it takes to transform one text into the other.‍
COMET was designed for machine translation and requires not only the reference translation but also the source text. It also uses a separate model pre-trained on human evaluation datasets that predicts how good the translation is.

You can also implement Semantic Similarity directly: pick an embedding model (e.g., one that works at the sentence level), turn the texts into vectors, and measure cosine similarity.

To aggregate the results, you can look at averages or define a cut-off threshold and calculate the share of matches with sufficiently high similarity. If you have meaningful metadata (like topics in a Q&A dataset), you can also compute category-specific scores.

Method	Description
BERTScore	Compares token-level embeddings using cosine similarity. Focuses on precision, recall, or F1.
MoverScore	Measures Earth Mover’s Distance (EMD) between embedded texts and returns a similarity score.
COMET	Evaluates translations by comparing embeddings of source, reference, and generated text.
Direct implementation	Measure Cosine Similarity using the chosen embeddings model.

The limitation of these LLM evaluation methods is that they depend entirely on the embedding model you use. It may not capture the nuances of meaning well, and false matches are possible. Focusing only on the token- or sentence-level comparisons can also sometimes miss the broader context of the word passage.

LLM as a judge

TL;DR: Prompting an LLM to compare responses or choose the best one.

While making a semantic match ("Do these two responses mean the same thing?") is often the goal, embedding-based similarity isn’t always the most precise option. Embeddings can capture general ideas but miss important details. For example:

Restart the device before contacting support.
Contact support before restarting the device.

Though the order changes the meaning, embedding vectors for these sentences might still appear similar. Or consider:

Click on the “account” page in the menu.
Click on the “account details” page in the menu.

In this case, the correct naming of specific menu items can be critical to determining accuracy, but embedding-based checks will not penalize this enough. To get better results, you can use an LLM as a judge for similarity matching.

The process is close to how you use LLMs in your product — except here, the task is a narrow classification. For instance, you could pass a reference and new response to LLM and ask: "Does the RESPONSE convey the same meaning as the REFERENCE? Yes or no".

This method is very adaptable. You can specify what to prioritize in matching — such as precise terminology, omissions, or style consistency. You can also ask the LLM to explain its reasoning. This makes the results easier to parse and improves the quality of the evals.

For example, you can separately assess if the two responses match in fact and style.

This LLM evaluation method is widely used in practice. For instance, segment describes how they use LLM-based matching to compare generated queries. Prompt-based evals have also proven themselves for translation (see Kocmi et al. 2023).

Another way to use LLMs is through pairwise comparison, where you give the LLM the two responses and ask it to determine which is better. This is where the nickname "LLM as a Judge" comes from (see Zheng et al, 2023).

Method	Description
Similarly matching	Use an LLM to compare two responses to determine if they convey the same meaning, style, or details.
Pairwise comparison	Use an LLM to choose the best of two responses or declare a tie based on specified criteria.

Both methods might require tuning to match your preferences: an LLM judge requires its own evaluation! Pairwise comparisons, in particular, can also be biased — LLMs may favor longer or more polished responses or outputs similar to their own.

Reference-free evals

If matching two answers is not an option, use reference-free evaluations. This applies to:

Production LLM monitoring: live, ongoing checks and guardrails.
Multi-turn tasks: chatbots or agents where it's hard to design a reference.
Nuanced criteria: assessing tone, style, or other subjective qualities.
High-volume evals: testing criteria like response safety or format adherence on large synthetic datasets.

In these cases, you can measure custom qualities or use proxy metrics — quantifiable properties that say something useful about the output, starting with simple text length.

It's worth noting that while we often refer to checks like "average helpfulness" as LLM quality metrics, they don’t have a strict mathematical definition like NDCG or precision. You tailor their implementation to your use case, and that’s where most of the work goes.

What's consistent are the LLM evaluation methods you can apply. Let’s break them down!

Regular expressions

TL;DR. Tracking frequency of words and patterns, like topical or risky keywords.

Regular expressions are a simple way to check for specific keywords, phrases, or structures in text. While they might seem basic they can also be surprisingly useful.

For example, you can use regex to track mentions of competitors or products. Each is usually a limited list of entities, so it’s easy to define and flag how often they come up.

You can also track topical keywords. Say, if you have a travel chatbot, you can look for words "cancel," "refund," or "change booking" to check how often it has to deal with cancellations. To simplify setting up such evals, you can use LLMs to brainstorm word lists, and use libraries that automatically handle variations of the same root word.

You can also flag conversations by checking for high-risk keywords. For example:

Look for words like "cancel" or "complaint" in a subscription service chats.
Set up profanity filters to catch abusive language.
Detect basic jailbreak attempts by monitoring phrases like "ignore instructions."

Regex can also catch repetitive error patterns. Some LLMs generate canned responses like "As an AI language model" or frequently start with "Certainly!" You can set up regex checks to catch these during regression testing. Similarly, refusal patterns (e.g., "I'm sorry" or "I can't") are easy to track since models often reuse the same wording.

Regex helps with structure checks, too. If your output needs specific elements (e.g., headers or disclaimers in the generated report), string matching can confirm that all sections are present.

Once you define your patterns, regex can efficiently scan large datasets. It’s fast, reliable, and doesn’t require external API calls. Plus, it adds a welcome touch of certainty in an otherwise unpredictable field — if the pattern exists, regex will detect it.

Here are examples of LLM evaluation metrics you can implement:

Example metric	Description
Share of responses containing a topic	Measures how often responses include topic-related keywords.
Share of responses with high-risk keywords	Track risky phrases or profanity.
Competitor, product and brand mentions	Count competitor or product name mentions.
Verbatim error occurrence	Detect errors or refusal patterns.
Structure adherence	Verify that responses contain required sections or disclaimers.

That said, regex has its limitations. It’s strict by nature, meaning it won't catch typos or variations unless you explicitly define them. You'll need to carefully design and maintain your pattern definitions to cover the full range of possible outputs.

Text stats

TL;DR. Tracking statistics like word or sentence count.

You can often tie quality to measurable text features. For instance, text length matters for tasks like generating social media posts, help articles, or summaries — outputs should be concise but still meaningful. Sometimes, length requirements are directly baked into the prompt, like: “Give a one-sentence takeaway”.

Checks like word and sentence count are quick, cheap, and easy to implement, making them useful at every stage. In LLM regression testing: are all outputs within expected limits? In LLM monitoring: what’s the average, minimum, or maximum length? Individual outliers can often signal problems, like longer prompts containing jailbreaks.

You can also explore metrics like readability scores. These provide a rough estimate of the education level someone would need to understand the text. This is useful if you're generating content that should be accessible to younger audiences or non-native speakers.

There are other stats to consider, like non-letter character counts. They are not always useful as is, but can give a good signal if things shift. For example, a sudden increase in non-vocabulary words might indicate spam, model behavior changes, or automated attacks that need investigation.

Example metric	Description
Word, sentence, or symbol count	Tracks if output meets length requirements.
Non-letter character count	Measure the proportion of symbols, special characters or punctuation.
Language detection or share of non-vocabulary words	Detect the language of the text or the share of texts not belonging to this language.
Stopword ratio	Track the share of filler words like "the" or "is”.
Readability scores (like Flesch-Kincaid)	Assess how easy or difficult the text is to read based on sentence and word complexity.
Named entity count	Tracks mentions of key entities (e.g., names, locations).

Deterministic validation

TL;DR. Validating format and structure for tasks like code or JSON generation.

There are other programmatic checks beyond regex and text stats. If your LLM generates structured content like SQL, code, JSON, or interacts with APIs and databases, you can often write code to verify correctness at least partially.

These checks focus on format, structure, or functionality. A few examples:

Check the response format. If you ask for JSON, SQL, or XML, verify that the response adheres to the correct syntax.
Ensure the required fields are present. For example, in JSON with extracted product details, check for keys like product_name, price, and in_stock.
Validate content. Confirm that extracted emails match an email pattern or that provided links lead to accessible web pages.‍
Test code execution. If the model returns code, you can test that it's syntactically correct and, ideally, executable. You can run it in a sandbox or against your own test suite to verify that it performs as expected.

Example metric	Description
Format adherence	Confirms that output follows required formats (e.g., is valid JSON, SQL, Python code).
Field completeness	Checks for the presence of expected fields in structured responses, e.g. all JSON keys are present and non-empty.
Code execution success	Tests whether generated code can run without errors.
Data quality validation	Confirms that generated data meets expectations (e.g., non-null, non-negative values).

Semantic similarity

TL;DR. Checking if responses align with inputs, context, or known patterns.

Semantic similarity isn’t just for reference matching. It’s also useful when you don’t have a ground truth response. Here are a few ways it can help:

Input-output similarity. For example, if you’re turning bullet points into a full email, you can measure how closely it matches the original points. This helps check whether it stays true to the source. Similarly, in Q&A tasks, a large semantic gap between the question and the response might indicate the answer isn’t relevant.

Response-context similarity. In RAG tasks, you can compare the generated answers to the retrieved context. High similarity would reflect that the model is properly using the information. Low similarity could suggest hallucination — when the model fabricates unsupported details.

For example, DoorDash monitors chatbot responses this way, flagging outputs with low similarity to relevant knowledge base articles.

Pattern similarity. For example, you can compare a model’s response to a set of denial templates. Even if the wording varies, high similarity can indicate that the response is a refusal. You can do the same for other patterns, including jailbreaks.

Example metric	Description
Response relevance	Measure how well the output aligns with input content.
Response groundedness	Measure how well the output aligns with the retrieved context.
Denial detection	Measure similarity between the response and the known examples of a specific pattern (like denial).

LLM-as-judge

TL;DR: Prompting LLMs to evaluate outputs against custom criteria.

We already looked at reference-based LLM judges, but you can do the same for any custom property that you can't define through rules. Just explain your criteria in natural language, put them in a prompt, and let the LLM judge the output and return a score or a label.

Such LLM judges help scale human labeling efforts and are one of the most universal evaluation methods. Here’s how they can work.

Direct evaluation. You can judge any standalone text qualities. For example, check if a chatbot response has the right tone or if user queries are appropriate.

Direct evaluation LLM-as-a-judge — Scoring user requests as "appropriate" in the app context.

Context-based evaluation. You can also assess outputs alongside inputs or retrieved data. In this case, you pass two pieces of text to the evaluator. For instance, you can check whether RAG search finds the context that can answer the user question:

Context-based evaluation LLM-as-a-judge — Evaluating if the context contains enough information to answer the question.

Or, you can look for “hallucinations” to detect if any responses give unsupported information. Pass both the context and the answer:

LLM hallucinations evaluation — Assessing if the responses is grounded in the retrieved context.

Conversation-level evaluation. You can also assess entire user sessions by asking the LLM to review transcripts. This is particularly useful for AI agents and chatbots. The goal could be to check if the LLM completed the user's task or maintained a consistent tone.

Session-level LLM evaluations LLM-as-a-judge

To create an LLM judge, you can start with direct prompting by asking something like: "Is this response polite?" Add details on what you mean by politeness and request a binary (yes/no) or multi-level rating (e.g., 1-5 stars).

You can enhance this with Chain of Thought (CoT), asking the model to reason step-by-step and optionally include examples. (See Wang et al., 2023 and Zheng et al., 2023).

Several more complex approaches build on this idea. For example:

SelfCheckGPT: Generates multiple responses to the same prompt and compares them for consistency, flagging hallucinations if answers diverge. (Manakul et al., 2023)
G-EVAL: Combines CoT with form-based evaluations, where the model first generates step-by-step instructions tailored to specific criteria. (Liu et al., 2023)
FineSurE: Decomposes complex evaluations (e.g., faithfulness, completeness) into discrete criteria and uses fact-by-fact checking. (Song et al., 2024)
Juries of models: Asks several different LLMs at once. (Verga, et al., 2024)

Navigating these nicknames can be confusing: terms like G-Eval or GPTScore may sound like specific metrics, but they're actually prompting techniques for LLM evaluations. There are also fine-tuned evaluation models like Prometheus (Li et al., 2024). You can use them or prompt any LLM yourself to evaluate specific qualities like “faithfulness” or "politeness".

Studies also show that complex techniques don't always yield better results. For example, simply asking the LLM to think step by step can outperform G-Eval.(Chiang et al, 2023). It's always best to test different approaches against your own assessments!

Here are a few common metrics implemented through LLM judges:

Example metric	Description
Helpfulness	Checks if the response fully satisfies the user’s request.
Groundedness / Faithfulness	Verifies alignment with retrieved context, checks for unsupported details.
Politeness / Tone	Evaluates whether tone and language are appropriate.
Toxicity / Bias	Detects harmful or biased language.
Relevance	Ensures that the response or context is relevant to the query.
Safety	Ensures that the response does not contain harmful content.

While LLM judges are powerful, their quality depends entirely on the prompt and underlying LLM. Just like system prompts, the evaluation prompts need to be refined and tested. Also, be cautious with domains like medicine or finance — generic LLMs may not provide reliable evaluations for specialized topics.

Model-based scoring

TL;DR: Using pre-trained ML models to score your texts.

To use machine learning for LLM evaluations, you don’t always need a large, general-purpose LLM. Smaller models can often perform just as well for specific tasks.

These models are trained on labeled data to predict or detect specific features. For example:

Classify text by language.
Evaluate the sentiment or emotion conveyed in the text.
Identifying text topics or intents from a predefined list.
Assess text readability.

Many of these qualities apply broadly across use cases. Thanks to open-source communities, you can access publicly available pre-trained models. You can run them locally without incurring API costs or sending sensitive data externally.

Additionally, if you have labeled data (or collected enough assessments from your LLM judges), you can train your own specialized models. By fine-tuning a pre-trained model with your examples, you can create a highly tailored evaluator that is cheap and fast to run.

Example metric	Description
Topic classification	Classifies text by predefined topics.
PII detection	Verifies if the text contains personally identifiable information (PII).
Toxicity	Detects harmful, biased, or offensive language.
Sentiment	Analyzes the emotional tone (e.g., positive, neutral, or negative).
Alignment (NLI model)	Tests if the text is consistent (entailment), contradictory, or neutral with respect to a reference text.

Of course, there's a catch: you need to know what each ML model does and test it against your criteria. The quality will vary if your data is very different from what the model was trained on.

Takeaways

There are tons of LLM evaluation methods and metrics out there, but you won’t need all of them for every app. Focus on what matters most based on your use case and the errors you’re running into.

For example, for a RAG-based chatbot, you might check retrieval quality, answer accuracy, and over-refusals. A few ranking metrics and custom LLM judges could do the trick.

Here are some key tips to keep in mind:

First, always look at your data. Nothing builds intuition better than looking at real-world examples. This helps you spot patterns, shape your quality criteria, and understand failures. Bring your whole team in early — especially domain experts — and curate test data.

Next, define what "quality" means for your app. Consider the basics: response structure, length, and language. What are your positive indicators, like tone or helpfulness? What risks do you want to avoid, like toxic responses, denials, or irrelevant content?

Finally, choose your LLM evaluation methods wisely. Don’t just go for a complex, obscure LLM metric you can’t explain or match to your labels. Start by making your own qualitative assessments. Then, think back to metrics and decide how to implement them. For instance, you could measure toxicity with LLM judges, ML models, or regex filters. The best option will depend on factors like cost, privacy, and accuracy.

When in doubt, start with LLM judges. They’re widely used in practice and offer a flexible alternative to human review, especially for open-ended or session-level evaluations. Plus, they’re easy to get started with — just write a custom prompt and refine it as needed.

LLM evals with Evidently

LLM evals can get tricky: we built Evidently to make this process easier. Our open-source library (with over 25 million downloads!) supports a variety of evaluation methods. For teams, we provide Evidently Cloud — a no-code workspace to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows.

We specialize in helping you assess complex systems like RAGs, AI agents and mission-critical apps where you need an extra layer of safety and security. Evidently also helps you generate test scenarios and agent simulations using synthetic data. If this is what you’re working on, reach out — we’d love to help!