This guide is for anyone working on LLM-powered systems — from engineers to product managers — looking for an “introduction to evals”.
We'll cover the basics of evaluating LLM-powered applications without getting too technical. As long as you understand how to use LLMs in your product, you’re good to go!
We’ll cover:
This guide will focus on the core evaluation principles and workflows.
Want a deep-dive on methods? Explore the LLM evaluation metrics guide.Â
Prefer code examples? Check the Evidently library and docs Quickstart.Â
Or videos? Here is a youtube playlist with bite-sized introduction to evals.
Test fast, ship faster. Evidently Cloud gives you reliable, repeatable evaluations for complex systems like RAG and agents — so you can iterate quickly and ship with confidence.Â
Before we talk about evaluations, let's establish what we are evaluating.
An LLM-powered product uses Large Language Models (LLMs) as part of its functionality.Â
These products can include user-facing features like support chatbots and internal tools like marketing copy generators.
You can add LLM features to existing software, such as allowing users to query data using natural language. Or, you can create entirely new LLM applications like conversational assistants with LLM at the very core.Â
Here are some examples:Â
All these apps run on LLMs: readily available text models trained on vast amounts of data. LLMs can tackle a wide range of tasks by following an instruction prompt: writing text, extracting information, generating code, translating, or holding entire conversations.Â
Some tasks, like generating product descriptions, might only need a single prompt. However, most LLM-powered products grow a bit — or a lot — more complex. For example, you may chain multiple prompts together, like splitting content generation and style alignment in a copywriting assistant.
One popular type of LLM-powered app is a RAG system, which stands for Retrieval Augmented Generation. Despite the intimidating name, it’s a simple concept: you combine an LLM with search. When a user asks a question, the system looks for relevant documents and sends both the question and the discovered context to the LLM to generate an answer. For example, a RAG-based support chatbot could pull information from a help center database and include links to relevant articles in its responses.
You can also create LLM-powered agents to automate complex workflows that need sequential reasoning: from correcting code to planning and booking trips. Agents go beyond just writing texts — they can use tools you give them. For example, query databases or send calendar invitations. Agentic systems can grow very complex, with multi-step planning, dozens of prompts, and “memory” to keep track of the progress.
Want more examples? Check out this collection of real-world LLM applications.Â
At some point when creating an LLM-powered system you'll ask: How well is it working? Does it handle all the scenarios I need? Can I improve it?Â
To answer these questions, you need evaluations.
‍LLM evaluations, or "evals" for short, help assess the performance of a large language model to ensure outputs are accurate, safe, and aligned with user needs.
The term applies in two key contexts:
This is an important distinction to make: while some assessment methods may overlap, these evaluations are quite different in spirit.Â
When evaluating an LLM directly, the focus is on its "raw" abilities — like coding, translating text, or solving math problems. Researchers use standardized LLM benchmarks for this purpose. For example, they may evaluate:
There are hundreds of LLM benchmarks, each with its own set of test cases. Most include questions with known correct answers, and the evaluation process checks how closely the model’s responses match these. Some benchmarks use more complex methods, like crowd-sourcing response rankings.
100+ LLM evaluation benchmarks. Check out this collection for more examples.
Benchmarks let you directly compare models. Public leaderboards (like this one) help track how each LLMs performs across benchmarks and answer questions like “Which open-source LLM is good in coding?" Â
However, while these benchmarks are great for choosing models and tracking industry progress, they’re not very useful for evaluating real-world products. They test broad capabilities, not the specific inputs your system might handle. They also focus on the LLM alone, but your product will involve other components.
LLM product evaluation assesses the full system's performance for its specific task.
This includes not just the LLM, but everything else: the prompts, the logic connecting them, the knowledge databases used to augment the answers, etc. You also run these tests on data that fits the use case, like real customer support queries.
Such application-level evals usually address two big aspects:
However, the specific quality criteria will vary. What defines “good” and what can go “wrong” depends on the use case. For example, if you’re working on a Question-Answering System, you might want to evaluate:
On the safety side, you may want to test that your Q&A system does not produce biased or toxic outputs, or reveal sensitive data even if provoked.
When designing evaluations, you need to narrow down criteria based on your app’s purpose, risks, and the types of errors you observe. Your evals should actually help make decisions. Does the new prompt work better? Is the app ready to go live?
Take criteria like “fluency” or “coherence”. Most modern LLMs are already great at producing natural, logical text. A perfect fluency score may look good on paper but won’t offer any useful information. But say, if you're working with a smaller local LLM that is less capable, testing for fluency can make sense.
Even positive criteria are not always universal. For most cases, factual accuracy is a big deal. But if you build a tool to generate ideas for marketing campaigns, making stuff up might be the whole point. In that case, you’d care more about diversity and creativity than hallucinations.
This is the core distinction between LLM product evals and benchmarks. Benchmarks are like school exams — they measure general skills. LLM product evals are more like job performance reviews. They check if the system excels in the specific task it was "hired" for, and that depends on the job and tools you’re working with.
Key takeaway: Each LLM-powered product requires a custom evaluation framework. The criteria should be both useful — focusing on what truly matters — and discriminative, meaning they can effectively highlight differences in performance across iterations.
These LLM evals are far from straightforward. It’s not just that quality criteria are custom: the approach itself differs from both traditional software testing and predictive machine learning.
Non-deterministic behavior. LLMs produce probabilistic outputs, meaning they may generate different responses for identical inputs.
While this allows for creative and varied answers, it complicates testing: you must check whether a range of outputs aligns with your expectations.
No single correct answer. Traditional machine learning systems, like classifiers or recommenders, deal with predefined outputs. For instance, an email is either spam or not, and each input has one ground truth answer. To test the model, you can create a dataset with known labels and check how well the model predicts them.
But LLMs often handle open-ended tasks like writing emails or holding conversations where multiple valid answers exist. For instance, there are countless ways to write a good email. This means you can’t rely on exact matches to a reference answer. Instead, you need to assess fuzzy similarities or subjective qualities like style, tone, and safety.
Wide range of possible inputs. LLM products often handle diverse use cases. For example, a support chatbot might answer queries about products, returns, or help with account troubleshooting. You need scenario-based tests that cover the full range of expected inputs. Creating a good evaluation dataset is a separate problem to solve!
Even more, what works in testing doesn’t always hold up in the wild. Real-world users may throw unexpected inputs at your system — pushing it beyond what you planned for. To detect this, you need ways to observe and evaluate the online quality.
‍Unique risks. Working with probabilistic systems trained to follow natural language instructions brings new types of vulnerabilities, including:
You need right evaluation workflows to address all these: stress-test the system, uncover its weaknesses, and monitor performance in the wild. How do you do this?Â
Let’s take a look at the possible approaches!Â
Evaluations generally occur in two key phases: before deployment and after launch.
During the development phase, you need to check if the app is good enough as you iterate on building it. Once it’s live, you’re monitoring that things work well. No matter the phase, every evaluation starts with data. You first need something to evaluate.
Both in testing and production, you can choose between manual and automatic evaluations.
Initially, you can conduct simple “vibe checks” by asking: “Do these responses feel right?”
After creating the first prompt version or RAG setup, you can run a few sample inputs through the LLM app and eyeball the responses. If they are way off, you tweak the prompt or adjust your approach.Â
Even at this informal stage, you need test cases. For a support bot, you could prepare a few sample questions with known answers. Each time you change something, you’ll assess how well the LLM handles them. Â
While not systematic, vibe checks help you see if things are working, spot issues, and come up with new prompt ideas. However, this approach isn’t reliable or repeatable. As you move forward, you’ll need more structure — with consistent grading and detailed records of results.
A more rigorous way to leverage human expertise is a labeling or annotation process: you can create a formal workflow where reviewers evaluate responses using set instructions. Â
They can give binary labels, such as “pass” or “fail,” or evaluate specific qualities, like whether the pulled context is “relevant,” or if the answer is “safe.” You can also ask reviewers to provide a brief explanation of their decisions.
To make the labeling process efficient and consistent, you must provide clear instructions, like asking to look for specific error types. You can also have multiple reviewers evaluate the same inputs to surface conflicting opinions.Â
These manual evaluations are the most reliable way to determine if your LLM app does its job well. As the product builder, you are best equipped to define what “success” means for your use case. In highly nuanced and specialized fields like healthcare, you may need to bring in subject matter experts to help judge this.
Example: Asana shares how test AI-powered features with a mix of automated unit tests and manual grading done by the product manager. This hands-on process helped uncover many issues with quality and formatting.
While incredibly valuable, manual labels are expensive to obtain. You can’t review thousands of outputs every time you edit a prompt. You need automation to scale.
With some upfront effort, you can set up automated evaluations. They fall into two types:
These evaluations rely on predefined correct answers — commonly called “reference,” “ground truth,” “golden,” or “target” responses.
For example, in a customer support system, the target response for “What is your return policy?” might be “You can return items within 30 days.” You can compare the chatbot’s outputs against such known answers to assess the overall correctness of responses.
These evaluations are inherently offline. You run tests while iterating on your app or before deploying changes to production.
To use this approach, you first need an evaluation dataset: a collection of sample inputs paired with their approved outputs. You can generate such a dataset or curate it from historical logs, like using past responses from human support agents. The closer these cases reflect real-world scenarios, the more reliable your evaluations will be.
Example: GitLab shares how they build Duo, their suite of AI-powered features. They created an evaluation framework with thousands of ground truth answers, which they test daily. They also have smaller proxy datasets for quick iterations.Â
Once your dataset is ready, here’s how automated evaluations work:
The tricky part is comparing responses to the ground truth. How do you decide if the new response is correct?
An exact match is an obvious idea: see if the new response is identical to the target one.
But exact matches are often too rigid — in open-ended scenarios different wording can convey the same meaning. To address this, you can use alternative methods, such as quantifying word overlap between the two responses, comparing semantic meaning using embeddings, or even asking LLMs to do the matching.Â
Here’s a quick breakdown of common matching methods:
After matching the correctness of individual responses, you can analyze the overall performance of your system on the test dataset.
For binary True/False matching, accuracy (percentage of correct responses) is an intuitive metric. For numerical scores like semantic similarity, averages are common but might not tell the full story. Instead, you may look at the share of responses below a set similarity threshold, or test for the lowest score across all examples.Â
If you’re using LLMs for predictive tasks, which is often a component of larger LLM solutions, you can use classic ML quality metrics.
The principle of matching new responses against reference responses is universal, but details vary by use case. Here are some examples:
Once your dataset and evaluators are set up, you can run the evals whenever you need. For example, re-run tests after tweaking a prompt to see if things are getting better or worse.Â
Want to understand this better? Check the guide on LLM evaluation methods.
However, obtaining ground truth answers isn’t always practical. For complex, open-ended tasks or multi-turn chats, it’s hard to define a single “right” response. And in production, there are no perfect references: you’re evaluating outputs as they come in.Â
Instead of comparing outputs to a fixed answer, you can run reference-free evaluations. These let you assess specific qualities of the output, like structure, tone, or meaning.Â
One popular evaluation method is using LLM-as-a-judge, where you use a language model to grade outputs based on a set rubric. For instance, an LLM judge might evaluate whether a chatbot response fully answers the question or whether the output maintains a consistent tone.
But it’s not the only option. Here’s a quick overview:
These reference-free evaluations can work both during iterative development (like when you refine outputs for tone or format) and for monitoring production performance.
While you don’t need to design and label a ground truth dataset in this case, you still need to put in some upfront work. This time, your focus is on:
‍It takes some thought to narrow down and express assessment criteria. Once you set those, you may need to work on evaluators like LLM judges to align with your expectations.
To sum up, all evaluations follow the same structure.Â
You can combine both manual and automated methods.
Example: Webflow uses this hybrid approach effectively. They rely on automated scores for day-to-day LLM validation and conduct weekly manual reviews.
While all evaluations rely on the same elements (data, criteria, scoring method), you run them for different reasons. Here’s a look at common scenarios across the LLM product lifecycle.
Choosing the best model, prompt or configuration for your AIÂ product.
When starting out, your first step is often making comparisons.
You might begin by selecting a model. You check leaderboards, pick a few candidate LLMs, and test them on your task. For example, do OpenAI models outperform Anthropic’s ones? Or, if you switch to a cheaper or open-source model, how much quality do you lose?
Another comparative task is finding the best prompt.
Let’s say you’re building a summarization tool. Would “Explain this in simple terms” perform better than “Write a TLDR”? What if you break the task into steps or use a chain of prompts? Or maybe add examples of the desired style? This process, called prompt engineering, takes some trial and error. Small tweaks often make a big difference, so testing each version systematically on your dataset is key.
Depending on the use case, you can also try things like chunking strategies for RAG, temperature settings, or different retrieval methods.Â
Each change is a new experiment, and you need evaluations to compare their results. This means having curated test datasets and automated ways to measure performance. You can use both ground-truth methods (like comparing to ideal summaries) and reference-free methods (like checking whether all generated summaries follow a set structure, maintain the right tone, and don't contradict the source).Â
You can also try using LLM judges for pairwise comparisons — showing two outputs and asking the model to pick the better one. To make this work, you'd need to invest in calibrating your eval prompts and watch for biases, like a tendency to favor outputs that appear first or last.
To experiment effectively, it's useful to establish a baseline. Try simple approaches before more sophisticated setups — this gives you a clear benchmark to measure progress.
Setting aside some test cases while you iterate is also a smart move. In machine learning, this is called a held-out dataset. Without it, you risk overfitting, where the app performs well on tested examples but struggles with new, unseen data. To avoid this, keep a portion of examples separate and only test them once you’re happy with the initial results.
While the details of experiments might change, the goal stays the same: figure out what works best and deliver a great product. A solid evaluation system helps move faster and make data-driven decisions. For example, instead of just saying one prompt seems better, you can quantify how well it performs on your test dataset.
Checking if your product is ready for real-world use by evaluating it across diverse scenarios, including edge cases.
As you run experiments, you’ll naturally want to expand your test coverage. Your initial example set may be small. But once you’ve picked a model, solved basic issues like output formatting, and settled on the prompt strategy, it’s time to test more thoroughly. Your system might work well on a dozen inputs — but what about a few hundred?
This means adding more examples — both to cover more common use cases and to see how the system handles tougher scenarios.
For example, in a support Q&A system, you might start with simple queries like “What’s your return policy?” Then, you’d expand to other topics like billing and add more complex questions, like “Why was I charged twice?” or “Can I exchange an item bought last year?”Â
You can also test robustness and sample multiple responses to the same questions to see how variable they are.
The next step is looking at edge cases — realistic but tricky scenarios that need special handling. Like, what happens if the input is a single word? Or if it’s way too long? What if it’s in another language or full of typos? How does the system handle sensitive topics it shouldn’t address, like questions about competitors?
Designing these takes some thought. You must understand how users interact with your product to create realistic test cases. Synthetic data can be super helpful here — it lets you quickly create variations of common questions or come up with more unusual examples.
Ultimately, you want to get to a point where you have a set of evaluation datasets for each topic or scenario, paired with methods to test them — such as expected answers to match or measurable criteria for automatic assessment.Â
For example, if you expect your system to refuse competitor questions, you could:
To verify that the LLM correctly refuses to answer, you could look for the presence of specific words, check for semantic similarity to known denials, or use an LLM judge.
Technically, stress-testing isn’t much different from experimental evaluations. The difference is the focus: instead of exploring options (like which prompt works best), you’re checking if the product is ready for real-world use. The question shifts to, Can our product, with its current prompt and design, handle everything users throw at it?
Ideally, you’d take your existing LLM setup — no changes to prompts or architecture — and run it through these extra scenarios to confirm it responds correctly to all the challenges. Check!
In reality, though, you’ll likely spot issues. Once you address them, you re-run the evaluations to ensure they’re resolved. Fixes might involve refining prompts, tweaking the system’s non-LLM logic, or adding safeguards — like blocking specific types of requests.
That said, there’s one more thing to test: adversarial inputs.
Testing how your system responds to adversarial behavior or malicious use.
Red-teaming is a testing technique where you simulate attacks or feed adversarial inputs to uncover vulnerabilities in the system. This is a crucial step in evaluating AI system safety for high-risk applications.Â
While stress-testing focuses on challenging but plausible scenarios — like complex queries a regular user might ask — red-teaming targets misuse. It looks for ways bad actors might exploit the system, pushing it into unsafe or unintended behavior, like giving harmful advice.
The line between edge cases and adversarial inputs can sometimes be thin. For example, a healthcare chatbot must safely handle medical questions as part of its core functionality. Testing this falls within its normal scope. But for a general support Q&A system, medical, financial, or legal questions are outside its intended use and treated as adversarial.
Red-teaming can also test for risks like generating explicit or violent content, promoting hate speech, enabling illegal activities, violating privacy, or showing bias.Â
This can involve both hands-on testing and more scalable methods. For example, you can manually try to trick the AI system to agree to harmful requests or leak sensitive information. To scale the process, you can run automated red-teaming, using techniques like synthetic data and targeted prompts to simulate a wide range of risks.
For example, to test for bias, you might:
Like other evaluations, red-teaming can be tailored to the specifics of your app.
Testing for generic harmful behavior — like asking blunt or provocative questions — is important, but context-specific tests can be even more valuable. For example, you could check if the app gives different advice when you change the age or gender of the person asking the same question.
Understanding live performance of your system to detect and resolve issues.Â
Offline quality evaluations can only take you so far. At some point, you’ll put your product in front of real users to see how it performs in the wild. If your use case doesn’t involve significant risks, you might launch your beta early to start gathering real-world feedback.Â
This brings us to the next evaluation scenario: monitoring.Â
Once your product is live, you’ll want to track its performance. Are users having a good experience? Are the responses accurate, safe, and helpful?
You can start with tracking user behavior, such as capturing clicks or engagement signals, or gather explicit user feedback, like asking users to upvote or downvote the response. However, these product metrics only give you top-level insights (do users seem to like it?), but they don’t reveal the actual content of interactions or where things go right or wrong.
To get deeper insights, you need to track what users ask and how your system responds. This starts with collecting traces: detailed records of all the interactions.
Having these traces will let you evaluate quality in production by running online evaluations. You can automatically process each new output (or a portion) to see how they score against your criteria, using reference-free methods like LLM judges, ML models, or regular expressions.
Online observability also helps you learn more about your users. Which requests are most popular? Which ones are unexpected? It’s like product analytics but focused on analyzing text interactions. For example, you may cluster user requests to spot common topics and decide which improvement to prioritize.
You can also test changes through A/B testing. For example, you might deploy a new prompt to 10% of users and compare performance metrics to see if it improves quality.
If something looks off — like a spike in negative feedback or a drop in sentiment scores — you can dig into logs to troubleshoot. This means reviewing specific interactions to identify what went wrong. Your LLM observability setup should make it easy to analyze individual responses for debugging.
Manual reviews are still very valuable. While you can’t check every response in production, reviewing a smaller sample regularly — either random examples or those flagged by automated checks — can be incredibly useful. It helps you build intuition about what’s working, curate new test cases, and refine your evaluation criteria as your product evolves.
Testing if new changes improve the system without breaking what used to work.
Even when your product is live, you need offline evaluations to run regression tests. They let you verify that the changes you make don’t introduce new (or old) issues.
Quality iterations rarely stop. As you learn more about how users interact with your app or uncover specific failures, you’ll naturally want to make updates — like tweaking a prompt. But every change comes with a risk: what if fixing one thing messes up something else?
For example, if you slightly adjust a prompt, how many previous outputs will change? And are those changes actually better — or worse? To stay ahead of this, you need a way to test updates in bulk by re-running your test cases to confirm that:
If tests pass, you can safely publish your updates to production.
Your test datasets must also evolve to reflect actual user behavior. The good news is you can pull new examples directly from the logs and turn them into test cases.
For larger updates, you can treat all recent production data as a test set. For example, take all inputs and outputs from last week, push them through a new LLM app version, regenerate responses, and check which ones shifted and how. This helps you spot any unintended side effects.
Systematic regression testing helps you safely build on top of existing system. You can make changes while making sure you’re not creating new problems along the way.
Real-time checks that detect quality issues in LLM inputs or outputs.
Sometimes, you want to catch issues immediately. Unlike evaluating past or test interactions, this means detecting problematic inputs or outputs on the fly. These validations are called guardrails, acting as a safety net between the system’s response and the user.
Inside, these checks are the same reference-free evals but built directly into your app and applied in real time. For example, you can look at:
You can define an appropriate action for when an issue is detected. The system might simply block the response and show a fallback message like, “I’m sorry, I can’t help with that.” Alternatively, it could apply a mitigation, such as removing private data or inappropriate language or retrying the response with a different prompt.
While guardrails are valuable, they come with trade-offs. Additional processing can introduce delays. Some checks, like regular expressions to spot specific words, are fast. But more complex checks, like those calling other LLMs, can take longer and slow down your app responses. They may also be impractical for streaming outputs, where responses are shown to users in real-time as they’re generated.
Because of these limitations, guardrails are often reserved for the most critical risks, such as blocking toxic content or identifying PII. When used, they act as real-time, automated evaluations to keep your system safe and compliant.
Good news: AI isn’t taking over everything just yet. Even with LLM-based products, you still need human input to manage quality. If not to actually review all the outputs, then to design and maintain automated evaluation systems.
Bad news: Evaluations aren’t simple. Each app needs a custom approach based on its use case, risks, and potential failure modes. For instance, a consumer-facing chatbot has much higher stakes for safety and accuracy than an internal tool where users can intervene.
You need evals at every stage, from your first product prompt to production. And these workflows aren’t isolated — each step builds on the previous one:
Automated and manual evaluations work hand-in-hand. While human labels can give the clearest signal, automated evals help replicate and scale these insights.
And all these evaluations aren’t just about crunching metrics. They help you:
A solid evaluation process has another bonus: it naturally leads to collecting high-quality labeled data. You can later use it to refine your system — replacing LLM judges with smaller models, optimizing production prompts, or even fine-tuning your main model.
LLM evals can be complex, especially for advanced systems like RAGs, AI agents, and mission-critical apps. That’s why we built Evidently, an open-source framework with over 25 million downloads. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges to match your needs.
For teams, Evidently Cloud provides a collaborative, no-code platform to test and evaluate AI quality. You can generate synthetic data to run scenario tests and AI agent simulations, manage datasets, trace interactions, and run evaluations right from the interface.Â
Ready to optimize your LLM evaluations? Sign up for free, or schedule a demo to see Evidently Cloud in action. We’d love to help you build with confidence!