🎓 Free introductory course "LLM evaluations for AI product teams". Save your seat
LLM guide

How to evaluate an LLM-powered product? A beginners guide.

Last updated:
February 12, 2025

This guide is for anyone working on LLM-powered systems — from engineers to product managers — looking for an “introduction to evals”.

We'll cover the basics of evaluating LLM-powered applications without getting too technical. As long as you understand how to use LLMs in your product, you’re good to go!

We’ll cover:

  • The difference between evaluating LLMs and LLM-powered products.
  • Evaluation approaches, from human labeling to automated evals.
  • When you need evaluations, from experiments to ongoing monitoring.

This guide will focus on the core evaluation principles and workflows.

Want a deep-dive on methods? Explore the LLM evaluation metrics guide. 
Prefer code examples?
Check the Evidently library and docs Quickstart. 
Or videos?
Here is a youtube playlist with bite-sized introduction to evals.

TL;DR

  • LLM evaluations ("evals") assess a model’s performance to ensure outputs are accurate, safe, and aligned with user needs.
  • LLM model evaluations focus on raw abilities like coding, translating, and solving math problems, often measured with standardized benchmarks.
  • LLM product evaluations assess how well an entire LLM-powered system performs the task it was built for, using both manual and automated methods.
  • Manual evaluations involve domain experts or human reviewers annotating and checking output accuracy and quality.
  • Automated evals can be reference-based: they compare outputs to a known ground truth and are used during experimentation, regression testing, and stress-testing.
  • Reference-free automated evals assess outputs directly and are commonly used in production monitoring, guardrails, and complex conversational scenarios.
  • Evaluation methods and metrics vary, with LLM judges being one of the most popular approaches in practice.
Build AI systems you can rely on

Test fast, ship faster. Evidently Cloud gives you reliable, repeatable evaluations for complex systems like RAG and agents — so you can iterate quickly and ship with confidence. 

Icon
Synthetic data and agent simulations
Icon
100+ built-in checks and evals
Icon
Create LLM judges with no code
Icon
Open-source with 25M+ downloads

Before we talk about evaluations, let's establish what we are evaluating.

What is an LLM product?

What is an LLM product

An LLM-powered product uses Large Language Models (LLMs) as part of its functionality. 

These products can include user-facing features like support chatbots and internal tools like marketing copy generators.

You can add LLM features to existing software, such as allowing users to query data using natural language. Or, you can create entirely new LLM applications like conversational assistants with LLM at the very core. 

Here are some examples: 

All these apps run on LLMs: readily available text models trained on vast amounts of data. LLMs can tackle a wide range of tasks by following an instruction prompt: writing text, extracting information, generating code, translating, or holding entire conversations. 

LLM instruction prompt

Some tasks, like generating product descriptions, might only need a single prompt. However, most LLM-powered products grow a bit  — or a lot  — more complex. For example, you may chain multiple prompts together, like splitting content generation and style alignment in a copywriting assistant.

One popular type of LLM-powered app is a RAG system, which stands for Retrieval Augmented Generation. Despite the intimidating name, it’s a simple concept: you combine an LLM with search. When a user asks a question, the system looks for relevant documents and sends both the question and the discovered context to the LLM to generate an answer. For example, a RAG-based support chatbot could pull information from a help center database and include links to relevant articles in its responses.

Retrieval-Augmented Generation (RAG)
A RAG system combines search with generation.

You can also create LLM-powered agents to automate complex workflows that need sequential reasoning: from correcting code to planning and booking trips. Agents go beyond just writing texts — they can use tools you give them. For example, query databases or send calendar invitations. Agentic systems can grow very complex, with multi-step planning, dozens of prompts, and “memory” to keep track of the progress.

Want more examples? Check out this collection of real-world LLM applications. 

At some point when creating an LLM-powered system you'll ask: How well is it working? Does it handle all the scenarios I need? Can I improve it? 

To answer these questions, you need evaluations.

What are LLM evaluations?

LLM evaluations

‍LLM evaluations, or "evals" for short, help assess the performance of a large language model to ensure outputs are accurate, safe, and aligned with user needs.

The term applies in two key contexts:

  • Evaluation of the LLM itself.
  • Evaluation of systems built with LLMs.

This is an important distinction to make: while some assessment methods may overlap, these evaluations are quite different in spirit. 

LLM model evals

When evaluating an LLM directly, the focus is on its "raw" abilities — like coding, translating text, or solving math problems. Researchers use standardized LLM benchmarks for this purpose. For example, they may evaluate:

  • How well the model "knows" historical facts.
  • How well it can make logical inferences.
  • How it responds to unsafe or adversarial inputs.

There are hundreds of LLM benchmarks, each with its own set of test cases. Most include questions with known correct answers, and the evaluation process checks how closely the model’s responses match these. Some benchmarks use more complex methods, like crowd-sourcing response rankings.

MMLU benchmark example questions
Example questions from the MMLU benchmark. Credit: MMLU paper
100+ LLM evaluation benchmarks. Check out this collection for more examples.

Benchmarks let you directly compare models. Public leaderboards (like this one) help track how each LLMs performs across benchmarks and answer questions like “Which open-source LLM is good in coding?"  

However, while these benchmarks are great for choosing models and tracking industry progress, they’re not very useful for evaluating real-world products. They test broad capabilities, not the specific inputs your system might handle. They also focus on the LLM alone, but your product will involve other components.

LLM product evals

LLM product evaluation assesses the full system's performance for its specific task.

This includes not just the LLM, but everything else: the prompts, the logic connecting them, the knowledge databases used to augment the answers, etc. You also run these tests on data that fits the use case, like real customer support queries.

LLM model evals vs. LLM product evals

Such application-level evals usually address two big aspects:

  • Capabilities. Can the LLM product do what it’s supposed to do well?
  • Risks. Can its outputs potentially cause harm?

However, the specific quality criteria will vary. What defines “good” and what can go “wrong” depends on the use case. For example, if you’re working on a Question-Answering System, you might want to evaluate:

  • Correctness. Does the LLM provide fact-based answers without making things up?
  • Helpfulness. Do the answers fully address what the user is asking? 
  • Text style. Is the tone clear, professional and close to the existing brand style?
  • Format. The responses may need to fit the length limit or always link to the source.

On the safety side, you may want to test that your Q&A system does not produce biased or toxic outputs, or reveal sensitive data even if provoked.

Example criteria for evaluating an LLM-based Q&A system

When designing evaluations, you need to narrow down criteria based on your app’s purpose, risks, and the types of errors you observe. Your evals should actually help make decisions. Does the new prompt work better? Is the app ready to go live?

Take criteria like “fluency” or “coherence”. Most modern LLMs are already great at producing natural, logical text. A perfect fluency score may look good on paper but won’t offer any useful information. But say, if you're working with a smaller local LLM that is less capable, testing for fluency can make sense.

Even positive criteria are not always universal. For most cases, factual accuracy is a big deal. But if you build a tool to generate ideas for marketing campaigns, making stuff up might be the whole point. In that case, you’d care more about diversity and creativity than hallucinations.

This is the core distinction between LLM product evals and benchmarks. Benchmarks are like school exams — they measure general skills. LLM product evals are more like job performance reviews. They check if the system excels in the specific task it was "hired" for, and that depends on the job and tools you’re working with.

LLM Product Evaluation LLM Model Evaluation
Focus Ensuring accurate, safe outputs for a specific task. Comparing capabilities of different LLMs.
Scope Testing the LLM-powered system (with prompt chains, guardrails, integrations, etc). Testing the LLM itself through direct prompts.
Evaluation data Custom, scenario-based datasets. General benchmark datasets.
Evaluation scenario Iterative testing from development to production. Periodic checks with new model releases.
Example task Assessing a support chatbot’s accuracy. Choosing the best LLM for math problems.

Key takeaway: Each LLM-powered product requires a custom evaluation framework. The criteria should be both useful — focusing on what truly matters — and discriminative, meaning they can effectively highlight differences in performance across iterations.

Why LLM evals are hard 

These LLM evals are far from straightforward. It’s not just that quality criteria are custom: the approach itself differs from both traditional software testing and predictive machine learning.

Non-deterministic behavior. LLMs produce probabilistic outputs, meaning they may generate different responses for identical inputs.

LLM evaluations

While this allows for creative and varied answers, it complicates testing: you must check whether a range of outputs aligns with your expectations.

No single correct answer. Traditional machine learning systems, like classifiers or recommenders, deal with predefined outputs. For instance, an email is either spam or not, and each input has one ground truth answer. To test the model, you can create a dataset with known labels and check how well the model predicts them.

But LLMs often handle open-ended tasks like writing emails or holding conversations where multiple valid answers exist. For instance, there are countless ways to write a good email. This means you can’t rely on exact matches to a reference answer. Instead, you need to assess fuzzy similarities or subjective qualities like style, tone, and safety.

Wide range of possible inputs. LLM products often handle diverse use cases. For example, a support chatbot might answer queries about products, returns, or help with account troubleshooting. You need scenario-based tests that cover the full range of expected inputs. Creating a good evaluation dataset is a separate problem to solve!

Even more, what works in testing doesn’t always hold up in the wild. Real-world users may throw unexpected inputs at your system — pushing it beyond what you planned for. To detect this, you need ways to observe and evaluate the online quality.

LLM evaluations

‍Unique risks. Working with probabilistic systems trained to follow natural language instructions brings new types of vulnerabilities, including:

  • Hallucinations. The system may generate false or misleading facts, like inventing a nonexistent product or giving incorrect advice.
  • Jailbreaks. Malicious users may try bypassing safety measures to provoke harmful or inappropriate responses.
  • Data leaks. An LLM might inadvertently reveal sensitive or private information from its training data or connected systems.

You need right evaluation workflows to address all these: stress-test the system, uncover its weaknesses, and monitor performance in the wild. How do you do this? 

Let’s take a look at the possible approaches! 

LLM evaluation methods

Evaluations generally occur in two key phases: before deployment and after launch.

During the development phase, you need to check if the app is good enough as you iterate on building it. Once it’s live, you’re monitoring that things work well. No matter the phase, every evaluation starts with data. You first need something to evaluate.

  • Test data helps you run experiments. These are example inputs that mimic the scenarios your LLM can encounter. You can write test cases manually, generate synthetically, or source from beta users. Once you have these inputs, you can test how your LLM app responds to them and assess the outputs against your success criteria.
  • Production data. Once the app is live, it’s all about seeing how it performs with real users. You will need to capture both the system inputs and responses and run continuous quality evaluation on the live data to catch any issues. 

Both in testing and production, you can choose between manual and automatic evaluations.

Manual evaluations 

Initially, you can conduct simple “vibe checks” by asking: “Do these responses feel right?”

After creating the first prompt version or RAG setup, you can run a few sample inputs through the LLM app and eyeball the responses. If they are way off, you tweak the prompt or adjust your approach. 

Even at this informal stage, you need test cases. For a support bot, you could prepare a few sample questions with known answers. Each time you change something, you’ll assess how well the LLM handles them.  

While not systematic, vibe checks help you see if things are working, spot issues, and come up with new prompt ideas. However, this approach isn’t reliable or repeatable. As you move forward, you’ll need more structure — with consistent grading and detailed records of results.

A more rigorous way to leverage human expertise is a labeling or annotation process: you can create a formal workflow where reviewers evaluate responses using set instructions.  

Manual labeling for LLM evaluations
Manual evaluations of the output quality.

They can give binary labels, such as “pass” or “fail,” or evaluate specific qualities, like whether the pulled context is “relevant,” or if the answer is “safe.” You can also ask reviewers to provide a brief explanation of their decisions.

To make the labeling process efficient and consistent, you must provide clear instructions, like asking to look for specific error types. You can also have multiple reviewers evaluate the same inputs to surface conflicting opinions. 

These manual evaluations are the most reliable way to determine if your LLM app does its job well. As the product builder, you are best equipped to define what “success” means for your use case. In highly nuanced and specialized fields like healthcare, you may need to bring in subject matter experts to help judge this.

Example: Asana shares how test AI-powered features with a mix of automated unit tests and manual grading done by the product manager. This hands-on process helped uncover many issues with quality and formatting.

While incredibly valuable, manual labels are expensive to obtain. You can’t review thousands of outputs every time you edit a prompt. You need automation to scale.

Automated evaluations 

With some upfront effort, you can set up automated evaluations. They fall into two types:

  • With ground truth: compare the LLM’s outputs to target reference answers. 
  • Without ground truth: directly assign quantitative scores or labels to the responses.

With ground truth

These evaluations rely on predefined correct answers — commonly called “reference,” “ground truth,” “golden,” or “target” responses.

For example, in a customer support system, the target response for “What is your return policy?” might be “You can return items within 30 days.” You can compare the chatbot’s outputs against such known answers to assess the overall correctness of responses.

Reference-based LLM evals
Reference-based evals: compare new results to the expected ones.

These evaluations are inherently offline. You run tests while iterating on your app or before deploying changes to production.

To use this approach, you first need an evaluation dataset: a collection of sample inputs paired with their approved outputs. You can generate such a dataset or curate it from historical logs, like using past responses from human support agents. The closer these cases reflect real-world scenarios, the more reliable your evaluations will be.

Example: GitLab shares how they build Duo, their suite of AI-powered features. They created an evaluation framework with thousands of ground truth answers, which they test daily. They also have smaller proxy datasets for quick iterations. 

Once your dataset is ready, here’s how automated evaluations work:

  • Feed the test inputs. 
  • Generate responses from your system.
  • Compare the new responses to the reference answers.
  • Calculate an overall quality score. 

The tricky part is comparing responses to the ground truth. How do you decide if the new response is correct?

An exact match is an obvious idea: see if the new response is identical to the target one.

Exact match LLM evals
Exact match: check if the new response is identical to what's expected.

But exact matches are often too rigid — in open-ended scenarios different wording can convey the same meaning. To address this, you can use alternative methods, such as quantifying word overlap between the two responses, comparing semantic meaning using embeddings, or even asking LLMs to do the matching. 

Semantic similarity for LLM evals
Semantic match: check if the new response conveys the same meaning.

Here’s a quick breakdown of common matching methods:

Method Description Example
Exact Match Check if the response exactly matches the expected output (True/False). Confirm a certain text is correctly classified as “spam”.
Word or Item Match Check if the response includes specific words or items, regardless of full phrasing (True/False). Verify that “Paris” appears in answers about France’s capital.
JSON match Match key-value pairs in structured JSON outputs, ignoring order (True/False). Verify that all ingredients extracted from a recipe match a known list.
Semantic Similarity Measure similarity using embeddings to compare meanings. (E.g., cosine similarity). Match “reject” and “decline” as similar responses.
N-gram overlap Measure overlap between generated and reference text (E.g. BLEU, ROUGE, METEOR scores). Compare word sequence overlap between two sets of translations or summaries.
LLM-as-a-judge Prompt an LLM to evaluate correctness (Returns label or score). Check that the response maintains a certain style and level of detail.

After matching the correctness of individual responses, you can analyze the overall performance of your system on the test dataset.

For binary True/False matching, accuracy (percentage of correct responses) is an intuitive metric. For numerical scores like semantic similarity, averages are common but might not tell the full story. Instead, you may look at the share of responses below a set similarity threshold, or test for the lowest score across all examples. 

LLM evals testing
You can run tests to see if you have any responses with low similarity.

If you’re using LLMs for predictive tasks, which is often a component of larger LLM solutions, you can use classic ML quality metrics.

Task Example metrics Example use case
Classification Accuracy, precision, recall, F1-score. Spam detection: recall helps quantify whether all spam cases are caught.
Ranking NDCG, Precision at K, Hit Rate, etc. Retrieval in RAG: Hit Rate checks if at least one relevant result is retrieved for the query.

The principle of matching new responses against reference responses is universal, but details vary by use case. Here are some examples:

  • Summarization. Evaluate new summaries by comparing them with reference human-written examples. Use methods like embedding-based semantic similarity, LLM-judged similarity, or word overlap.
  • Structured information extraction. For tasks like extracting information from notes, compare JSON outputs generated by your system with the reference JSONs. This can be done programmatically.
  • Retrieval. Compare the retrieved documents for a given query against a set of known documents that contain the correct answer using ranking metrics.

Once your dataset and evaluators are set up, you can run the evals whenever you need. For example, re-run tests after tweaking a prompt to see if things are getting better or worse. 

Want to understand this better? Check the guide on LLM evaluation methods.

Without ground truth

Open-ended LLM evals
Reference-free evaluations: directly score the responses by chosen criteria.

However, obtaining ground truth answers isn’t always practical. For complex, open-ended tasks or multi-turn chats, it’s hard to define a single “right” response. And in production, there are no perfect references: you’re evaluating outputs as they come in. 

Instead of comparing outputs to a fixed answer, you can run reference-free evaluations. These let you assess specific qualities of the output, like structure, tone, or meaning. 

One popular evaluation method is using LLM-as-a-judge, where you use a language model to grade outputs based on a set rubric. For instance, an LLM judge might evaluate whether a chatbot response fully answers the question or whether the output maintains a consistent tone.

But it’s not the only option. Here’s a quick overview:

Method Description Example
LLM-as-a-Judge Use an LLM with an evaluation prompt to assess custom properties. Check if the response fully answers the question fully and does not contradict retrieved context.
ML models Use specialized ML models to score input/output texts. Verify that text is non-toxic and has a neutral or positive sentiment.
Semantic similarity Measure text similarity using embeddings. Track how similar the response is to the question as a proxy for relevance.
Regular expressions Check for specific words, phrases, or patterns. Monitor for mentions of competitor names or banned terms.
Format match Validate structured formats like JSON, SQL, and XML. Confirm the output is valid JSON and includes all required keys.
Text statistics Measure properties like word count or symbols. Ensure all generated summaries are single sentences.

These reference-free evaluations can work both during iterative development (like when you refine outputs for tone or format) and for monitoring production performance.

While you don’t need to design and label a ground truth dataset in this case, you still need to put in some upfront work. This time, your focus is on:

  • curating a diverse set of test inputs and
  • fine-tuning the evaluators.

‍It takes some thought to narrow down and express assessment criteria. Once you set those, you may need to work on evaluators like LLM judges to align with your expectations.

Evaluation scenarios

To sum up, all evaluations follow the same structure. 

  • You start with an evaluation dataset, which includes test or production data. For testing, it may also contain ground truth.
  • You decide on the evaluation method: manual review or automated scoring.
  • You assess the outputs based on specific criteria, whether it’s correctness against a reference response or qualities like tone and structure.

You can combine both manual and automated methods.

Example: Webflow uses this hybrid approach effectively. They rely on automated scores for day-to-day LLM validation and conduct weekly manual reviews.

While all evaluations rely on the same elements (data, criteria, scoring method), you run them for different reasons. Here’s a look at common scenarios across the LLM product lifecycle.

Comparative evals

Choosing the best model, prompt or configuration for your AI product.

When starting out, your first step is often making comparisons.

You might begin by selecting a model. You check leaderboards, pick a few candidate LLMs, and test them on your task. For example, do OpenAI models outperform Anthropic’s ones? Or, if you switch to a cheaper or open-source model, how much quality do you lose?

Another comparative task is finding the best prompt.

Let’s say you’re building a summarization tool. Would “Explain this in simple terms” perform better than “Write a TLDR”? What if you break the task into steps or use a chain of prompts? Or maybe add examples of the desired style? This process, called prompt engineering, takes some trial and error. Small tweaks often make a big difference, so testing each version systematically on your dataset is key.

Experimental comparison LLM evals
Comparative evals help your see your progress over time.

Depending on the use case, you can also try things like chunking strategies for RAG, temperature settings, or different retrieval methods. 

Each change is a new experiment, and you need evaluations to compare their results. This means having curated test datasets and automated ways to measure performance. You can use both ground-truth methods (like comparing to ideal summaries) and reference-free methods (like checking whether all generated summaries follow a set structure, maintain the right tone, and don't contradict the source). 

You can also try using LLM judges for pairwise comparisons — showing two outputs and asking the model to pick the better one. To make this work, you'd need to invest in calibrating your eval prompts and watch for biases, like a tendency to favor outputs that appear first or last.

To experiment effectively, it's useful to establish a baseline. Try simple approaches before more sophisticated setups — this gives you a clear benchmark to measure progress.

Setting aside some test cases while you iterate is also a smart move. In machine learning, this is called a held-out dataset. Without it, you risk overfitting, where the app performs well on tested examples but struggles with new, unseen data. To avoid this, keep a portion of examples separate and only test them once you’re happy with the initial results.

While the details of experiments might change, the goal stays the same: figure out what works best and deliver a great product. A solid evaluation system helps move faster and make data-driven decisions. For example, instead of just saying one prompt seems better, you can quantify how well it performs on your test dataset.

Scenario and stress-testing 

Checking if your product is ready for real-world use by evaluating it across diverse scenarios, including edge cases.

As you run experiments, you’ll naturally want to expand your test coverage. Your initial example set may be small. But once you’ve picked a model, solved basic issues like output formatting, and settled on the prompt strategy, it’s time to test more thoroughly. Your system might work well on a dozen inputs — but what about a few hundred?

This means adding more examples — both to cover more common use cases and to see how the system handles tougher scenarios.

For example, in a support Q&A system, you might start with simple queries like “What’s your return policy?” Then, you’d expand to other topics like billing and add more complex questions, like “Why was I charged twice?” or “Can I exchange an item bought last year?” 

You can also test robustness and sample multiple responses to the same questions to see how variable they are.

The next step is looking at edge cases — realistic but tricky scenarios that need special handling. Like, what happens if the input is a single word? Or if it’s way too long? What if it’s in another language or full of typos? How does the system handle sensitive topics it shouldn’t address, like questions about competitors?

Scenario and stress-testing for LLM evals

Designing these takes some thought. You must understand how users interact with your product to create realistic test cases. Synthetic data can be super helpful here — it lets you quickly create variations of common questions or come up with more unusual examples.

Ultimately, you want to get to a point where you have a set of evaluation datasets for each topic or scenario, paired with methods to test them — such as expected answers to match or measurable criteria for automatic assessment. 

For example, if you expect your system to refuse competitor questions, you could:

  • Create a test dataset with competitor-related questions.
  • Generate responses for these inputs.
  • Check if all the responses properly deny the request.

To verify that the LLM correctly refuses to answer, you could look for the presence of specific words, check for semantic similarity to known denials, or use an LLM judge.

Technically, stress-testing isn’t much different from experimental evaluations. The difference is the focus: instead of exploring options (like which prompt works best), you’re checking if the product is ready for real-world use. The question shifts to, Can our product, with its current prompt and design, handle everything users throw at it?

Ideally, you’d take your existing LLM setup — no changes to prompts or architecture — and run it through these extra scenarios to confirm it responds correctly to all the challenges. Check!

In reality, though, you’ll likely spot issues. Once you address them, you re-run the evaluations to ensure they’re resolved. Fixes might involve refining prompts, tweaking the system’s non-LLM logic, or adding safeguards — like blocking specific types of requests.

That said, there’s one more thing to test: adversarial inputs.

Red-teaming

Testing how your system responds to adversarial behavior or malicious use.

Red-teaming is a testing technique where you simulate attacks or feed adversarial inputs to uncover vulnerabilities in the system. This is a crucial step in evaluating AI system safety for high-risk applications. 

While stress-testing focuses on challenging but plausible scenarios — like complex queries a regular user might ask — red-teaming targets misuse. It looks for ways bad actors might exploit the system, pushing it into unsafe or unintended behavior, like giving harmful advice.

The line between edge cases and adversarial inputs can sometimes be thin. For example, a healthcare chatbot must safely handle medical questions as part of its core functionality. Testing this falls within its normal scope. But for a general support Q&A system, medical, financial, or legal questions are outside its intended use and treated as adversarial.

Red-teaming can also test for risks like generating explicit or violent content, promoting hate speech, enabling illegal activities, violating privacy, or showing bias. 

Adversarial inputs example

This can involve both hands-on testing and more scalable methods. For example, you can manually try to trick the AI system to agree to harmful requests or leak sensitive information. To scale the process, you can run automated red-teaming, using techniques like synthetic data and targeted prompts to simulate a wide range of risks.

For example, to test for bias, you might:

  • Create adversarial inputs using synthetic data or ethical benchmarks to provoke sensitive or inappropriate responses.
  • Run these inputs through the system.
  • Check whether all outputs deny unsafe requests or avoid problematic statements.

Like other evaluations, red-teaming can be tailored to the specifics of your app.

Testing for generic harmful behavior — like asking blunt or provocative questions — is important, but context-specific tests can be even more valuable. For example, you could check if the app gives different advice when you change the age or gender of the person asking the same question.

Production observability

Understanding live performance of your system to detect and resolve issues. 

Offline quality evaluations can only take you so far. At some point, you’ll put your product in front of real users to see how it performs in the wild. If your use case doesn’t involve significant risks, you might launch your beta early to start gathering real-world feedback. 

This brings us to the next evaluation scenario: monitoring. 

Once your product is live, you’ll want to track its performance. Are users having a good experience? Are the responses accurate, safe, and helpful?

You can start with tracking user behavior, such as capturing clicks or engagement signals, or gather explicit user feedback, like asking users to upvote or downvote the response. However, these product metrics only give you top-level insights (do users seem to like it?), but they don’t reveal the actual content of interactions or where things go right or wrong.

To get deeper insights, you need to track what users ask and how your system responds. This starts with collecting traces: detailed records of all the interactions.

LLM observability in production

Having these traces will let you evaluate quality in production by running online evaluations. You can automatically process each new output (or a portion) to see how they score against your criteria, using reference-free methods like LLM judges, ML models, or regular expressions.

LLM observability in production

Online observability also helps you learn more about your users. Which requests are most popular? Which ones are unexpected? It’s like product analytics but focused on analyzing text interactions. For example, you may cluster user requests to spot common topics and decide which improvement to prioritize.

You can also test changes through A/B testing. For example, you might deploy a new prompt to 10% of users and compare performance metrics to see if it improves quality.

If something looks off — like a spike in negative feedback or a drop in sentiment scores — you can dig into logs to troubleshoot. This means reviewing specific interactions to identify what went wrong. Your LLM observability setup should make it easy to analyze individual responses for debugging.

Manual reviews are still very valuable. While you can’t check every response in production, reviewing a smaller sample regularly — either random examples or those flagged by automated checks — can be incredibly useful. It helps you build intuition about what’s working, curate new test cases, and refine your evaluation criteria as your product evolves.

Regression testing 

Testing if new changes improve the system without breaking what used to work.

Even when your product is live, you need offline evaluations to run regression tests. They let you verify that the changes you make don’t introduce new (or old) issues.

Regression testing

Quality iterations rarely stop. As you learn more about how users interact with your app or uncover specific failures, you’ll naturally want to make updates — like tweaking a prompt. But every change comes with a risk: what if fixing one thing messes up something else?

For example, if you slightly adjust a prompt, how many previous outputs will change? And are those changes actually better — or worse? To stay ahead of this, you need a way to test updates in bulk by re-running your test cases to confirm that:

  • Correct outputs from before still work.
  • Your changes fix the issue you were targeting or improve overall quality.

If tests pass, you can safely publish your updates to production.

Your test datasets must also evolve to reflect actual user behavior. The good news is you can pull new examples directly from the logs and turn them into test cases.

For larger updates, you can treat all recent production data as a test set. For example, take all inputs and outputs from last week, push them through a new LLM app version, regenerate responses, and check which ones shifted and how. This helps you spot any unintended side effects.

Regression testing for LLM evaluations

Systematic regression testing helps you safely build on top of existing system. You can make changes while making sure you’re not creating new problems along the way.

Guardrails

Real-time checks that detect quality issues in LLM inputs or outputs.

Sometimes, you want to catch issues immediately. Unlike evaluating past or test interactions, this means detecting problematic inputs or outputs on the fly. These validations are called guardrails, acting as a safety net between the system’s response and the user.

Inside, these checks are the same reference-free evals but built directly into your app and applied in real time. For example, you can look at:

  • Inputs: Detect problematic queries, like questions about forbidden topics or containing toxic language.
  • Outputs: Detect if responses contain personally identifiable information (PII) or resemble legal advice.
LLM guardrails

You can define an appropriate action for when an issue is detected. The system might simply block the response and show a fallback message like, “I’m sorry, I can’t help with that.” Alternatively, it could apply a mitigation, such as removing private data or inappropriate language or retrying the response with a different prompt.

While guardrails are valuable, they come with trade-offs. Additional processing can introduce delays. Some checks, like regular expressions to spot specific words, are fast. But more complex checks, like those calling other LLMs, can take longer and slow down your app responses. They may also be impractical for streaming outputs, where responses are shown to users in real-time as they’re generated.

Because of these limitations, guardrails are often reserved for the most critical risks, such as blocking toxic content or identifying PII. When used, they act as real-time, automated evaluations to keep your system safe and compliant.

Recap

LLM evaluaions
Source: https://x.com/gdb/status/1733553161884127435

Good news: AI isn’t taking over everything just yet. Even with LLM-based products, you still need human input to manage quality. If not to actually review all the outputs, then to design and maintain automated evaluation systems.

Bad news: Evaluations aren’t simple. Each app needs a custom approach based on its use case, risks, and potential failure modes. For instance, a consumer-facing chatbot has much higher stakes for safety and accuracy than an internal tool where users can intervene.

You need evals at every stage, from your first product prompt to production. And these workflows aren’t isolated — each step builds on the previous one:

  • You often begin with comparative experiments to see what works best. A key prerequisite here is a good test dataset. You need to invest time in curating one and keep updating it as you gather new insights.
  • Before launch, you can run stress-testing and red-teaming to prepare for tricky cases.
  • When your app goes live, guardrails can help catch and prevent major issues.
  • Once your product is out in the world, AI-powered systems require ongoing monitoring. This isn’t a "set it and forget it" deal. You need production observability to see how well the system handles real-time data through online evaluations.
  • If something breaks, you fix it, run regression tests, and roll out the update.
LLM evaluation system

Automated and manual evaluations work hand-in-hand. While human labels can give the clearest signal, automated evals help replicate and scale these insights.

And all these evaluations aren’t just about crunching metrics. They help you:

  • Build better AI products. They help create reliable apps ready for real-world users.
  • Prevent failures. You can catch problems early, from edge cases to production bugs.
  • Move faster. Without evaluations, changes are slow and risky — you won’t know what you’ve broken or fixed. Automated evals help run more experiments and ship updates faster, whether it’s a bug fix or switching to a new, more cost-effective LLM.

A solid evaluation process has another bonus: it naturally leads to collecting high-quality labeled data. You can later use it to refine your system — replacing LLM judges with smaller models, optimizing production prompts, or even fine-tuning your main model.

Get started with LLM evals

LLM evals can be complex, especially for advanced systems like RAGs, AI agents, and mission-critical apps. That’s why we built Evidently, an open-source framework with over 25 million downloads. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges to match your needs.

For teams, Evidently Cloud provides a collaborative, no-code platform to test and evaluate AI quality. You can generate synthetic data to run scenario tests and AI agent simulations, manage datasets, trace interactions, and run evaluations right from the interface. 

Evidently Cloud LLM evaluations

Ready to optimize your LLM evaluations? Sign up for free, or schedule a demo to see Evidently Cloud in action. We’d love to help you build with confidence!

Read next

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
🎓 Free course on LLM evaluations for AI product teams. Sign up ⟶