contents‍
How do you know your LLM-powered product – a customer support chatbot, coding assistant, or booking AI agent – actually works?Â
To ensure an LLM app does what it’s supposed to do well and its outputs don’t cause any harm, you need LLM evaluations. They help you build high-quality apps your users can trust, prevent failures by testing for edge cases, and speed up your AI product development cycle.Â
You need LLM evals at every stage of your AI product development:
As Greg Brockman, a Co-founder of OpenAI, put it, “Evals are surprisingly often all you need.” However, LLM evaluations are not as simple. You need to choose relevant evaluation methods, design a reliable test dataset, and build a custom LLM evaluation system based on the use case your app solves, its risks, and potential failure modes.Â
We put together seven examples of how companies run LLM evals to inspire you. They share how they approach the task, what methods and metrics they use, what they test for, and their learnings along the way.
Let’s dive in!
At Asana, LLM-powered features are incorporated into different functional teams' daily routine: from task prioritization to debugging to customer feedback analysis. To ensure these features work reliably, the company built a high-quality, high-throughput QA process reflecting their real-world use cases. The testing process combines both automated and manual evaluation methods.Â
An in-house LLM unit testing framework allows engineers at Asana to test LLM responses during development, similar to traditional unit testing. Here are some example tests they run:
They use LLM-as-a-judge method for evaluations to verify if the unit test assertions are true. To ensure accuracy, the team runs tests multiple times (e.g., best-of-3). Fast execution allows developers to iterate efficiently in sandboxes.
Asana also runs integration tests to validate multi-prompt chains before release. It helps ensure they retrieve the necessary data and generate accurate user-facing responses.
For end-to-end testing, Asana uses realistic data in sandboxed instances to reflect real-world customer interaction scenarios. These tests are graded manually by product managers. While this process takes longer, manual review allows for assessing harder-to-quantify aspects like tone and style and catching unexpected quality issues before production.
Blog: Asana's LLM testing playbook
Webflow measures its LLM system's performance along several axes. They combine a multi-point human rating system with heuristic evaluations that check whether a specific true/false condition is being met.Â
As human evaluations are time-consuming and usually don’t scale well, Webflow automates them using an LLM judge that assesses the LLM system’s responses against predefined criteria. To learn more about LLM judges and how they work, check out this in-depth guide.Â
To check how well LLM-based evals match human evaluations, Webflow runs them side-by-side. In practice, they combine both methods. For example, they rely on automated scores for day-to-day validation and do weekly manual scoring to validate all significant changes. Webflow also relies on automated scores to detect unexpected regressions in quality.
Blog: Mastering AI quality: How we use language model evaluations to improve large language model output quality
GitLab Duo is a suite of AI-powered features designed to accelerate software development, from planning to deployment. To evaluate it, GitLab designed a Centralized Framework that supports the entire end-to-end process of LLM feature creation, from selecting the appropriate model for a use case to assess the features’ output.Â
LLM testing at GitHub is based upon four main steps:Â
Blog: Developing GitLab Duo: How we validate and test AI models at scale
Wix customizes an LLM to its use cases, which requires dedicated evaluation benchmarks and an extensive evaluation process. While open LLM benchmarks are useful for evaluating general-purpose capabilities, custom benchmarks are needed to estimate domain knowledge and abilities to solve domain tasks.
To estimate the LLM's domain knowledge, Wix built a custom evaluation dataset from existing customer service live chats and FAQs. To assess the quality of responses, they built an LLM judge that compares LLM-suggested answers to the ground truth.
For task capabilities estimation, Wix uses domain-specific, text-based learning tasks. These include customer intent classification, customer segmentation, custom domain summarization, and sentiment analysis.Â
Blog: Customizing LLMs for Enterprise Data Using Domain Adaptation: The Wix Journey
Segment built an LLM-powered Audience Builder that helps express query logic. For example, it allows searching for “all users who have added a product to the cart but not checked out in the last 7 days” without code. Behind the scenes, the query is expressed as an AST (abstract syntax trees).Â
However, evaluating the quality of generated queries is not so straightforward. For complex queries, there are multiple correct ways to express an audience. This means that you can’t simply compare the answer you get against a single ground truth answer.Â
To determine the best representation, Segment uses LLM-as-a-judge. The judge compares the “ground truth” – examples of queries built by the users before – to new generated outputs to assess correctness.
Additionally, the team had to solve a task of creating a test dataset. They built an LLM Question Generator Agent that takes a ground truth AST input and generates a possible input prompt. The synthetic prompts are then put into the AST Generator, and the LLM Judge evaluates the new output.Â
Blog: LLM-as-Judge: Evaluating and Improving Language Model Performance in Production
GitHub shares how they run offline evaluations for their coding assistant GitHub Copilot.
The team runs thousands of tests to ensure performance, quality, and safety before making any change to the production environment. These evaluations encompass both automated and manual testing methods.Â
For code completions, the team evaluates the percentage of passed unit tests. They have a collection of around 100 containerized repositories with corresponding tests. Those repositories are then modified to fail the tests, and the task of the model is to modify the codebase to pass the failing tests once again. To measure the quality of the code suggestions, the team also measures the similarity to the original known passing state.
For Copilot Chat capabilities, GitHub calculates the percentage of questions answered correctly. Automatic evaluations are used for simple true-or-false questions. For more complex ones, an LLM judge is used to evaluate the chatbot's answers. GitHub routinely audits these evaluation outputs to ensure the LLM judge aligns with human reviewers and performs consistently.Â
Blog: How we evaluate AI models and LLMs for GitHub Copilot
Doordash, a food delivery company, built a RAG-based support chatbot to provide timely and accurate responses to Dashers, independent contractors who do deliveries through DoorDash.Â
To monitor the system quality over time, DoorDash uses a system that assesses the chatbot's performance across five LLM evaluation metrics: retrieval correctness, response accuracy, grammar and language accuracy, coherence to context, and relevance to the Dasher's request.Â
They originally performed manual evaluations of conversation transcripts, which helped them narrow down the quality criteria. They further implemented monitors that use an LLM-as-a-judge approach or a metric based on regular expressions.    Â
The quality of each aspect is determined by prompting the judge with open-ended questions. Answers to these questions are then summarized into common issues for further analysis. DoorDash has a dedicated human team that reviews random subset transcript samples to calibrate the evaluations of the LLM judge.
To maintain the high quality of the system’s responses, DoorDash also implemented the LLM Guardrail system, an online monitoring tool that evaluates each LLM-generated response for accuracy and compliance. It helps prevent hallucinations and filter out responses that violate company policies.Â
Blog: Path to high-quality LLM-based Dasher support automation
If you’re building an LLM-powered system, you need evaluations to test it during development and production monitoring. That’s why we built Evidently. Our open-source library, with over 25 million downloads, makes it easy to test and evaluate LLM-powered applications, from chatbots to RAG. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges for every use case.
We also provide Evidently Cloud, a no-code workspace for teams to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows.
Ready to evaluate your LLM app? Sign up for free or schedule a demo to see Evidently Cloud in action. We're here to help you build with confidence!