⚠️ AI Risk 101: How to test your AI systems before users do. Join the webinar on April 22, 2025. Register now

Community

How companies evaluate LLM systems: 7 examples from Asana, GitHub, and more

Last updated:

March 26, 2025

Published:

March 25, 2025

contents‍

Start testing your AI systems today

Get demo

How do you know your LLM-powered product – a customer support chatbot, coding assistant, or booking AI agent – actually works?

To ensure an LLM app does what it’s supposed to do well and its outputs don’t cause any harm, you need LLM evaluations. They help you build high-quality apps your users can trust, prevent failures by testing for edge cases, and speed up your AI product development cycle.

You need LLM evals at every stage of your AI product development:

Evaluations help you compare different prompts, models, and settings during the experiments to determine what works best.
Before launch, you can run stress-testing and red-teaming to see how your app deals with edge cases.
Then, you need to monitor your product's performance in the real world.
If you make changes, you can run regression tests before rolling out the updates.

LLM evaluaions — *Source:* *https://x.com/gdb/status/1733553161884127435*

As Greg Brockman, a Co-founder of OpenAI, put it, “Evals are surprisingly often all you need.” However, LLM evaluations are not as simple. You need to choose relevant evaluation methods, design a reliable test dataset, and build a custom LLM evaluation system based on the use case your app solves, its risks, and potential failure modes.

We put together seven examples of how companies run LLM evals to inspire you. They share how they approach the task, what methods and metrics they use, what they test for, and their learnings along the way.

Let’s dive in!

Asana

At Asana, LLM-powered features are incorporated into different functional teams' daily routine: from task prioritization to debugging to customer feedback analysis. To ensure these features work reliably, the company built a high-quality, high-throughput QA process reflecting their real-world use cases. The testing process combines both automated and manual evaluation methods.

An in-house LLM unit testing framework allows engineers at Asana to test LLM responses during development, similar to traditional unit testing. Here are some example tests they run:

Test if the LLM captures key details of a task (e.g., project launch date).
Test if the LLM can find specific data in a very large project.
Test the correctness of synthesized answers from retrieved data.

They use LLM-as-a-judge method for evaluations to verify if the unit test assertions are true. To ensure accuracy, the team runs tests multiple times (e.g., best-of-3). Fast execution allows developers to iterate efficiently in sandboxes.

Asana also runs integration tests to validate multi-prompt chains before release. It helps ensure they retrieve the necessary data and generate accurate user-facing responses.

For end-to-end testing, Asana uses realistic data in sandboxed instances to reflect real-world customer interaction scenarios. These tests are graded manually by product managers. While this process takes longer, manual review allows for assessing harder-to-quantify aspects like tone and style and catching unexpected quality issues before production.

Blog: Asana's LLM testing playbook

Webflow

Webflow measures its LLM system's performance along several axes. They combine a multi-point human rating system with heuristic evaluations that check whether a specific true/false condition is being met.

As human evaluations are time-consuming and usually don’t scale well, Webflow automates them using an LLM judge that assesses the LLM system’s responses against predefined criteria. To learn more about LLM judges and how they work, check out this in-depth guide.

To check how well LLM-based evals match human evaluations, Webflow runs them side-by-side. In practice, they combine both methods. For example, they rely on automated scores for day-to-day validation and do weekly manual scoring to validate all significant changes. Webflow also relies on automated scores to detect unexpected regressions in quality.

Blog: Mastering AI quality: How we use language model evaluations to improve large language model output quality

GitLab

GitLab Duo is a suite of AI-powered features designed to accelerate software development, from planning to deployment. To evaluate it, GitLab designed a Centralized Framework that supports the entire end-to-end process of LLM feature creation, from selecting the appropriate model for a use case to assess the features’ output.

LLM testing at GitHub is based upon four main steps:

Creation of a prompt library. This library consists of thousands of question-answer pairs that serve as a reference for evaluating system responses against an ideal answer. Unlike standard LLM benchmarks, GitLab's prompt library is designed explicitly for GitLab features and use cases, ensuring it accurately represents the inputs they expect in production environments.
Baseline model performance. Then, the prompts are fed into various models to evaluate their responses against ground truth using metrics like cosine similarity, cross similarity, and LLM judge. This establishes a baseline for model performance and guides the selection of a foundational model for specific features.
Feature development. During active development, GitLab re-validates LLM features’ performance daily to ensure the changes improve the overall functionality.
Iteration. In each experimental cycle, GitLab manually examines the test results to identify performance patterns and decide what to focus on next. To iterate faster, they craft a smaller-scale dataset to act as a mini-proxy. Once the prompt that addresses the specific use case is identified, the team validates it on a broader dataset to ensure it won’t adversely affect other performance aspects.

Blog: Developing GitLab Duo: How we validate and test AI models at scale

Wix

Wix customizes an LLM to its use cases, which requires dedicated evaluation benchmarks and an extensive evaluation process. While open LLM benchmarks are useful for evaluating general-purpose capabilities, custom benchmarks are needed to estimate domain knowledge and abilities to solve domain tasks.

To estimate the LLM's domain knowledge, Wix built a custom evaluation dataset from existing customer service live chats and FAQs. To assess the quality of responses, they built an LLM judge that compares LLM-suggested answers to the ground truth.

For task capabilities estimation, Wix uses domain-specific, text-based learning tasks. These include customer intent classification, customer segmentation, custom domain summarization, and sentiment analysis.

*LLM-as-a-judge process at Wix. Credit:* *Customizing LLMs for Enterprise Data Using Domain Adaptation: The Wix Journey*

Blog: Customizing LLMs for Enterprise Data Using Domain Adaptation: The Wix Journey

Segment

Segment built an LLM-powered Audience Builder that helps express query logic. For example, it allows searching for “all users who have added a product to the cart but not checked out in the last 7 days” without code. Behind the scenes, the query is expressed as an AST (abstract syntax trees).

However, evaluating the quality of generated queries is not so straightforward. For complex queries, there are multiple correct ways to express an audience. This means that you can’t simply compare the answer you get against a single ground truth answer.

To determine the best representation, Segment uses LLM-as-a-judge. The judge compares the “ground truth” – examples of queries built by the users before – to new generated outputs to assess correctness.

Additionally, the team had to solve a task of creating a test dataset. They built an LLM Question Generator Agent that takes a ground truth AST input and generates a possible input prompt. The synthetic prompts are then put into the AST Generator, and the LLM Judge evaluates the new output.

*LLM evaluation process at Segment. Credit:* *LLM-as-Judge: Evaluating and Improving Language Model Performance in Production*

Blog: LLM-as-Judge: Evaluating and Improving Language Model Performance in Production

GitHub

GitHub shares how they run offline evaluations for their coding assistant GitHub Copilot.

The team runs thousands of tests to ensure performance, quality, and safety before making any change to the production environment. These evaluations encompass both automated and manual testing methods.

For code completions, the team evaluates the percentage of passed unit tests. They have a collection of around 100 containerized repositories with corresponding tests. Those repositories are then modified to fail the tests, and the task of the model is to modify the codebase to pass the failing tests once again. To measure the quality of the code suggestions, the team also measures the similarity to the original known passing state.

For Copilot Chat capabilities, GitHub calculates the percentage of questions answered correctly. Automatic evaluations are used for simple true-or-false questions. For more complex ones, an LLM judge is used to evaluate the chatbot's answers. GitHub routinely audits these evaluation outputs to ensure the LLM judge aligns with human reviewers and performs consistently.

Blog: How we evaluate AI models and LLMs for GitHub Copilot

DoorDash

Doordash, a food delivery company, built a RAG-based support chatbot to provide timely and accurate responses to Dashers, independent contractors who do deliveries through DoorDash.

To monitor the system quality over time, DoorDash uses a system that assesses the chatbot's performance across five LLM evaluation metrics: retrieval correctness, response accuracy, grammar and language accuracy, coherence to context, and relevance to the Dasher's request.

They originally performed manual evaluations of conversation transcripts, which helped them narrow down the quality criteria. They further implemented monitors that use an LLM-as-a-judge approach or a metric based on regular expressions.

The quality of each aspect is determined by prompting the judge with open-ended questions. Answers to these questions are then summarized into common issues for further analysis. DoorDash has a dedicated human team that reviews random subset transcript samples to calibrate the evaluations of the LLM judge.

LLM judge system at DoorDash — LLM-as-a-judge system at DoorDash. Credit: Path to high-quality LLM-based Dasher support automation

To maintain the high quality of the system’s responses, DoorDash also implemented the LLM Guardrail system, an online monitoring tool that evaluates each LLM-generated response for accuracy and compliance. It helps prevent hallucinations and filter out responses that violate company policies.

Blog: Path to high-quality LLM-based Dasher support automation

Run LLM evals with Evidently

If you’re building an LLM-powered system, you need evaluations to test it during development and production monitoring. That’s why we built Evidently. Our open-source library, with over 25 million downloads, makes it easy to test and evaluate LLM-powered applications, from chatbots to RAG. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges for every use case.

We also provide Evidently Cloud, a no-code workspace for teams to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows.