LLM regression testing workflow step by step: code tutorial

contents‍

Get started with AI observability

In a previous tutorial, we introduced the concept of regression testing for LLM outputs and various evaluation methods. This time, we'll focus on the end-to-end workflow. You will learn how to run regression testing as a process and build a dashboard to monitor the test results.

You will use the open-source Evidently Python library to run evaluations and Evidently Cloud platform for monitoring.

Code example: to follow along, run the example in Jupyter notebook or Colab.

[fs-toc-omit]Building an LLM-powered product?

Sign up for our free email course on LLM evaluations for AI product teams. A gentle introduction to evaluating LLM-powered apps, no coding knowledge required.

Save your seat ⟶

How the testing works

Let's start with a quick recap.

Regression testing helps ensure that new changes don't cause problems. You can run tests whenever you modify any part of your LLM system, such as trying a new retrieval strategy, model version, or prompt. The goal is to check that updates don't make the quality of generative outputs worse or introduce new errors.

Regression testing is not so much about nuanced experiments (there's a time and place for that!) as it is a first line of defense and a "sanity check."

You run your tests against a set of defined test cases. The aim is to verify that everything stays the same and you still generate correct and high-quality outputs in all your test scenarios. If all tests pass, you are good to go and publish the changes. If not, you need to quickly surface issues that came up without needing to review responses one by one.

Let’s give an example. Imagine that you discover a scenario where your LLM performs not as you want it to. Say, when a user greets your chatbot, it fails to greet the user back. You decide to address this by adding a prompt line: "If the user starts with a greeting, respond with 'Hello!' before answering the question."

You then try a few sample queries to see that the bot indeed starts adding "Hello" as expected. All is set—or is it?

Even though the prompt edit seems harmless, you can't be sure that it does not disrupt other expected behaviors until you test it. Before releasing the prompt fix, you need to make sure that solving this one issue does not create others.

Here is how the workflow can look.

You start by putting together a test set. This dataset is often called a golden dataset: it contains input questions and approved responses. You can collect the examples from production data or generate them synthetically. It's good to have a mix, so your tests cover different scenarios. But don't overthink it to start: a dozen examples are better than none!

Next up, pick your evaluation methods. There are a few ways to measure that your new answers are good enough. A solid start is comparing how similar new responses are to the ones in your test set using techniques like semantic similarity. Check out the previous tutorial for an introduction. Generally, you want checks that are quick and cheap to run.

Run tests on every update. Whenever you make changes, it's testing time! You must first generate new responses for the same test questions using your updated prompt or new system parameters. Then, you evaluate their quality against your set criteria or examples. The goal is to get a sign-off or spot any deviations.

Record the results. To monitor test outcomes, you can set up a dashboard. This helps track which tests passed, failed, or triggered warnings and allows you to see specific metrics for each run, like average semantic similarity.

Here is an example of a dashboard you can get that will show the test results over time and distributions of specific quality metrics you care about.

Evidently LLM regression testing dashboard

Let's see how you can get there.

[fs-toc-omit]Want to learn more about LLM evals?

Request free access to our on-demand webinar "How to use LLM-as-a-judge to evaluate LLM systems." You will learn what LLM evals are, how to use LLM judges, and what makes a good evaluation prompt.

Get access to the webinar ⟶

Code tutorial

In this tutorial, you will:

Create a golden reference dataset.
Design a test suite to check text similarity, sentiment, and length.
Simulate changes with five different prompts.
Run the tests against new responses.
Build a dashboard to track test results over time.

As an example, you will work with a toy chatbot use case and create a few synthetic datasets.

You will use the open-source Evidently Python library to run the checks and Evidently Cloud to track and monitor the test results and metrics.

You need basic Python knowledge to run the tutorial.

1. Create a Project

Start by creating a free account on Evidently Cloud. After signing up, create an Organization and a Team to manage your Projects.

Next, configure your Python environment to run the checks.

Install Evidently:

!pip install evidently[llm]

Import the required components:

import pandas as pd

from evidently.test_suite import TestSuite
from evidently.descriptors import *
from evidently.tests import *

from evidently.ui.workspace.cloud import CloudWorkspace

You can also run optional imports to be able to add dashboard-as-code. Check the complete example for details.

Next, connect to the cloud workspace:

ws = CloudWorkspace(token="YOUR_API_TOKEN", url="https://app.evidently.cloud")

Include the API token, which you can grab on the Token page on the left menu.

Now, let's create your first Project. You can do this through the UI or using the Python API.

Let's do this in Python:

project = ws.create_project("Regression testing example", team_id="YOUR_TEAM_ID")
project.description = "My project description"
project.save()

Make sure to include the Team ID (you can find it on the Teams page using the left menu).

This setup will allow you to send and view test results for a specific Project within Evidently Cloud and share them with others.

2. Prepare the golden dataset

Create a reference dataset with approved answers to compare against. Here's a simple synthetic dataset to start with:

Let's call this dataset ref_data. Note that the dataset has a placeholder column for new responses. For now, the column contains duplicates of the reference answers.

This will keep the data structure constant and let you use a nice trick: using the reference dataset to generate test conditions automatically.

3. Define the Test Suite

Next, you must define the criteria to check. To do this, create a TestSuite object and list the tests with their respective conditions:

test_suite = TestSuite(tests=[...])

In the earlier tutorial, we explored different methods in-depth. You can often start with just one test on semantic similarity between new and old responses.

To make the example more interesting, let's include a few more things.

First, set two separate checks for semantic similarity with different thresholds:

If the similarity between new and old responses is < 0.9, the test returns a warning.
If the similarity between new and old responses is < 0.8, the test fails.

By using the is_critical parameter, you specify whether to give a warning or a fail.

Here is the code excerpt where you define these two tests:

    TestColumnValueMin(
        column_name=SemanticSimilarity(
        display_name="Response Similarity",
        with_column="reference_response").
        on("response"),
        gte=0.9,
        is_critical=False),
    TestColumnValueMin(
        column_name=SemanticSimilarity(
        display_name="Response Similarity",
        with_column="reference_response").
        on("response"),
        gte=0.8),

Next, let's check the sentiment of the text. When generating new responses, you expect them to have a similar sentiment. If they become more negative, it's worth a look!

But how do you set a condition for this sentiment test?

Here is a helpful trick: you can derive the thresholds from the reference data. Instead of defining a specific boundary, pass the example dataset, and Evidently will do it for you. For any column value test (like mean, max, or range), it will calculate the reference statistic and set the test condition within +/- 10%.

For example, if you use TestColumnValueMin for sentiment, Evidently will find the lowest sentiment score in the reference responses and set the test condition to be within +/- 10% of the observed minimal value.

This test will show if any of your new responses have a sentiment score that is much lower (or higher) than previously approved examples. This can often be a better approach than expecting sentiment to always be above zero. Your reference examples might naturally include some negativity, such as when the system correctly denies something to a user.

An alternative could be looking for a mean sentiment score. It's up to you!

    TestColumnValueMin(
        column_name=Sentiment(
        display_name="Response Sentiment").
        on("response")),

Let's also look at the length of the text.

One test will check that the mean text length is within +/-10% of what it was before. It isn't a critical test, but you might want a warning if the average response length is changing.

The second check includes a hard requirement. You want to ensure that all texts fall between 0 and 250 symbols. For instance, you may expect all responses to fit within the space of a chat window.

    TestColumnValueMean(
        column_name=TextLength(
        display_name="Text Length").
        on("response"),
        is_critical=False),
    TestShareOfOutRangeValues(
        column_name=TextLength(
        display_name="Text Length").
        on("response"),
        left=0, right=250,
        eq=0),

Finally, let's add a competitor check. Here, the condition is simple: you want the number of responses that mention two named competitors to be zero.

    TestCategoryCount(
        column_name=Contains(
            items=["AnotherFinCo", "YetAnotherFinCo"],
            display_name="Competitor Mentions").
        on("response"),
        category=True,
        eq=0),

These are just examples: you can combine other checks, such as using LLM-as-a-judge, model-based evaluations, or rule-based validations like regular expression matches.

Now that you've defined your Test Suite, you can run it every time there's a prompt change.

4. Get new responses

Let's simulate just this. Imagine you've created a new prompt to slightly change the style of responses, aiming for a more conversational tone.

You can add the new answers to the “response” column and create a new dataframe. Let's call it cur_data_1.

To run the Test Suite, you will use both the new current dataset and the reference dataset. This will allow you to use the trick with auto-generating test conditions.

Note: You can also run checks with a single dataset. In this case, you'd need to define all the conditions manually, such as setting a specific sentiment or length threshold directly.

5. Run the Test Suite

Run the tests to evaluate the cur_data_1 dataset. You will pass the curated examples (ref_data) as a reference and send the results to Evidently Cloud:

test_suite.run(reference_data=ref_data, current_data=cur_data_1)
ws.add_test_suite(project.id, test_suite, include_data=True)

You will repeat this step each time you conduct the check. Since the Test Suite is already defined, you only need to provide the new current dataset each time.

Go to your Project in Evidently Cloud and open the Tests section in the left menu to see your Test Suite:

Great news: all tests passed. Nothing to worry about!

6. Add monitoring panel

With a single test run, you can view results directly in Python. However, as you make repeated test runs, you might want to capture their history and track results on a dashboard.

Let's make this dashboard happen!

You can design very custom dashboards in Evidently to suit your needs. Here are two types of monitoring panels you might be interested in:

Test Panels. These panels show test passes and failures. You can plot the summary outcomes, break them down into detailed views, or separate different tests by groups.
Metric Panels. These panels show specific computed values from your tests, such as text length or sentiment scores. Pick a counter, line, bar, or scatter plot as you wish.

You can combine both types of panels and organize them by tabs. You can design your dashboard directly in the user interface (click the "edit" button in the top right corner) or define dashboards as code.

Here's an example of adding a panel to display detailed test results via Python API:

project.dashboard.add_panel(
    DashboardPanelTestSuite(
        title="Test results",
        filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
        size=WidgetSize.FULL,
        panel_type=TestSuitePanelType.DETAILED,
        time_agg="1min",
    ),
    tab="Tests"
)

Check the complete code example to see how you add the metric panels.

Here is what you get: the first panels show the test outcomes. Since you ran the tests only once, you see the summary for this single run. This green wall shows that all six tests passed.

Below are the plots showing the minimal, maximum, and mean semantic similarity and sentiment values for this run.

To add more, you must run new tests.

7. Continue testing

And there it goes: each time you change a prompt, you get a new set of responses, add them to a new dataset, and pass it to the same Test Suite as the new "current" data.

Here's what happens during subsequent runs.

Second run. Let's say you ask the prompt to make the responses more formal in style. As a result, you encounter two warnings.

Firstly, the responses have become longer. Here is what you see as you open the test details:

The mean value that used to be 86.4 symbols is now more than double and equals 190.

Secondly, one response has a semantic similarity of 0.87, which is below the warning threshold.

You can investigate what's happening by looking at specific examples. Texts are indeed longer:

It appears that asking to make responses more formal has naturally increased their length. You can make up your mind if this is a desired change. Still, all texts are within strict limits; otherwise, you'd get a failed test on the text length range.

What about the semantic similarity check? You can look specifically for the example with a similarity score below the threshold. Here is the culprit:

The meaning of the response changed slightly due to added details about real-time updates. Is it a big deal? Depending on the situation, if no such information is present in the context the chatbot is using to respond, this can be a hallucination.

The third run has a similar outcome. In this case, the prompt modification was to ask to provide more details to the response where appropriate.

Once again, this made the texts longer on average and created two answers with semantic similarity outside the warning threshold. Here they are:

Having a warning set for this semantic similarity threshold allowed you to notice such minor changes. It's up to you whether you want to keep tracking such observations so you have the chance to double-check each of them. The good thing is that you don't have to go through all the responses one by one; you can only review the ones that warrant attention.

Similar things occur on the following runs, but there are still no failed tests. Great!

As you repeatedly run your checks, the dashboard automatically updates with data from each new run. You can see test results over time and share them with others:

You can see the warnings generated on specific runs: if you hover over the section, you can verify the particular test that triggered it.

Let's take a look at the metric panels, too!

You can notice something interesting: the sentiment of the responses fluctuated across the test runs, getting more positive at first and spreading out more in the end. There was no explicit test for average sentiment, but having this plot in front of you can provide some additional food for thought. You can review the specific examples and decide if you want to add another test.

The last prompt edit was to make the responses as short as possible. This is likely what brought the sentiment down since brief and on-point answers are mostly neutral.

Note: by default, the dashboard organizes panels by the time of each test run, keeping the relative gaps between timestamps. You can toggle the "Show in order" option to view the values sequentially without gaps. This is particularly useful for ad hoc test checks.

You can also configure alerts to receive notifications via email or Slack for critical test failures.

Takeaways

In this example, you went through a simple workflow of running a set of regression tests against new responses as you modified your prompts. You can adapt this workflow to fit your specific use case.

Build your golden dataset. To start, you need a set of basic input-output examples for testing. Even a handful of examples are better than none.

Test for semantic similarity. You can begin with simple checks like semantic similarity and text length. With regression tests in place, you can have peace of mind that even though the exact words change as you change the prompt, the response meaning stays the same. As you learn more, you can refine the testing strategy to add more examples and scenarios.

Monitor the test outcomes. By tracking each test outcome, you can monitor trends, spot recurring issues, and maintain a history of the checks if you need to go back.

The key is to start. It's better to run tests with just a few examples than to skip them entirely. Setting up a process is most important: you need a place for test datasets and workflow to run checks. You can always iterate to make it better, but it's much harder to rethink the whole approach later on. Don't make testing an afterthought!

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶

Tutorials