📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Evidently

Meet Evidently Cloud for AI Product Teams

Last updated:

February 12, 2025

Published:

September 5, 2024

contents‍

Start testing your AI systems today

Get demo

TL;DR. Meet the new Evidently Cloud, built for teams developing products with LLMs. It includes tracing, datasets, evals, and a no-code workflow. Sign up for free today, or schedule a demo to see it in action.

We are excited to share some big news. After months of hard work and listening closely to user feedback, we are launching Evidently Cloud — a collaborative AI observability platform for teams working on LLM-powered products, from chatbots to RAG.

Our mission has always been to help teams working on AI systems to run them with certainty. We built a set of tools to evaluate, test, and monitor them. But as AI evolves, particularly with the rise of LLMs, the tools must adapt, too. That's why we've added new LLM observability features and made our platform more accessible to non-technical users.

It’s built on top of our open-source, with a few extra features AI product teams need.

[fs-toc-omit]Building an LLM-powered product?

Sign up for our free email course on LLM evaluations for AI product teams. A gentle introduction to evaluating LLM-powered apps, no coding knowledge required.

Save your seat ⟶

Why we built this

Since launching Evidently AI, we have worked with thousands of users deploying ML products. We’ve seen it all: from fraud detection to recommendation systems to natural language processing. Along the way, we've also seen the challenges of deploying non-deterministic AI in the real world.

Things are constantly in flux: environments change, models drift, and data quality is an ongoing battle. The need for thorough testing and monitoring never goes away. You must keep tabs to keep things running smoothly.

With LLMs, the stakes are even higher. While deploying an LLM app is almost deceivingly easy, you can't just do it and hope for the best. When outputs are open-ended, like text generation, a lot more can go wrong. You must test, track, and quality-control it to meet your user needs and prevent issues: from inaccurate responses to biased outcomes.

Machine learning workflows are evolving, too. Training models from scratch is no longer necessary, but new challenges have emerged. Now, you need to trace and evaluate complex chains of interactions, not just simple input-output pairs.

There’s also a whole new audience now. You don’t have to be an ML expert to use the tech anymore. AI engineers work with ready-made LLMs, and product managers support AI system design. They need AI quality tools that are easy and effective to use.

We are addressing that with the new launch of Evidently Cloud.

The AI quality workflow

Anyone who's deployed AI systems in production knows why evals matter. Machine learning systems can be unpredictable, so it takes rigorous testing to release your product confidently. You don’t want your chatbot selling a car for a dollar!

Evals aren't just about metrics and methods, though that's a big part. They're really about creating a workflow that ensures your AI systems work as they should, from start to finish.

Building. When testing prompts and models, you need to objectively compare outputs across scenarios to ensure your AI is on track and to know whether things improve.
Regression testing. Anytime you make changes, you’ve got to run regression tests. This means replaying past requests with your new prompts to ensure nothing breaks.
Online evals. Once your app is live, you need to check if outputs are safe and accurate and understand user behavior. No amount of testing will fully prepare you for the real world, so you need to bake observability into your app from the start.

And the cycle continues. As new issues pop up in production, you’ll need to add those scenarios to your testing workflow. If user behavior shifts, you introduce new prompts or a new model version is released, you’ll need to test each change in your specific context.

What's important is that with LLMs, you deal with interpretable text data. This requires you to start with detailed tracing so that you see every call and chain step. As you debug and iterate, you will often work with individual completions, not just "datasets."

Finally, who’s running these AI quality workflows? Traditionally, data scientists or ML engineers handled evals, focusing on cross-validation, A/B testing, or data drift detection. Collaboration was always important, but domain experts rarely got into the nitty-gritty of model design. Now, everyone can write prompts!

Non-technical experts—from product leads and governance teams—are now much more involved. Their input is key for quality control, curating test cases, and setting criteria. Deciding what your AI should answer on a sensitive topic or ensuring it matches your brand's tone isn’t something an engineer typically handles alone.

Given all these changes, it’s clear we need a new approach to AI quality workflows:

Data-centric with powerful tracing. Workflows must center around raw data, with the capability to trace and analyze every step.
Collaborative by design. The process should bring together engineers and non-technical users to handle evals and quality control.
Easily tailored. Unlike traditional ML metrics, LLM evals are usually specific to the use case. You need tools to easily run custom, domain-specific evals.

Features

With this launch, we're rolling out new features designed for the LLM product workflows. Whether you're building chatbots, AI assistants, or predictive models, Evidently Cloud has all you need to test and trust your AI systems.

We’ve built on the solid foundation of Evidently for ML observability and expanded it to cover AI quality at every stage of your LLM-powered product’s lifecycle.

Here’s a quick look at what’s inside.

Tracing

We added tracing to help you track and understand how your LLM-powered apps work. With Tracely, our open-source Python library built on OpenTelemetry, you can capture detailed logs of your system's behavior, from inputs and outputs to crucial intermediate steps like function calls.

Evidently also automatically turns these traces into easy-to-view tabular datasets, ready for you to explore and evaluate.

Datasets

Datasets are key to evaluating AI products, and now Evidently makes it super simple to store and organize your raw data.

You can create datasets from your application traces, like above, or curate and upload test datasets. You can import existing data via CSV files or the Python API.

Store and organize raw data at Evidently Clod

And if you’re storing your logs elsewhere, no worries—you can still run evaluations locally without needing to store the full datasets on the platform.

Run LLM evaluations locally with Evidently Cloud

LLM evals

We’ve introduced many LLM evaluation methods, from text statistics, pattern checks, and model-based evaluations (think semantic similarity and sentiment analysis) to LLM-as-a-judge. You can run multiple evals at once and add custom ones.

For every metric, you get powerful visuals that go beyond showing averages. You'll get detailed distributions, stats, and a view of how values change over time.

For example, here is a distribution of the sentiment scores. You can quickly run such an evaluation to detect texts with negative sentiment, using a built-in machine learning model.

The LLM-as-a-Judge feature is particularly exciting. It allows you to run and scale custom evals specific to the task and criteria at hand, by using an external LLM to judge your texts.

For example, you can use an LLM to evaluate if the new answer is correct compared to the reference and provide its reasoning.

Create custom LLM judges with Evidently Cloud

You can do the same to test factuality based on context, relevance to the question, tone, style, or literally anything else.

And these are not just metrics: we also have a testing interface.

It is great to detect regressions when you explicitly define what you expect from your evaluation outcome. For example, you can test if all responses have a style you wish (as labeled by an LLM judge), or that the text length stays within bounds. You can set nuanced conditions, like expecting tests to fail at a certain rate.

We haven’t forgotten about the classic ML evaluations. You still have access to all metrics for classification, ranking, and so on. They have their place in the LLM world: you need ranking metrics to test your RAG or classification metrics when you use LLMs for predictive tasks like user intent detection or entity extraction.

No-code workflow

Not an engineer? No problem! Our platform now includes no-code tools that let you run evaluations on your LLM outputs without writing a single line of code.

You can drag and drop files, create datasets from logged traces, and run evaluations directly from the user interface.

You can select and configure your metrics one by one:

Run LLM evaluations with no code at Evidently Cloud

You can also design and write custom LLM judges directly in the user interface.

Here we create an evaluator that defines if the text is cheerful - using an Evidently built-in template that will automatically create a complete prompt:

Create LLM judges with no code at Evidently Cloud

You can then immediately run it on your data and dive into the results, row by row:

You can continue to run evaluations as new data comes in to monitor it over time.

Powerful dashboard

Evidently platform comes with a flexible dashboard to track your AI system’s quality and performance.

If you’re looking for a quick start, we’ve got templates ready. But most AI products are unique, and so are your evaluation workflows. You can set up your dashboard to match your specific needs—whether it’s for regression testing, monitoring, or creating different views for your team. You can choose which metrics to plot and how to display them.

And yes, alerting is built-in, so you won’t miss a thing.

F.A.Q

So, what changed? Didn’t you already do monitoring?

A lot! Half of the features we mentioned are brand new. Now, you can run tracing, store and manage raw datasets (which is a big deal!), and perform evaluations directly on the platform with no-code tools. Plus, you can now take advantage of our new LLM-focused evaluations.

I am doing “classic ML”, can I still use Evidently Cloud?

Absolutely! You can continue using Evidently just like before to test and track classification, regression, and ranking models. There’s no requirement to send your data to the platform unless you want to. But with the new features, you’ve got more flexibility—and you can use the no-code evals for data quality and predictive ML, too.

How can I deploy it?

You can sign up for an Evidently Cloud account and start using all the features right away. For larger companies with strict security needs, we offer deployment in your chosen cloud region or a completely self-hosted option for full control over your infrastructure. This includes role-based access control, dedicated support, onboarding, and more.

Why choose Evidently?

Evidently streamlines AI quality management from development to production. It’s a flexible, modular platform that adapts to your needs, whether you’re focused on evals, traces, experimental testing, or production monitoring—both for “classic” ML or complex AI agents.

With a huge library of built-in metrics and tests, Evidently takes care of the evaluation work that other tools often leave you to figure out. Plus, it’s backed by an open-source community with over 20 million downloads!

How is Evidently Cloud different from open-source?

The core Evidently Python library is and will be open-source. It’s perfect for individual data scientists and AI/ML engineers who need to run evaluations and inspect AI system quality.

The Evidently Platform also has an open-source version, but advanced features like scaling, datasets, collaboration, and no-code tools are exclusive to the commercial edition. The commercial version also includes support and hosting. You can focus on your AI systems without ever knowing how hard it is to manage observability backend at scale.

How can I try it?

The new LLM features on Evidently Cloud are live, and we’re excited for you to try it out!

There’s a generous free tier, so you can start exploring everything we offer with no commitment. For questions or support, join our Discord community.

If you’re a startup, reach out to us—we’re eager to support you as much as we can.

For an enterprise version, request a demo.