📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

pricing

Free to start. Scales with you.

No credit card required at sign up.

Monthly Billing

Yearly Billing

Developer

For hobby projects and experiments

Free

Get started

All core evaluation features

100+ built-in metrics

10,000 data rows / month

30-day retention

Community support

Pro

For refining and monitoring AI systems

$50/month

Get started

All in Free, plus:

Alerting

100,000 data rows / month

12-month retention

5 seats, 10 projects

Email support

Expert

For advanced testing of AI agents and complex apps

from $399/month

Request trial

All in Pro, plus:

Synthetic data

Adversarial testing

200,000 data rows / month

10 seats, unlimited projects

Dedicated support

Enterprise

For companies managing AI at scale

Custom

Talk to us

All features with custom limits

On-prem or private cloud deployment options

Premium support and SLA

Startups

Ideal for early-stage startups building with AI

Special offer

Talk to us

All core features

Free credits towards Pro tier

Premium support

Developer

For hobby projects and experiments

Free

Get started

All core features

10,000 rows / month

1GB snapshots

3 Projects

2 Seats

Community support

Pro

For teams that run production AI systems

$80/month

Get started

All core features

100,000 rows / month

100 GB snapshots

10 Projects

5 Seats

Email support

Additional:
$10 per 10,000 rows/month
$1 per GB snapshots stored

Enterprise

For companies managing AI at scale

Custom

Get started

Custom limits

Custom SSO

Custom roles

Audit logs

Private cloud

Premium support

comparison

See which is the best price plan for you

Plan

Developer

Pro

Expert

Enterprise

Deployment

Evidently Cloud

Private Cloud Option

Usage limits

Number of rows

Tabular data rows or trace spans you can send.

10,000 / month

100,000 / month

Additional: $10 per 10,000 rows/month

200,000 / month

Additional: $10 per 10,000 rows/month

Custom

Retention

Access to historical evaluation data.

30 days

12 months

24 months

Custom

Synthetic data

Generating inputs for quality and safety testing.

Trial access (2 datasets)

Unlimited for quality testing

Unlimited

Evaluations

Run evaluations on your data or traces.

Unlimited

Storage

Evaluation artifacts and reports.

1 GB

100 GB

Additional storage: $1/GB per month.

200 GB

Additional storage: $1/GB per month.

Custom

Core features

Tracing

Collect AI system inputs and outputs.

Datasets

Organize and manage evaluation datasets.

Evaluations

100+ built-in metrics.

LLM-as-a-judge

Prompt external LLMs for evaluations.

Test suites

Group and run conditional evaluations.

Custom metrics

Add deterministic or prompt-based metrics.

Dashboard

Design monitoring panels in UI or as code.

Data export

Download traces or evaluation results.

Alerting

Send alerts on failed tests or metric values.

Advanced evals

Synthetic data

Generate diverse inputs for scenario testing.

Adversarial testing

Test AI safety and edge case performance.

RAG testing

Test quality of RAG retrieval and generation.

Workspace

Projects

2 projects

10 projects

Unlimited

Users

2 users

5 users

10 users

Unlimited

Role-based access support

Standard roles

Custom roles

Support

Channels

Community

Email, Shared Slack

Premium support

Custom support SLA

Billing

Monthly, self-serve

Monthly or Annual

Annual contract

Custom contract

Team training

Add-on

faq

Frequently Asked Questions

Evaluations

What can you evaluate?

Evidently supports text, tabular, and embeddings data, using 100+ built-in metrics from the open-source Evidently Python library. You can evaluate:
- Generative AI tasks. Any LLM-powered app from summarization to RAG systems and AI agents.
- Predictive AI tasks. Evaluate classification, regression, ranking, and recommendation systems.
- Data quality and data drift. Identify data distribution shifts and validate input data quality. ‍

What types of LLM judges do you support?

Evidently offers built-in LLM-based metrics for popular cases like assessing RAG context quality. You can also customize evaluation criteria using LLM judge templates. These templates implement known best practices like chain-of-thought prompting so you only need to add your plain text criteria. You can also choose different LLMs for evaluation.

Do I need ML metrics if I work on LLMs?

Yes! Even if you’re focused on LLMs, traditional ML metrics can still matter. Real-world AI systems often blend both types of workflows. For example:
- An LLM chatbot might perform classification to detect user intent.
- Building with RAG involves ranking, which ML metrics like Hit Rate can measure.

Evidently handles both structured tabular data (and dataset-level evaluations) and complex nested workflows (with conversation-level evaluation), so you can adapt it to any use case.

How do I run evals?

You can run evaluations either in the Evidently Cloud or locally.
- In the cloud, you first need to upload traces or raw data directly to the platform. Then you can trigger the evaluations directly on the platform.
- For local evaluations, you can use the Evidently Python library to run tests on your infrastructure. For instance, you might run regression tests in your CI/CD pipelines. Then,you upload the results to Evidently Cloud for visualization and alerting.

Local runs don’t count toward your data row limit unless you upload raw data to the platform.

Advanced AI testing

What is synthetic data generation?

AI systems need to handle diverse scenarios and edge cases. Evidently helps by generating test inputs based on task descriptions or provided source context. You can review, edit, and manage these synthetic inputs through the platform’s UI. You can then use these test inputs to run against your LLM application and assess the quality of the result.

What is RAG testing?

RAG (retrieval-augmented generation) testing involves two key steps:
- Generating synthetic data. This includes creating input questions or both input-output pairs using specific strategies to get a golden reference dataset against which to test performance.
- Evaluating results. You assess how well the system performs by measuring both the quality of retrieval and generation. This includes identifying issues like hallucinations or incorrect outputs.

Evidently supports both workflows. The platform's interface also allows product experts to review, edit, and manage generated test data.

How does AI agent testing work?

Testing AI agents involves more than just evaluating single input-output pairs. You often need to assess multi-step interactions, such as to test correct tool choice, conversation tone, and whether the agent achieves its intended goals. Evidently allows you to:
- Replay synthetic agent interactions. You can define behavioral scenarios and run simulated tests where an Evidently-configured agent interacts with your system.
- Run session-level evaluations. You automatically judge the interaction outcomes or specific criteria using configurable session-level evaluations.

You can re-run such tests easily on any change to test for regressions or improvements.

‍Agent simulation is currently in beta. Please reach out if you want to be an early adopter.

What is adversarial testing?

LLM apps face risks like leaking private data or generating harmful content. Evidently lets you:
- Generate adversarial inputs, such as inappropriate prompts and jailbreak scenarios.
- Evaluate the safety of outcomes to test how your system handles these challenges.

You can choose and configure from categories of risks relevant to your application.

How do I try advanced features?

Some advanced features, like basic RAG testing and synthetic input generation, are available in preview mode on the Developer and Pro plans. For full access, please contact us to set up a trial.

Usage limits

What does GB storage refer to?

You can use Evidently Cloud in two ways.
- Upload raw data or traces directly to the platform. This is particularly relevant for LLM apps where you use raw logs for debugging. Here, we charge based on the number of rows you upload.
- Alternatively, you can run evaluations locally using our Python SDK and upload only the aggregated reports with data summaries. You can view results and track metrics without storing raw data. This is helpful for tabular ML use cases when you store logs elsewhere.

These summary reports are saved as JSON files. They don't count toward your row limit but do consume storage. Most plans offer generous storage allowance, which is rarely exceeded under normal use. However, if you upload Reports frequently, you may eventually reach your limit. If that happens, you can purchase additional GB of storage or delete older reports.

Will I be charged if I go over the limit?

If you’re on the free plan and exceed your data row or storage limits, you’ll need to upgrade or delete existing data to continue using the platform.

On paid plans, you won’t be automatically charged unless you enable that option. Otherwise, you’ll receive a billing notice and can choose how to proceed.

Deployment

Is Evidently open-source?

Yes! The core Evidently Python library is open-source under the Apache 2.0 license. It’s ideal for small teams running evaluations independently. To get started, check out our documentation.

Evidently Cloud builds on the open-source version, offering a full web service for testing and evaluation. Features include a no-code interface, alerting, and user management. Read the comparison in the documentation.

Can I get a trial of the self-hosted enterprise version?

Yes, reach out to learn more about enterprise trials.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

Get demo

No credit card required