🎓 Free introductory course "LLM evaluations for AI product teams". Save your seat
pricing

Free to start. Scales with you.

Building an AI-powered startup? Apply to get up to 90% off.
Monthly Billing
Yearly Billing
Developer
For hobby projects and experiments
Free
Get started
All core evaluation features
100+ built-in metrics
Icon
10,000 data rows / month
Icon
30-day retention
Icon
Community support
Pro
For refining and monitoring AI systems
$50/month
Get started
All in Free, plus:
Icon
Alerting
Icon
100,000 data rows / month
Icon
12-month retention
Icon
5 seats, 10 projects
Icon
Email support
Expert
For advanced testing of AI agents and complex apps
from $399/month
Request trial
All in Pro, plus:
Icon
Synthetic data
Icon
Adversarial testing
Icon
Agent simulations
Icon
200,000 data rows / month
Icon
10 seats, unlimited projects
Icon
Dedicated support
Enterprise
For companies managing AI at scale
Custom
Talk to us
Icon
All features with custom limits
Icon
On-prem or private cloud deployment options
Icon
Premium support and SLA
Startups
Ideal for early-stage startups building with AI
Special offer
Talk to us
Icon
All core features
Icon
Free credits towards Pro tier
Icon
Premium support
Developer
For hobby projects and experiments
Free
Get started
Icon
All core features
Icon
10,000 rows / month
Icon
1GB snapshots
Icon
3 Projects
Icon
2 Seats
Icon
Community support
Pro
For teams that run production AI systems
$80/month
Get started
Icon
All core features
Icon
100,000 rows / month
Icon
100 GB snapshots
Icon
10 Projects
Icon
5 Seats
Icon
Email support
Additional: 
$10 per 10,000 rows/month
$1 per GB snapshots stored
Enterprise
For companies managing AI at scale
Custom
Get started
Icon
Custom limits
Icon
Custom SSO
Icon
Custom roles 
Icon
Audit logs
Icon
Private cloud
Icon
Premium support
comparison

See which is the best price plan for you

Plan
Developer
Pro
Expert
Enterprise
Deployment
Evidently Cloud
Evidently Cloud
Evidently Cloud
Private Cloud Option
Usage limits
Number of rows
Tabular data rows or trace spans you can send.
10,000 / month
100,000 / month
Additional: $10 per 10,000 rows/month
200,000 / month
Additional: $10 per 10,000 rows/month
Custom
Retention
Access to historical evaluation data.
30 days
12 months
24 months
Custom
Agent simulation
Simulated interactions with your AI system.
Icon
Icon
10,000 runs / month
Additional: $10 per 1,000 runs
Custom
Synthetic data
Generating inputs for quality and safety testing.
Trial access (2 datasets)
Trial access (2 datasets)
Unlimited
Unlimited
Evaluations
Run evaluations on your data or traces.
Unlimited
Unlimited
Unlimited
Unlimited
Storage
Evaluation artifacts and reports. 
1 GB
100 GB
Additional storage: $1/GB per month. 
200 GB
Additional storage: $1/GB per month. 
Custom
Core features
Tracing
Collect AI system inputs and outputs.
Icon
Icon
Icon
Icon
Datasets
Organize and manage evaluation datasets.
Icon
Icon
Icon
Icon
Evaluations
100+ built-in metrics.
Icon
Icon
Icon
Icon
LLM-as-a-judge
Prompt external LLMs for evaluations.
Icon
Icon
Icon
Icon
Test suites
Group and run conditional evaluations.
Icon
Icon
Icon
Icon
Custom metrics
Add deterministic or prompt-based metrics.
Icon
Icon
Icon
Icon
Dashboard
Design monitoring panels in UI or as code. 
Icon
Icon
Icon
Icon
Data export
Download traces or evaluation results.
Icon
Icon
Icon
Icon
Alerting
Send alerts on failed tests or metric values.
Icon
Icon
Icon
Icon
Advanced evals
Synthetic data
Generate diverse inputs for scenario testing. 
Icon
Icon
Icon
Icon
Adversarial testing
Test AI safety and edge case performance. 
Icon
Icon
Icon
Icon
RAG testing
Test quality of RAG  retrieval and generation. 
Icon
Icon
Icon
Icon
Agent simulation
Evidently AI agent plays the role of a user.
Icon
Icon
Icon
Icon
Workspace
Projects
2 projects
10 projects
Unlimited
Unlimited
Users
2 users
5 users
10 users
Unlimited
Role-based access support
Icon
Standard roles
Standard roles
Custom roles
Support
Channels
Community
Email
Email, Shared Slack
Premium support
Custom support SLA
Icon
Icon
Icon
Icon
Billing
Icon
Monthly, self-serve
Monthly or Annual
Annual contract
Custom contract
Icon
Icon
Icon
Icon
Team training
Icon
Icon
Add-on
Add-on
faq

Frequently Asked Questions

Evaluations

What can you evaluate?
Evidently supports text, tabular, and embeddings data, using 100+ built-in metrics from the open-source Evidently Python library. You can evaluate:
- Generative AI tasks. Any LLM-powered app from summarization to RAG systems and AI agents. 
- Predictive AI tasks. Evaluate classification, regression, ranking, and recommendation systems.
- Data quality and data drift. Identify data distribution shifts and validate input data quality. ‍
What types of LLM judges do you support?
Evidently offers built-in LLM-based metrics for popular cases like assessing RAG context quality. You can also customize evaluation criteria using LLM judge templates. These templates implement known best practices like chain-of-thought prompting so you only need to add your plain text criteria. You can also choose different LLMs for evaluation. 
Do I need ML metrics if I work on LLMs?
Yes! Even if you’re focused on LLMs, traditional ML metrics can still matter. Real-world AI systems often blend both types of workflows. For example:
- An LLM chatbot might perform classification to detect user intent.
- Building with RAG involves ranking, which ML metrics like Hit Rate can measure.

Evidently handles both structured tabular data (and dataset-level evaluations) and complex nested workflows (with conversation-level evaluation), so you can adapt it to any use case.
How do I run evals?
You can run evaluations either in the Evidently Cloud or locally.
- In the cloud, you first need to upload traces or raw data directly to the platform. Then you can trigger the evaluations directly on the platform.
- For local evaluations, you can use the Evidently Python library to run tests on your infrastructure. For instance, you might run regression tests in your CI/CD pipelines. Then,you upload the results to Evidently Cloud for visualization and alerting.

Local runs don’t count toward your data row limit unless you upload raw data to the platform.

Advanced AI testing

What is synthetic data generation?
AI systems need to handle diverse scenarios and edge cases. Evidently helps by generating test inputs based on task descriptions or provided source context. You can review, edit, and manage these synthetic inputs through the platform’s UI. You can then use these test inputs to run against your LLM application and assess the quality of the result.
What is RAG testing?
RAG (retrieval-augmented generation) testing involves two key steps:
- Generating synthetic data. This includes creating input questions or both input-output pairs using specific strategies to get a golden reference dataset against which to test performance.
- Evaluating results. You assess how well the system performs by measuring both the quality of retrieval and generation. This includes identifying issues like hallucinations or incorrect outputs.

Evidently supports both workflows. The platform's interface also allows product experts to  review, edit, and manage generated test data.
How does AI agent testing work?
Testing AI agents involves more than just evaluating single input-output pairs. You often need to assess multi-step interactions, such as to test correct tool choice, conversation tone, and whether the agent achieves its intended goals. Evidently allows you to:
- Replay synthetic agent interactions. You can define behavioral scenarios and run simulated tests where an Evidently-configured agent interacts with your system. 
- Run session-level evaluations. You automatically judge the interaction outcomes or specific criteria using configurable session-level evaluations.

You can re-run such tests easily on any change to test for regressions or improvements. 

‍Agent simulation is currently in beta. Please reach out if you want to be an early adopter. 
What is adversarial testing?
LLM apps face risks like leaking private data or generating harmful content. Evidently lets you:
- Generate adversarial inputs,
such as inappropriate prompts and jailbreak scenarios.
- Evaluate the safety of outcomes to test how your system handles these challenges.

You can choose and configure from categories of risks relevant to your application.
How do I try advanced features?
Some advanced features, like basic RAG testing and synthetic input generation, are available in preview mode on the Developer and Pro plans. For full access, please contact us to set up a trial.

Usage limits

What does GB storage refer to?
You can use Evidently Cloud in two ways. 
- Upload raw data or traces directly to the platform. This is particularly relevant for LLM apps where you use raw logs for debugging. Here, we charge based on the number of rows you upload.
- Alternatively, you can run evaluations locally using our Python SDK and upload only the aggregated reports with data summaries. You can view results and track metrics without storing raw data. This is helpful for tabular ML use cases when you store logs elsewhere.

These summary reports are saved as JSON files. They don't count toward your row limit but do consume storage. Most plans offer generous storage allowance, which is rarely exceeded under normal use. However, if you upload Reports frequently, you may eventually reach your limit. If that happens, you can purchase additional GB of storage or delete older reports.
Will I be charged if I go over the limit?
If you’re on the free plan and exceed your data row or storage limits, you’ll need to upgrade or delete existing data to continue using the platform. 

On paid plans, you won’t be automatically charged unless you enable that option. Otherwise, you’ll receive a billing notice and can choose how to proceed.

Deployment

Is Evidently open-source?
Yes! The core Evidently Python library is open-source under the Apache 2.0 license. It’s ideal for small teams running evaluations independently. To get started, check out our documentation.

Evidently Cloud builds on the open-source version, offering a full web service for testing and evaluation. Features include a no-code interface, alerting, and user management. Read the comparison in the documentation.
Can I get a trial of the self-hosted enterprise version?
Yes, reach out to learn more about enterprise trials.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.