📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations.  Get your copy
pricing

From open-source to enterprise

Everything you need to evaluate, test, and monitor AI –
self-serve or fully managed.
Monthly Billing
Yearly Billing
Open-source
For individual developers, experiments, and lightweight evaluation needs.
Free
Get started
Core AI evaluation and testing features
100+ evaluation metrics
Icon
Local self-hosted dashboards
Icon
API and CLI access
Icon
Community support on Discord and docs
Enterprise
For team collaboration, enterprise scale and advanced workflows.
Custom
Talk to us
Icon
Managed Cloud or private deployment
Icon
No-code UI
Icon
Alerts & scheduled tasks
Icon
Role-based access control
Icon
Premium support and onboarding
Startups
Ideal for early-stage startups building with AI
Special offer
Talk to us
Icon
All core features
Icon
Free credits towards Pro tier
Icon
Premium support
Developer
For hobby projects and experiments
Free
Get started
Icon
All core features
Icon
10,000 rows / month
Icon
1GB snapshots
Icon
3 Projects
Icon
2 Seats
Icon
Community support
Pro
For teams that run production AI systems
$80/month
Get started
Icon
All core features
Icon
100,000 rows / month
Icon
100 GB snapshots
Icon
10 Projects
Icon
5 Seats
Icon
Email support
Additional: 
$10 per 10,000 rows/month
$1 per GB snapshots stored
Enterprise
For companies managing AI at scale
Custom
Get started
Icon
Custom limits
Icon
Custom SSO
Icon
Custom roles 
Icon
Audit logs
Icon
Private cloud
Icon
Premium support
comparison

See which is the best price plan for you

Plan
Open-source
Enterprise
Deployment
Self-hosted
Cloud or Self-hosted
Core features
Tracing
Collect AI system inputs and outputs.
Icon
Icon
Datasets
Organize and manage evaluation datasets.
Icon
Icon
Evaluations
100+ built-in metrics.
Icon
Icon
LLM-as-a-judge
Prompt external LLMs for evaluations.
API only
UI and API
Test suites
Group and run conditional evaluations.
Icon
Icon
Custom metrics
Add deterministic or prompt-based metrics.
Icon
Icon
Dashboard
Design monitoring panels in UI or as code. 
Icon
Icon
Data export
Download traces or evaluation results.
Icon
Icon
Alerting
Send alerts on failed tests or metric values.
Icon
Icon
Advanced evals
Synthetic data
Generate diverse inputs for scenario testing. 
API only
UI and API
Adversarial testing
Test AI safety and edge case performance.
Icon
UI and API
Workspace
Projects
Unlimited
Unlimited
Users
Icon
Unlimited
Authentification
Icon
Icon
Role-based access
Icon
Custom roles
Support
Support channels
Community
Premium support
Custom SLA
Icon
Icon
Team training
Icon
Add-on
faq

Frequently Asked Questions

Evaluations

What can you evaluate?
Evidently supports text, tabular, and embeddings data, using 100+ built-in metrics from the open-source Evidently Python library. You can evaluate:
- Generative AI tasks. Any LLM-powered app from summarization to RAG systems and AI agents. 
- Predictive AI tasks. Evaluate classification, regression, ranking, and recommendation systems.
- Data quality and data drift. Identify data distribution shifts and validate input data quality. ‍
What types of LLM judges do you support?
Evidently offers built-in LLM-based metrics for popular cases like assessing RAG context quality. You can also customize evaluation criteria using LLM judge templates. These templates implement known best practices like chain-of-thought prompting so you only need to add your plain text criteria. You can also choose different LLMs for evaluation. 
Do I need ML metrics if I work on LLMs?
Yes! Even if you’re focused on LLMs, traditional ML metrics can still matter. Real-world AI systems often blend both types of workflows. For example:
- An LLM chatbot might perform classification to detect user intent.
- Building with RAG involves ranking, which ML metrics like Hit Rate can measure.

Evidently handles both structured tabular data (and dataset-level evaluations) and complex nested workflows (with conversation-level evaluation), so you can adapt it to any use case.
How do I run evals?
You can run evaluations either in the Evidently Cloud or locally.
- In the cloud, you first need to upload traces or raw data directly to the platform. Then you can trigger the evaluations directly on the platform.
- For local evaluations, you can use the Evidently Python library to run tests on your infrastructure. For instance, you might run regression tests in your CI/CD pipelines. Then,you upload the results to Evidently Cloud for visualization and alerting.

Local runs don’t count toward your data row limit unless you upload raw data to the platform.

Advanced AI testing

What is synthetic data generation?
AI systems need to handle diverse scenarios and edge cases. Evidently helps by generating test inputs based on task descriptions or provided source context. You can review, edit, and manage these synthetic inputs through the platform’s UI. You can then use these test inputs to run against your LLM application and assess the quality of the result.
What is RAG testing?
RAG (retrieval-augmented generation) testing involves two key steps:
- Generating synthetic data. This includes creating input questions or both input-output pairs using specific strategies to get a golden reference dataset against which to test performance.
- Evaluating results. You assess how well the system performs by measuring both the quality of retrieval and generation. This includes identifying issues like hallucinations or incorrect outputs.

Evidently supports both workflows. The platform's interface also allows product experts to  review, edit, and manage generated test data.
What is adversarial testing?
LLM apps face risks like leaking private data or generating harmful content. Evidently lets you:
- Generate adversarial inputs,
such as inappropriate prompts and jailbreak scenarios.
- Evaluate the safety of outcomes to test how your system handles these challenges.

You can choose and configure from categories of risks relevant to your application.

Usage limits

What does GB storage refer to?
You can use Evidently Cloud in two ways. 
- Upload raw data or traces directly to the platform. This is particularly relevant for LLM apps where you use raw logs for debugging. Here, we charge based on the number of rows you upload.
- Alternatively, you can run evaluations locally using our Python SDK and upload only the aggregated reports with data summaries. You can view results and track metrics without storing raw data. This is helpful for tabular ML use cases when you store logs elsewhere.

These summary reports are saved as JSON files. They don't count toward your row limit but do consume storage. Most plans offer generous storage allowance, which is rarely exceeded under normal use. However, if you upload Reports frequently, you may eventually reach your limit. If that happens, you can purchase additional GB of storage or delete older reports.

Deployment

Is Evidently open-source?
Yes! The core Evidently Python library is open-source under the Apache 2.0 license. It’s ideal for small teams running evaluations independently. To get started, check out our documentation.

Evidently Cloud builds on the open-source version, offering a full web service for testing and evaluation. Features include a no-code interface, alerting, and user management. Read the comparison in the documentation.
Can I get a trial of the self-hosted enterprise version?
Yes, reach out to learn more about enterprise trials.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or try open source.
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.