Evidently supports text, tabular, and embeddings data, using 100+ built-in metrics from the open-source Evidently Python library. You can evaluate: - Generative AI tasks. Any LLM-powered app from summarization to RAG systems and AI agents. - Predictive AI tasks. Evaluate classification, regression, ranking, and recommendation systems. - Data quality and data drift. Identify data distribution shifts and validate input data quality. ‍
What types of LLM judges do you support?
Evidently offers built-in LLM-based metrics for popular cases like assessing RAG context quality. You can also customize evaluation criteria using LLM judge templates. These templates implement known best practices like chain-of-thought prompting so you only need to add your plain text criteria. You can also choose different LLMs for evaluation.Â
Do I need ML metrics if I work on LLMs?
Yes! Even if you’re focused on LLMs, traditional ML metrics can still matter. Real-world AI systems often blend both types of workflows. For example: - An LLM chatbot might perform classification to detect user intent. - Building with RAG involves ranking, which ML metrics like Hit Rate can measure.
Evidently handles both structured tabular data (and dataset-level evaluations) and complex nested workflows (with conversation-level evaluation), so you can adapt it to any use case.
How do I run evals?
You can run evaluations either in the Evidently Cloud or locally. - In the cloud, you first need to upload traces or raw data directly to the platform. Then you can trigger the evaluations directly on the platform. - For local evaluations, you can use the Evidently Python library to run tests on your infrastructure. For instance, you might run regression tests in your CI/CD pipelines. Then,you upload the results to Evidently Cloud for visualization and alerting.
Local runs don’t count toward your data row limit unless you upload raw data to the platform.
Advanced AI testing
What is synthetic data generation?
AI systems need to handle diverse scenarios and edge cases. Evidently helps by generating test inputs based on task descriptions or provided source context. You can review, edit, and manage these synthetic inputs through the platform’s UI. You can then use these test inputs to run against your LLM application and assess the quality of the result.
What is RAG testing?
RAG (retrieval-augmented generation) testing involves two key steps: - Generating synthetic data. This includes creating input questions or both input-output pairs using specific strategies to get a golden reference dataset against which to test performance. - Evaluating results. You assess how well the system performs by measuring both the quality of retrieval and generation. This includes identifying issues like hallucinations or incorrect outputs.
Evidently supports both workflows. The platform's interface also allows product experts to  review, edit, and manage generated test data.
How does AI agent testing work?
Testing AI agents involves more than just evaluating single input-output pairs. You often need to assess multi-step interactions, such as to test correct tool choice, conversation tone, and whether the agent achieves its intended goals. Evidently allows you to: - Replay synthetic agent interactions. You can define behavioral scenarios and run simulated tests where an Evidently-configured agent interacts with your system. - Run session-level evaluations. You automatically judge the interaction outcomes or specific criteria using configurable session-level evaluations.
You can re-run such tests easily on any change to test for regressions or improvements.Â
‍Agent simulation is currently in beta. Please reach out if you want to be an early adopter.Â
What is adversarial testing?
LLM apps face risks like leaking private data or generating harmful content. Evidently lets you: - Generate adversarial inputs, such as inappropriate prompts and jailbreak scenarios. - Evaluate the safety of outcomes to test how your system handles these challenges.
You can choose and configure from categories of risks relevant to your application.
How do I try advanced features?
Some advanced features, like basic RAG testing and synthetic input generation, are available in preview mode on the Developer and Pro plans. For full access, please contact us to set up a trial.
Usage limits
What does GB storage refer to?
You can use Evidently Cloud in two ways. - Upload raw data or traces directly to the platform. This is particularly relevant for LLM apps where you use raw logs for debugging. Here, we charge based on the number of rows you upload. - Alternatively, you can run evaluations locally using our Python SDK and upload only the aggregated reports with data summaries. You can view results and track metrics without storing raw data. This is helpful for tabular ML use cases when you store logs elsewhere.
These summary reports are saved as JSON files. They don't count toward your row limit but do consume storage. Most plans offer generous storage allowance, which is rarely exceeded under normal use. However, if you upload Reports frequently, you may eventually reach your limit. If that happens, you can purchase additional GB of storage or delete older reports.
Will I be charged if I go over the limit?
If you’re on the free plan and exceed your data row or storage limits, you’ll need to upgrade or delete existing data to continue using the platform.Â
On paid plans, you won’t be automatically charged unless you enable that option. Otherwise, you’ll receive a billing notice and can choose how to proceed.
Deployment
Is Evidently open-source?
Yes! The core Evidently Python library is open-source under the Apache 2.0 license. It’s ideal for small teams running evaluations independently. To get started, check out our documentation.
Evidently Cloud builds on the open-source version, offering a full web service for testing and evaluation. Features include a no-code interface, alerting, and user management. Read the comparison in the documentation.
Can I get a trial of the self-hosted enterprise version?
Yes, reach out to learn more about enterprise trials.
Start testing your AI systems today
Book a personalized 1:1 demo with our team or sign up for a free account.
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.