🎓 Free introductory course "LLM evaluations for AI product teams". Save your seat

100+ LLM benchmarks and evaluation datasets

A database of LLM benchmarks and datasets to evaluate the performance of language models.
First published: December 10, 2024.
How can you evaluate different LLMs? LLM benchmarks are standardized tests designed to measure and compare the abilities of different language models. We put together a database of 100+ LLM benchmarks and publicly available datasets you can use to evaluate LLM capabilities in various domains, including reasoning, language understanding, math, question answering, coding, tool use, and more.

Maintained by the team behind Evidently, an open-source tool for ML and LLM evaluation.

LLM benchmarks and datasets

All the content belongs to respective parties. We simply put the links together

Did we miss some great LLM benchmarks and datasets? Let us know! Our Discord Community with 2500+ ML practitioners and AI Engineers is the best place to share feedback.

Get Started with AI Observability

Book a personalized 1:1 demo with our team or sign up for a free account.
Icon
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.