📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

200 LLM benchmarks and evaluation datasets

A database of LLM benchmarks and datasets to evaluate the performance of language models.

First published: December 10, 2024.

How can you evaluate different LLMs? LLM benchmarks are standardized tests designed to measure and compare the abilities of different language models. We put together a database of 200 LLM benchmarks and publicly available datasets you can use to evaluate LLM capabilities in various domains, including reasoning, language understanding, math, question answering, coding, tool use, and more.

Maintained by the team behind Evidently, an open-source tool for ML and LLM evaluation.

‍Navigation tips. You can browse the database by filtering benchmarks by LLM abilities they evaluate. We added tags based on common capabilities, like reasoning, coding, solving math problems, conversational abilities, tool-calling, agent evals, etc. This is not a perfect or mutually exclusive division, but you can use the tags to quickly find:

Knowledge, language and reasoning LLM benchmarks. They test the ability of LLMs to handle tasks like comprehension, inference, logical reasoning, and factual knowledge retrieval.
Chatbot and conversation LLM benchmarks. These benchmarks test how well a model can generate coherent, contextually appropriate, engaging, and accurate responses in a conversation.
Coding LLM benchmarks test LLMs on programming-related tasks like code generation and debugging.
Safety LLM benchmarks test how well LLMs handle adversarial inputs, mitigate bias, and avoid generating toxic or harmful content.
Benchmarks for multimodal LLMs that handle various data types, including images, video, audio, and structured data.

We also added a column “Cited by” that refers to the number of times the benchmark paper was cited. You can use this as a proxy metric for commonly used benchmarks in your domain. We labeled papers published in 2024 or later as “New.”

Bookmark the list! And if you find the database helpful, spread the word.

🔥 Free course on LLM evaluations

Building an LLM-powered app? While benchmarks help compare models, your AI product needs custom evaluations. Learn how to create LLM judges, evaluate RAG systems and run adversarial tests.

Learn more and sign up

LLM benchmarks and datasets

All the content belongs to respective parties. We simply put the links together

Did we miss some great LLM benchmarks and datasets? Let us know! Our Discord Community with 2500+ ML practitioners and AI Engineers is the best place to share feedback.

Start testing your AI systems today

Book a personalized 1:1 demo with our team or sign up for a free account.

Get demo

No credit card required