‍Navigation tips. You can browse the database by filtering benchmarks by LLM abilities they evaluate. We added tags based on common capabilities, like reasoning, coding, solving math problems, conversational abilities, tool-calling, agent evals, etc. This is not a perfect or mutually exclusive division, but you can use the tags to quickly find:
- Knowledge, language and reasoning LLM benchmarks. They test the ability of LLMs to handle tasks like comprehension, inference, logical reasoning, and factual knowledge retrieval.
- Chatbot and conversation LLM benchmarks. These benchmarks test how well a model can generate coherent, contextually appropriate, engaging, and accurate responses in a conversation.Â
- Coding LLM benchmarks test LLMs on programming-related tasks like code generation and debugging.Â
- Safety LLM benchmarks test how well LLMs handle adversarial inputs, mitigate bias, and avoid generating toxic or harmful content.Â
- Benchmarks for multimodal LLMs that handle various data types, including images, video, audio, and structured data.Â
We also added a column “Cited by” that refers to the number of times the benchmark paper was cited. You can use this as a proxy metric for commonly used benchmarks in your domain. We labeled papers published in 2024 or later as “New.”Â
Bookmark the list! And if you find the database helpful, spread the word.
Building an LLM-powered product? While benchmarks help compare models, your AI product needs custom evals. Evidently Cloud helps you generate test data, run evaluations, and ship AI products you can trust.Â
Talk to us ⟶