contents‍
Large Language Models (LLMs) are powerful, but as more LLM-powered systems are put into production, ensuring their safety becomes paramount.Â
To address this, the AI community has developed specialized LLM safety benchmarks that test models across dimensions like truthfulness, toxicity, bias, and robustness. These benchmarks serve as essential tools for LLM evaluation, helping the community identify and mitigate risks during development and production use of language models. In this blog, we highlight 10 key safety and bias benchmarks that help assess and improve LLM reliability.
Want more examples of LLM benchmarks? We put together a database of 100+ LLM benchmarks and datasets you can use to evaluate the performance of language models.Â
Bookmark the list ⟶
TruthfulQA evaluates how well LLMs generate truthful responses to questions. Specifically, it measures a model's tendency to generate false or misleading information, particularly in scenarios where humans might have common misconceptions.
The dataset contains 817 questions across 38 categories, such as health, law, finance, and politics. The benchmark includes human-annotated reference answers – both true and false versions – and uses human evaluation to judge model outputs for truthfulness and informativeness. As human evaluation is costly, the authors of the benchmark also tested how well automated metrics serve as a proxy. They created two LLM judges: one classifies answers as true or false, and the other evaluates the informativeness.Â
Example questions:Â
Paper: TruthfulQA: Measuring How Models Mimic Human Falsehoods by Lin et al. (2021)
Dataset: TruthfulQA dataset
ToxiGen evaluates how well LLMs distinguish between toxic and benign statements and detect hate speech. The benchmark is designed to train classifiers that can identify nuanced hate speech containing neither slurs nor profanity. This helps models recognize implicit toxicity that might otherwise evade detection systems.Â
The dataset consists of 274,000 machine-generated toxic and benign statements about 13 minority groups. To generate a challenging subset, the benchmark uses an adversarial classifier-in-the-loop decoding algorithm. It allows ToxiGen to cover implicitly toxic text at a larger scale, and about more minority groups.
Example statements:Â
Paper: ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection by Hartvigsen et al. (2022)
Dataset: ToxiGen dataset
HHH (Helpfulness, Honesty, Harmlessness) tests how well LLMs align with ethical values like helpfulness, honesty, and harmlessness. It evaluates the model’s responses in different interaction scenarios and how well it adheres to these key principles when generating text or responding to prompts. The HHH dataset consists of pairs of model outputs where human evaluators have expressed preferences between responses.
Paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback by Bai et al. (2022)
Dataset: HHH dataset
ForbiddenQuestions tests model responses to harmful or unethical scenarios. It assesses whether the model follows predefined ethical guidelines and refuses to generate unsafe content. The dataset contains 107,250 samples across 13 behavioral scenarios that OpenAI deems disallowed. Forbidden scenarios include illegal activity, hate speech, fraud, privacy violation, financial advice, and health consultation. To pass the test, the model must refuse to answer harmful questions.Â
Example Questions:
Paper: "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models by Shen et al. (2023)
Dataset: ForbiddenQuestions dataset
DecodingTrust evaluates the trustworthiness of LLMs across eight perspectives: toxicity, stereotypes, privacy, machine ethics, fairness, adversarial robustness, out-of-distribution robustness, and robustness to adversarial demonstrations. It incorporates test sets from multiple benchmarks, including those evaluating bias, toxicity, and ethical risks. DecodingTrust offers a broad safety evaluation framework, making it one of the most holistic benchmarks available.
Example questions:
Paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models by Wang et al. (2023)
Dataset: DecodingTrust dataset
AdvBench tests how resistant LLMs are to adversarial inputs. Specifically, it checks models for jailbreaking, a prompt engineering attempt designed to circumvent safety mechanisms and elicit harmful outputs. The dataset combines 500 harmful strings and 500 harmful instructions, related to misinformation, hate speech, and self-harm. The authors also introduce a new attack method by adding an adversarial suffix onto the prompt, that induces the model to respond to the user’s original, potentially harmful, request.Â
Paper: Universal and Transferable Adversarial Attacks on Aligned Language Models by Zou et al. (2023)
Dataset: AdvBench dataset
AnthropicRedTeam uses human-generated and annotated adversarial dialogues to assess LLM safety. It tests how a model responds to extended adversarial conversations rather than just single prompts. The dataset consists of 38,961 human red team attacks designed to test model boundaries in areas like content policy violations, bias exploitation, and manipulation techniques. The red team members were instructed to have open-ended conversations with an AI assistant in order to “make the AI behave badly, to get it to say obnoxious, offensive, and harmful things”. The adversarial dialogues were generated by over 300 crowdworkers.Â
Paper: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned by Ganguli et al. (2022)
Dataset: AnthropicRedTeam dataset
HELM Safety aims to introduce a standardized approach to LLM safety evaluation. It is a collection of 5 safety benchmarks spanning 6 risk categories: violence, fraud, discrimination, sexual content, harassment, and deception. It incorporates BBQ to evaluate risks of social discrimination, SimpleSafetyTest to test universally harmful requests containing sexual content and violence, HarmBench to measure vulnerability to various jailbreaking methods, XSTest to check for edge case and refusals, and AnthropicRedTeam to asses how models respond to red team arracks.Â
Article: HELM Safety: Towards Standardized Safety Evaluations of Language Models
RealToxicityPrompt assesses how LLMs respond to naturally occurring text prompts that might lead to toxic outputs without explicitly requesting harmful content. The dataset contains over 100,000 prompts scraped from outbound URLs from Reddit and selected to represent real-world text that models might encounter.
Example prompts:
Paper: RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models by Gehman et al. (2020)
Dataset: RealToxicityPrompt dataset
DoNotAnswer evaluates safeguards in LLMs. It tests a model's ability to recognize and appropriately refuse to respond to harmful, unethical, or dangerous requests that should not be answered. The dataset consists of over 900 prompts across 12 harm types, including sensitive data leakage, illegal activities, unfair discrimination, misinformation, and human-chatbot interaction harms. Each query is annotated with an explanation of why it should be refused, allowing for a nuanced evaluation of model refusal capabilities.
Example questions:
Paper: Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs by Wang et al. (2023)
Dataset: DoNotAnswer dataset
LLM safety benchmarks are essential for testing models, but when building an AI product – like a chatbot or a virtual assistant – you need custom test datasets that reflect your use case. These should cover key scenarios and edge cases specific to your application. You'll also need task-specific evaluations like LLM judges tuned to your custom criteria and preferences.
That’s why we built Evidently. Our open-source library (trusted with over 25 million downloads!) offers a range of evaluation metrics.Â
For teams working on complex, mission-critical AI systems, Evidently Cloud provides a platform to collaboratively test and monitor AI quality. You can generate synthetic data, create evaluation scenarios (including AI agent simulations), run tests, and track performance — all in one place.
Ready to design your custom AI test dataset? Sign up for free or schedule a demo to see Evidently Cloud in action. We're here to help you build with confidence!