Hallucination Benchmark Leaderboard
Past benchmarks are saturated, focus on single-turn scenarios, and have limited judge capability. HalluHard addresses these limitations by providing a challenging multi-turn benchmark with rigorous verification that can read and parse full-text sources, including PDFs.
| Rank | Model | Hallucination Rate | |
|---|---|---|---|
| Loading data... | |||
Our benchmark remains hard for frontier proprietary models, even with web search enabled.
Designing such a benchmark is hard, both because it requires curating difficult domains and building a rigorous evaluation pipeline.
Models hallucinate more in later turns for citation-grounded tasks because they condition on their own earlier mistakes (3-20% of incorrect references reappear), though coding shows a downward trend as tasks narrow from broad to focused.
More capable models consistently demonstrate lower hallucination rates, with larger models (GPT-5-nano → GPT-5-mini → GPT-5) and newer flagship models (GPT-5.2, Claude-Opus) showing substantial improvements across all domains.
Effective thinking helps with hallucination mitigation in GPT-family models, but the effect is model-dependent (DeepSeek-Reasoner shows no improvement), and stronger reasoning can paradoxically increase hallucination risk by producing longer responses with more claims.
Content-grounding failures are far more common than reference-grounding failures, and while web search reduces reference errors, ensuring generated content is actually supported by cited sources remains difficult, especially for PDF-based research papers.
Models struggle with niche facts (which have some training traces) but abstain on completely fabricated items, creating a "dangerous middle zone" where models feel answerable and fill in missing specifics with "most likely" details, leading to hallucinations.
Even the strongest model configurations (Claude-Opus-4.5 and GPT-5.2 with web search) maintain substantial hallucination rates (~30%), underscoring the need for better uncertainty awareness and verification when handling niche or long-tail knowledge.
HalluHard is a challenging multi-turn hallucination benchmark with 950 seed questions across four domains: legal cases (250), research questions (250), medical guidelines (250), and coding (200). A user LLM generates engaging follow-up questions, and we measure 3 rounds of conversation (initial question plus 2 follow-ups).
HalluHard elicits open-ended responses while requiring models to ground factual claims in cited sources. This design ensures that the benchmark focuses specifically on hallucination (ungrounded factual mistakes), not on other aspects of the response.
For legal, research, and medical domains, we sample 5 claims per response and judge claim-wise; for coding, we judge response-wise. Our verification pipeline extracts claims, retrieves evidence via web search, and fetches full-text sources (including PDF parsing) to verify whether cited material supports the generated content.
Two-stage loop: first initialize history with a seed query, then iteratively generate new queries using a user LLM conditioned on the previous history, query the target LLM, and append the new (query, response) pair to history.
Our multi-turn response generation pipeline uses a user LLM to generate engaging follow-up questions based on conversation history, creating natural multi-turn dialogues.
Each extracted claim must have both a valid reference and a valid content. We use Serper API for web search.
Note: Our verification pipeline performs two-stage verification: reference grounding checks whether the cited source exists and is correctly cited, while content grounding verifies whether the cited source actually supports the claim. A claim is marked as a hallucination if either verification fails. The pipeline parses full-text sources, including PDFs, to perform accurate content verification.
EPFL
EPFL
EPFL
ELLIS Institute Tübingen
Max Planck Institute for Intelligent Systems
Tübingen AI Center
* Denotes equal contribution.
If you use HalluHard in your research, please cite:
@misc{fan2026halluhardhardmultiturnhallucination,
title={HalluHard: A Hard Multi-Turn Hallucination Benchmark},
author={Dongyang Fan and Sebastien Delsad and Nicolas Flammarion and Maksym Andriushchenko},
year={2026},
eprint={2602.01031},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.01031},
}