Hallucination Benchmark Leaderboard
Past benchmarks are saturated, focus on single-turn scenarios, and have limited judge capability. HalluHard addresses these limitations by providing a challenging multi-turn benchmark with rigorous verification that can read and parse full-text sources, including PDFs.
| Rank | Model | Hallucination Rate | |
|---|---|---|---|
| Loading data... | |||
Models hallucinate more in later turns for citation-grounded tasks because they condition on their own earlier mistakes (3-20% of incorrect references reappear), though coding shows a downward trend as tasks narrow from broad to focused.
More capable models consistently demonstrate lower hallucination rates, with larger models (GPT-5-nano → GPT-5-mini → GPT-5) and newer flagship models (GPT-5.2, Claude-Opus) showing substantial improvements across all domains.
Effective thinking helps with hallucination mitigation in GPT-family models, but the effect is model-dependent (DeepSeek-Reasoner shows no improvement), and stronger reasoning can paradoxically increase hallucination risk by producing longer responses with more claims.
Content-grounding failures are far more common than reference-grounding failures, and while web search reduces reference errors, ensuring generated content is actually supported by cited sources remains difficult, especially for PDF-based research papers.
Models struggle with niche facts (which have some training traces) but abstain on completely fabricated items, creating a "dangerous middle zone" where models feel answerable and fill in missing specifics with "most likely" details, leading to hallucinations.
Even the strongest model configurations (Claude-Opus-4.5 and GPT-5.2 with web search) maintain substantial hallucination rates (~30%), underscoring the need for better uncertainty awareness and verification when handling niche or long-tail knowledge.
HalluHard is a challenging multi-turn hallucination benchmark with 950 seed questions across four domains: legal cases (250), research questions (250), medical guidelines (250), and coding (200). A user LLM generates engaging follow-up questions, and we measure 3 rounds of conversation (initial question plus 2 follow-ups).
HalluHard elicits open-ended responses while requiring models to ground factual claims in cited sources. This design ensures that the benchmark focuses specifically on hallucination (ungrounded factual mistakes), not on other aspects of the response.
For legal, research, and medical domains, we sample 5 claims per response and judge claim-wise; for coding, we judge response-wise. Our verification pipeline extracts claims, retrieves evidence via web search, and fetches full-text sources (including PDF parsing) to verify whether cited material supports the generated content.
Two-stage loop: first initialize history with a seed query, then iteratively generate new queries using a user LLM conditioned on the previous history, query the target LLM, and append the new (query, response) pair to history.
Our multi-turn response generation pipeline uses a user LLM to generate engaging follow-up questions based on conversation history, creating natural multi-turn dialogues.
Our verification pipeline extracts claims, retrieves evidence via web search, and fetches full-text sources (including PDF parsing) to verify 1) whether the cited material exists and 2) whether the cited material supports the generated content.
The benchmark evaluates models on inherently difficult tasks that require precise knowledge, accurate citation, and careful content grounding across multiple domains.
Our evaluation judge can read full PDFs to verify detailed content grounding, going beyond simple web search snippets. This enables more accurate and thorough verification of claims, making the benchmark significantly more challenging to pass.
If you use HalluHard in your research, please cite:
@misc{fan2026halluhard,
title={HalluHard: A Hard Multi-Turn Hallucination Benchmark},
author={Fan, Dongyang and Delsad, Sebastien and Flammarion, Nicolas and Andriushchenko, Maksym},
year={2026}
}