HalluHard

Hallucination Benchmark Leaderboard

Why Do We Need a New Benchmark?

Past benchmarks are saturated, focus on single-turn scenarios, and have limited judge capability. HalluHard addresses these limitations by providing a challenging multi-turn benchmark with rigorous verification that can read and parse full-text sources, including PDFs.

Domain:

Turn:

Rank ↕	Model ↕	Hallucination Rate ↑	Domain Breakdown Legal ↕ Research ↕ Medical ↕ Coding ↕
Loading data...

Why This Benchmark Is Hard

🎯

Challenging Tasks

Our benchmark remains hard for frontier proprietary models, even with web search enabled.

🔬

Rigorous Evaluation

Designing such a benchmark is hard, both because it requires curating difficult domains and building a rigorous evaluation pipeline.

Key Takeaways

1 Turn Progression and Self-Conditioning

Models hallucinate more in later turns for citation-grounded tasks because they condition on their own earlier mistakes (3-20% of incorrect references reappear), though coding shows a downward trend as tasks narrow from broad to focused.

2 Model Capability Matters

More capable models consistently demonstrate lower hallucination rates, with larger models (GPT-5-nano → GPT-5-mini → GPT-5) and newer flagship models (GPT-5.2, Claude-Opus) showing substantial improvements across all domains.

3 Reasoning: Helpful but Not Sufficient

Effective thinking helps with hallucination mitigation in GPT-family models, but the effect is model-dependent (DeepSeek-Reasoner shows no improvement), and stronger reasoning can paradoxically increase hallucination risk by producing longer responses with more claims.

4 Content Grounding Remains Challenging

Content-grounding failures are far more common than reference-grounding failures, and while web search reduces reference errors, ensuring generated content is actually supported by cited sources remains difficult, especially for PDF-based research papers.

5 Niche vs. Fabricated Knowledge

Models struggle with niche facts (which have some training traces) but abstain on completely fabricated items, creating a "dangerous middle zone" where models feel answerable and fill in missing specifics with "most likely" details, leading to hallucinations.

6 High Hallucination Rates Persist

Even the strongest model configurations (Claude-Opus-4.5 and GPT-5.2 with web search) maintain substantial hallucination rates (~30%), underscoring the need for better uncertainty awareness and verification when handling niche or long-tail knowledge.

About the Benchmark

HalluHard is a challenging multi-turn hallucination benchmark with 950 seed questions across four domains: legal cases (250), research questions (250), medical guidelines (250), and coding (200). A user LLM generates engaging follow-up questions, and we measure 3 rounds of conversation (initial question plus 2 follow-ups).

HalluHard elicits open-ended responses while requiring models to ground factual claims in cited sources. This design ensures that the benchmark focuses specifically on hallucination (ungrounded factual mistakes), not on other aspects of the response.

For legal, research, and medical domains, we sample 5 claims per response and judge claim-wise; for coding, we judge response-wise. Our verification pipeline extracts claims, retrieves evidence via web search, and fetches full-text sources (including PDF parsing) to verify whether cited material supports the generated content.

Pipeline Overview

Response Generation Pipeline

Two-stage loop: first initialize history with a seed query, then iteratively generate new queries using a user LLM conditioned on the previous history, query the target LLM, and append the new (query, response) pair to history.

i = 1

seed query q₁

🤖 target LLM

response r₁

History
H₁ = (q₁, r₁)

i ≥ 2

History
H_i−1

🤖 user LLM

query q_i

🤖 target LLM

response r_i

update History
H_i = H_i−1 ∪ (q_i, r_i)

The "+" indicates combining the new query with conversation history before sending to the target LLM.

query

LLM module

response

history / memory

Our multi-turn response generation pipeline uses a user LLM to generate engaging follow-up questions based on conversation history, creating natural multi-turn dialogues.

Example Conversation

Seed Query

I keep hearing that LLMs "hallucinate". What does that actually mean?

Response

It usually means the model produces a statement that sounds confident and fluent but isn't supported by reliable evidence, sometimes it's flat-out false, sometimes it's unverified or misleading (Ji et al, 2023).

Turn 2

Follow-up Query

If we wanted to test hallucinations with a straightforward benchmark, what's a good option?

Response

One commonly used choice is SimpleQA (Wei et al., 2024), which gathers challenging fact-seeking questions that cover diverse topics and time-sensitive knowledge.

Turn 3

Follow-up Query

The benchmark is rather simplistic, how about open-ended generation?

Response

For open-ended settings, people often evaluate hallucinations by checking long-form outputs claim-by-claim, e.g., LongFact (Wei et al., 2024) breaks a generation into atomic facts and measures what fraction are supported by web search.

Response Verification Pipeline

responses

r1, . . . , rn

atomic references
+ supported claims

{ refi, ci }

search for refi

refi found?

HTML? -> read HTML

PDF? -> download and parse PDF

ci grounded?

Hallucination

Well-grounded

Each extracted claim must have both a valid reference and a valid content. We use Serper API for web search.

Example Hallucinated Claim

Research Questions Example: Content Grounding Failure

Claim: SimpleQA (Wei et al. 2024) gathers challenging fact-seek questions that cover diverse topics and time-sensitive knowledge.

Reference: Wei et al. (2024). "Measuring short-form factuality in large language models."

Reference: Yes. The cited paper exists and the details are correct. Content: No. The content is partially fabricated, as SimpleQA emphasizes that "reference answers should not change over time". Verdict: Hallucination

Note: Our verification pipeline performs two-stage verification: reference grounding checks whether the cited source exists and is correctly cited, while content grounding verifies whether the cited source actually supports the claim. A claim is marked as a hallucination if either verification fails. The pipeline parses full-text sources, including PDFs, to perform accurate content verification.

Citation

If you use HalluHard in your research, please cite:

BibTeX

@misc{fan2026halluhardhardmultiturnhallucination,
                            title={HalluHard: A Hard Multi-Turn Hallucination Benchmark}, 
                            author={Dongyang Fan and Sebastien Delsad and Nicolas Flammarion and Maksym Andriushchenko},
                            year={2026},
                            eprint={2602.01031},
                            archivePrefix={arXiv},
                            primaryClass={cs.AI},
                            url={https://arxiv.org/abs/2602.01031}, 
                      }

HalluHard

Why Do We Need a New Benchmark?

Why This Benchmark Is Hard

Challenging Tasks

Rigorous Evaluation

Key Takeaways

1 Turn Progression and Self-Conditioning

2 Model Capability Matters

3 Reasoning: Helpful but Not Sufficient

4 Content Grounding Remains Challenging

5 Niche vs. Fabricated Knowledge

6 High Hallucination Rates Persist

About the Benchmark

Pipeline Overview

Response Generation Pipeline

Example Conversation

Response Verification Pipeline

Example Hallucinated Claim

Team

Dongyang Fan*

Sébastien Delsad*

Nicolas Flammarion

Maksym Andriushchenko

Citation

BibTeX