Benchmarking LLM’s Capability in Reasoning over Conflicting Web References

Yizhen Yuan; Rui Kong; Dongze Li; Yuanchun Li; Yunxin Liu

Benchmarking LLM’s Capability in Reasoning over Conflicting Web References

Yizhen Yuan, Rui Kong, Dongze Li, Yuanchun Li, Yunxin Liu

Abstract

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have become a dominant framework for building intelligent assistants. In real-world applications such as ChatGPT with web search, the retrieved document often comes from diverse, potentially unreliable sources and may contain inconsistent claims. Unlike traditional search engines that rely on users to manually compare information, LLM-based systems typically feed all retrieved content into the model’s context, requiring LLMs to autonomously identify, differentiate, and reason over conflicting viewpoints. Unlike mainstream LLM evaluation tasks like math and code generation that are primarily focused on reasoning with factual context, question-answering with multi-source references requires fundamentally different capabilities to identify and reason over knowledge contradictions. In this paper, we introduce ConfRAG, a benchmark for evaluating LLMs’ reasoning capability over real-world conflicting documents retrieved from the web. It consists of 1,814 real-world questions, each paired with an average of 9.58 retrieved paragraphs from heterogeneous online sources. A total of 57.2% of the questions exhibit explicit contradictions. We further propose three structured evaluation tasks, answer clustering, answer coverage, and reason coverage, to quantify a model’s ability to organize and explain contradictory content. Experiments with state-of-the-art models such as GPT-4.1 and Claude-3-7-Sonnet reveal substantial performance gaps, highlighting the need for more targeted research in contradiction-aware question answering. To the best of our knowledge, ConfRAG is the first benchmark specifically designed to evaluate contradiction-aware reasoning on real-world long web documents.

Anthology ID:: 2026.acl-long.11
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 303–322
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.11/
DOI:
Bibkey:
Cite (ACL):: Yizhen Yuan, Rui Kong, Dongze Li, Yuanchun Li, and Yunxin Liu. 2026. Benchmarking LLM’s Capability in Reasoning over Conflicting Web References. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 303–322, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Benchmarking LLM’s Capability in Reasoning over Conflicting Web References (Yuan et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.11.pdf
Checklist:: 2026.acl-long.11.checklist.pdf

PDF Cite Search Checklist Fix data