Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints

Juntao Wu; Wei Wen; Xianting Huang; Shuai Pang; Ruizhi Qiao; Xing Sun; Ke Wang

Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints

Juntao Wu, Wei Wen, Xianting Huang, Shuai Pang, Ruizhi Qiao, Xing Sun, Ke Wang

Abstract

Evaluating the exhaustive search capabilities of large language models (LLMs) is plagued by a fundamental paradox: verifying completeness requires complete ground truth, yet high-entropy enumeration tasks make such ground truth impossible for humans to create. This causes benchmarks to systematically penalize models for outperforming their human annotators. Despite rapid progress in web-search and deep research agents—which now issue hundreds of queries, traverse diverse sites, and synthesize long reports—evaluation still largely relies on partially annotated answer sets, LLM-based judges, or single-answer questions that avoid genuinely exhaustive search scenarios.We break this paradox by shifting the evaluation paradigm from simulating a messy reality to constructing computationally pure challenges. We introduce VERITAS (Verifiable Traversal Assessment for Search), a framework built on the principle of computationally irreducible constraints. By introducing novel, non-optimizable constraints, we create verifiable, sparse-answer search tasks that are computationally equivalent to exhaustive enumeration. These constraints are easy to verify but impossible for LLMs or search engines to optimize, forcing agents to genuinely traverse the entire search space. VERITAS can automatically generate a virtually infinite number of test cases with perfect ground truth and precise difficulty control, with marginal instance cost dominated by hash computations. This provides not only a robust benchmark for evaluating systematic exploration under uncertainty but also a scalable method for generating training data to improve these crucial, yet underdeveloped, capabilities.

Anthology ID:: 2026.findings-acl.1406
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28209–28218
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1406/
DOI:
Bibkey:
Cite (ACL):: Juntao Wu, Wei Wen, Xianting Huang, Shuai Pang, Ruizhi Qiao, Xing Sun, and Ke Wang. 2026. Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28209–28218, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints (Wu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1406.pdf
Checklist:: 2026.findings-acl.1406.checklist.pdf

PDF Cite Search Checklist Fix data