DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama


Abstract
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers usually target general-domain atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs.Yet building such a benchmark for DRR fact-checkers is itself difficult because it requires expert judgments over cognitively demanding, domain-specific claims.In a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on hidden known-answer claims. We therefore propose evolving benchmarking via **Audit-then-Score** (**AtS**), in which labels and rationales remain revisable: when a verifier disagrees with the current benchmark, it submits evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before scoring. After three additional **AtS** rounds, expert accuracy rises to 90.9%, showing that experts are better auditors than one-shot labelers.We instantiate **AtS** as **DeepFactBench**, a versioned DRR factuality benchmark with auditable rationales, and introduce **DeepFactEval**, a claim-level verifier.On the frozen **DeepFactBench** release, **DeepFactEval** achieves 83.4% accuracy, outperforming the best prior deep-research and traditional fact-checkers by 14.3 and 24.9 points, respectively, and transferring well to external factuality datasets.
Anthology ID:
2026.acl-long.1586
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34356–34386
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1586/
DOI:
Bibkey:
Cite (ACL):
Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, and Venkatesh Saligrama. 2026. DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34356–34386, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality (Huang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1586.pdf
Checklist:
 2026.acl-long.1586.checklist.pdf