DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham; ChangHao Li; Ron Dorfman; Roy Ganz; Oren Nuriel; Amir Dudai; Aviad Aberdam; Noah Flynn; Elman Mansimov; Aditya Kalyanpur; Ron Litman

DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham, ChangHao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Aditya Kalyanpur, Ron Litman

Abstract

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose **DREAM** (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

Anthology ID:: 2026.acl-long.448
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9879–9904
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.448/
DOI:
Bibkey:
Cite (ACL):: Elad Ben Avraham, ChangHao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Aditya Kalyanpur, and Ron Litman. 2026. DREAM: Deep Research Evaluation with Agentic Metrics. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9879–9904, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DREAM: Deep Research Evaluation with Agentic Metrics (Avraham et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.448.pdf
Checklist:: 2026.acl-long.448.checklist.pdf

PDF Cite Search Checklist Fix data