A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting

Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, Mark Dredze


Abstract
Multi-document retrieval approaches often overlook the ways different retrievals complement each other when addressing complex queries. In this work, we study journalist source selection in news article writing and examine the discourse roles that different sources serve when paired together, finding that discourse function (not simply informational content) is an important component of source usage. Then, we introduce a novel IR task to benchmark how well language models can reason about this narrative process. We extract a journalist’s initial query and the sources they used from news articles and aim to recover the sources that support this query. We demonstrate that large language models (LLMs) can be employed in multi-step query planning, identifying informational gaps and enhancing retrieval performance, but current approaches to interleave queries fall short. By training auxiliary discourse planners and incorporating this information into LLMs, we enhance query planning, achieving a significant 5% improvement in precision and a 2% increase in F1 score over the previous SOTA, all while maintaining recall.
Anthology ID:
2025.knowledgenlp-1.18
Volume:
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, USA
Editors:
Weijia Shi, Wenhao Yu, Akari Asai, Meng Jiang, Greg Durrett, Hannaneh Hajishirzi, Luke Zettlemoyer
Venues:
KnowledgeNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
180–204
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.knowledgenlp-1.18/
DOI:
Bibkey:
Cite (ACL):
Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, and Mark Dredze. 2025. A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting. In Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 180–204, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting (Spangher et al., KnowledgeNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.knowledgenlp-1.18.pdf