Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Junjie Wu, Gefei Gu, Yanan Zheng, Dit-Yan Yeung, Arman Cohan


Abstract
Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing—a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data—remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code will be publicly released, and the data is also attached in the submission.
Anthology ID:
2025.acl-long.1162
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23861–23880
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1162/
DOI:
Bibkey:
Cite (ACL):
Junjie Wu, Gefei Gu, Yanan Zheng, Dit-Yan Yeung, and Arman Cohan. 2025. Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23861–23880, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models (Wu et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1162.pdf