100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Van Yang; Hongye Jin; Shaochen Zhong; Song Jiang; Qifan Wang; Vipin Chaudhary; Xiaotian Han

doi:10.18653/v1/2025.findings-acl.903

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Van Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have a few major shortcomings. For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs. While some real-task-based benchmarks like LongBench avoid this problem, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable for models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interest would fail. Last, most benchmarks tend to not provide proper metrics to separate long-context performance from the model’s baseline ability, so when conducting a cross-model/recipe comparison, such conflation makes the user unable to understand how exactly one model or recipe excels at the long-context task in relation to its baseline ability. To address these issues, we introduce a length-controllable, real-life reflective benchmark with a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our datasets in effectively evaluating LLMs. All assets are available at https://github.com/uservan/100-LongBench.git.

Anthology ID:: 2025.findings-acl.903
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17560–17576
Language:
URL:: https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.903/
DOI:: 10.18653/v1/2025.findings-acl.903
Bibkey:
Cite (ACL):: Van Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, and Xiaotian Han. 2025. 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17560–17576, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? (Yang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.903.pdf

PDF Cite Search Fix data