LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios

Pei Chen; Hongye Jin; Cheng-Che Lee; Rulin Shao; Jingfeng Yang; Mingyu Zhao; Zhaoyu Zhang; Qin Lu; Kaiwen Men; Ning Xie; Huasheng Li; Bing Yin; Han Li; Lingyun Wang

LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios

Pei Chen, Hongye Jin, Cheng-Che Lee, Rulin Shao, Jingfeng Yang, Mingyu Zhao, Zhaoyu Zhang, Qin Lu, Kaiwen Men, Ning Xie, Huasheng Li, Bing Yin, Han Li, Lingyun Wang

Abstract

Large Language Models (LLMs), exemplified by Claude and LLama, have exhibited impressive proficiency in tackling a myriad of Natural Language Processing (NLP) tasks. Yet, in pursuit of the ambitious goal of attaining Artificial General Intelligence (AGI), there remains ample room for enhancing LLM capabilities. Chief among these is the pressing need to bolster long-context comprehension. Numerous real-world scenarios demand LLMs to adeptly reason across extended contexts, such as multi-turn dialogues or agent workflow. Hence, recent advancements have been dedicated to stretching the upper bounds of long-context comprehension, with models like Claude 3 accommodating up to 200k tokens, employing various techniques to achieve this feat. Aligned with this progression, we propose a leaderboard LongLeader that seeks to comprehensively assess different long-context comprehension abilities of diverse LLMs and context length extension strategies across meticulously selected benchmarks. Specifically, we aim to address the following questions: 1) Do LLMs genuinely deliver the long-context proficiency they purport? 2) Which benchmarks offer reliable metrics for evaluating long-context comprehension? 3) What technical strategies prove effective in extending the understanding of longer contexts? We streamline the evaluation process for LLMs on the benchmarks, offering open-source access to the benchmarks and maintaining a dedicated website for leaderboards. We will continuously curate new datasets and update models to the leaderboards.

Anthology ID:: 2025.naacl-long.439
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8734–8750
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.439/
DOI:
Bibkey:
Cite (ACL):: Pei Chen, Hongye Jin, Cheng-Che Lee, Rulin Shao, Jingfeng Yang, Mingyu Zhao, Zhaoyu Zhang, Qin Lu, Kaiwen Men, Ning Xie, Huasheng Li, Bing Yin, Han Li, and Lingyun Wang. 2025. LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8734–8750, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: LongLeader: A Comprehensive Leaderboard for Large Language Models in Long-context Scenarios (Chen et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.439.pdf

PDF Cite Search Fix data