LOFT: Scalable and More Realistic Long-Context Evaluation

Jinhyuk Lee; Anthony Chen; Zhuyun Dai; Dheeru Dua; Devendra Singh Sachan; Michael Boratko; Yi Luan; Séb Arnold; Vincent Perot; Siddharth Dalmia; Hexiang Hu; Xudong Lin; Panupong Pasupat; Aida Amini; Jeremy R. Cole; Sebastian Riedel; Iftekhar Naim; Ming-Wei Chang; Kelvin Guu

LOFT: Scalable and More Realistic Long-Context Evaluation

Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Séb Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu

Abstract

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs’ ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs’ performance on in-context retrieval and reasoning. Our findings reveal LCLMs’ surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their capabilities to tackle existing paradigms.

Anthology ID:: 2025.findings-naacl.374
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6698–6723
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.374/
DOI:
Bibkey:
Cite (ACL):: Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Séb Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. 2025. LOFT: Scalable and More Realistic Long-Context Evaluation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6698–6723, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: LOFT: Scalable and More Realistic Long-Context Evaluation (Lee et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.374.pdf

PDF Cite Search Fix data