LongAttn: Selecting Long-context Training Data via Token-level Attention

Longyun Wu; Dawei Zhu; Guangxiang Zhao; Zhuocheng Yu; Junfeng Ran; Xiangyu Wong; Lin Sun; Sujian Li (李素建)

LongAttn: Selecting Long-context Training Data via Token-level Attention

Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li

Abstract

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn**, which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies**, enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness**, **scalability**, and **efficiency**. We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.

Anthology ID:: 2025.findings-acl.991
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19367–19380
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.991/
DOI:
Bibkey:
Cite (ACL):: Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, and Sujian Li. 2025. LongAttn: Selecting Long-context Training Data via Token-level Attention. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19367–19380, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LongAttn: Selecting Long-context Training Data via Token-level Attention (Wu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.991.pdf

PDF Cite Search Fix data