Re3Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training
Zhiyang Zhang, Ziqiang Liu, Huiming Wang, Renke Shan, Li Kuang, Lu Wang, De Wen Soh
Abstract
An important trend in the realm of large language models (LLMs) is the development of longer context windows. However, training LLMs with long context windows to acquire the capability of effectively modeling lengthy inputs is often hindered by the scarcity of naturally long-context data. Existing methods for constructing long-context data by concatenating short documents have overlooked a crucial characteristic of long-context data quality, namely semantic dependency. In this paper, we propose a novel framework called Retrieval, Dependency Recognition, and Reorder for data synthesis (Re3Syn), which leverages semantic similarity to retrieve relevant documents and form several batches. Within each batch, the framework comprehensively recognizes dependency and utilizes them, along with a reorder algorithm, to organize the short documents into coherent long-context data. Comprehensive experiment on multiple benchmarks indicate that the data generated by the Re3Syn has longer dependencies and significantly enhances the model’s long-context capabilities. For reproducibility, we will release our codebase upon acceptance.- Anthology ID:
- 2025.acl-long.1518
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31468–31480
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1518/
- DOI:
- Cite (ACL):
- Zhiyang Zhang, Ziqiang Liu, Huiming Wang, Renke Shan, Li Kuang, Lu Wang, and De Wen Soh. 2025. Re3Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31468–31480, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Re3Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training (Zhang et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1518.pdf