WangYan WangYan
2025
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Zhiyuan Hu
|
Yuliang Liu
|
Jinman Zhao
|
Suyuchen Wang
|
WangYan WangYan
|
Wei Shen
|
Qing Gu
|
Anh Tuan Luu
|
See-Kiong Ng
|
Zhiwei Jiang
|
Bryan Hooi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive.To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model’s understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM’s capabilities in general tasks. Ultimately, we can extend effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding
Zikai Xiao
|
Ziyang Wang
|
Wen Ma
|
Yan Zhang
|
Wei Shen
|
WangYan WangYan
|
Luqi Gong
|
Zuozhu Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
While Large Language Models (LLMs) support long contexts, they struggle with performance degradation within the context window. Current solutions incur prohibitive training costs, leaving statistical behaviors and cost-effective approaches underexplored. From the decoding perspective, we identify the Posterior Salience Attenuation (PSA) phenomenon, where the salience ratio correlates with long-text performance degradation. Notably, despite the attenuation, gold tokens still occupy high-ranking positions in the decoding space. Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training. Through the analysis of long-term decay simulation, we demonstrate that PCD effectively alleviates attention score degradation. Experimental results show that PCD achieves state-of-the-art performance on long-context benchmarks.
Search
Fix author
Co-authors
- Wei Shen 2
- Luqi Gong 1
- Qing Gu 1
- Bryan Hooi 1
- Zhiyuan Hu 1
- show all...
Venues
- acl2