Haksoo Moon
2025
Overlapping Context with Variable-Length Stride Increases Diversity when Training Large Language Model for Code
Geonmo Gu
|
Jaeho Kwak
|
Haksoo Moon
|
Hyun Seung Shim
|
Yu Jin Kim
|
Byoungjip Kim
|
Moontae Lee
|
Hyejeong Jeon
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
The pretraining of code LLMs typically begins with general data and progresses to domain-specific data through sequential stages. In the latter stages, a challenging issue is that the data of a target domain can be limited in size, and conventional approach of increasing the number of epochs does not lead to a performance gain. In this paper, we propose a novel packing method, which is extracting overlapping contexts from the training data using variable-length stride. Our method can mitigate the data-scarcity issue by providing more diverse and abundant examples of next token prediction than non-overlapping contexts. While the training time of our approach is increased proportionally to the amount of augmented examples, we present space-efficient implementations to store overlapping contexts. Extensive experiments with real datasets show that our approach outperforms the conventional approach of controlling the number of epochs in terms of the pass@k rate.
Search
Fix author
Co-authors
- Geonmo Gu 1
- Hyejeong Jeon 1
- Yu Jin Kim 1
- Byoungjip Kim 1
- Jaeho Kwak 1
- show all...
Venues
- acl1