Xu Miao
2025
YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model
Hu Yiwen
|
Huatong Song
|
Jie Chen
|
Jia Deng
|
Jiapeng Wang
|
Kun Zhou
|
Yutao Zhu
|
Jinhao Jiang
|
Zican Dong
|
Yang Lu
|
Xu Miao
|
Xin Zhao
|
Ji-Rong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Due to the immense resource demands and the involved complex techniques, it is still challenging for successfully pre-training a large language models (LLMs) with state-of-the-art performance. In this paper, we explore the key bottlenecks and designs during pre-training, and make the following contributions: (1) a comprehensive investigation into the factors contributing to training instability; (2) a robust optimization approach designed to mitigate training instability effectively; (3) an elaborate data pipeline that integrates data synthesis, data curriculum, and data selection. By integrating the above techniques, we create a rather low-cost training recipe and use it to pre-train YuLan-Mini, a fully-open base model with 2.4B parameters on 1.08T tokens. Remarkably, YuLan-Mini achieves top-tier performance among models of similar parameter scale, with comparable performance to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of training recipe and data composition. Project details can be accessed at the following link: https://anonymous.4open.science/r/YuLan-Mini/README.md.
Search
Fix author
Co-authors
- Jie Chen 1
- Jia Deng 1
- Zican Dong 1
- Jinhao Jiang 1
- Yang Lu 1
- show all...
Venues
- acl1