YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model

Hu Yiwen; Huatong Song; Jie Chen; Jia Deng; Jiapeng Wang; Kun Zhou; Yutao Zhu (朱余韬); Jinhao Jiang; Zican Dong; Yang Lu; Xu Miao; Wayne Xin Zhao; Ji-Rong Wen

YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model

Hu Yiwen, Huatong Song, Jie Chen, Jia Deng, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Yang Lu, Xu Miao, Xin Zhao, Ji-Rong Wen

Abstract

Due to the immense resource demands and the involved complex techniques, it is still challenging for successfully pre-training a large language models (LLMs) with state-of-the-art performance. In this paper, we explore the key bottlenecks and designs during pre-training, and make the following contributions: (1) a comprehensive investigation into the factors contributing to training instability; (2) a robust optimization approach designed to mitigate training instability effectively; (3) an elaborate data pipeline that integrates data synthesis, data curriculum, and data selection. By integrating the above techniques, we create a rather low-cost training recipe and use it to pre-train YuLan-Mini, a fully-open base model with 2.4B parameters on 1.08T tokens. Remarkably, YuLan-Mini achieves top-tier performance among models of similar parameter scale, with comparable performance to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of training recipe and data composition. Project details can be accessed at the following link: https://anonymous.4open.science/r/YuLan-Mini/README.md.

Anthology ID:: 2025.acl-long.268
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5374–5400
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.268/
DOI:
Bibkey:
Cite (ACL):: Hu Yiwen, Huatong Song, Jie Chen, Jia Deng, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Yang Lu, Xu Miao, Xin Zhao, and Ji-Rong Wen. 2025. YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5374–5400, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model (Yiwen et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.268.pdf

PDF Cite Search Fix data