Houyi Li
2026
Scaling Laws for Code: A More Data-Hungry Regime
Xianzhen Luo | Wenzhen Zheng | Qingfu Zhu | Rongyi Zhang | Houyi Li | Siming Huang | YuanTao Fan | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xianzhen Luo | Wenzhen Zheng | Qingfu Zhu | Rongyi Zhang | Houyi Li | Siming Huang | YuanTao Fan | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
2025
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Siming Huang | Tianhao Cheng | Jason Klein Liu | Weidi Xu | Jiaran Hao | Liuyihan Song | Yang Xu | Jian Yang | Jiaheng Liu | Chenchen Zhang | Linzheng Chai | Ruifeng Yuan | Xianzhen Luo | Qiufeng Wang | YuanTao Fan | Qingfu Zhu | Zhaoxiang Zhang | Yang Gao | Jie Fu | Qian Liu | Houyi Li | Ge Zhang | Yuan Qi | Xu Yinghui | Wei Chu | Zili Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siming Huang | Tianhao Cheng | Jason Klein Liu | Weidi Xu | Jiaran Hao | Liuyihan Song | Yang Xu | Jian Yang | Jiaheng Liu | Chenchen Zhang | Linzheng Chai | Ruifeng Yuan | Xianzhen Luo | Qiufeng Wang | YuanTao Fan | Qingfu Zhu | Zhaoxiang Zhang | Yang Gao | Jie Fu | Qian Liu | Houyi Li | Ge Zhang | Yuan Qi | Xu Yinghui | Wei Chu | Zili Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code LLMs have been widely used in various domains, including code generation, logical reasoning, and agent systems. However, open-access code LLMs mostly only release weights, lacking key features such as reproducible data pipelines and transparent training protocols, which are crucial for advancing deeper, more reliable investigations. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an “open cookbook” for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Our work identifies the key ingredients for building a top-tier code LLM: optimized heuristic rules for data cleaning and deduplication, effective recall of code-related text corpus, and high-quality synthetic data for both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research and enable reproducible advancements in code intelligence. The released resource is available at https://opencoder-llm.github.io.
Multi-matrix Factorization Attention
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
Search
Fix author
Co-authors
- Yuantao Fan 2
- Siming Huang 2
- Xianzhen Luo 2
- Zili Wang 2
- Qingfu Zhu 2
- Linzheng Chai 1
- Wanxiang Che (车万翔) 1
- Tianhao Cheng 1
- Wei Chu 1
- Jie Fu 1
- Yang Gao 1
- Jiaran Hao 1
- Jingcheng Hu 1
- Jason Klein Liu 1
- Jiaheng Liu 1
- Qian Liu 1
- Yuan Qi 1
- Heung Yeung Shum 1
- Liuyihan Song 1
- Qiufeng Wang 1
- Weidi Xu 1
- Yang Xu 1
- Jian Yang 1
- Xu Yinghui 1
- Ruifeng Yuan 1
- Chenchen Zhang 1
- Zhaoxiang Zhang 1
- Ge Zhang 1
- Rongyi Zhang 1
- Yinmin Zhang 1
- Xiangyu Zhang 1
- Wenzhen Zheng 1
- Shuigeng Zhou 1