2025
pdf
bib
abs
A Survey of Post-Training Scaling in Large Language Models
Hanyu Lai
|
Xiao Liu
|
Junjie Gao
|
Jiale Cheng
|
Zehan Qi
|
Yifan Xu
|
Shuntian Yao
|
Dan Zhang
|
Jinhua Du
|
Zhenyu Hou
|
Xin Lv
|
Minlie Huang
|
Yuxiao Dong
|
Jie Tang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the “scaling law” that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.
pdf
bib
abs
LongReward: Improving Long-context Large Language Models with AI Feedback
Jiajie Zhang
|
Zhongni Hou
|
Xin Lv
|
Shulin Cao
|
Zhenyu Hou
|
Yilin Niu
|
Lei Hou
|
Yuxiao Dong
|
Ling Feng
|
Juanzi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models’ capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models’ long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one’s performance.
pdf
bib
abs
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Zhenyu Hou
|
Ziniu Hu
|
Yujiang Li
|
Rui Lu
|
Jie Tang
|
Yuxiao Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at
https://github.com/THUDM/TreeRL.
pdf
bib
abs
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
Xiao Xia
|
Dan Zhang
|
Zibo Liao
|
Zhenyu Hou
|
Tianrui Sun
|
Jing Li
|
Ling Fu
|
Yuxiao Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and dataset are available at https://github.com/THUDM/SceneGenAgent.
pdf
bib
abs
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Haoran Wang
|
Zhenyu Hou
|
Yao Wei
|
Jie Tang
|
Yuxiao Dong
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.
2024
pdf
bib
abs
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
Yifan Xu
|
Xiao Liu
|
Xinghan Liu
|
Zhenyu Hou
|
Yueyan Li
|
Xiaohan Zhang
|
Zihan Wang
|
Aohan Zeng
|
Zhengxiao Du
|
Zhao Wenyi
|
Jie Tang
|
Yuxiao Dong
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) have shown excellent mastering of human language but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs’ mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM’s own generations for data collection. Based on ChatGLM3-32B, we conduct experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our pipeline significantly enhances the LLM’s mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger. Related techniques have been deployed to ChatGLM, an online serving LLM. Related evaluation datasets and scripts are released at
https://github.com/THUDM/ChatGLM-Math.