Hejia Geng
2026
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper investigates the scaling behavior of Large Language Model (LLM) reinforcement learning post-training, focusing on mathematical reasoning. Through experiments across the Qwen2.5 series (0.5B to 72B), we characterize how model scale, data, and compute interact. Our analysis yields four key findings: 1. Larger models consistently demonstrate superior compute and data efficiency. 2. The relationship between model performance and training resources follows a **predictive power-law** across both base and instruction-tuned models. 3. RL learning efficiency exhibits a latent **saturation trend** with increasing model scale. 4. In data-constrained regimes, performance is primarily driven by the **total volume of training data** rather than sample uniqueness. These results offer practical guidelines for scaling reasoning capabilities through reinforcement learning post-training.
2025
ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multi-agent systems have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving. However, current MAS frameworks are limited by poor flexibility and scalability, with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process. The core of ReSo is the proposed Collaborative Reward Model, which can provide fine-grained reward signals for MAS cooperation for optimization. We also introduce an automated data synthesis framework for generating MAS benchmarks, without human annotations. Experimentally, ReSo matches or outperforms existing methods. ReSo achieves 33.7% and 32.3% accuracy on Math-MAS and SciBench-MAS SciBench, while other methods completely fail. The code and data are available at [Reso](https://github.com/hengzzzhou/ReSo).