Yifan Zhou
2026
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper investigates the scaling behavior of Large Language Model (LLM) reinforcement learning post-training, focusing on mathematical reasoning. Through experiments across the Qwen2.5 series (0.5B to 72B), we characterize how model scale, data, and compute interact. Our analysis yields four key findings: 1. Larger models consistently demonstrate superior compute and data efficiency. 2. The relationship between model performance and training resources follows a **predictive power-law** across both base and instruction-tuned models. 3. RL learning efficiency exhibits a latent **saturation trend** with increasing model scale. 4. In data-constrained regimes, performance is primarily driven by the **total volume of training data** rather than sample uniqueness. These results offer practical guidelines for scaling reasoning capabilities through reinforcement learning post-training.
2025
The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters
Chulun Zhou | Qiujing Wang | Mo Yu | Xiaoqian Yue | Rui Lu | Jiangnan Li | Yifan Zhou | Shunchi Zhang | Jie Zhou | Wai Lam
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chulun Zhou | Qiujing Wang | Mo Yu | Xiaoqian Yue | Rui Lu | Jiangnan Li | Yifan Zhou | Shunchi Zhang | Jie Zhou | Wai Lam
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others’ thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines’ ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios. To achieve this, we introduce CharToM-QA benchmark, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 and DeepSeek-R1 models, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.