Zhenfei Yin
2026
From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li | Hongru Wang | Jiahao Qiu | Zhenfei Yin | Dongdong Zhang | Cheng Qian | Zeping Li | Xiaoteng Ma | Guanhua Chen | Heng Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yixia Li | Hongru Wang | Jiahao Qiu | Zhenfei Yin | Dongdong Zhang | Cheng Qian | Zeping Li | Xiaoteng Ma | Guanhua Chen | Heng Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Agentic learning increasingly hinges on interaction, yet real-world experience is expensive, limited, and often irreversible at inference time. World models promise to mitigate these limitations, but it remains unclear whether large language models can actually serve as reliable world models, and deliver concrete benefits to downstream agents. We investigate these questions in text-based environments, a controlled testbed that reframes language modeling as next-state prediction under interaction. We propose a three-level framework to evaluate LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we show that sufficiently trained world models capture coherent environment dynamics, scale predictably with data and model capacity, and unlock tangible agent improvements—for example, action verification boosts GPT-4o by 5.5% on WebShop, and warm-started RL achieves a 15% gain on SciWorld. Crucially, these benefits hinge on behavioral coverage and environment complexity, sharply characterizing when world modeling meaningfully advances agent learning.
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper investigates the scaling behavior of Large Language Model (LLM) reinforcement learning post-training, focusing on mathematical reasoning. Through experiments across the Qwen2.5 series (0.5B to 72B), we characterize how model scale, data, and compute interact. Our analysis yields four key findings: 1. Larger models consistently demonstrate superior compute and data efficiency. 2. The relationship between model performance and training resources follows a **predictive power-law** across both base and instruction-tuned models. 3. RL learning efficiency exhibits a latent **saturation trend** with increasing model scale. 4. In data-constrained regimes, performance is primarily driven by the **total volume of training data** rather than sample uniqueness. These results offer practical guidelines for scaling reasoning capabilities through reinforcement learning post-training.
Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents
Zeping Li | Hongru Wang | Yiwen Zhao | Guanhua Chen | Yixia Li | Keyang Chen | Yixin Cao | Guangnan Ye | Hongfeng Chai | Zhenfei Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zeping Li | Hongru Wang | Yiwen Zhao | Guanhua Chen | Yixia Li | Keyang Chen | Yixin Cao | Guangnan Ye | Hongfeng Chai | Zhenfei Yin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.
2025
ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Heng Zhou | Hejia Geng | Xiangyuan Xue | Li Kang | Yiran Qin | Zhiyong Wang | Zhenfei Yin | Lei Bai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multi-agent systems have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving. However, current MAS frameworks are limited by poor flexibility and scalability, with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process. The core of ReSo is the proposed Collaborative Reward Model, which can provide fine-grained reward signals for MAS cooperation for optimization. We also introduce an automated data synthesis framework for generating MAS benchmarks, without human annotations. Experimentally, ReSo matches or outperforms existing methods. ReSo achieves 33.7% and 32.3% accuracy on Math-MAS and SciBench-MAS SciBench, while other methods completely fail. The code and data are available at [Reso](https://github.com/hengzzzhou/ReSo).
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System
Haoyang Su | Renqi Chen | Shixiang Tang | Zhenfei Yin | Xinzhe Zheng | Jinzhe Li | Biqing Qi | Qi Wu | Hui Li | Wanli Ouyang | Philip Torr | Bowen Zhou | Nanqing Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haoyang Su | Renqi Chen | Shixiang Tang | Zhenfei Yin | Xinzhe Zheng | Jinzhe Li | Biqing Qi | Qi Wu | Hui Li | Wanli Ouyang | Philip Torr | Bowen Zhou | Nanqing Dong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VIRSCI), designed to mimic the teamwork inherent in scientific research. VIRSCI organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.
2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chen Qian | Jie Zhang | Wei Yao | Dongrui Liu | Zhenfei Yin | Yu Qiao | Yong Liu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
Chen Qian | Jie Zhang | Wei Yao | Dongrui Liu | Zhenfei Yin | Yu Qiao | Yong Liu | Jing Shao
Findings of the Association for Computational Linguistics: ACL 2024
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs’ trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs’ trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM’s pre-training checkpoints to enhance the LLM’s trustworthiness. Finally, inspired by the theoretical result that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.
Search
Fix author
Co-authors
- Lei Bai 2
- Guanhua Chen 2
- Hejia Geng 2
- Yixia Li 2
- Zeping Li 2
- Philip Torr 2
- Hongru Wang 2
- Xiangyuan Xue 2
- Heng Zhou 2
- Yixin Cao 1
- Hongfeng Chai (柴洪峰) 1
- Keyang Chen 1
- Renqi Chen 1
- Nanqing Dong 1
- Yutao Fan 1
- Qiang He 1
- Heng Ji 1
- Li Kang 1
- Hui Li 1
- Jinzhe Li 1
- Zhong-Zhi Li 1
- Dongrui Liu 1
- Yong Liu 1
- Xiaoteng Ma 1
- Wanli Ouyang 1
- Biqing Qi 1
- Chen Qian 1
- Cheng Qian 1
- Yu Qiao 1
- Yiran Qin 1
- Jiahao Qiu 1
- Jing Shao 1
- Haoyang Su 1
- Zelin Tan 1
- Shixiang Tang 1
- Guancheng Wan 1
- Zhiyong Wang 1
- Qi Wu 1
- Wei Yao 1
- Guangnan Ye (叶广楠) 1
- Xiaohang Yu 1
- Chen Zhang 1
- Dongdong Zhang 1
- Guibin Zhang 1
- Jie Zhang 1
- Mulei Zhang 1
- Zaibin Zhang 1
- Yiwen Zhao 1
- Xinzhe Zheng 1
- Bowen Zhou 1
- Yifan Zhou 1