Guancheng Wan
2026
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zelin Tan | Hejia Geng | Xiaohang Yu | Mulei Zhang | Guancheng Wan | Yifan Zhou | Qiang He | Xiangyuan Xue | Heng Zhou | Yutao Fan | Zhong-Zhi Li | Zaibin Zhang | Guibin Zhang | Chen Zhang | Zhenfei Yin | Philip Torr | Lei Bai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper investigates the scaling behavior of Large Language Model (LLM) reinforcement learning post-training, focusing on mathematical reasoning. Through experiments across the Qwen2.5 series (0.5B to 72B), we characterize how model scale, data, and compute interact. Our analysis yields four key findings: 1. Larger models consistently demonstrate superior compute and data efficiency. 2. The relationship between model performance and training resources follows a **predictive power-law** across both base and instruction-tuned models. 3. RL learning efficiency exhibits a latent **saturation trend** with increasing model scale. 4. In data-constrained regimes, performance is primarily driven by the **total volume of training data** rather than sample uniqueness. These results offer practical guidelines for scaling reasoning capabilities through reinforcement learning post-training.
Behavioral Consistency Validation for LLM Agents: An Analysis of Trading-Style Switching through Stock-Market Simulation
Zeping Li | Guancheng Wan | Keyang Chen | Yu Chen | Yiwen Zhao | Philip Torr | Guangnan Ye | Zhenfei Yin | Hongfeng Chai
Findings of the Association for Computational Linguistics: ACL 2026
Zeping Li | Guancheng Wan | Keyang Chen | Yu Chen | Yiwen Zhao | Philip Torr | Guangnan Ye | Zhenfei Yin | Hongfeng Chai
Findings of the Association for Computational Linguistics: ACL 2026
Recent works have increasingly applied Large Language Models (LLMs) as agents in financial stock market simulations to test if micro-level behaviors aggregate into macro-level phenomena. However, a crucial question arises: Do LLM agents’ behaviors align with real market participants? This alignment is key to the validity of simulation results. To explore this, we select a financial stock market scenario to test behavioral consistency. Investors are typically classified as fundamental or technical traders, but most simulations fix strategies at initialization, failing to reflect real-world trading dynamics. In this work, we assess whether agents’ strategy switching aligns with financial theory, providing a framework for this evaluation. We operationalize four behavioral-finance drivers—loss aversion, herding, wealth differentiation, and price misalignment—as personality traits set via prompting and stored long-term. In year-long simulations, agents process daily price-volume data, trade under a designated style, and reassess their strategy every 10 trading days. We introduce four alignment metrics and use Mann–Whitney U tests to compare agents’ style-switching behavior with financial theory. Our results show that recent LLMs’ switching behavior is only partially consistent with behavioral-finance theories, highlighting the need for further refinement in aligning agent behavior with financial theory.
Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Eric Hanchen Jiang | Weixuan Ou | Run Liu | Shengyuan Pang | Guancheng Wan | Ranjie Duan | Wei Dong | Kai-Wei Chang | XiaoFeng Wang | Ying Nian Wu | Xinfeng Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, the EBM maps the LLM’s internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model’s core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness: raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.
2025
Protein Large Language Models: A Comprehensive Survey
Yijia Xiao | Wanjia Zhao | Junkai Zhang | Yiqiao Jin | Han Zhang | Zhicheng Ren | Renliang Sun | Haixin Wang | Guancheng Wan | Pan Lu | Xiao Luo | Yu Zhang | James Zou | Yizhou Sun | Wei Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Yijia Xiao | Wanjia Zhao | Junkai Zhang | Yiqiao Jin | Han Zhang | Zhicheng Ren | Renliang Sun | Haixin Wang | Guancheng Wan | Pan Lu | Xiao Luo | Yu Zhang | James Zou | Yizhou Sun | Wei Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Protein-specific large language models (ProteinLLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of ProteinLLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art ProteinLLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning ProteinLLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Yanwei Yue | Guibin Zhang | Boyang Liu | Guancheng Wan | Kun Wang | Dawei Cheng | Yiyan Qi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanwei Yue | Guibin Zhang | Boyang Liu | Guancheng Wan | Kun Wang | Dawei Cheng | Yiyan Qi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a 1.8 improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to 52.07 compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by 17.21 via customized routing.
G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems
Shilong Wang | Guibin Zhang | Miao Yu | Guancheng Wan | Fanci Meng | Chongye Guo | Kun Wang | Yang Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shilong Wang | Guibin Zhang | Miao Yu | Guancheng Wan | Fanci Meng | Chongye Guo | Kun Wang | Yang Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees.
Search
Fix author
Co-authors
- Guibin Zhang 3
- Philip Torr 2
- Kun Wang 2
- Zhenfei Yin 2
- Lei Bai 1
- Hongfeng Chai (柴洪峰) 1
- Kai-Wei Chang 1
- Keyang Chen 1
- Yu Chen 1
- Dawei Cheng 1
- Wei Dong 1
- Ranjie Duan 1
- Yutao Fan 1
- Hejia Geng 1
- Chongye Guo 1
- Qiang He 1
- Eric Hanchen Jiang 1
- Yiqiao Jin 1
- Zhong-Zhi Li 1
- Zeping Li 1
- Xinfeng Li 1
- Boyang Liu 1
- Run Liu 1
- Pan Lu 1
- Xiao Luo 1
- Fanci Meng 1
- Weixuan Ou 1
- Shengyuan Pang 1
- Yiyan Qi 1
- Zhicheng Ren 1
- Renliang Sun 1
- Yizhou Sun 1
- Zelin Tan 1
- Haixin Wang 1
- Wei Wang 1
- Shilong Wang 1
- Yang Wang 1
- XiaoFeng Wang 1
- Ying Nian Wu 1
- Yijia Xiao 1
- Xiangyuan Xue 1
- Guangnan Ye (叶广楠) 1
- Miao Yu 1
- Xiaohang Yu 1
- Yanwei Yue 1
- Junkai Zhang 1
- Han Zhang 1
- Yu Zhang 1
- Mulei Zhang 1
- Zaibin Zhang 1
- Chen Zhang 1
- Wanjia Zhao 1
- Yiwen Zhao 1
- Yifan Zhou 1
- Heng Zhou 1
- James Zou 1