Zhixun Li
2026
Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
Xinyu Tang | Yuliang Zhan | Zhixun Li | Xin Zhao | Zhenduo Zhang | Zujie Wen | Zhiqiang Zhang | Jun Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyu Tang | Yuliang Zhan | Zhixun Li | Xin Zhao | Zhenduo Zhang | Zujie Wen | Zhiqiang Zhang | Jun Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct ***sample polarities***. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the polarity level and the token level affects RLVR training. Based on these insights, we propose an **A**daptive and **A**symmetric token-level **A**dvantage shaping method for **P**olicy **O**ptimization, namely **A3PO**, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.
Exploring Reasoning Reward Model for Agents
Kaixuan Fan | Kaituo Feng | Manyuan Zhang | Tianshuo Peng | Zhixun Li | Yilei Jiang | Shuang Chen | Xiangyu Yue
Findings of the Association for Computational Linguistics: ACL 2026
Kaixuan Fan | Kaituo Feng | Manyuan Zhang | Tianshuo Peng | Zhixun Li | Yilei Jiang | Shuang Chen | Xiangyu Yue
Findings of the Association for Computational Linguistics: ACL 2026
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets will be released to facilitate future research.
RealChart2Code: Bridging the Gap in Real-World Chart-to-Code Generation via Multi-Task Evaluation
Jiajun Zhang | Yuying Li | Zhixun Li | Xingyu Guo | Jingzhuo Wu | Leqi Zheng | Yiran Yang | Jianke Zhang | Qingbin Li | Shannan Yan | Changguo Jia | Junfei Wu | Zilei Wang | Qiang Liu | Liang Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiajun Zhang | Yuying Li | Zhixun Li | Xingyu Guo | Jingzhuo Wu | Leqi Zheng | Yiran Yang | Jianke Zhang | Qingbin Li | Shannan Yan | Changguo Jia | Junfei Wu | Zilei Wang | Qiang Liu | Liang Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Chaoyang Wang | Kaituo Feng | Dongyang Chen | Zhongyu Wang | Zhixun Li | Sicheng Gao | Meng Meng | Xu Zhou | Manyuan Zhang | Yuzhang Shang | Xiangyu Yue
Findings of the Association for Computational Linguistics: ACL 2026
Chaoyang Wang | Kaituo Feng | Dongyang Chen | Zhongyu Wang | Zhixun Li | Sicheng Gao | Meng Meng | Xu Zhou | Manyuan Zhang | Yuzhang Shang | Xiangyu Yue
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro.
Search
Fix author
Co-authors
- Kaituo Feng 2
- Xiangyu Yue 2
- Manyuan Zhang 2
- Shuang Chen 1
- Dongyang Chen 1
- Kaixuan Fan 1
- Sicheng Gao 1
- Xingyu Guo 1
- Changguo Jia 1
- Yilei Jiang 1
- Yuying Li 1
- Qingbin Li 1
- Qiang Liu 1
- Meng Meng 1
- Tianshuo Peng 1
- Yuzhang Shang 1
- Xinyu Tang 1
- Zilei Wang 1
- Liang Wang 1
- Chaoyang Wang 1
- Zhongyu Wang 1
- Zujie Wen 1
- Jingzhuo Wu 1
- Junfei Wu 1
- Shannan Yan 1
- Yiran Yang 1
- Yuliang Zhan 1
- Zhenduo Zhang 1
- Zhiqiang Zhang 1
- Jiajun Zhang 1
- Jianke Zhang 1
- Wayne Xin Zhao 1
- Leqi Zheng 1
- Jun Zhou 1
- Xu Zhou 1