Bo Zhou
Other people with similar names: Bo Zhou
Unverified author pages with similar names: Bo Zhou
2026
Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
Kun Li | Zenan Xu | Junan Li | Zengrui Jin | Jinghao Deng | Zexuan Qiu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kun Li | Zenan Xu | Junan Li | Zengrui Jin | Jinghao Deng | Zexuan Qiu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model’s intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without additional human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors during training. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
Rhombus: Incentivizing Coordination in Parallel Thinking through Reinforcement Learning
Ziyuan Nan | Qi Yi | Di Huang | Yutong Wu | Guanhua Huang | Xue Gong | Kejiao Li | Yuhao Jiang | Chenchen Zhang | Zenan Xu | Xing Hu | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Ziyuan Nan | Qi Yi | Di Huang | Yutong Wu | Guanhua Huang | Xue Gong | Kejiao Li | Yuhao Jiang | Chenchen Zhang | Zenan Xu | Xing Hu | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Parallel thinking offers a promising avenue for scaling test-time compute in Large Language Models (LLMs), enabling them to explore diverse solution paths simultaneously before aggregating them into a final answer. However, coordinating the exploration and aggregation stages remains challenging, as simple aggregation techniques often incur information loss, failing to preserve the subtle, decision-relevant signals generated during exploration. To overcome this, we propose Rhombus, a parallel thinking framework that explicitly incentivizes coordination between components via end-to-end reinforcement learning. Rhombus employs multiple parallel Proposers to generate compact, decision-focused reasoning cues and a central Synthesizer to integrate them into final predictions, utilizing co-training under a shared task reward to align their interaction. Across challenging mathematical reasoning benchmarks, Rhombus improves accuracy by 6.0% over long chain-of-thought baselines while reducing wall-clock latency by 39.4% under matched token budgets. Our work demonstrates that explicit communication optimization is essential for realizing the accuracy and efficiency gains of parallel reasoning.
AT²PO: Agentic Turn-based Policy Optimization via Tree Search
Zefang Zong | Dingwei Chen | Yang Li | Qi Yi | Bo Zhou | Chengming Li | BO Qian | Peng Chen | Jie Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zefang Zong | Dingwei Chen | Yang Li | Qi Yi | Bo Zhou | Chengming Li | BO Qian | Peng Chen | Jie Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT²PO (**A**gentic **T**urn-based **P**olicy **O**ptimization via **T**ree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT²PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component.
Reinforcement Learning on Pre-Training Data
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of . For example, RLPT yields substantial improvements in continual pre-training (+4.6%) and provides a strong foundation for post-training (+3.4%) on Qwen3-8B-Base.
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. While previous methods attempt to maintain high entropy, we argue that unselective entropy maximization risks amplifying irrelevant noise rather than fostering meaningful exploration. In this paper, we identify a deeper issue: the gradual elimination of valuable low-probability exploratory tokens, which we term reasoning sparks, driven by RLVR over-penalization. To address this, we introduce Low-probability Regularization (Lp-Reg). Leveraging the statistical distinction where reasoning sparks exhibit higher probabilities than noise, Lp-Reg filters out the extremely low-probability noise tokens and prevents the suppression of potentially valuable low-probability candidates. Experiments demonstrate that Lp-Reg enables stable on-policy training for over 3,000 steps (81,204 GPU-hours), sustaining exploration in regimes where baselines typically collapse. Validated across extensive evaluations totaling over 300,000 cumulative GPU-hours, Lp-Reg demonstrates highly competitive performance in off-policy settings and consistently achieves state-of-the-art results in on-policy training across diverse model families, sizes, and domains, with relative accuracy improvements ranging from 3.06% to 7.98%.
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Yuhang Li | Chenchen Zhang | Ruilin Lv | Ao Liu | Ken Deng | Yuanxing Zhang | Jiaheng Liu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhang Li | Chenchen Zhang | Ruilin Lv | Ao Liu | Ken Deng | Yuanxing Zhang | Jiaheng Liu | Bo Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate–diagnose–refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent employs an MLLM-in-the-loop to serve as a visual critic, evaluating code via screenshots and providing actionable feedback. Crucially, we enforce a strict zero-reward policy for invalid renders to guarantee renderability and mitigate reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training–inference decoupling.
Search
Fix author
Co-authors
- Qi Yi 4
- Xue Gong 3
- Guanhua Huang 3
- Yuhao Jiang 3
- Kejiao Li 3
- Zenan Xu 3
- Chenchen Zhang 3
- Kun Li 2
- Siheng Li 2
- Ruibin Xiong 2
- Tingqiang Xu 2
- Dingwei Chen 1
- Peng Chen 1
- Jinghao Deng 1
- Ken Deng 1
- Zheng Fang 1
- Fei Gao 1
- Xing Hu 1
- Di Huang 1
- Jie Jiang 1
- Cheng Jiang 1
- Zengrui Jin 1
- Shuai LI 1
- Wai Lam 1
- Junan Li 1
- Yang Li 1
- Chengming Li 1
- Xiaoxue Li 1
- Zhuoyu Li 1
- Yuhang Li 1
- Qibin Liu 1
- Kai Liu 1
- Ao Liu 1
- Jiaheng Liu 1
- Ruilin Lv 1
- Ziyuan Nan 1
- BO Qian 1
- Zexuan Qiu 1
- Kun Shi 1
- Yangyu Tao 1
- Bochao Wang 1
- Di Wang 1
- Mingze Wang 1
- Yutong Wu 1
- Haoyuan Wu 1
- Wujiajia 1
- Guanghui Xu 1
- Jinbao Xue 1
- Jianfeng Yan 1
- Yuyuan Zeng 1
- Yuanxing Zhang 1
- Zihao Zheng 1
- Jianchen Zhu 1
- Zefang Zong 1
- Zhijiang xu 1