Yuqiang Li
2026
WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
Fangyuan Li | Pengfei Li | Shijie Wang | Junqi Gao | Jianxing Liu | Biqing Qi | Yuqiang Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangyuan Li | Pengfei Li | Shijie Wang | Junqi Gao | Jianxing Liu | Biqing Qi | Yuqiang Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improving language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present WIST, a Web-grounded Iterative Self-play Tree framework for domain-targeted reasoning improvement that learns directly from the open-web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree to structure exploration and retrieves and cleans path-consistent web evidence to construct a controllable training environment. It then performs Challenger-Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching +9.8 (Qwen3-4B-Base) and +9.7 (OctoThinker-8B-Hybrid-Base). WIST is also domain-steerable: improving Qwen3-8B-Base by +14.79 in medicine and Qwen3-4B-Base by +5.28 on PhyBench. Ablations further confirm the importance of WIST’s key components for stable open-web learning. Our Code is available at https://github.com/lfy-123/WIST.
MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
Pengfei Li | Shijie Wang | Fangyuan Li | Yikun Fu | Kaifeng Liu | Kaiyan Zhang | Dazhi Zhang | Yuqiang Li | Biqing Qi | Bowen Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pengfei Li | Shijie Wang | Fangyuan Li | Yikun Fu | Kaifeng Liu | Kaiyan Zhang | Dazhi Zhang | Yuqiang Li | Biqing Qi | Bowen Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose MARS2 (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS2 models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu | Zonglin Yang | Tong Xie | Jinjie Ni | Ben Gao | Yuqiang Li | Shixiang Tang | Wanli Ouyang | Erik Cambria | Dongzhan Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Yujie Liu | Zonglin Yang | Tong Xie | Jinjie Ni | Ben Gao | Yuqiang Li | Shixiang Tang | Wanli Ouyang | Erik Cambria | Dongzhan Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks—inspiration retrieval, hypothesis composition, and hypothesis ranking—where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components—research questions, background surveys, inspirations, and hypotheses—from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval—an out-of-distribution task—suggesting their ability to surface novel knowledge associations.
2025
LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search
Di Zhang | Jianbo Wu | Jingdi Lei | Tong Che | Jiatong Li | Tong Xie | Xiaoshui Huang | Shufei Zhang | Marco Pavone | Yuqiang Li | Wanli Ouyang | Dongzhan Zhou
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Di Zhang | Jianbo Wu | Jingdi Lei | Tong Che | Jiatong Li | Tong Xie | Xiaoshui Huang | Shufei Zhang | Marco Pavone | Yuqiang Li | Wanli Ouyang | Dongzhan Zhou
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
This paper presents LLaMA-Berry, an advanced mathematical reasoning framework to enhance the problem-solving ability of large language models (LLMs). The framework combines Monte Carlo Tree Search with Self-Refine (SR-MCTS) to optimize the reasoning paths and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critique and rewriting capabilities of LLMs, our SR-MCTS overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms, enabling a more efficient exploration of solution spaces. To guide the search process, we propose the Pairwise Preference Reward Model (PPRM), which predicts pairwise preferences between solutions through instruction-following capabilities trained by Reinforcement Learning from Human Feedback (RLHF). Finally, the Enhanced Borda Count (EBC) method is adopted to synthesize pairwise preferences into global quantile scores for evaluations. This approach mitigates the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior search efficiency and performance compared to existing open-source and closed-source methods, particularly in complex Olympiad-level benchmarks, including AIME24 and AMC23.
Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Haonan He | Yuchen Ren | Yining Tang | Ziyang Xu | Junxian Li | Minghao Yang | Di Zhang | Yuan Dong | Tao Chen | Shufei Zhang | Yuqiang Li | Nanqing Dong | Wanli Ouyang | Dongzhan Zhou | Peng Ye
Findings of the Association for Computational Linguistics: EMNLP 2025
Haonan He | Yuchen Ren | Yining Tang | Ziyang Xu | Junxian Li | Minghao Yang | Di Zhang | Yuan Dong | Tao Chen | Shufei Zhang | Yuqiang Li | Nanqing Dong | Wanli Ouyang | Dongzhan Zhou | Peng Ye
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.
Search
Fix author
Co-authors
- Wanli Ouyang 3
- Dongzhan Zhou 3
- Fangyuan Li 2
- Pengfei Li 2
- Biqing Qi 2
- Shijie Wang 2
- Tong Xie 2
- Di Zhang 2
- Shufei Zhang 2
- Erik Cambria 1
- Tong Che 1
- Tao Chen 1
- Yuan Dong 1
- Nanqing Dong 1
- Yikun Fu 1
- Junqi Gao 1
- Ben Gao 1
- Haonan He 1
- Xiaoshui Huang 1
- Jingdi Lei 1
- Jiatong Li 1
- Junxian Li 1
- Jianxing Liu 1
- Kaifeng Liu 1
- Yujie Liu 1
- Jinjie Ni 1
- Marco Pavone 1
- Yuchen Ren 1
- Yining Tang 1
- Shixiang Tang 1
- Jianbo Wu 1
- Ziyang Xu 1
- Minghao Yang 1
- Zonglin Yang 1
- Peng Ye 1
- Kaiyan Zhang 1
- Dazhi Zhang 1
- Bowen Zhou 1