Ling Yang
2026
TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
Jinyang Wu | Chonghua Liao | Mingkuan Feng | Shuai Zhang | Zhengqi Wen | Haoran Luo | Ling Yang | Huazhe Xu | Jianhua Tao
Findings of the Association for Computational Linguistics: ACL 2026
Jinyang Wu | Chonghua Liao | Mingkuan Feng | Shuai Zhang | Zhengqi Wen | Haoran Luo | Ling Yang | Huazhe Xu | Jianhua Tao
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO often rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address these limitations, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.
Beyond Examples: Towards Automated Thought-level In-Context Reasoning for Large Language Models
Jinyang Wu | Mingkuan Feng | Shuai Zhang | Feihu Che | Zhengqi Wen | Chonghua Liao | Ling Yang | Haoran Luo | Zheng Lian | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jinyang Wu | Mingkuan Feng | Shuai Zhang | Feihu Che | Zhengqi Wen | Chonghua Liao | Ling Yang | Haoran Luo | Zheng Lian | Jianhua Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In-context learning (ICL) leverages demonstrations to enhance the performance of large language models (LLMs). However, traditional ICL struggles with complex reasoning mainly due to superficial, example-level implicit imitation. To address these limitations, we introduce **ThoughtICR**, an automated **Thought**-level **I**n-**C**ontext **R**easoning paradigm that shifts from surface-level examples to more guidance-oriented thought patterns. Specifically, we first define atomic reasoning actions and construct thought patterns on small-scale seed data using Monte Carlo Tree Search (MCTS). During inference, we dynamically select appropriate thought patterns based on target problem attributes, providing explicit guidance for model reasoning. Thanks to its automated and strategic design, our method enables seamless plug-and-play integration with various post-training techniques. Experimental results demonstrate that our method improves performance across different model sizes and generalizes effectively across reasoning domains. Using only small-scale seed data, we achieve 80.6% accuracy on MATH and 62.5% on AMC, surpassing GPT-4o’s 77.2% and 57.5%, respectively. Moreover, compared to test-time scaling methods, our approach reduces computational costs by over 10. Our code is available at https://github.com/jinyangwu/ThoughtICR.
2025
Temporal Consistency for LLM Reasoning Process Error Identification
Jiacheng Guo | Yue Wu | Jiahao Qiu | Kaixuan Huang | Xinzhe Juan | Ling Yang | Mengdi Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Jiacheng Guo | Yue Wu | Jiahao Qiu | Kaixuan Huang | Xinzhe Juan | Ling Yang | Mengdi Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1.
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Jiahao Qiu | Yifu Lu | Yifan Zeng | Jiacheng Guo | Jiayi Geng | Chenhao Zhu | Xinzhe Juan | Ling Yang | Huazheng Wang | Kaixuan Huang | Yue Wu | Mengdi Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Jiahao Qiu | Yifu Lu | Yifan Zeng | Jiacheng Guo | Jiayi Geng | Chenhao Zhu | Xinzhe Juan | Ling Yang | Huazheng Wang | Kaixuan Huang | Yue Wu | Mengdi Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai | Ling Yang | Zhen Hao Wong | Fupeng Sun | Xinlin Zhuang | Jiahui Peng | Chi Zhang | Lijun Wu | Qiu Jiantao | Wentao Zhang | Binhang Yuan | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyi Bai | Ling Yang | Zhen Hao Wong | Fupeng Sun | Xinlin Zhuang | Jiahui Peng | Chi Zhang | Lijun Wu | Qiu Jiantao | Wentao Zhang | Binhang Yuan | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism. Each data selection method independently prioritizes data based on its specific criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection. Additionally, a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
Jiahao Qiu | Yinghui He | Xinzhe Juan | Yimin Wang | Yuhan Liu | Zixin Yao | Yue Wu | Xun Jiang | Ling Yang | Mengdi Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jiahao Qiu | Yinghui He | Xinzhe Juan | Yimin Wang | Yuhan Liu | Zixin Yao | Yue Wu | Xun Jiang | Ling Yang | Mengdi Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: **EmoEval** simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. **EmoGuard** serves as an intermediary, monitoring users’ mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions.
Search
Fix author
Co-authors
- Xinzhe Juan 3
- Jiahao Qiu 3
- Mengdi Wang 3
- Yue Wu 3
- Mingkuan Feng 2
- Jiacheng Guo 2
- Kaixuan Huang 2
- Chonghua Liao 2
- Haoran Luo 2
- Jianhua Tao 2
- Zhengqi Wen 2
- Jinyang Wu 2
- Shuai Zhang 2
- Tianyi Bai 1
- Feihu Che 1
- Jiayi Geng 1
- Conghui He 1
- Yinghui He 1
- Xun Jiang 1
- Qiu Jiantao 1
- Zheng Lian 1
- Yuhan Liu 1
- Yifu Lu 1
- Jiahui Peng 1
- Fupeng Sun 1
- Huazheng Wang 1
- Yimin Wang 1
- Zhen Hao Wong 1
- Lijun Wu 1
- Huazhe Xu 1
- Zixin Yao 1
- Binhang Yuan 1
- Yifan Zeng 1
- Chi Zhang 1
- Wentao Zhang 1
- Chenhao Zhu 1
- Xinlin Zhuang 1