Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts

Xinyi Wang; Jinyi Han; Zishang Jiang; Tingyun li; Jiaqing Liang; Sihang Jiang; Zhaoqian Dai; Ma Shuguang; Fei Yu; Yanghua Xiao

Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts

Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Ma Shuguang, Fei Yu, Yanghua Xiao

Abstract

Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT

Anthology ID:: 2026.findings-acl.169
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3437–3455
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.169/
DOI:
Bibkey:
Cite (ACL):: Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Ma Shuguang, Fei Yu, and Yanghua Xiao. 2026. Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts. In Findings of the Association for Computational Linguistics: ACL 2026, pages 3437–3455, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts (Wang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.169.pdf
Checklist:: 2026.findings-acl.169.checklist.pdf

PDF Cite Search Checklist Fix data