Jinyi Han
2026
Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency–performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency–performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.
2025
CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory
Haokun Zhao | Jinyi Han | Jiaqing Liang | Yanghua Xiao | Xiaojun Meng | Jiansheng Wei
Findings of the Association for Computational Linguistics: ACL 2025
Haokun Zhao | Jinyi Han | Jiaqing Liang | Yanghua Xiao | Xiaojun Meng | Jiansheng Wei
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the **Cognitive Diagnostic Synthesis** (CDS) method, which incorporates a diagnostic process inspired by **Cognitive Diagnosis Theory** (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub https://anonymous.4open.science/r/cds-04D1.