Jinyi Han

2026

Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT

pdf bib abs

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency–performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency–performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

2025

pdf bib abs

Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the **Cognitive Diagnostic Synthesis** (CDS) method, which incorporates a diagnostic process inspired by **Cognitive Diagnosis Theory** (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub https://anonymous.4open.science/r/cds-04D1.

Co-authors

Fei Yu 2

Han Xia 1

Venues

Findings3

Fix author