Qianyu Wang

2026

PARIF: Pushing the Pareto Frontier of Instruction Following and Reasoning with Curriculum Reinforcement Learning
Rongchuan Mu | Zexin Wang | Qianyu Wang | MingHua Ma | Zekun Wang | Ming Liu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Reasoning Models (LRMs) excel at complex problem-solving but frequently overlook specific instruction constraints. Existing alignment methods struggle to balance general reasoning with instruction-following (IF), hindered by dependency on teacher models, reward hacking, and reasoning-answer inconsistencies. We propose PARIF, a two-stage curriculum learning framework based on Reinforcement Learning from Verifiable Rewards (RLVR) to enhance both IF and general reasoning capabilities. The framework employs a correctness proxy across different stages to mitigate reward hacking. Stage I employs a dynamic weighting strategy simultaneously to optimize the model’s reasoning paradigm regarding constraints. Stage II introduces Decoupled-GRPO, which builds upon the first stage to enhance the logical consistency between the reasoning process and the final answer, enabling the model to better leverage its optimized reasoning paradigm. To support the framework, we curate 26,000 high-quality instructions featuring diverse constraints. Extensive experiments demonstrate PARIF’s effectiveness: our 7B model achieves a remarkable 21.25% relative average improvement to the original model across six representative IF tasks, while our 8B model outperforms leading models like DeepSeek-V3 on these IF tasks, effectively pushing the Pareto frontier of instruction following and reasoning for models of comparable scale. We open-source our code and models to facilitate future research.

2025

pdf bib abs

Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the absence of intermediate guidance often leads to inaccurate retrieval and intermediate reasoning errors, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition, while also being able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by 8.6%. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at https://github.com/zchuz/SiGIR-MHQA.

pdf bib abs

Entity alignment (EA), critical for knowledge graph (KG) integration, identifies equivalent entities across different KGs. Traditional methods often face challenges in semantic understanding and scalability. The rise of language models (LMs), particularly large language models (LLMs), has provided powerful new strategies. This paper systematically reviews LM-driven EA methods, proposing a novel taxonomy that categorizes methods in three key stages: data preparation, feature embedding, and alignment. We further summarize key benchmarks, evaluation metrics, and discuss future directions. This paper aims to provide researchers and practitioners with a clear and comprehensive understanding of how language models reshape the field of entity alignment.

Co-authors

Zerui Chen 1

Zheng Chu 1

Tao He 1

Hao Li 1

Ze Li 1

Venues

Fix author