Yundi Qian
2026
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)—especially for complex, multi-turn, and system-prompted instructions—remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF, a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs’ ability to follow complex, multi-turn, and system-level instructions. We also open-source the evaluation script of AdvancedIF. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
2025
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu | Zhengxing Chen | Aston Zhang | Liang Tan | Chenguang Zhu | Richard Yuanzhe Pang | Yundi Qian | Xuewei Wang | Suchin Gururangan | Chao Zhang | Melanie Kambadur | Dhruv Mahajan | Rui Hou
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Yue Yu | Zhengxing Chen | Aston Zhang | Liang Tan | Chenguang Zhu | Richard Yuanzhe Pang | Yundi Qian | Xuewei Wang | Suchin Gururangan | Chao Zhang | Melanie Kambadur | Dhruv Mahajan | Rui Hou
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of the generated critiques.
Search
Fix author
Co-authors
- Richard Yuanzhe Pang 2
- Yue Yu 2
- Amine Benhalloum 1
- Shengjie Bi 1
- Zhengxing Chen 1
- Manaal Faruqui 1
- Maryam Fazel-Zarandi 1
- Shengyu Feng 1
- Sujan Kumar Gonugondla 1
- Suchin Gururangan 1
- Hany Hassan Awadalla 1
- Yun He 1
- Rui Hou 1
- Melanie Kambadur 1
- Julian Katz-Samuels 1
- Sopan Khosla 1
- Hunter Lang 1
- Beibin Li 1
- Songlin Li 1
- Wenzhe Li 1
- Dhruv Mahajan 1
- Karishma Mandyam 1
- Shishir G Patil 1
- Xiaoliang Peng 1
- Qi Qi 1
- Liang Tan 1
- Nanshu Wang 1
- Xuewei Wang 1
- Yuanhao Xiong 1
- Licheng Yu 1
- Aston Zhang 1
- Chao Zhang 1
- Hejia Zhang 1
- Chenguang Zhu 1