Nanshu Wang
2026
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)—especially for complex, multi-turn, and system-prompted instructions—remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF, a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs’ ability to follow complex, multi-turn, and system-level instructions. We also open-source the evaluation script of AdvancedIF. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
2025
Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation
Chengwei Qin | Wenxuan Zhou | Karthik Abinav Sankararaman | Nanshu Wang | Tengyu Xu | Alexander Radovic | Eryk Helenowski | Arya Talebzadeh | Aditya Tayade | Sinong Wang | Shafiq Joty | Han Fang | Hao Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Chengwei Qin | Wenxuan Zhou | Karthik Abinav Sankararaman | Nanshu Wang | Tengyu Xu | Alexander Radovic | Eryk Helenowski | Arya Talebzadeh | Aditya Tayade | Sinong Wang | Shafiq Joty | Han Fang | Hao Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available.In this work, we systematically investigate reference-free hallucination detection in open-domain long-form responses. Our findings reveal that internal states (e.g., model’s output probability and entropy) alone are insufficient for reliably (i.e., better than random guessing) distinguishing between factual and hallucinated content. To enhance detection, we explore various existing approaches, including prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To further improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an auxiliary task for the model to jointly learn with the main task of hallucination detection. With extensive experiments and analysis using a variety of model families & datasets, we demonstrate the effectiveness and generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact.
Search
Fix author
Co-authors
- Amine Benhalloum 1
- Shengjie Bi 1
- Han Fang 1
- Manaal Faruqui 1
- Maryam Fazel-Zarandi 1
- Shengyu Feng 1
- Sujan Kumar Gonugondla 1
- Hany Hassan Awadalla 1
- Yun He 1
- Eryk Helenowski 1
- Shafiq Joty 1
- Julian Katz-Samuels 1
- Sopan Khosla 1
- Hunter Lang 1
- Beibin Li 1
- Songlin Li 1
- Wenzhe Li 1
- Hao Ma 1
- Karishma Mandyam 1
- Richard Yuanzhe Pang 1
- Shishir G Patil 1
- Xiaoliang Peng 1
- Qi Qi 1
- Yundi Qian 1
- Chengwei Qin 1
- Alexander Radovic 1
- Karthik Abinav Sankararaman 1
- Arya Talebzadeh 1
- Aditya Tayade 1
- Sinong Wang 1
- Yuanhao Xiong 1
- Tengyu Xu 1
- Licheng Yu 1
- Yue Yu 1
- Hejia Zhang 1
- Wenxuan Zhou 1
Venues
- ACL2