Hejia Zhang
2026
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yun He | Wenzhe Li | Hejia Zhang | Songlin Li | Karishma Mandyam | Sopan Khosla | Yuanhao Xiong | Nanshu Wang | Xiaoliang Peng | Beibin Li | Shengjie Bi | Shishir G Patil | Qi Qi | Shengyu Feng | Julian Katz-Samuels | Richard Yuanzhe Pang | Sujan Kumar Gonugondla | Hunter Lang | Yue Yu | Yundi Qian | Maryam Fazel-Zarandi | Licheng Yu | Amine Benhalloum | Hany Hassan Awadalla | Manaal Faruqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)—especially for complex, multi-turn, and system-prompted instructions—remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF, a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs’ ability to follow complex, multi-turn, and system-level instructions. We also open-source the evaluation script of AdvancedIF. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
2025
Improving Model Factuality with Fine-grained Critique-based Evaluator
Yiqing Xie | Wenxuan Zhou | Pradyot Prakash | Di Jin | Yuning Mao | Quintin Fettes | Arya Talebzadeh | Sinong Wang | Han Fang | Carolyn Rose | Daniel Fried | Hejia Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiqing Xie | Wenxuan Zhou | Pradyot Prakash | Di Jin | Yuning Mao | Quintin Fettes | Arya Talebzadeh | Sinong Wang | Han Fang | Carolyn Rose | Daniel Fried | Hejia Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. In particular, we train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools, via data augmentation on a combination of public judgment datasets. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, ask FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator’s accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat/Llama3-8B-chat’s factuality rate by 16.86%/14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83%/6.96%.
2024
Effective Long-Context Scaling of Foundation Models
Wenhan Xiong | Jingyu Liu | Igor Molybog | Hejia Zhang | Prajjwal Bhargava | Rui Hou | Louis Martin | Rashi Rungta | Karthik Abinav Sankararaman | Barlas Oguz | Madian Khabsa | Han Fang | Yashar Mehdad | Sharan Narang | Kshitiz Malik | Angela Fan | Shruti Bhosale | Sergey Edunov | Mike Lewis | Sinong Wang | Hao Ma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Wenhan Xiong | Jingyu Liu | Igor Molybog | Hejia Zhang | Prajjwal Bhargava | Rui Hou | Louis Martin | Rashi Rungta | Karthik Abinav Sankararaman | Barlas Oguz | Madian Khabsa | Han Fang | Yashar Mehdad | Sharan Narang | Kshitiz Malik | Angela Fan | Shruti Bhosale | Sergey Edunov | Mike Lewis | Sinong Wang | Hao Ma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass gpt-3.5-turbo-16k‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.
Search
Fix author
Co-authors
- Han Fang 2
- Sinong Wang 2
- Amine Benhalloum 1
- Prajjwal Bhargava 1
- Shruti Bhosale 1
- Shengjie Bi 1
- Sergey Edunov 1
- Angela Fan 1
- Manaal Faruqui 1
- Maryam Fazel-Zarandi 1
- Shengyu Feng 1
- Quintin Fettes 1
- Daniel Fried 1
- Sujan Kumar Gonugondla 1
- Hany Hassan Awadalla 1
- Yun He 1
- Rui Hou 1
- Di Jin 1
- Julian Katz-Samuels 1
- Madian Khabsa 1
- Sopan Khosla 1
- Hunter Lang 1
- Mike Lewis 1
- Beibin Li 1
- Songlin Li 1
- Wenzhe Li 1
- Jingyu Liu 1
- Hao Ma 1
- Kshitiz Malik 1
- Karishma Mandyam 1
- Yuning Mao 1
- Louis Martin 1
- Yashar Mehdad 1
- Igor Molybog 1
- Sharan Narang 1
- Barlas Oguz 1
- Richard Yuanzhe Pang 1
- Shishir G Patil 1
- Xiaoliang Peng 1
- Pradyot Prakash 1
- Qi Qi 1
- Yundi Qian 1
- Carolyn Rose 1
- Rashi Rungta 1
- Karthik Abinav Sankararaman 1
- Arya Talebzadeh 1
- Nanshu Wang 1
- Yiqing Xie 1
- Wenhan Xiong 1
- Yuanhao Xiong 1
- Licheng Yu 1
- Yue Yu 1
- Wenxuan Zhou 1