Bosi Wen
2026
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
Bosi Wen | Yilin Niu | Cunxiang Wang | Pei Ke | Xiaoying Ling | Ying Zhang | Aohan Zeng | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bosi Wen | Yilin Niu | Cunxiang Wang | Pei Ke | Xiaoying Ling | Ying Zhang | Aohan Zeng | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction-following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction-following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic for fine-grained, efficient, and reliable instruction-following evaluation. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments show that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including o4-mini and Gemini-3-Pro. With the reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lowercomputational overhead compared to strong LLM critic baselines. Our code and model are available at https://github.com/thu-coai/IF-CRITIC.
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen | Yilin Niu | Cunxiang Wang | Xiaoying Ling | Ying Zhang | Pei Ke | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bosi Wen | Yilin Niu | Cunxiang Wang | Xiaoying Ling | Ying Zhang | Pei Ke | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.
2025
Training Language Model to Critique for Better Refinement
Tianshu Yu | Chao Xiang | Mingchuan Yang | Pei Ke | Bosi Wen | Cunxiang Wang | Jiale Cheng | Li Zhang | Xinyu Mu | Chuxiong Sun | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Tianshu Yu | Chao Xiang | Mingchuan Yang | Pei Ke | Bosi Wen | Cunxiang Wang | Jiale Cheng | Li Zhang | Xinyu Mu | Chuxiong Sun | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce Refinement-oriented Critique Optimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks—dialog generation, summarization, question answering, mathematical reasoning, and code generation—and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops. Code and data will be publicly available upon acceptance of this paper.
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Bosi Wen | Pei Ke | Yufei Sun | Cunxiang Wang | Xiaotao Gu | Jinfeng Zhou | Jie Tang | Hongning Wang | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Bosi Wen | Pei Ke | Yufei Sun | Cunxiang Wang | Xiaotao Gu | Jinfeng Zhou | Jie Tang | Hongning Wang | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available athttps://github.com/thu-coai/HPSS.
2024
CharacterGLM: Customizing Social Characters with Large Language Models
Jinfeng Zhou | Zhuang Chen | Dazhen Wan | Bosi Wen | Yi Song | Jifan Yu | Yongkang Huang | Pei Ke | Guanqun Bi | Libiao Peng | JiaMing Yang | Xiyao Xiao | Sahand Sabour | Xiaohan Zhang | Wenjing Hou | Yijia Zhang | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Jinfeng Zhou | Zhuang Chen | Dazhen Wan | Bosi Wen | Yi Song | Jifan Yu | Yongkang Huang | Pei Ke | Guanqun Bi | Libiao Peng | JiaMing Yang | Xiyao Xiao | Sahand Sabour | Xiaohan Zhang | Wenjing Hou | Yijia Zhang | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Character-based dialogue (CharacterDial) has become essential in the industry (e.g., Character.AI), enabling users to freely customize social characters for social interactions. However, the generalizability and adaptability across various conversational scenarios inherent in customizing social characters still lack public industrial solutions. To address these challenges, by dissecting well-rounded social characters composed of both inherent social profiles and external social behaviors, we manually collect a large-scale Chinese corpus featuring characters with diverse categories and behaviors, and develop CharacterGLM models alongside well-designed refinement methods. Extensive experiments show that CharacterGLM outperforms most popular open- and closed-source LLMs and performs comparably to GPT-4. We will release our data and models for local development and deployment.
ToMBench: Benchmarking Theory of Mind in Large Language Models
Zhuang Chen | Jincenzi Wu | Jinfeng Zhou | Bosi Wen | Guanqun Bi | Gongyao Jiang | Yaru Cao | Mengting Hu | Yunghwei Lai | Zexuan Xiong | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuang Chen | Jincenzi Wu | Jinfeng Zhou | Bosi Wen | Guanqun Bi | Gongyao Jiang | Yaru Cao | Mengting Hu | Yunghwei Lai | Zexuan Xiong | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs’ ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Pei Ke | Bosi Wen | Andrew Feng | Xiao Liu | Xuanyu Lei | Jiale Cheng | Shengyuan Wang | Aohan Zeng | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pei Ke | Bosi Wen | Andrew Feng | Xiao Liu | Xuanyu Lei | Jiale Cheng | Shengyuan Wang | Aohan Zeng | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4’s direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Xiao Liu | Xuanyu Lei | Shengyuan Wang | Yue Huang | Andrew Feng | Bosi Wen | Jiale Cheng | Pei Ke | Yifan Xu | Weng Lam Tam | Xiaohan Zhang | Lichao Sun | Xiaotao Gu | Hongning Wang | Jing Zhang | Minlie Huang | Yuxiao Dong | Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiao Liu | Xuanyu Lei | Shengyuan Wang | Yue Huang | Andrew Feng | Bosi Wen | Jiale Cheng | Pei Ke | Yifan Xu | Weng Lam Tam | Xiaohan Zhang | Lichao Sun | Xiaotao Gu | Hongning Wang | Jing Zhang | Minlie Huang | Yuxiao Dong | Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at https://github.com/THUDM/AlignBench
Search
Fix author
Co-authors
- Minlie Huang 8
- Pei Ke 7
- Hongning Wang 6
- Jie Tang 4
- Cunxiang Wang 4
- Jiale Cheng 3
- Yuxiao Dong 3
- Jinfeng Zhou 3
- Guanqun Bi 2
- Zhuang Chen 2
- Andrew Feng 2
- Xiaotao Gu 2
- Xuanyu Lei 2
- Xiaoying Ling 2
- Xiao Liu 2
- Yilin Niu 2
- Shengyuan Wang 2
- Aohan Zeng 2
- Xiaohan Zhang 2
- Ying Zhang 2
- Yaru Cao 1
- Wenjing Hou 1
- Mengting Hu 1
- Yongkang Huang 1
- Yue Huang 1
- Gongyao Jiang 1
- Yunghwei Lai 1
- Xinyu Mu 1
- Libiao Peng 1
- Sahand Sabour 1
- Yi Song 1
- Chuxiong Sun 1
- Lichao Sun 1
- Yufei Sun 1
- Weng Lam Tam 1
- Dazhen Wan 1
- Jincenzi Wu 1
- Chao Xiang 1
- Xiyao Xiao 1
- Zexuan Xiong 1
- Yifan Xu 1
- JiaMing Yang 1
- Mingchuan Yang 1
- Jifan Yu 1
- Tianshu Yu 1
- Jing Zhang 1
- Li Zhang 1
- Yijia Zhang (张益嘉) 1