2025
pdf
bib
abs
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Bosi Wen
|
Pei Ke
|
Yufei Sun
|
Cunxiang Wang
|
Xiaotao Gu
|
Jinfeng Zhou
|
Jie Tang
|
Hongning Wang
|
Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available athttps://github.com/thu-coai/HPSS.
pdf
bib
abs
Training Language Model to Critique for Better Refinement
Tianshu Yu
|
Chao Xiang
|
Mingchuan Yang
|
Pei Ke
|
Bosi Wen
|
Cunxiang Wang
|
Jiale Cheng
|
Li Zhang
|
Xinyu Mu
|
Chuxiong Sun
|
Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce Refinement-oriented Critique Optimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks—dialog generation, summarization, question answering, mathematical reasoning, and code generation—and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops. Code and data will be publicly available upon acceptance of this paper.
2024
pdf
bib
abs
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Xiao Liu
|
Xuanyu Lei
|
Shengyuan Wang
|
Yue Huang
|
Andrew Feng
|
Bosi Wen
|
Jiale Cheng
|
Pei Ke
|
Yifan Xu
|
Weng Lam Tam
|
Xiaohan Zhang
|
Lichao Sun
|
Xiaotao Gu
|
Hongning Wang
|
Jing Zhang
|
Minlie Huang
|
Yuxiao Dong
|
Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at
https://github.com/THUDM/AlignBenchpdf
bib
abs
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Pei Ke
|
Bosi Wen
|
Andrew Feng
|
Xiao Liu
|
Xuanyu Lei
|
Jiale Cheng
|
Shengyuan Wang
|
Aohan Zeng
|
Yuxiao Dong
|
Hongning Wang
|
Jie Tang
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4’s direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
pdf
bib
abs
ToMBench: Benchmarking Theory of Mind in Large Language Models
Zhuang Chen
|
Jincenzi Wu
|
Jinfeng Zhou
|
Bosi Wen
|
Guanqun Bi
|
Gongyao Jiang
|
Yaru Cao
|
Mengting Hu
|
Yunghwei Lai
|
Zexuan Xiong
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs’ ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
pdf
bib
abs
CharacterGLM: Customizing Social Characters with Large Language Models
Jinfeng Zhou
|
Zhuang Chen
|
Dazhen Wan
|
Bosi Wen
|
Yi Song
|
Jifan Yu
|
Yongkang Huang
|
Pei Ke
|
Guanqun Bi
|
Libiao Peng
|
JiaMing Yang
|
Xiyao Xiao
|
Sahand Sabour
|
Xiaohan Zhang
|
Wenjing Hou
|
Yijia Zhang
|
Yuxiao Dong
|
Hongning Wang
|
Jie Tang
|
Minlie Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Character-based dialogue (CharacterDial) has become essential in the industry (e.g., Character.AI), enabling users to freely customize social characters for social interactions. However, the generalizability and adaptability across various conversational scenarios inherent in customizing social characters still lack public industrial solutions. To address these challenges, by dissecting well-rounded social characters composed of both inherent social profiles and external social behaviors, we manually collect a large-scale Chinese corpus featuring characters with diverse categories and behaviors, and develop CharacterGLM models alongside well-designed refinement methods. Extensive experiments show that CharacterGLM outperforms most popular open- and closed-source LLMs and performs comparably to GPT-4. We will release our data and models for local development and deployment.