Chenjun Xu
2026
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
Katelyn X. Mei | Yi-Li Hsu | Minjoon Choi | Zongwan Cao | Chenjun Xu | Bingbing Wen | Su Lin Blodgett | Lucy Lu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Katelyn X. Mei | Yi-Li Hsu | Minjoon Choi | Zongwan Cao | Chenjun Xu | Bingbing Wen | Su Lin Blodgett | Lucy Lu Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols—details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023–2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research.
2025
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen | Jihan Yao | Shangbin Feng | Chenjun Xu | Yulia Tsvetkov | Bill Howe | Lucy Lu Wang
Transactions of the Association for Computational Linguistics, Volume 13
Bingbing Wen | Jihan Yao | Shangbin Feng | Chenjun Xu | Yulia Tsvetkov | Bill Howe | Lucy Lu Wang
Transactions of the Association for Computational Linguistics, Volume 13
Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future research, such as whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, and opportunities to optimize abstention abilities in specific contexts. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.1
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu | Bingbing Wen | Bin Han | Robert Wolfe | Lucy Lu Wang | Bill Howe
Findings of the Association for Computational Linguistics: ACL 2025
Chenjun Xu | Bingbing Wen | Bin Han | Robert Wolfe | Lucy Lu Wang | Bill Howe
Findings of the Association for Computational Linguistics: ACL 2025
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas—e.g., expert vs layman, or different race, gender, and ages—the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.