Xiaoning Wang
2026
Prior Beliefs Prejudice LLM-as-Judge: Evidence from Persuasion Evaluation
Pardis Sadat Zahraei | Xiaoning Wang | Nimet Beyza Bozdag | Gokhan Tur | Dilek Hakkani-T\"ur
Findings of the Association for Computational Linguistics: ACL 2026
Pardis Sadat Zahraei | Xiaoning Wang | Nimet Beyza Bozdag | Gokhan Tur | Dilek Hakkani-T\"ur
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly used as judges to evaluate text quality, moderate content, and assess arguments. We investigate whether alignment-instilled prior beliefs bias LLM judgments, using persuasion evaluation as a representative task. We find a systematic failure: models conflate their trained beliefs with rhetorical quality, rating identical claims differently based on belief alignment rather than argumentative merit. A bare assertion aligned with training receives higher scores than a well-crafted counter-argument, even when explicitly instructed to judge rhetoric alone. We introduce ConvinceQA, a dataset of 27,756 persuasive arguments with controlled stance variation across subjective, harmful, and misinformation domains, and demonstrate this prior prejudice across models. We exploit this failure through persuasion-based probing: evaluating minimal pairs that differ only in the subject token bypasses learned refusals and reveals hidden biases. Analysis identifies three failure modes, with belief-conditioned rating inflation accounting for 88% of cases. Cross-task validation on essay quality assessment and debate judging confirms this is a pervasive limitation.