LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts
Zongxia Li, Xiyang Wu, Ishani Mondal, Alexa Siu, Jordan Lee Boyd-Graber, Ani Nenkova
Abstract
Large language models (LLMs) such as GPT-4, Claude and LLaMA are routinely used to evaluate long-form text generated by language models. We study the ability of these models to identify low quality texts, an increasingly rare subset of output which is of great interest to pinpoint during development. We present experiments with a panel of LLM judges, and crowd-sourced approximations of reference judgments. Pinpointing sub-par outputs is a difficult task for both models and crowdworkers, with models doing overall better. Moreover, unlike findings in prior work on factoid question answering, panels of cheaper models do not agree as well with high quality developer judgments of low quality as panels of frontier models. We present qualitative and quantitative analysis of the relative strengths of models in the panel, gleaning insights why they yield better results over a single model.- Anthology ID:
- 2025.newsum-main.1
- Volume:
- Proceedings of The 5th New Frontiers in Summarization Workshop
- Month:
- November
- Year:
- 2025
- Address:
- Hybrid
- Editors:
- Yue Dong, Wen Xiao, Haopeng Zhang, Rui Zhang, Ori Ernst, Lu Wang, Fei Liu
- Venues:
- NewSum | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–16
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.newsum-main.1/
- DOI:
- Cite (ACL):
- Zongxia Li, Xiyang Wu, Ishani Mondal, Alexa Siu, Jordan Lee Boyd-Graber, and Ani Nenkova. 2025. LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts. In Proceedings of The 5th New Frontiers in Summarization Workshop, pages 1–16, Hybrid. Association for Computational Linguistics.
- Cite (Informal):
- LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts (Li et al., NewSum 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.newsum-main.1.pdf