Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan
Abstract
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.- Anthology ID:
- 2026.gem-main.58
- Volume:
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 595–603
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.58/
- DOI:
- Cite (ACL):
- Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, and Elsa Fan. 2026. Permutation-Consensus Listwise Judging for Robust Factuality Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 595–603, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Permutation-Consensus Listwise Judging for Robust Factuality Evaluation (Huang et al., GEM 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.58.pdf