Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Ameya Godbole, Robin Jia


Abstract
Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism in regards to factuality evaluation.We re-evaluate five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering.We find that these evaluators are inconsistent with each other and often misestimate the factual accuracy of NLG systems, both of which can lead to a variety of pitfalls.We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents.We urge users of factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest.
Anthology ID:
2025.findings-acl.1175
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22889–22912
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1175/
DOI:
Bibkey:
Cite (ACL):
Ameya Godbole and Robin Jia. 2025. Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22889–22912, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics (Godbole & Jia, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.1175.pdf