A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation

Yang Zhong; Diane Litman

doi:10.18653/v1/2025.findings-acl.648

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation

Abstract

Ensuring factual consistency in summarization remains a challenge, especially for long-document evaluation. While automated, reference-free evaluation models are essential given the impracticality of large-scale human assessment for lengthy texts, challenges persist in evaluating different systems on how to handle different summary granularities and evolving model generations. In this work, we conduct a systematic study on diverse factual-consistency evaluation systems across four long-document datasets, encompassing summaries generated by models from non-LLMs to proprietary LLMs. Our analysis reveals that fine-grained continuous scores can provide more reliable assessments of different evaluation systems’ capabilities than binary classification. We also examine the relationship between sentence-level and summary-level model performance, highlighting its dependency on dataset characteristics. Moreover, our study reveals that advanced systems can achieve higher recall in error detection for older summaries, yet struggle with false positives and fine-grained error detection. Our analysis and case studies provide further insights into designing robust factuality evaluation systems, which are becoming increasingly in demand as generative models advance rapidly.

Anthology ID:: 2025.findings-acl.648
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12511–12532
Language:
URL:: https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.648/
DOI:: 10.18653/v1/2025.findings-acl.648
Bibkey:
Cite (ACL):: Yang Zhong and Diane Litman. 2025. A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12511–12532, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation (Zhong & Litman, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/transition-to-people-yaml/2025.findings-acl.648.pdf

PDF Cite Search Fix data