Revisiting Faithfulness Annotations for Long-form Summaries

Yang Zhong, Yang Janet Liu, Diane Litman


Abstract
Benchmarks for long-form summaries (four or more sentences) generated by language models increasingly serve as gold-standard references for developing, evaluating, and comparing faithfulness-checking systems. As their influence grows, understanding the challenges of annotating faithfulness errors within long, discourse-rich summaries becomes critical. We revisit three benchmarks spanning diverse text types and contrasting annotation designs. Using a discourse-aware evaluation framework together with human auditing, we identify cases where benchmark labels may be unreliable. Manual verification shows that 3.4%-5.4% of sentence-level labels warrant revision due to discourse-level inconsistencies that standard annotation procedures overlook. We introduce a taxonomy of five recurring annotation error types, propose revised labels, and show that correcting these cases leads to meaningful shifts in system rankings. We conclude with recommendations for future annotation practices.
Anthology ID:
2026.law-main.12
Volume:
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yang Janet Liu, Luke Gessler
Venues:
LAW | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
158–172
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.law-main.12/
DOI:
Bibkey:
Cite (ACL):
Yang Zhong, Yang Janet Liu, and Diane Litman. 2026. Revisiting Faithfulness Annotations for Long-form Summaries. In Proceedings of the 20th Linguistic Annotation Workshop (LAW XX), pages 158–172, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Revisiting Faithfulness Annotations for Long-form Summaries (Zhong et al., LAW 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.law-main.12.pdf