Measuring Cross-Text Cohesion for Segmentation Similarity Scoring

Gerardo Ocampo Diaz, Jessica Ouyang


Abstract
Text segmentation is the task of dividing a sequence of text elements (eg. words, sentences, or paragraphs) into meaningful chunks. Although exciting advances are being made in modern segmentation-based tasks, such as automatically generating podcast chapters, current segmentation similarity metrics share a critical weakness: they are content-agnostic. In this paper, we present a word-embedding-based metric of cross-textual cohesion based on the formal linguistic definition of cohesion and incorporate it into a new segmentation similarity metric, SED. Our similarity metric, SED, is capable of providing fine-grained segmentation similarity scoring for the 3 basic segmentation errors: transposition, insertion, and deletion, as well as mixtures of them, avoiding the limitations of traditional metrics. We discuss the benefits of SED and evaluate its alignment with human judgement for each of the 3 basic error types. We show that our metric aligns with human evaluations significantly more than traditional metrics. We briefly discuss future work, such as the integration of anaphora resolution into our cohesion-based metric, and make our code publicly available.
Anthology ID:
2024.lrec-main.971
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
11138–11147
Language:
URL:
https://aclanthology.org/2024.lrec-main.971
DOI:
Bibkey:
Cite (ACL):
Gerardo Ocampo Diaz and Jessica Ouyang. 2024. Measuring Cross-Text Cohesion for Segmentation Similarity Scoring. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11138–11147, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Measuring Cross-Text Cohesion for Segmentation Similarity Scoring (Ocampo Diaz & Ouyang, LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.971.pdf