Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?
Aleksandra Smolka, Hsin-Min Wang, Jason S. Chang, Keh-Yih Su
Abstract
Sentence alignment is an essential step in studying the mapping among different language expressions, and the character trigram overlapping ratio was reported to be the most effective similarity measure in aligning sentences in the text simplification dataset. However, the appropriateness of each similarity measure depends on the characteristics of the corpus to be aligned. This paper studies if the character trigram is still a suitable similarity measure for the task of aligning sentences in a paragraph paraphrasing corpus. We compare several embedding-based and non-embeddings model-agnostic similarity measures, including those that have not been studied previously. The evaluation is conducted on parallel paragraphs sampled from the Webis-CPC-11 corpus, which is a paragraph paraphrasing dataset. Our results show that modern BERT-based measures such as Sentence-BERT or BERTScore can lead to significant improvement in this task.- Anthology ID:
- 2022.rocling-1.7
- Volume:
- Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
- Month:
- November
- Year:
- 2022
- Address:
- Taipei, Taiwan
- Editors:
- Yung-Chun Chang, Yi-Chin Huang
- Venue:
- ROCLING
- SIG:
- Publisher:
- The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
- Note:
- Pages:
- 49–60
- Language:
- URL:
- https://aclanthology.org/2022.rocling-1.7
- DOI:
- Cite (ACL):
- Aleksandra Smolka, Hsin-Min Wang, Jason S. Chang, and Keh-Yih Su. 2022. Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), pages 49–60, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
- Cite (Informal):
- Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus? (Smolka et al., ROCLING 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2022.rocling-1.7.pdf