Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes

Sarah Albertsson, Evelina Rennes, Arne Jönsson


Abstract
Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.
Anthology ID:
W16-4118
Volume:
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venue:
CL4LC
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
154–163
Language:
URL:
https://aclanthology.org/W16-4118
DOI:
Bibkey:
Cite (ACL):
Sarah Albertsson, Evelina Rennes, and Arne Jönsson. 2016. Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 154–163, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes (Albertsson et al., CL4LC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W16-4118.pdf