Abstract
Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.- Anthology ID:
- W16-4118
- Volume:
- Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Venue:
- CL4LC
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 154–163
- Language:
- URL:
- https://aclanthology.org/W16-4118
- DOI:
- Cite (ACL):
- Sarah Albertsson, Evelina Rennes, and Arne Jönsson. 2016. Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 154–163, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes (Albertsson et al., CL4LC 2016)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W16-4118.pdf