Abstract
Text simplification is a growing field with many potential useful applications. Training text simplification algorithms generally requires a lot of annotated data, however there are not many corpora suitable for this task. We propose a new unsupervised method for aligning text based on Doc2Vec embeddings and a new alignment algorithm, capable of aligning texts at different levels. Initial evaluation shows promising results for the new approach. We used the newly developed approach to create a new monolingual parallel corpus composed of the works of English early modern philosophers and their corresponding simplified versions.- Anthology ID:
- 2021.naacl-srw.6
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 40–46
- Language:
- URL:
- https://aclanthology.org/2021.naacl-srw.6
- DOI:
- 10.18653/v1/2021.naacl-srw.6
- Cite (ACL):
- Stefan Paun. 2021. Parallel Text Alignment and Monolingual Parallel Corpus Creation from Philosophical Texts for Text Simplification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 40–46, Online. Association for Computational Linguistics.
- Cite (Informal):
- Parallel Text Alignment and Monolingual Parallel Corpus Creation from Philosophical Texts for Text Simplification (Paun, NAACL 2021)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2021.naacl-srw.6.pdf
- Code
- stefanpaun/massalign
- Data
- Newsela