Abstract
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL. We evaluate translation tasks between English and each of Nepali, Sinhala, and Kazakh, and predict that using morphologically-based segmentation methods would lead to better performance in this setting. However, comparing to BPE, we find that no consistent and reliable differences emerge between the segmentation methods. While morphologically-based methods outperform BPE in a few cases, what performs best tends to vary across tasks, and the performance of segmentation methods is often statistically indistinguishable.- Anthology ID:
- 2021.eacl-srw.22
- Volume:
- Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
- Month:
- April
- Year:
- 2021
- Address:
- Online
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 164–174
- Language:
- URL:
- https://aclanthology.org/2021.eacl-srw.22
- DOI:
- 10.18653/v1/2021.eacl-srw.22
- Cite (ACL):
- Jonne Saleva and Constantine Lignos. 2021. The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 164–174, Online. Association for Computational Linguistics.
- Cite (Informal):
- The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation (Saleva & Lignos, EACL 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.eacl-srw.22.pdf
- Data
- FLoRes