An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages
Aquia Richburg, Ramy Eskander, Smaranda Muresan, Marine Carpuat
Abstract
Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.- Anthology ID:
- 2020.winlp-1.40
- Volume:
- Proceedings of the Fourth Widening Natural Language Processing Workshop
- Month:
- July
- Year:
- 2020
- Address:
- Seattle, USA
- Editors:
- Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
- Venue:
- WiNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 151–155
- Language:
- URL:
- https://aclanthology.org/2020.winlp-1.40
- DOI:
- 10.18653/v1/2020.winlp-1.40
- Cite (ACL):
- Aquia Richburg, Ramy Eskander, Smaranda Muresan, and Marine Carpuat. 2020. An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 151–155, Seattle, USA. Association for Computational Linguistics.
- Cite (Informal):
- An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages (Richburg et al., WiNLP 2020)