When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation

Jiahuan Li, Yutong Shen, Shujian Huang, Xinyu Dai, Jiajun Chen


Abstract
Subword segmentation algorithms have been a de facto choice when building neural machine translation systems. However, most of them need to learn a segmentation model based on some heuristics, which may produce sub-optimal segmentation. This can be problematic in some scenarios when the target language has rich morphological changes or there is not enough data for learning compact composition rules. Translating at fully character level has the potential to alleviate the issue, but empirical performances of character-based models has not been fully explored. In this paper, we present an in-depth comparison between character-based and subword-based NMT systems under three settings: translating to typologically diverse languages, training with low resource, and adapting to unseen domains. Experiment results show strong competitiveness of character-based models. Further analyses show that compared to subword-based models, character-based models are better at handling morphological phenomena, generating rare and unknown words, and more suitable for transferring to unseen domains.
Anthology ID:
2021.acl-short.69
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
August
Year:
2021
Address:
Online
Editors:
Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
543–549
Language:
URL:
https://aclanthology.org/2021.acl-short.69
DOI:
10.18653/v1/2021.acl-short.69
Bibkey:
Cite (ACL):
Jiahuan Li, Yutong Shen, Shujian Huang, Xinyu Dai, and Jiajun Chen. 2021. When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 543–549, Online. Association for Computational Linguistics.
Cite (Informal):
When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation (Li et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2021.acl-short.69.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-2/2021.acl-short.69.mp4