Inference-only sub-character decomposition improves translation of unseen logographic characters

Danielle Saunders, Weston Feely, Bill Byrne


Abstract
Neural Machine Translation (NMT) on logographic source languages struggles when translating ‘unseen’ characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.
Anthology ID:
2020.wat-1.21
Volume:
Proceedings of the 7th Workshop on Asian Translation
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Win Pa Pa, Ondřej Bojar, Shantipriya Parida, Isao Goto, Hidaya Mino, Hiroshi Manabe, Katsuhito Sudoh, Sadao Kurohashi, Pushpak Bhattacharyya
Venue:
WAT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
170–177
Language:
URL:
https://aclanthology.org/2020.wat-1.21
DOI:
Bibkey:
Cite (ACL):
Danielle Saunders, Weston Feely, and Bill Byrne. 2020. Inference-only sub-character decomposition improves translation of unseen logographic characters. In Proceedings of the 7th Workshop on Asian Translation, pages 170–177, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Inference-only sub-character decomposition improves translation of unseen logographic characters (Saunders et al., WAT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2020.wat-1.21.pdf
Data
ASPEC