Meaningless yet meaningful: Morphology grounded subword-level NMT

Tamali Banerjee, Pushpak Bhattacharyya


Abstract
We explore the use of two independent subsystems Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We show that, for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We propose a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, Bengali-Hindi and English-Bengali.
Anthology ID:
W18-1207
Volume:
Proceedings of the Second Workshop on Subword/Character LEvel Models
Month:
June
Year:
2018
Address:
New Orleans
Editors:
Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh
Venue:
SCLeM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
55–60
Language:
URL:
https://aclanthology.org/W18-1207
DOI:
10.18653/v1/W18-1207
Bibkey:
Cite (ACL):
Tamali Banerjee and Pushpak Bhattacharyya. 2018. Meaningless yet meaningful: Morphology grounded subword-level NMT. In Proceedings of the Second Workshop on Subword/Character LEvel Models, pages 55–60, New Orleans. Association for Computational Linguistics.
Cite (Informal):
Meaningless yet meaningful: Morphology grounded subword-level NMT (Banerjee & Bhattacharyya, SCLeM 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/W18-1207.pdf