Abstract
We explore the use of two independent subsystems Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We show that, for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We propose a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, Bengali-Hindi and English-Bengali.- Anthology ID:
- W18-1207
- Volume:
- Proceedings of the Second Workshop on Subword/Character LEvel Models
- Month:
- June
- Year:
- 2018
- Address:
- New Orleans
- Venue:
- SCLeM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 55–60
- Language:
- URL:
- https://aclanthology.org/W18-1207
- DOI:
- 10.18653/v1/W18-1207
- Cite (ACL):
- Tamali Banerjee and Pushpak Bhattacharyya. 2018. Meaningless yet meaningful: Morphology grounded subword-level NMT. In Proceedings of the Second Workshop on Subword/Character LEvel Models, pages 55–60, New Orleans. Association for Computational Linguistics.
- Cite (Informal):
- Meaningless yet meaningful: Morphology grounded subword-level NMT (Banerjee & Bhattacharyya, SCLeM 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W18-1207.pdf