Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring

Zihan Liu, Yan Xu, Genta Indra Winata, Pascale Fung


Abstract
This paper describes CAiRE’s submission to the unsupervised machine translation track of the WMT’19 news shared task from German to Czech. We leverage a phrase-based statistical machine translation (PBSMT) model and a pre-trained language model to combine word-level neural machine translation (NMT) and subword-level NMT models without using any parallel data. We propose to solve the morphological richness problem of languages by training byte-pair encoding (BPE) embeddings for German and Czech separately, and they are aligned using MUSE (Conneau et al., 2018). To ensure the fluency and consistency of translations, a rescoring mechanism is proposed that reuses the pre-trained language model to select the translation candidates generated through beam search. Moreover, a series of pre-processing and post-processing approaches are applied to improve the quality of final translations.
Anthology ID:
W19-5327
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
275–282
Language:
URL:
https://aclanthology.org/W19-5327
DOI:
10.18653/v1/W19-5327
Bibkey:
Cite (ACL):
Zihan Liu, Yan Xu, Genta Indra Winata, and Pascale Fung. 2019. Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 275–282, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring (Liu et al., 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/W19-5327.pdf