Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution

Lukas Edman, Antonio Toral, Gertjan van Noord


Abstract
Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.
Anthology ID:
2020.eamt-1.10
Volume:
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
Month:
November
Year:
2020
Address:
Lisboa, Portugal
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
81–90
Language:
URL:
https://aclanthology.org/2020.eamt-1.10
DOI:
Bibkey:
Cite (ACL):
Lukas Edman, Antonio Toral, and Gertjan van Noord. 2020. Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 81–90, Lisboa, Portugal. European Association for Machine Translation.
Cite (Informal):
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution (Edman et al., EAMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.eamt-1.10.pdf
Code
 leukas/lrumt