Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

Phillip Keung; Julian Salazar; Yichao Lu; Noah A. Smith

doi:10.1162/tacl_a_00348

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

Phillip Keung, Julian Salazar, Yichao Lu, Noah A. Smith

Abstract

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT’14 French-English and WMT’16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT’15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

Anthology ID:: 2020.tacl-1.53
Volume:: Transactions of the Association for Computational Linguistics, Volume 8
Month:
Year:: 2020
Address:: Cambridge, MA
Editors:: Mark Johnson, Brian Roark, Ani Nenkova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 828–841
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.tacl-1.53/
DOI:: 10.1162/tacl_a_00348
Bibkey:
Cite (ACL):: Phillip Keung, Julian Salazar, Yichao Lu, and Noah A. Smith. 2020. Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings. Transactions of the Association for Computational Linguistics, 8:828–841.
Cite (Informal):: Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings (Keung et al., TACL 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.tacl-1.53.pdf
Video:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.tacl-1.53.mp4

PDF Cite Search Video Fix data