Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Kevin Heffernan; Onur Çelebi; Holger Schwenk

doi:10.18653/v1/2022.findings-emnlp.154

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Kevin Heffernan, Onur Çelebi, Holger Schwenk

Abstract

Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. We move away from the popular one-for-all multilingual models and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. We focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We also combine supervised and self-supervised training, allowing encoders to take advantage of monolingual training data.Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 44 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders and mine bitexts. Adding these mined bitexts yielded an improvement of 3.8 BLEU for NMT into English.

Anthology ID:: 2022.findings-emnlp.154
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2101–2112
Language:
URL:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2022.findings-emnlp.154/
DOI:: 10.18653/v1/2022.findings-emnlp.154
Bibkey:
Cite (ACL):: Kevin Heffernan, Onur Çelebi, and Holger Schwenk. 2022. Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (Heffernan et al., Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2022.findings-emnlp.154.pdf

PDF Cite Search Fix data