Abstract
Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. We move away from the popular one-for-all multilingual models and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. We focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We also combine supervised and self-supervised training, allowing encoders to take advantage of monolingual training data.Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 44 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders and mine bitexts. Adding these mined bitexts yielded an improvement of 3.8 BLEU for NMT into English.- Anthology ID:
- 2022.findings-emnlp.154
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2022
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2101–2112
- Language:
- URL:
- https://aclanthology.org/2022.findings-emnlp.154
- DOI:
- Cite (ACL):
- Kevin Heffernan, Onur Çelebi, and Holger Schwenk. 2022. Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages (Heffernan et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2022.findings-emnlp.154.pdf