Abstract
Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.- Anthology ID:
- 2023.eacl-main.138
- Volume:
- Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1886–1894
- Language:
- URL:
- https://aclanthology.org/2023.eacl-main.138
- DOI:
- Cite (ACL):
- Zhuoyuan Mao and Tetsuji Nakagawa. 2023. LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1886–1894, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation (Mao & Nakagawa, EACL 2023)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2023.eacl-main.138.pdf