Abstract
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER.- Anthology ID:
 - Q19-1038
 - Volume:
 - Transactions of the Association for Computational Linguistics, Volume 7
 - Month:
 - Year:
 - 2019
 - Address:
 - Cambridge, MA
 - Editors:
 - Lillian Lee, Mark Johnson, Brian Roark, Ani Nenkova
 - Venue:
 - TACL
 - SIG:
 - Publisher:
 - MIT Press
 - Note:
 - Pages:
 - 597–610
 - Language:
 - URL:
 - https://aclanthology.org/Q19-1038
 - DOI:
 - 10.1162/tacl_a_00288
 - Cite (ACL):
 - Mikel Artetxe and Holger Schwenk. 2019. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
 - Cite (Informal):
 - Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond (Artetxe & Schwenk, TACL 2019)
 - PDF:
 - https://preview.aclanthology.org/ingest-acl-2023-videos/Q19-1038.pdf
 - Code
 - facebookresearch/LASER + additional community code
 - Data
 - Tatoeba, BUCC, MLDoc, XNLI