Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau


Abstract
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.
Anthology ID:
2021.mrl-1.2
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
EMNLP | MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–31
Language:
URL:
https://aclanthology.org/2021.mrl-1.2
DOI:
10.18653/v1/2021.mrl-1.2
Bibkey:
Cite (ACL):
Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, and Jey Han Lau. 2021. Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 16–31, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora (Wada et al., MRL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2021.mrl-1.2.pdf
Code
 twadada/multilingual-nlm