Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau
Abstract
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.- Anthology ID:
- 2021.mrl-1.2
- Volume:
- Proceedings of the 1st Workshop on Multilingual Representation Learning
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, Gozde Gul Sahin
- Venue:
- MRL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16–31
- Language:
- URL:
- https://aclanthology.org/2021.mrl-1.2
- DOI:
- 10.18653/v1/2021.mrl-1.2
- Cite (ACL):
- Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, and Jey Han Lau. 2021. Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 16–31, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora (Wada et al., MRL 2021)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/2021.mrl-1.2.pdf
- Code
- twadada/multilingual-nlm