Cross-Lingual Word Representations: Induction and Evaluation

Manaal Faruqui, Anders Søgaard, Ivan Vulić


Abstract
In recent past, NLP as a field has seen tremendous utility of distributional word vector representations as features in downstream tasks. The fact that these word vectors can be trained on unlabeled monolingual corpora of a language makes them an inexpensive resource in NLP. With the increasing use of monolingual word vectors, there is a need for word vectors that can be used as efficiently across multiple languages as monolingually. Therefore, learning bilingual and multilingual word embeddings/vectors is currently an important research topic. These vectors offer an elegant and language-pair independent way to represent content across different languages.This tutorial aims to bring NLP researchers up to speed with the current techniques in cross-lingual word representation learning. We will first discuss how to induce cross-lingual word representations (covering both bilingual and multilingual ones) from various data types and resources (e.g., parallel data, comparable data, non-aligned monolingual data in different languages, dictionaries and theasuri, or, even, images, eye-tracking data). We will then discuss how to evaluate such representations, intrinsically and extrinsically. We will introduce researchers to state-of-the-art methods for constructing cross-lingual word representations and discuss their applicability in a broad range of downstream NLP applications.We will deliver a detailed survey of the current methods, discuss best training and evaluation practices and use-cases, and provide links to publicly available implementations, datasets, and pre-trained models.
Anthology ID:
D17-3007
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Alexandra Birch, Nathan Schneider
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
Language:
URL:
https://aclanthology.org/D17-3007
DOI:
Bibkey:
Cite (ACL):
Manaal Faruqui, Anders Søgaard, and Ivan Vulić. 2017. Cross-Lingual Word Representations: Induction and Evaluation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Cross-Lingual Word Representations: Induction and Evaluation (Faruqui et al., EMNLP 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/D17-3007.pdf