Cross-Lingual Word Representations: Induction and Evaluation

Manaal Faruqui; Anders Søgaard; Ivan Vulić

Cross-Lingual Word Representations: Induction and Evaluation

Manaal Faruqui, Anders Søgaard, Ivan Vulić

Abstract

In recent past, NLP as a field has seen tremendous utility of distributional word vector representations as features in downstream tasks. The fact that these word vectors can be trained on unlabeled monolingual corpora of a language makes them an inexpensive resource in NLP. With the increasing use of monolingual word vectors, there is a need for word vectors that can be used as efficiently across multiple languages as monolingually. Therefore, learning bilingual and multilingual word embeddings/vectors is currently an important research topic. These vectors offer an elegant and language-pair independent way to represent content across different languages.This tutorial aims to bring NLP researchers up to speed with the current techniques in cross-lingual word representation learning. We will first discuss how to induce cross-lingual word representations (covering both bilingual and multilingual ones) from various data types and resources (e.g., parallel data, comparable data, non-aligned monolingual data in different languages, dictionaries and theasuri, or, even, images, eye-tracking data). We will then discuss how to evaluate such representations, intrinsically and extrinsically. We will introduce researchers to state-of-the-art methods for constructing cross-lingual word representations and discuss their applicability in a broad range of downstream NLP applications.We will deliver a detailed survey of the current methods, discuss best training and evaluation practices and use-cases, and provide links to publicly available implementations, datasets, and pre-trained models.

Anthology ID:: D17-3007
Volume:: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Alexandra Birch, Nathan Schneider
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:
Language:
URL:: https://aclanthology.org/D17-3007
DOI:
Bibkey:
Cite (ACL):: Manaal Faruqui, Anders Søgaard, and Ivan Vulić. 2017. Cross-Lingual Word Representations: Induction and Evaluation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: Cross-Lingual Word Representations: Induction and Evaluation (Faruqui et al., EMNLP 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-5/D17-3007.pdf

PDF Search