On the Language Neutrality of Pre-trained Multilingual Representations

Jindřich Libovický; Rudolf Rosa; Alexander Fraser

doi:10.18653/v1/2020.findings-emnlp.150

On the Language Neutrality of Pre-trained Multilingual Representations

Jindřich Libovický, Rudolf Rosa, Alexander Fraser

Abstract

Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings, which are explicitly trained for language neutrality. Contextual embeddings are still only moderately language-neutral by default, so we propose two simple methods for achieving stronger language neutrality: first, by unsupervised centering of the representation for each language and second, by fitting an explicit projection on small parallel data. Besides, we show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences without using parallel data.

Anthology ID:: 2020.findings-emnlp.150
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1663–1674
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.150
DOI:: 10.18653/v1/2020.findings-emnlp.150
Bibkey:
Cite (ACL):: Jindřich Libovický, Rudolf Rosa, and Alexander Fraser. 2020. On the Language Neutrality of Pre-trained Multilingual Representations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1663–1674, Online. Association for Computational Linguistics.
Cite (Informal):: On the Language Neutrality of Pre-trained Multilingual Representations (Libovický et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.150.pdf
Optional supplementary material:: 2020.findings-emnlp.150.OptionalSupplementaryMaterial.tgz
Code: jlibovicky/assess-multilingual-bert
Data: WMT 2014, XNLI

PDF Search Code Optional supplementary material