Abstract
In this work, we analyze the performance and properties of cross-lingual word embedding models created by mapping-based alignment methods. We use several measures of corpus and embedding similarity to predict BLI scores of cross-lingual embedding mappings over three types of corpora, three embedding methods and 55 language pairs. Our experimental results corroborate that instead of mere size, the amount of common content in the training corpora is essential. This phenomenon manifests in that i) despite of the smaller corpus sizes, using only the comparable parts of Wikipedia for training the monolingual embedding spaces to be mapped is often more efficient than relying on all the contents of Wikipedia, ii) the smaller, in return less diversified Spanish Wikipedia works almost always much better as a training corpus for bilingual mappings than the ubiquitously used English Wikipedia.- Anthology ID:
- 2021.mrl-1.9
- Volume:
- Proceedings of the 1st Workshop on Multilingual Representation Learning
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, Gozde Gul Sahin
- Venue:
- MRL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 96–106
- Language:
- URL:
- https://aclanthology.org/2021.mrl-1.9
- DOI:
- 10.18653/v1/2021.mrl-1.9
- Cite (ACL):
- Réka Cserháti and Gábor Berend. 2021. Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 96–106, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings (Cserháti & Berend, MRL 2021)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2021.mrl-1.9.pdf
- Data
- WikiMatrix