Colex2Lang: Language Embeddings from Semantic Typology

Yiyi Chen, Russa Biswas, Johannes Bjerva

[How to correct problems with metadata yourself]


Abstract
In semantic typology, colexification refers to words with multiple meanings, either related (polysemy) or unrelated (homophony). Studies of cross-linguistic colexification have yielded insights into, e.g., psychology, historical linguistics and cognitive science (Xu et al., 2020; Brochhagen and Boleda, 2022; Schapper and Koptjevskaja-Tamm, 2022). While NLP research up until now has mainly focused on integrating syntactic typology (Naseem et al., 2012; Ponti et al., 2019; Chaudhary et al., 2019; Üstün et al., 2020; Ansell et al., 2021; Oncevay et al., 2022), we here investigate the potential of incorporating semantic typology, of which colexification is an example. We propose a framework for constructing a large-scale synset graph and learning language representations with node embedding algorithms. We demonstrate that cross-lingual colexification patterns provide a distinct signal for modelling language similarity and predicting typological features. Our representations achieve a 9.97% performance gain in predicting lexico-semantic typological features and expectantly contain a weaker syntactic signal. This study is the first attempt to learn language representations and model language similarities using semantic typology at a large scale, setting a new direction for multilingual NLP, especially for low-resource languages.
Anthology ID:
2023.nodalida-1.67
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
673–684
Language:
URL:
https://aclanthology.org/2023.nodalida-1.67
DOI:
Bibkey:
Cite (ACL):
Yiyi Chen, Russa Biswas, and Johannes Bjerva. 2023. Colex2Lang: Language Embeddings from Semantic Typology. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 673–684, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Colex2Lang: Language Embeddings from Semantic Typology (Chen et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/2023.nodalida-1.67.pdf