Token-level semantic typology without a massively parallel corpus

Barend Beekhuizen


Abstract
This paper presents a computational method for token-level lexical semantic comparative research in an original text setting, as opposed to the more common massively parallel setting. Given a set of (non-massively parallel) bitexts, the method consists of leveraging pre-trained contextual vectors in a reference language to induce, for a token in one target language, the lexical items that all other target languages would have used, thus simulating a massively parallel set-up. The method is evaluated on its extraction and induction quality, and the use of the method for lexical semantic typological research is demonstrated.
Anthology ID:
2025.sigtyp-1.16
Volume:
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:
August
Year:
2025
Address:
Vinenna. Austria
Editors:
Michael Hahn, Priya Rani, Ritesh Kumar, Andreas Shcherbakov, Alexey Sorokin, Oleg Serikov, Ryan Cotterell, Ekaterina Vylomova
Venues:
SIGTYP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
165–176
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.sigtyp-1.16/
DOI:
Bibkey:
Cite (ACL):
Barend Beekhuizen. 2025. Token-level semantic typology without a massively parallel corpus. In Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 165–176, Vinenna. Austria. Association for Computational Linguistics.
Cite (Informal):
Token-level semantic typology without a massively parallel corpus (Beekhuizen, SIGTYP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.sigtyp-1.16.pdf