Token-level semantic typology without a massively parallel corpus

Barend Beekhuizen

Token-level semantic typology without a massively parallel corpus

Abstract

This paper presents a computational method for token-level lexical semantic comparative research in an original text setting, as opposed to the more common massively parallel setting. Given a set of (non-massively parallel) bitexts, the method consists of leveraging pre-trained contextual vectors in a reference language to induce, for a token in one target language, the lexical items that all other target languages would have used, thus simulating a massively parallel set-up. The method is evaluated on its extraction and induction quality, and the use of the method for lexical semantic typological research is demonstrated.

Anthology ID:: 2025.sigtyp-1.16
Volume:: Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:: August
Year:: 2025
Address:: Vinenna. Austria
Editors:: Michael Hahn, Priya Rani, Ritesh Kumar, Andreas Shcherbakov, Alexey Sorokin, Oleg Serikov, Ryan Cotterell, Ekaterina Vylomova
Venues:: SIGTYP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 165–176
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.sigtyp-1.16/
DOI:
Bibkey:
Cite (ACL):: Barend Beekhuizen. 2025. Token-level semantic typology without a massively parallel corpus. In Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 165–176, Vinenna. Austria. Association for Computational Linguistics.
Cite (Informal):: Token-level semantic typology without a massively parallel corpus (Beekhuizen, SIGTYP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.sigtyp-1.16.pdf

PDF Cite Search Fix data