Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison

Frederic Blum, Johannes Englisch, Alba Hermida Rodriguez, Rik van Gijn, Johann-Mattis List


Abstract
Comparative wordlists play a crucial role for historical language comparison. They are regularly used for the identification of related words and languages, or for the reconstruction of language phylogenies and proto-languages. While automated solutions exist for the majority of methods used for this purpose, no standardized computational or computer-assisted approaches for the compilation of comparative wordlists have been proposed so far. Up to today, scholars compile wordlists by sifting manually through dictionaries or similar language resources and typing them into spreadsheets. In this study we present a semi-automatic approach to extract wordlists from machine-readable dictionaries. The transparent workflow allows to build user-defined wordlists for individual languages in a standardized format. By automating the search for translation equivalents in dictionaries, our approach greatly facilitates the aggregation of individual resources into multilingual comparative wordlists that can be used for a variety of purposes.
Anthology ID:
2024.sigul-1.36
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
300–306
Language:
URL:
https://aclanthology.org/2024.sigul-1.36
DOI:
Bibkey:
Cite (ACL):
Frederic Blum, Johannes Englisch, Alba Hermida Rodriguez, Rik van Gijn, and Johann-Mattis List. 2024. Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 300–306, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison (Blum et al., SIGUL-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.sigul-1.36.pdf