Jonathan Janetzki


2024

pdf bib
GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages
Jonathan Janetzki | Gerard De Melo | Joshua Nemecek | Daniel Whitenack
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Over 7,000 of the world’s 7,168 living languages are still low-resourced. This paper aims to narrow the language documentation gap by creating multiparallel dictionaries, clustered by SIL’s semantic domains. This task is new for machine learning and has previously been done manually by native speakers. We propose GUIDE, a language-agnostic tool that uses a GNN to create and populate semantic domain dictionaries, using seed dictionaries and Bible translations as a parallel text corpus. Our work sets a new benchmark, achieving an exemplary average precision of 60% in eight zero-shot evaluation languages and predicting an average of 2,400 dictionary entries. We share the code, model, multilingual evaluation data, and new dictionaries with the research community: https://github.com/janetzki/GUIDE