GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages

Jonathan Janetzki, Gerard De Melo, Joshua Nemecek, Daniel Whitenack


Abstract
Over 7,000 of the world’s 7,168 living languages are still low-resourced. This paper aims to narrow the language documentation gap by creating multiparallel dictionaries, clustered by SIL’s semantic domains. This task is new for machine learning and has previously been done manually by native speakers. We propose GUIDE, a language-agnostic tool that uses a GNN to create and populate semantic domain dictionaries, using seed dictionaries and Bible translations as a parallel text corpus. Our work sets a new benchmark, achieving an exemplary average precision of 60% in eight zero-shot evaluation languages and predicting an average of 2,400 dictionary entries. We share the code, model, multilingual evaluation data, and new dictionaries with the research community: https://github.com/janetzki/GUIDE
Anthology ID:
2024.sigtyp-1.2
Volume:
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:
March
Year:
2024
Address:
St. Julian's, Malta
Editors:
Michael Hahn, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Yulia Otmakhova, Jinrui Yang, Oleg Serikov, Priya Rani, Edoardo M. Ponti, Saliha Muradoğlu, Rena Gao, Ryan Cotterell, Ekaterina Vylomova
Venues:
SIGTYP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–24
Language:
URL:
https://aclanthology.org/2024.sigtyp-1.2
DOI:
Bibkey:
Cite (ACL):
Jonathan Janetzki, Gerard De Melo, Joshua Nemecek, and Daniel Whitenack. 2024. GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 10–24, St. Julian's, Malta. Association for Computational Linguistics.
Cite (Informal):
GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages (Janetzki et al., SIGTYP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2024.sigtyp-1.2.pdf