GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages
Jonathan Janetzki, Gerard De Melo, Joshua Nemecek, Daniel Whitenack
Abstract
Over 7,000 of the world’s 7,168 living languages are still low-resourced. This paper aims to narrow the language documentation gap by creating multiparallel dictionaries, clustered by SIL’s semantic domains. This task is new for machine learning and has previously been done manually by native speakers. We propose GUIDE, a language-agnostic tool that uses a GNN to create and populate semantic domain dictionaries, using seed dictionaries and Bible translations as a parallel text corpus. Our work sets a new benchmark, achieving an exemplary average precision of 60% in eight zero-shot evaluation languages and predicting an average of 2,400 dictionary entries. We share the code, model, multilingual evaluation data, and new dictionaries with the research community: https://github.com/janetzki/GUIDE- Anthology ID:
- 2024.sigtyp-1.2
- Volume:
- Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian's, Malta
- Editors:
- Michael Hahn, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Yulia Otmakhova, Jinrui Yang, Oleg Serikov, Priya Rani, Edoardo M. Ponti, Saliha Muradoğlu, Rena Gao, Ryan Cotterell, Ekaterina Vylomova
- Venues:
- SIGTYP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10–24
- Language:
- URL:
- https://aclanthology.org/2024.sigtyp-1.2
- DOI:
- Cite (ACL):
- Jonathan Janetzki, Gerard De Melo, Joshua Nemecek, and Daniel Whitenack. 2024. GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages. In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 10–24, St. Julian's, Malta. Association for Computational Linguistics.
- Cite (Informal):
- GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages (Janetzki et al., SIGTYP-WS 2024)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2024.sigtyp-1.2.pdf