Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

Changbing Yang; Garrett Nicolai; Miikka Silfverberg

doi:10.18653/v1/2024.emnlp-main.261

Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

Changbing Yang, Garrett Nicolai, Miikka Silfverberg

Abstract

In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We enhance models by incorporating both token-level and sentence-level translations, utilizing the extensive linguistic capabilities of modern LLMs, and incorporating available dictionary resources. Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state of the art on a typologically diverse dataset spanning six low-resource languages. The improvements are particularly noticeable for the lowest-resourced language Gitksan, where we achieve a 10%-point improvement. Furthermore, in a simulated ultra-low resource setting for the same six languages, training on fewer than 100 glossed sentences, we establish an average 10%-point improvement in word-level accuracy over the previous state-of-the-art system.

Anthology ID:: 2024.emnlp-main.261
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4537–4552
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.261/
DOI:: 10.18653/v1/2024.emnlp-main.261
Bibkey:
Cite (ACL):: Changbing Yang, Garrett Nicolai, and Miikka Silfverberg. 2024. Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4537–4552, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing (Yang et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.261.pdf

PDF Search Fix data