Abstract
In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We enhance models by incorporating both token-level and sentence-level translations, utilizing the extensive linguistic capabilities of modern LLMs, and incorporating available dictionary resources. Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state of the art on a typologically diverse dataset spanning six low-resource languages. The improvements are particularly noticeable for the lowest-resourced language Gitksan, where we achieve a 10%-point improvement. Furthermore, in a simulated ultra-low resource setting for the same six languages, training on fewer than 100 glossed sentences, we establish an average 10%-point improvement in word-level accuracy over the previous state-of-the-art system.- Anthology ID:
- 2024.emnlp-main.261
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4537–4552
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.261/
- DOI:
- 10.18653/v1/2024.emnlp-main.261
- Cite (ACL):
- Changbing Yang, Garrett Nicolai, and Miikka Silfverberg. 2024. Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4537–4552, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing (Yang et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.261.pdf