Can we teach language models to gloss endangered languages?

Michael Ginn, Mans Hulden, Alexis Palmer


Abstract
Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation. Automating the creation of interlinear glossed text would be desirable to reduce annotator effort and maintain consistency across annotated corpora. Prior research has explored a number of statistical and neural methods for automatically producing IGT. As large language models (LLMs) have showed promising results across multilingual tasks, even for rare, endangered languages, it is natural to wonder whether they can be utilized for the task of generating IGT. We explore whether LLMs can be effective at the task of interlinear glossing with in-context learning, without any traditional training. We propose new approaches for selecting examples to provide in-context, observing that targeted selection can significantly improve performance. We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all. These approaches still underperform state-of-the-art supervised systems for the task, but are highly practical for researchers outside of the NLP community, requiring minimal effort to use.
Anthology ID:
2024.findings-emnlp.337
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5861–5876
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.337/
DOI:
10.18653/v1/2024.findings-emnlp.337
Bibkey:
Cite (ACL):
Michael Ginn, Mans Hulden, and Alexis Palmer. 2024. Can we teach language models to gloss endangered languages?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5861–5876, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Can we teach language models to gloss endangered languages? (Ginn et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.337.pdf