LLMs for Extremely Low-Resource Finno-Ugric Languages

Taido Purason, Hele-Andra Kuulmets, Mark Fishel


Abstract
The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on Võro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
Anthology ID:
2025.findings-naacl.373
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6677–6697
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.373/
DOI:
Bibkey:
Cite (ACL):
Taido Purason, Hele-Andra Kuulmets, and Mark Fishel. 2025. LLMs for Extremely Low-Resource Finno-Ugric Languages. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6677–6697, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
LLMs for Extremely Low-Resource Finno-Ugric Languages (Purason et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.373.pdf