Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications

Ulyana Isaeva; Danil Astafurov; Nikita Martynov

Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications

Ulyana Isaeva, Danil Astafurov, Nikita Martynov

Abstract

This paper addresses the constraints of down-stream applications of pre-trained language models (PLMs) for low-resource languages. These constraints are pre-train data deficiency preventing a low-resource language from being well represented in a PLM and inaccessibility of high-quality task-specific data annotation that limits task learning. We propose to use automatically labeled texts combined with manually annotated data in a two-stage task fine-tuning approach. The experiments revealed that utilizing such methodology combined with vocabulary adaptation may compensate for the absence of a targeted PLM or the deficiency of manually annotated data. The methodology is validated on the morphological tagging task for the Udmurt language. We publish our best model that achieved 93.25% token accuracy on HuggingFace Hub along with the training code1.

Anthology ID:: 2025.xllm-1.9
Volume:: Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang
Venues:: XLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 86–90
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.xllm-1.9/
DOI:
Bibkey:
Cite (ACL):: Ulyana Isaeva, Danil Astafurov, and Nikita Martynov. 2025. Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025), pages 86–90, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications (Isaeva et al., XLLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.xllm-1.9.pdf

PDF Cite Search Fix data