Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh; Mohammad Javad Dousti

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh, Mohammad Javad Dousti

Abstract

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English ↔ Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.

Anthology ID:: 2025.emnlp-main.1352
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26616–26624
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1352/
DOI:
Bibkey:
Cite (ACL):: Mohammad Amin Ghanizadeh and Mohammad Javad Dousti. 2025. Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26616–26624, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning (Ghanizadeh & Dousti, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1352.pdf
Checklist:: 2025.emnlp-main.1352.checklist.pdf

PDF Cite Search Checklist Fix data