A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Peiqin Lin, Andre Martins, Hinrich Schuetze


Abstract
Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
Anthology ID:
2025.findings-naacl.225
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4038–4050
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.225/
DOI:
Bibkey:
Cite (ACL):
Peiqin Lin, Andre Martins, and Hinrich Schuetze. 2025. A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 4038–4050, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models (Lin et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.225.pdf