Assessing the Role of Data Quality in Training Bilingual Language Models

Skyler Seto, Maartje Ter Hoeve, Maureen de Seyssel, David Grangier


Abstract
Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2–4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.
Anthology ID:
2025.findings-emnlp.1236
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22694–22720
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1236/
DOI:
10.18653/v1/2025.findings-emnlp.1236
Bibkey:
Cite (ACL):
Skyler Seto, Maartje Ter Hoeve, Maureen de Seyssel, and David Grangier. 2025. Assessing the Role of Data Quality in Training Bilingual Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22694–22720, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Assessing the Role of Data Quality in Training Bilingual Language Models (Seto et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1236.pdf
Checklist:
 2025.findings-emnlp.1236.checklist.pdf