Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Surangika Ranathunga, Aloka Fernando, Menan Velayuthan, Charitha Rathnayaka, Nisansa de Silva


Abstract
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets
Anthology ID:
2025.emnlp-main.1435
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28252–28269
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1435/
DOI:
Bibkey:
Cite (ACL):
Surangika Ranathunga, Aloka Fernando, Menan Velayuthan, Charitha Rathnayaka, and Nisansa de Silva. 2025. Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28252–28269, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics (Ranathunga et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1435.pdf
Checklist:
 2025.emnlp-main.1435.checklist.pdf