ParaCLEAN: Improving Translation Quality through Systematic Parallel Data Cleaning

Audrey Mash, Ella Paulina Bohman, Maite Melero


Abstract
Parallel corpora often contain significant noise, particularly in low-resource settings where both collected and synthetic data are combined. We present ParaCLEAN, a modular pipeline for cleaning parallel data that integrates embeddings-based filtering, language identification, deduplication, and normalisation. Experiments on Catalan to Japanese translation demonstrate that ParaCLEAN improves data quality and downstream MT performance. Ablation studies highlight the contribution of each step. ParaCLEAN is lightweight, reproducible, and extensible for diverse language pairs.
Anthology ID:
2026.lrec-main.527
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
6630–6640
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.527/
DOI:
Bibkey:
Cite (ACL):
Audrey Mash, Ella Paulina Bohman, and Maite Melero. 2026. ParaCLEAN: Improving Translation Quality through Systematic Parallel Data Cleaning. International Conference on Language Resources and Evaluation, main:6630–6640.
Cite (Informal):
ParaCLEAN: Improving Translation Quality through Systematic Parallel Data Cleaning (Mash et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.527.pdf