Ella Paulina Bohman

2026

ParaCLEAN: Improving Translation Quality through Systematic Parallel Data Cleaning
Audrey Mash | Ella Paulina Bohman | Maite Melero
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Parallel corpora often contain significant noise, particularly in low-resource settings where both collected and synthetic data are combined. We present ParaCLEAN, a modular pipeline for cleaning parallel data that integrates embeddings-based filtering, language identification, deduplication, and normalisation. Experiments on Catalan to Japanese translation demonstrate that ParaCLEAN improves data quality and downstream MT performance. Ablation studies highlight the contribution of each step. ParaCLEAN is lightweight, reproducible, and extensible for diverse language pairs.

Co-authors

Audrey Mash 1
Maite Melero 1

Venues

LREC1

Fix author