How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

Yan Meng; Di Wu; Christof Monz

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

Abstract

The massive amounts of web-mined parallel data often contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a process for simulating misalignment controlled by semantic similarity, which closely resembles misaligned sentences in real-world web-crawled corpora. Under our simulated misalignment noise settings, we quantitatively analyze its impact on machine translation and demonstrate the limited effectiveness of widely used pre-filters for noise detection. This underscores the necessity of more fine-grained ways to handle hard-to-detect misalignment noise. By analyzing the reliability of the model’s self-knowledge for distinguishing misaligned and clean data at the token level, we propose self-correction—an approach that gradually increases trust in the model’s self-knowledge to correct the supervision signal during training. Comprehensive experiments show that our method significantly improves translation performance both in the presence of simulated misalignment noise and when applied to real-world, noisy web-mined datasets, across a range of translation tasks.

Anthology ID:: 2025.findings-naacl.416
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7451–7467
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.416/
DOI:
Bibkey:
Cite (ACL):: Yan Meng, Di Wu, and Christof Monz. 2025. How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7451–7467, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation (Meng et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.416.pdf

PDF Cite Search Fix data