Abstract
Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.- Anthology ID:
- D18-1328
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2967–2973
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/D18-1328/
- DOI:
- 10.18653/v1/D18-1328
- Cite (ACL):
- MinhQuang Pham, Josep Crego, Jean Senellart, and François Yvon. 2018. Fixing Translation Divergences in Parallel Corpora for Neural MT. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2967–2973, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Fixing Translation Divergences in Parallel Corpora for Neural MT (Pham et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/D18-1328.pdf
- Code
- jmcrego/similarity
- Data
- OpenSubtitles