Bilingual Methods for Adaptive Training Data Selection for Machine Translation
Boxing Chen, Roland Kuhn, George Foster, Colin Cherry, Fei Huang
Abstract
In this paper, we propose a new data selection method which uses semi-supervised convolutional neural networks based on bitokens (Bi-SSCNNs) for training machine translation systems from a large bilingual corpus. In earlier work, we devised a data selection method based on semi-supervised convolutional neural networks (SSCNNs). The new method, Bi-SSCNN, is based on bitokens, which use bilingual information. When the new methods are tested on two translation tasks (Chinese-to-English and Arabic-to-English), they significantly outperform the other three data selection methods in the experiments. We also show that the BiSSCNN method is much more effective than other methods in preventing noisy sentence pairs from being chosen for training. More interestingly, this method only needs a tiny amount of in-domain data to train the selection model, which makes fine-grained topic-dependent translation adaptation possible. In the follow-up experiments, we find that neural machine translation (NMT) is more sensitive to noisy data than statistical machine translation (SMT). Therefore, Bi-SSCNN which can effectively screen out noisy sentence pairs, can benefit NMT much more than SMT.We observed a BLEU improvement over 3 points on an English-to-French WMT task when Bi-SSCNNs were used.- Anthology ID:
- 2016.amta-researchers.8
- Volume:
- Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track
- Month:
- October 28 - November 1
- Year:
- 2016
- Address:
- Austin, TX, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- The Association for Machine Translation in the Americas
- Note:
- Pages:
- 93–106
- Language:
- URL:
- https://aclanthology.org/2016.amta-researchers.8
- DOI:
- Cite (ACL):
- Boxing Chen, Roland Kuhn, George Foster, Colin Cherry, and Fei Huang. 2016. Bilingual Methods for Adaptive Training Data Selection for Machine Translation. In Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track, pages 93–106, Austin, TX, USA. The Association for Machine Translation in the Americas.
- Cite (Informal):
- Bilingual Methods for Adaptive Training Data Selection for Machine Translation (Chen et al., AMTA 2016)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2016.amta-researchers.8.pdf