Abstract
In this paper, we investigate large-scale lightly-supervised training with a pivot language: We augment a baseline statistical machine translation (SMT) system that has been trained on human-generated parallel training corpora with large amounts of additional unsupervised parallel data; but instead of creating this synthetic data from monolingual source language data with the baseline system itself, or from target language data with a reverse system, we employ a parallel corpus of target language data and data in a pivot language. The pivot language data is automatically translated into the source language, resulting in a trilingual corpus with unsupervised source language side. We augment our baseline system with the unsupervised source-target parallel data. Experiments are conducted for the German-French language pair using the standard WMT newstest sets for development and testing. We obtain the unsupervised data by translating the English side of the English-French 109 corpus to German. With careful system design, we are able to achieve improvements of up to +0.4 points BLEU / -0.7 points TER over the baseline.- Anthology ID:
- 2012.amta-papers.8
- Volume:
- Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers
- Month:
- October 28-November 1
- Year:
- 2012
- Address:
- San Diego, California, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- Language:
- URL:
- https://aclanthology.org/2012.amta-papers.8
- DOI:
- Cite (ACL):
- Matthias Huck and Hermann Ney. 2012. Pivot Lightly-Supervised Training for Statistical Machine Translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, San Diego, California, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- Pivot Lightly-Supervised Training for Statistical Machine Translation (Huck & Ney, AMTA 2012)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2012.amta-papers.8.pdf