A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT

Kashif Shah, Loïc Barrault, Holger Schwenk


Abstract
The standard procedure to train the translation model of a phrase-based SMT system is to concatenate all available parallel data, to perform word alignment, to extract phrase pairs and to calculate translation probabilities by simple relative frequency. However, parallel data is quite inhomogeneous in many practical applications with respect to several factors like data source, alignment quality, appropriateness to the task, etc. We propose a general framework to take into account these factors during the calculation of the phrase-table, e.g. by better distributing the probability mass of the individual phrase pairs. No additional feature functions are needed. We report results on two well-known tasks: the IWSLT’11 and WMT’11 evaluations, in both conditions translating from English to French. We give detailed results for different functions to weight the bitexts. Our best systems improve a strong baseline by up to one BLEU point without any impact on the computational complexity during training or decoding.
Anthology ID:
2012.amta-papers.21
Volume:
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2012.amta-papers.21/
DOI:
Bibkey:
Cite (ACL):
Kashif Shah, Loïc Barrault, and Holger Schwenk. 2012. A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
A General Framework to Weight Heterogeneous Parallel Data for Model Adaptation in Statistical MT (Shah et al., AMTA 2012)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2012.amta-papers.21.pdf