Sample Selection for Large-scale MT Discriminative Training

Yuan Cao, Sanjeev Khudanpur


Abstract
Discriminative training for MT usually involves numerous features and requires large-scale training set to reach reliable parameter estimation. Other than using the expensive human-labeled parallel corpora for training, semi-supervised methods have been proposed to generate huge amount of “hallucinated” data which relieves the data sparsity problem. However the large training set contains both good samples which are suitable for training and bad ones harmful to the training. How to select training samples from vast amount of data can greatly affect the training performance. In this paper we propose a method for selecting samples that are most suitable for discriminative training according to a criterion measuring the dataset quality. Our experimental results show that by adding samples to the training set selectively, we are able to exceed the performance of system trained with the same amount of samples selected randomly.
Anthology ID:
2012.amta-papers.3
Volume:
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2012.amta-papers.3
DOI:
Bibkey:
Cite (ACL):
Yuan Cao and Sanjeev Khudanpur. 2012. Sample Selection for Large-scale MT Discriminative Training. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Sample Selection for Large-scale MT Discriminative Training (Cao & Khudanpur, AMTA 2012)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2012.amta-papers.3.pdf