Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation

Burak Aydın, Arzucan Özgür


Abstract
The training data size is of utmost importance for statistical machine translation (SMT), since it affects the training time, model size, decoding speed, as well as the system’s overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this paper, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on first ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation filter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation system are achieved. We compared our results with the translation model combination approaches as well and reported the improvements. Moreover, we implemented our system with dependency parse tree based language modeling in addition to the n-gram based language modeling and reported comparable results.
Anthology ID:
2014.amta-researchers.14
Volume:
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Month:
October 22-26
Year:
2014
Address:
Vancouver, Canada
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
180–192
Language:
URL:
https://aclanthology.org/2014.amta-researchers.14
DOI:
Bibkey:
Cite (ACL):
Burak Aydın and Arzucan Özgür. 2014. Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 180–192, Vancouver, Canada. Association for Machine Translation in the Americas.
Cite (Informal):
Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation (Aydın & Özgür, AMTA 2014)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2014.amta-researchers.14.pdf