Robust Tuning Datasets for Statistical Machine Translation

Preslav Nakov, Stephan Vogel


Abstract
We explore the idea of automatically crafting a tuning dataset for Statistical Machine Translation (SMT) that makes the hyper-parameters of the SMT system more robust with respect to some specific deficiencies of the parameter tuning algorithms. This is an under-explored research direction, which can allow better parameter tuning. In this paper, we achieve this goal by selecting a subset of the available sentence pairs, which are more suitable for specific combinations of optimizers, objective functions, and evaluation measures. We demonstrate the potential of the idea with the pairwise ranking optimization (PRO) optimizer, which is known to yield too short translations. We show that the learning problem can be alleviated by tuning on a subset of the development set, selected based on sentence length. In particular, using the longest 50% of the tuning sentences, we achieve two-fold tuning speedup, and improvements in BLEU score that rival those of alternatives, which fix BLEU+1’s smoothing instead.
Anthology ID:
R17-1071
Volume:
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Month:
September
Year:
2017
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
543–550
Language:
URL:
https://doi.org/10.26615/978-954-452-049-6_071
DOI:
10.26615/978-954-452-049-6_071
Bibkey:
Cite (ACL):
Preslav Nakov and Stephan Vogel. 2017. Robust Tuning Datasets for Statistical Machine Translation. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 543–550, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Robust Tuning Datasets for Statistical Machine Translation (Nakov & Vogel, RANLP 2017)
Copy Citation:
PDF:
https://doi.org/10.26615/978-954-452-049-6_071