Data Selection with Cluster-Based Language Difference Models and Cynical Selection

Lucía Santamaría, Amittai Axelrod


Abstract
We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models [1], we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method [2], which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84% less data than the other methods. Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.
Anthology ID:
2017.iwslt-1.19
Volume:
Proceedings of the 14th International Conference on Spoken Language Translation
Month:
December 14-15
Year:
2017
Address:
Tokyo, Japan
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
International Workshop on Spoken Language Translation
Note:
Pages:
137–145
Language:
URL:
https://aclanthology.org/2017.iwslt-1.19
DOI:
Bibkey:
Cite (ACL):
Lucía Santamaría and Amittai Axelrod. 2017. Data Selection with Cluster-Based Language Difference Models and Cynical Selection. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 137–145, Tokyo, Japan. International Workshop on Spoken Language Translation.
Cite (Informal):
Data Selection with Cluster-Based Language Difference Models and Cynical Selection (Santamaría & Axelrod, IWSLT 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2017.iwslt-1.19.pdf