Abstract
We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models [1], we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method [2], which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84% less data than the other methods. Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.- Anthology ID:
- 2017.iwslt-1.19
- Volume:
- Proceedings of the 14th International Conference on Spoken Language Translation
- Month:
- December 14-15
- Year:
- 2017
- Address:
- Tokyo, Japan
- Venue:
- IWSLT
- SIG:
- SIGSLT
- Publisher:
- International Workshop on Spoken Language Translation
- Note:
- Pages:
- 137–145
- Language:
- URL:
- https://aclanthology.org/2017.iwslt-1.19
- DOI:
- Cite (ACL):
- Lucía Santamaría and Amittai Axelrod. 2017. Data Selection with Cluster-Based Language Difference Models and Cynical Selection. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 137–145, Tokyo, Japan. International Workshop on Spoken Language Translation.
- Cite (Informal):
- Data Selection with Cluster-Based Language Difference Models and Cynical Selection (Santamaría & Axelrod, IWSLT 2017)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2017.iwslt-1.19.pdf