Comparison of data selection techniques for the translation of video lectures
Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman, Shachar Mirkin
Abstract
For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.- Anthology ID:
- 2014.amta-researchers.15
- Volume:
- Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
- Month:
- October 22-26
- Year:
- 2014
- Address:
- Vancouver, Canada
- Editors:
- Yaser Al-Onaizan, Michel Simard
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 193–207
- Language:
- URL:
- https://aclanthology.org/2014.amta-researchers.15
- DOI:
- Cite (ACL):
- Joern Wuebker, Hermann Ney, Adrià Martínez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman, and Shachar Mirkin. 2014. Comparison of data selection techniques for the translation of video lectures. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 193–207, Vancouver, Canada. Association for Machine Translation in the Americas.
- Cite (Informal):
- Comparison of data selection techniques for the translation of video lectures (Wuebker et al., AMTA 2014)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2014.amta-researchers.15.pdf