TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task

Kaixin Wu; Bojie Hu; Qi Ju

TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task

Abstract

The paper describes the TenTrans’s submissions to the WMT 2021 Efficiency Shared Task. We explore training a variety of smaller compact transformer models using the teacher-student setup. Our model is trained by our self-developed open-source multilingual training platform TenTrans-Py. We also release an open-source high-performance inference toolkit for transformer models and the code is written in C++ completely. All additional optimizations are built on top of the inference engine including attention caching, kernel fusion, early-stop, and several other optimizations. In our submissions, the fastest system can translate more than 22,000 tokens per second with a single Tesla P4 while maintaining 38.36 BLEU on En-De newstest2019. Our trained models and more details are available in TenTrans-Decoding competition examples.

Anthology ID:: 2021.wmt-1.77
Volume:: Proceedings of the Sixth Conference on Machine Translation
Month:: November
Year:: 2021
Address:: Online
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 795–798
Language:
URL:: https://aclanthology.org/2021.wmt-1.77
DOI:
Bibkey:
Cite (ACL):: Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics.
Cite (Informal):: TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task (Wu et al., WMT 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/remove-xml-comments/2021.wmt-1.77.pdf

PDF Search