TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task

Kaixin Wu, Bojie Hu, Qi Ju


Abstract
The paper describes the TenTrans’s submissions to the WMT 2021 Efficiency Shared Task. We explore training a variety of smaller compact transformer models using the teacher-student setup. Our model is trained by our self-developed open-source multilingual training platform TenTrans-Py. We also release an open-source high-performance inference toolkit for transformer models and the code is written in C++ completely. All additional optimizations are built on top of the inference engine including attention caching, kernel fusion, early-stop, and several other optimizations. In our submissions, the fastest system can translate more than 22,000 tokens per second with a single Tesla P4 while maintaining 38.36 BLEU on En-De newstest2019. Our trained models and more details are available in TenTrans-Decoding competition examples.
Anthology ID:
2021.wmt-1.77
Volume:
Proceedings of the Sixth Conference on Machine Translation
Month:
November
Year:
2021
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
795–798
Language:
URL:
https://aclanthology.org/2021.wmt-1.77
DOI:
Bibkey:
Cite (ACL):
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics.
Cite (Informal):
TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task (Wu et al., WMT 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.wmt-1.77.pdf