Abstract
Despite the revolutionary advances made by Transformer in Neural Machine Translation (NMT), inference efficiency remains an obstacle due to the heavy use of attention operations in auto-regressive decoding. We thereby propose a lightweight attention structure called Attention Refinement Network (ARN) for speeding up Transformer. Specifically, we design a weighted residual network, which reconstructs the attention by reusing the features across layers. To further improve the Transformer efficiency, we merge the self-attention and cross-attention components for parallel computing. Extensive experiments on ten WMT machine translation tasks show that the proposed model yields an average of 1.35x faster (with almost no decrease in BLEU) over the state-of-the-art inference implementation. Results on widely used WMT14 En-De machine translation tasks demonstrate that our model achieves a higher speed-up, giving highly competitive performance compared to AAN and SAN models with fewer parameter numbers.- Anthology ID:
- 2022.coling-1.453
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 5109–5118
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.453
- DOI:
- Cite (ACL):
- Kaixin Wu, Yue Zhang, Bojie Hu, and Tong Zhang. 2022. Speeding up Transformer Decoding via an Attention Refinement Network. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5109–5118, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Speeding up Transformer Decoding via an Attention Refinement Network (Wu et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2022.coling-1.453.pdf
- Code
- kaixin-wu-for-open-source/arn