Mixed Multi-Head Self-Attention for Neural Machine Translation
Hongyi Cui, Shohei Iida, Po-Hsuan Hung, Takehito Utsuro, Masaaki Nagata
Abstract
Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.- Anthology ID:
- D19-5622
- Volume:
- Proceedings of the 3rd Workshop on Neural Generation and Translation
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong
- Venue:
- NGT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 206–214
- Language:
- URL:
- https://aclanthology.org/D19-5622
- DOI:
- 10.18653/v1/D19-5622
- Cite (ACL):
- Hongyi Cui, Shohei Iida, Po-Hsuan Hung, Takehito Utsuro, and Masaaki Nagata. 2019. Mixed Multi-Head Self-Attention for Neural Machine Translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 206–214, Hong Kong. Association for Computational Linguistics.
- Cite (Informal):
- Mixed Multi-Head Self-Attention for Neural Machine Translation (Cui et al., NGT 2019)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/D19-5622.pdf