Abstract
Recent work has questioned the importance of the Transformer’s multi-headed attention for achieving high translation quality. We push further in this direction by developing a “hard-coded” attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally, hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.- Anthology ID:
- 2020.acl-main.687
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7689–7700
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.687
- DOI:
- 10.18653/v1/2020.acl-main.687
- Cite (ACL):
- Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020. Hard-Coded Gaussian Attention for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7689–7700, Online. Association for Computational Linguistics.
- Cite (Informal):
- Hard-Coded Gaussian Attention for Neural Machine Translation (You et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2020.acl-main.687.pdf
- Code
- fallcat/stupidNMT