Abstract
Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.- Anthology ID:
- 2020.coling-main.324
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Donia Scott, Nuria Bel, Chengqing Zong
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 3630–3636
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.324
- DOI:
- 10.18653/v1/2020.coling-main.324
- Cite (ACL):
- Philipp Dufter, Martin Schmitt, and Hinrich Schütze. 2020. Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3630–3636, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention (Dufter et al., COLING 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2020.coling-main.324.pdf
- Code
- pdufter/convatt
- Data
- Penn Treebank