Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Philipp Dufter; Martin Schmitt; Hinrich Schütze

doi:10.18653/v1/2020.coling-main.324

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Philipp Dufter, Martin Schmitt, Hinrich Schütze

Abstract

Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.

Anthology ID:: 2020.coling-main.324
Volume:: Proceedings of the 28th International Conference on Computational Linguistics
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Donia Scott, Nuria Bel, Chengqing Zong
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 3630–3636
Language:
URL:: https://aclanthology.org/2020.coling-main.324
DOI:: 10.18653/v1/2020.coling-main.324
Bibkey:
Cite (ACL):: Philipp Dufter, Martin Schmitt, and Hinrich Schütze. 2020. Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3630–3636, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):: Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention (Dufter et al., COLING 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2020.coling-main.324.pdf
Code: pdufter/convatt
Data: Penn Treebank

PDF Search Code