Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang
Abstract
Neural networks equipped with self-attention have parallelizable computation, light-weight structure, and the ability to capture both long-range and local dependencies. Further, their expressive power and performance can be boosted by using a vector to measure pairwise dependency, but this requires to expand the alignment matrix to a tensor, which results in memory and computation bottlenecks. In this paper, we propose a novel attention mechanism called “Multi-mask Tensorized Self-Attention” (MTSA), which is as fast and as memory-efficient as a CNN, but significantly outperforms previous CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token) and global (source2token) dependencies by a novel compatibility function composed of dot-product and additive attentions, 2) uses a tensor to represent the feature-wise alignment scores for better expressive power but only requires parallelizable matrix multiplications, and 3) combines multi-head with multi-dimensional attentions, and applies a distinct positional mask to each head (subspace), so the memory and computation can be distributed to multiple heads, each with sequential information encoded independently. The experiments show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or competitive performance on nine NLP benchmarks with compelling memory- and time-efficiency.- Anthology ID:
- N19-1127
- Volume:
- Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, Minnesota
- Editors:
- Jill Burstein, Christy Doran, Thamar Solorio
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1256–1266
- Language:
- URL:
- https://aclanthology.org/N19-1127
- DOI:
- 10.18653/v1/N19-1127
- Cite (ACL):
- Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1256–1266, Minneapolis, Minnesota. Association for Computational Linguistics.
- Cite (Informal):
- Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together (Shen et al., NAACL 2019)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/N19-1127.pdf
- Code
- taoshen58/DiSAN + additional community code
- Data
- MPQA Opinion Corpus, MultiNLI, SNLI, SST, SST-5