Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns
Zihan Wang, Jiuxiang Gu, Jason Kuen, Handong Zhao, Vlad Morariu, Ruiyi Zhang, Ani Nenkova, Tong Sun, Jingbo Shang
Abstract
We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.- Anthology ID:
- 2022.findings-acl.74
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2022
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Smaranda Muresan, Preslav Nakov, Aline Villavicencio
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 916–925
- Language:
- URL:
- https://aclanthology.org/2022.findings-acl.74
- DOI:
- 10.18653/v1/2022.findings-acl.74
- Cite (ACL):
- Zihan Wang, Jiuxiang Gu, Jason Kuen, Handong Zhao, Vlad Morariu, Ruiyi Zhang, Ani Nenkova, Tong Sun, and Jingbo Shang. 2022. Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns. In Findings of the Association for Computational Linguistics: ACL 2022, pages 916–925, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns (Wang et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.findings-acl.74.pdf
- Data
- GLUE, LRA, QNLI