Jason Kuen
2022
Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns
Zihan Wang
|
Jiuxiang Gu
|
Jason Kuen
|
Handong Zhao
|
Vlad Morariu
|
Ruiyi Zhang
|
Ani Nenkova
|
Tong Sun
|
Jingbo Shang
Findings of the Association for Computational Linguistics: ACL 2022
We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.
Search
Co-authors
- Zihan Wang 1
- Jiuxiang Gu 1
- Handong Zhao 1
- Vlad Morariu 1
- Ruiyi Zhang 1
- show all...