Michael Ng
2025
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
Chuanyang Zheng
|
Yihang Gao
|
Han Shi
|
Jing Xiong
|
Jiankai Sun
|
Jingyao Li
|
Minbin Huang
|
Xiaozhe Ren
|
Michael Ng
|
Xin Jiang
|
Zhenguo Li
|
Yu Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, **the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem**, which is called Convolutional Data-Adaptive Position Encoding (CDAPE).The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
Search
Fix author
Co-authors
- Yihang Gao 1
- Minbin Huang 1
- Xin Jiang 1
- Jingyao Li 1
- Zhenguo Li 1
- show all...
Venues
- acl1