Michael Ng


2025

pdf bib
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
Chuanyang Zheng | Yihang Gao | Han Shi | Jing Xiong | Jiankai Sun | Jingyao Li | Minbin Huang | Xiaozhe Ren | Michael Ng | Xin Jiang | Zhenguo Li | Yu Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, **the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem**, which is called Convolutional Data-Adaptive Position Encoding (CDAPE).The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.