Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev; Abbas Ghaddar; Dingtao Hu; Boxing Chen

Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, Boxing Chen

Abstract

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as punctuation and special tokens, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. This approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on rigorous knowledge and reasoning benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer more effectively balances attention distributions and reduces rank collapse in upper layers.

Anthology ID:: 2025.emnlp-main.118
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2337–2354
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.118/
DOI:
Bibkey:
Cite (ACL):: Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, and Boxing Chen. 2025. Integral Transformer: Denoising Attention, Not Too Much Not Too Little. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2337–2354, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Integral Transformer: Denoising Attention, Not Too Much Not Too Little (Kobyzev et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.118.pdf
Checklist:: 2025.emnlp-main.118.checklist.pdf

PDF Cite Search Checklist Fix data