LayerNorm Induces Recency Bias in Transformer Decoders

Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi


Abstract
Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.
Anthology ID:
2026.findings-acl.1430
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28638–28652
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1430/
DOI:
Bibkey:
Cite (ACL):
Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, and Edward Choi. 2026. LayerNorm Induces Recency Bias in Transformer Decoders. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28638–28652, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LayerNorm Induces Recency Bias in Transformer Decoders (Kim et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1430.pdf
Checklist:
 2026.findings-acl.1430.checklist.pdf