Does Self-Attention Need Separate Weights in Transformers?

Md Kowsher; Nusrat Jahan Prottasha; Chun-Nam Yu; Ozlem Garibay; Niloofar Yousefi

Does Self-Attention Need Separate Weights in Transformers?

Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Garibay, Niloofar Yousefi

Abstract

Self-attention has revolutionized natural language processing by capturing long-range dependencies and improving context understanding. However, it comes with high computational costs and struggles with sequential data’s inherent directionality. This paper investigates and presents a simplified approach called “shared weight self-attention,” where a single weight matrix is used for Keys, Queries, and Values instead of separate matrices for each. This approach cuts training parameters by more than half and significantly reduces training time. Our method not only improves efficiency but also achieves strong performance on tasks from the GLUE benchmark, even outperforming the standard BERT baseline in handling noisy and out-of-domain data. Experimental results show a 66.53% reduction in parameter size within the attention block and competitive accuracy improvements of 3.55% and 0.89% over symmetric and pairwise attention-based BERT models, respectively.

Anthology ID:: 2025.naacl-industry.44
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 535–543
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.44/
DOI:
Bibkey:
Cite (ACL):: Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Garibay, and Niloofar Yousefi. 2025. Does Self-Attention Need Separate Weights in Transformers?. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 535–543, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Does Self-Attention Need Separate Weights in Transformers? (Kowsher et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.44.pdf

PDF Cite Search Fix data