Simulating Hard Attention Using Soft Attention

Andy Yang, Lena Strobl, David Chiang, Dana Angluin


Abstract
We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several subclasses of languages recognized by hard-attention transformers, which can be defined in variants of linear temporal logic. We demonstrate how soft-attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate general hard-attention transformers, using a temperature that depends on the minimum gap between the maximum attention scores and other attention scores.
Anthology ID:
2026.tacl-1.8
Volume:
Transactions of the Association for Computational Linguistics, Volume 14
Month:
Year:
2026
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
147–166
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.tacl-1.8/
DOI:
10.1162/tacl.a.597
Bibkey:
Cite (ACL):
Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. Simulating Hard Attention Using Soft Attention. Transactions of the Association for Computational Linguistics, 14:147–166.
Cite (Informal):
Simulating Hard Attention Using Soft Attention (Yang et al., TACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.tacl-1.8.pdf