Differentiable Subset Pruning of Transformer Heads

Jiaoda Li; Ryan Cotterell; Mrinmaya Sachan

doi:10.1162/tacl_a_00436

Differentiable Subset Pruning of Transformer Heads

Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan

Abstract

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1

Anthology ID:: 2021.tacl-1.86
Volume:: Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:: 2021
Address:: Cambridge, MA
Editors:: Brian Roark, Ani Nenkova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1442–1459
Language:
URL:: https://aclanthology.org/2021.tacl-1.86
DOI:: 10.1162/tacl_a_00436
Bibkey:
Cite (ACL):: Jiaoda Li, Ryan Cotterell, and Mrinmaya Sachan. 2021. Differentiable Subset Pruning of Transformer Heads. Transactions of the Association for Computational Linguistics, 9:1442–1459.
Cite (Informal):: Differentiable Subset Pruning of Transformer Heads (Li et al., TACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2021.tacl-1.86.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-4/2021.tacl-1.86.mp4

PDF Search Video