Not All Tokens Are Equal: Per-Dimension Top-K Pooling for Adversarially Robust BERT Classification

Manoranjan Dash, Shivam Anand Aralikatti, Shanay Sheth, Pranav Shinde


Abstract
Contextual text classification with BERT typically relies on the [CLS] token representation for downstream prediction. While effective under standard conditions, [CLS]-based pooling is brittle under adversarial perturbation, as its single-vector representation is indiscriminately influenced by injected adversarial tokens. We propose Per-Dimension Top-K Average Pooling, a pooling strategy that, for each hidden dimension, selectively aggregates only the top-K token activations rather than the full sequence — effectively controlling which tokens contribute to the final representation. This token-level selectivity acts as a natural filter against adversarial injection: tokens that do not rank among the top-K for a given dimension are suppressed from aggregation. We evaluate our approach against CLS, Global Average Pooling (GAP), Global Max Pooling (GMP), and Hybrid variants across three text classification domains: spam detection (Enron and LingSpam), automated essay scoring (ASAP), and hate speech classification. On the Enron spam dataset under adversarial attack, our best Hybrid (K=3) variant reduces the Attack Success Rate from 70.65% to 37.07% while maintaining clean accuracy above 99%, compared to CLS which degrades to 63.64% adversarial accuracy. Representation-level analyses further corroborate these findings: Top-K pooling variants exhibit substantially lower cosine similarity shift under attack, and adversarially injected tokens enter the top-K selection in far fewer dimensions compared to CLS. These results suggest that per-dimension token selectivity offers a principled and lightweight mechanism for adversarial robustness in BERT-based classifiers without any modification to the underlying model architecture.
Anthology ID:
2026.gem-main.29
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
285–295
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.29/
DOI:
Bibkey:
Cite (ACL):
Manoranjan Dash, Shivam Anand Aralikatti, Shanay Sheth, and Pranav Shinde. 2026. Not All Tokens Are Equal: Per-Dimension Top-K Pooling for Adversarially Robust BERT Classification. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 285–295, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Not All Tokens Are Equal: Per-Dimension Top-K Pooling for Adversarially Robust BERT Classification (Dash et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.29.pdf