Shanay Sheth


2026

Contextual text classification with BERT typically relies on the [CLS] token representation for downstream prediction. While effective under standard conditions, [CLS]-based pooling is brittle under adversarial perturbation, as its single-vector representation is indiscriminately influenced by injected adversarial tokens. We propose Per-Dimension Top-K Average Pooling, a pooling strategy that, for each hidden dimension, selectively aggregates only the top-K token activations rather than the full sequence — effectively controlling which tokens contribute to the final representation. This token-level selectivity acts as a natural filter against adversarial injection: tokens that do not rank among the top-K for a given dimension are suppressed from aggregation. We evaluate our approach against CLS, Global Average Pooling (GAP), Global Max Pooling (GMP), and Hybrid variants across three text classification domains: spam detection (Enron and LingSpam), automated essay scoring (ASAP), and hate speech classification. On the Enron spam dataset under adversarial attack, our best Hybrid (K=3) variant reduces the Attack Success Rate from 70.65% to 37.07% while maintaining clean accuracy above 99%, compared to CLS which degrades to 63.64% adversarial accuracy. Representation-level analyses further corroborate these findings: Top-K pooling variants exhibit substantially lower cosine similarity shift under attack, and adversarially injected tokens enter the top-K selection in far fewer dimensions compared to CLS. These results suggest that per-dimension token selectivity offers a principled and lightweight mechanism for adversarial robustness in BERT-based classifiers without any modification to the underlying model architecture.