Catherine Liu

2026

The Shape of Vulnerability: How Adversarial Perturbations Reshape the Topology of Language Model Latent Spaces
Angelina Tsai | Shreya Subramanian | Catherine Liu | Kimberly Lopez | Leif Zinn-Brooks | Alexia E. Schulz | Adaku Uchendu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Adversarial perturbations in the context of large language models (LLMs) are subtle changes added to input data (i.e., images or text) that are designed to alter predictions or outputs of machine learning models. We introduce several novel visualizations using topological data analysis (TDA) (leveraging persistent homology) to characterize how adversarial perturbations act on text inputs, specifically, how sandbagging and code-injection attacksalter the geometric structure of attention heads in transformer models. By computing persistent homology metrics from attention maps across different model architectures (such as BERT, RoBERTa, ELECTRA, DistilGPT, etc.), we find that adversarial inputs alter higher-dimensional topological features (H₁ loops and H₂ voids) in ways that distinguish them from clean, non-adversarial inputs.

Co-authors

Leif Zinn-Brooks 1

Venues

ACL1

Fix author