Alexia E. Schulz

2026

Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks
Adam Tang | Catherine Liu | Kimberly Lopez | Shreya Subramanian | Leif Zinn-Brooks | Alexia E. Schulz | Adaku Uchendu
Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Large language models are known to be vulnerable to adversarial perturbations such as synonym-based word substitutions. However, previous analyses of adversarial influence focus only on output behavior and provide limited insight into the propagation of substitution-based input perturbations through internal representations. In this work, we introduce a topological data analysis (TDA) framework to study the structural effects of adversarial attacks on attention maps across model layers. We evaluate small encoder-based architectures (BERT, RoBERTa, DistilBERT) fine-tuned to solve binary classification on the IMDb review dataset, which were attacked using TextFooler. We convert attention maps into distance matrices and apply TDA to extract topological features, which we then compare using Wasserstein distances between original and perturbed features. In parallel, we compute a non-TDA baseline on attention maps using per-head L₁ distances between original and perturbed attentions. In addition, we analyze these models on a layer-by-layer basis. We find that adversarial perturbations induce systematic and statistically significant topological changes across layers, with the largest deviations occurring in late layers and smaller but notable effects in early layers. These patterns are consistent across models and are validated using both non-parametric (Kruskal–Wallis, Dunn) and parametric (one-way ANOVA, Tukey) tests on log-transformed Wasserstein distances. Compared to our non-TDA baseline, our results show more distinct layer-wise separation and provides a robust and interpretable framework for evaluating how adversarial perturbations alter internal model structure. Our code is publicly available at: https://github.com/angelinatsai04/mitll_clinic/tree/adam_spring.

pdf bib abs

The Shape of Vulnerability: How Adversarial Perturbations Reshape the Topology of Language Model Latent Spaces
Angelina Tsai | Shreya Subramanian | Catherine Liu | Kimberly Lopez | Leif Zinn-Brooks | Alexia E. Schulz | Adaku Uchendu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Adversarial perturbations in the context of large language models (LLMs) are subtle changes added to input data (i.e., images or text) that are designed to alter predictions or outputs of machine learning models. We introduce several novel visualizations using topological data analysis (TDA) (leveraging persistent homology) to characterize how adversarial perturbations act on text inputs, specifically, how sandbagging and code-injection attacksalter the geometric structure of attention heads in transformer models. By computing persistent homology metrics from attention maps across different model architectures (such as BERT, RoBERTa, ELECTRA, DistilGPT, etc.), we find that adversarial inputs alter higher-dimensional topological features (H₁ loops and H₂ voids) in ways that distinguish them from clean, non-adversarial inputs.

Co-authors

Adam Tang 1

Angelina Tsai 1

Venues

Fix author