Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks

Adam Tang; Catherine Liu; Kimberly Lopez; Shreya Subramanian; Leif Zinn-Brooks; Alexia E. Schulz; Adaku Uchendu

Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks

Adam Tang, Catherine Liu, Kimberly Lopez, Shreya Subramanian, Leif Zinn-Brooks, Alexia E. Schulz, Adaku Uchendu

Abstract

Large language models are known to be vulnerable to adversarial perturbations such as synonym-based word substitutions. However, previous analyses of adversarial influence focus only on output behavior and provide limited insight into the propagation of substitution-based input perturbations through internal representations. In this work, we introduce a topological data analysis (TDA) framework to study the structural effects of adversarial attacks on attention maps across model layers. We evaluate small encoder-based architectures (BERT, RoBERTa, DistilBERT) fine-tuned to solve binary classification on the IMDb review dataset, which were attacked using TextFooler. We convert attention maps into distance matrices and apply TDA to extract topological features, which we then compare using Wasserstein distances between original and perturbed features. In parallel, we compute a non-TDA baseline on attention maps using per-head L₁ distances between original and perturbed attentions. In addition, we analyze these models on a layer-by-layer basis. We find that adversarial perturbations induce systematic and statistically significant topological changes across layers, with the largest deviations occurring in late layers and smaller but notable effects in early layers. These patterns are consistent across models and are validated using both non-parametric (Kruskal–Wallis, Dunn) and parametric (one-way ANOVA, Tukey) tests on log-transformed Wasserstein distances. Compared to our non-TDA baseline, our results show more distinct layer-wise separation and provides a robust and interpretable framework for evaluating how adversarial perturbations alter internal model structure. Our code is publicly available at: https://github.com/angelinatsai04/mitll_clinic/tree/adam_spring.

Anthology ID:: 2026.customnlp4u-1.12
Volume:: Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Sheshera Mysore, Sachin Kumar, Vidhisha Balachandran, Shirley Anugrah Hayati, Faeze Brahman, Hanane Nour Moussa, Alireza Salemi
Venues:: CustomNLP4U | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 131–148
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.customnlp4u-1.12/
DOI:
Bibkey:
Cite (ACL):: Adam Tang, Catherine Liu, Kimberly Lopez, Shreya Subramanian, Leif Zinn-Brooks, Alexia E. Schulz, and Adaku Uchendu. 2026. Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks. In Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U), pages 131–148, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks (Tang et al., CustomNLP4U 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.customnlp4u-1.12.pdf

PDF Cite Search Fix data