Polina Druzhinina


2025

pdf bib
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev | Matvey Mikhalchuk | Temurbek Rahmatullaev | Elizaveta Goncharova | Polina Druzhinina | Ivan Oseledets | Andrey Kuznetsov
Findings of the Association for Computational Linguistics: NAACL 2025

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer’s embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of “filler” tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.

pdf bib
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov | Laida Kushnareva | Anton Razzhigaev | Polina Druzhinina | Anastasia Voznyuk | Irina Piontkovskaya | Evgeny Burnaev | Serguei Barannikov
Findings of the Association for Computational Linguistics: ACL 2025

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2B’s residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation of obtained features. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts. The code for this paper is available at https://github.com/pyashy/SAE_ATD.