Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov, Laida Kushnareva, Anton Razzhigaev, Polina Druzhinina, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
Abstract
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2B’s residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation of obtained features. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts. The code for this paper is available at https://github.com/pyashy/SAE_ATD.- Anthology ID:
- 2025.findings-acl.1321
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venues:
- Findings | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25727–25748
- Language:
- URL:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.1321/
- DOI:
- Cite (ACL):
- Kristian Kuznetsov, Laida Kushnareva, Anton Razzhigaev, Polina Druzhinina, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, and Serguei Barannikov. 2025. Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25727–25748, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders (Kuznetsov et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.1321.pdf