Oleg Rogov
2026
Feature Drift: How Fine-Tuning Repurposes Representations in LLMs
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Findings of the Association for Computational Linguistics: EACL 2026
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Findings of the Association for Computational Linguistics: EACL 2026
Fine-tuning LLMs introduces many important behaviors, such as instruction-following and safety alignment. This makes it crucial to study how fine-tuning changes models’ internal mechanisms. Sparse Autoencoders (SAEs) offer a powerful tool for interpreting neural networks by extracting concepts (features) represented in their activations. Previous work observed that SAEs trained on base models transfer effectively to instruction-tuned (chat) models, attributed to activation similarity. In this work, we propose *feature drift* as an alternative explanation: the feature space remains relevant, but the distribution of feature activations changes. In other words, fine-tuning recombines existing concepts rather than learning new ones. We validate this by showing base SAEs reconstruct both base and chat activations comparably despite systematic differences, with individual features exhibiting clear drift patterns. In a refusal behavior case study, we identify base SAE features that drift to activate on harmful instructions in chat models. Causal interventions using these features confirm that they mediate refusal. Our findings suggest that monitoring how existing features drift, rather than searching for entirely new features, may provide a more complete explanation of how fine-tuning changes model capabilities.
2025
CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov | Dmitrii Korzh | Alexey Zhavoronkin | Boris Mikheev | Denis Bobkov | Aibek Alanov | Oleg Rogov | Ivan Oseledets | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2025
Alexey Dontsov | Dmitrii Korzh | Alexey Zhavoronkin | Boris Mikheev | Denis Bobkov | Aibek Alanov | Oleg Rogov | Ivan Oseledets | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2025
Machine Unlearning (MU) is critical for removing private or hazardous information from deep learning models. While MU has advanced significantly in unimodal (text or vision) settings, multimodal unlearning (MMU) remains underexplored due to the lack of open benchmarks for evaluating cross-modal data removal. To address this gap, we introduce CLEAR, the first open-source benchmark designed specifically for MMU. CLEAR contains 200 fictitious individuals and 3,700 images linked with corresponding question-answer pairs, enabling a thorough evaluation across modalities. We conduct a comprehensive analysis of 11 MU methods (e.g., SCRUB, gradient ascent, DPO) across four evaluation sets, demonstrating that jointly unlearning both modalities outperforms single-modality approaches. The dataset is available at [link](https://huggingface.co/datasets/therem/CLEAR)