Oleg Rogov
2026
Feature Drift: How Fine-Tuning Repurposes Representations in LLMs
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Findings of the Association for Computational Linguistics: EACL 2026
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Findings of the Association for Computational Linguistics: EACL 2026
Fine-tuning LLMs introduces many important behaviors, such as instruction-following and safety alignment. This makes it crucial to study how fine-tuning changes models’ internal mechanisms. Sparse Autoencoders (SAEs) offer a powerful tool for interpreting neural networks by extracting concepts (features) represented in their activations. Previous work observed that SAEs trained on base models transfer effectively to instruction-tuned (chat) models, attributed to activation similarity. In this work, we propose *feature drift* as an alternative explanation: the feature space remains relevant, but the distribution of feature activations changes. In other words, fine-tuning recombines existing concepts rather than learning new ones. We validate this by showing base SAEs reconstruct both base and chat activations comparably despite systematic differences, with individual features exhibiting clear drift patterns. In a refusal behavior case study, we identify base SAE features that drift to activate on harmful instructions in chat models. Causal interventions using these features confirm that they mediate refusal. Our findings suggest that monitoring how existing features drift, rather than searching for entirely new features, may provide a more complete explanation of how fine-tuning changes model capabilities.
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Nikita Afonin | Nikita Andriianov | Vahagn Hovhannisyan | Nikhil Bageshpura | Kyle Liu | Kevin Zhu | Sunishchal Dev | Ashwinee Panda | Oleg Rogov | Elena Tutubalina | Alexander Panchenko | Mikhail Seleznyov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nikita Afonin | Nikita Andriianov | Vahagn Hovhannisyan | Nikhil Bageshpura | Kyle Liu | Kevin Zhu | Sunishchal Dev | Ashwinee Panda | Oleg Rogov | Elena Tutubalina | Alexander Panchenko | Mikhail Seleznyov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Usman Naseem | Robert Geislinger | Juan Ren | Sarah Kohail | Rudy Alexandro Garrido Veliz | P Sam Sahil | Yiran Zhang | Idris Abdulmumin | Marco Antonio Stranisci | \"Ozge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Simona Frenda | Alessandra Teresa Cignarella | Elena Tutubalina | Oleg Rogov | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Kritesh Rauniyar | Tanmoy Chakraborty | MD Arfeen Zeeshan | Dheeraj Kodati | Satya Keerthi | Sahar Moradizeyveh | Firoj Alam | Md Arid Hasan | Syed Ishtiaque Ahmed | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo Onyango | Clemencia Siro | Jane Wanjiru Kimani | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
Findings of the Association for Computational Linguistics: ACL 2026
Usman Naseem | Robert Geislinger | Juan Ren | Sarah Kohail | Rudy Alexandro Garrido Veliz | P Sam Sahil | Yiran Zhang | Idris Abdulmumin | Marco Antonio Stranisci | \"Ozge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Simona Frenda | Alessandra Teresa Cignarella | Elena Tutubalina | Oleg Rogov | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Kritesh Rauniyar | Tanmoy Chakraborty | MD Arfeen Zeeshan | Dheeraj Kodati | Satya Keerthi | Sahar Moradizeyveh | Firoj Alam | Md Arid Hasan | Syed Ishtiaque Ahmed | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo Onyango | Clemencia Siro | Jane Wanjiru Kimani | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
Findings of the Association for Computational Linguistics: ACL 2026
Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multi-event dataset with over 110K instances in 22 languages drawn from diverse online platforms and real-world events. Polarization is annotated along three axes, namely detection, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) fine-tuning six pretrained small language models; and (2) evaluating a range of open and closed large language models in few-shot and zero-shot settings. Results show that while most models perform well on binary polarization detection, they achieve substantially lower performance when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and underscore the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
2025
CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov | Dmitrii Korzh | Alexey Zhavoronkin | Boris Mikheev | Denis Bobkov | Aibek Alanov | Oleg Rogov | Ivan Oseledets | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2025
Alexey Dontsov | Dmitrii Korzh | Alexey Zhavoronkin | Boris Mikheev | Denis Bobkov | Aibek Alanov | Oleg Rogov | Ivan Oseledets | Elena Tutubalina
Findings of the Association for Computational Linguistics: ACL 2025
Machine Unlearning (MU) is critical for removing private or hazardous information from deep learning models. While MU has advanced significantly in unimodal (text or vision) settings, multimodal unlearning (MMU) remains underexplored due to the lack of open benchmarks for evaluating cross-modal data removal. To address this gap, we introduce CLEAR, the first open-source benchmark designed specifically for MMU. CLEAR contains 200 fictitious individuals and 3,700 images linked with corresponding question-answer pairs, enabling a thorough evaluation across modalities. We conduct a comprehensive analysis of 11 MU methods (e.g., SCRUB, gradient ascent, DPO) across four evaluation sets, demonstrating that jointly unlearning both modalities outperforms single-modality approaches. The dataset is available at [link](https://huggingface.co/datasets/therem/CLEAR)
Search
Fix author
Co-authors
- Elena Tutubalina 4
- Alexey Dontsov 2
- Ivan Oseledets 2
- Idris Abdulmumin 1
- Cengiz Acarturk 1
- Nikita Afonin 1
- Ibrahim Said Ahmad 1
- Syed Ishtiaque Ahmed 1
- Özge Alacam 1
- Firoj Alam 1
- Aibek Alanov 1
- Adem Chanie Ali 1
- Nikita Andriianov 1
- Saba Anwar 1
- Abinew Ali Ayele 1
- Nikhil Bageshpura 1
- Chris Biemann 1
- Denis Bobkov 1
- Tanmoy Chakraborty 1
- Alessandra Teresa Cignarella 1
- Sunishchal Dev 1
- Simona Frenda 1
- Andrey V. Galichin 1
- Robert Geislinger 1
- Md. Arid Hasan 1
- Vahagn Hovhannisyan 1
- Aung Kyaw Htet 1
- Aisha Jabr 1
- Satya Keerthi 1
- Jane Wanjiru Kimani 1
- Dheeraj Kodati 1
- Sarah Kohail 1
- Dmitrii Korzh 1
- Anton Korznikov 1
- Kyle Liu 1
- Boris Mikheev 1
- Sahar Moradizeyveh 1
- Shamsuddeen Hassan Muhammad 1
- Usman Naseem 1
- Nelson Odhiambo Onyango 1
- Alexander Panchenko 1
- Ashwinee Panda 1
- Shantipriya Parida 1
- Ihsan Ayyub Qazi 1
- Kritesh Rauniyar 1
- Juan Ren 1
- P Sam Sahil 1
- Mikhail Seleznyov 1
- Martin Semmann 1
- Clemencia Siro 1
- Marco Antonio Stranisci 1
- Surendrabikram Thapa 1
- Ye Kyaw Thu 1
- Rudy Alexandro Garrido Veliz 1
- Xintong Wang 1
- Lilian Diana Awuor Wanzare 1
- Seid Muhie Yimam 1
- MD Arfeen Zeeshan 1
- Yiran Zhang 1
- Alexey Zhavoronkin 1
- Kevin Zhu 1