Ashwinee Panda
2026
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Nikita Afonin | Nikita Andriianov | Vahagn Hovhannisyan | Nikhil Bageshpura | Kyle Liu | Kevin Zhu | Sunishchal Dev | Ashwinee Panda | Oleg Rogov | Elena Tutubalina | Alexander Panchenko | Mikhail Seleznyov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Nikita Afonin | Nikita Andriianov | Vahagn Hovhannisyan | Nikhil Bageshpura | Kyle Liu | Kevin Zhu | Sunishchal Dev | Ashwinee Panda | Oleg Rogov | Elena Tutubalina | Alexander Panchenko | Mikhail Seleznyov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
2025
Beyond the Haystack: Sensitivity to Context in Legal Reference Recall
Eric Xia | Karthik Srikumar | Keshav Karthik | Advaith Renjith | Ashwinee Panda
Proceedings of the Natural Legal Language Processing Workshop 2025
Eric Xia | Karthik Srikumar | Keshav Karthik | Advaith Renjith | Ashwinee Panda
Proceedings of the Natural Legal Language Processing Workshop 2025
Reference retrieval is critical for many applications in the legal domain, for instance in determining which case texts support a particular claim. However, existing benchmarking methods do not rigorously enable evaluation of recall capabilities in previously unseen contexts. We develop an evaluation framework from U.S. court opinions which ensures models have no prior knowledge of case results or context. Applying our framework, we identify an consistent gap across models and tasks between traditional needle-in-a-haystack retrieval and actual performance in legal recall. Our work shows that standard needle-in-a-haystack benchmarks consistently overestimate recall performance in the legal domain. By isolating the causes of performance degradation to contextual informativity rather than distributional differences, our findings highlight the need for specialized testing in reference-critical applications, and establish an evaluation framework for improving retrieval across informativity levels.