Prasoon Goyal
2026
From Narrow Unlearning to Emergent Misalignment in LLMs
Erum Mushtaq | Anil Ramakrishna | Satyapriya Krishna | Sattvik Sahai | Prasoon Goyal | Kai-Wei Chang | Tao Zhang | Rahul Gupta
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Erum Mushtaq | Anil Ramakrishna | Satyapriya Krishna | Sattvik Sahai | Prasoon Goyal | Kai-Wei Chang | Tao Zhang | Rahul Gupta
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains. Among the two intervened concepts, Cybersecurity and Safety, we find that the safety concept can have larger EMA impact, i.e, causing lower refusal scores, across other unrelated domains such as bias. We observe this effect consistently across two model families, Mistral-7b-0.3v, and Qwen-7b-2.5. Further, we show that refusal unlearning augmented with cross-entropy loss function on a small set of retain data from the affected domains can largely, if not fully, restore alignment across the impacted domains while having lower refusal rate on the concept we perform unlearning on. To investigate the underlying causes of EMA, we analyze concept entanglements at the representation level via concept vectors. Our analysis reveals that concepts with higher representation similarity in earlier layers are more susceptible to EMA after intervention when the refusal stream is altered through targeted refusal unlearning.
2024
MICo: Preventative Detoxification of Large Language Models through Inhibition Control
Roy Siegelmann | Ninareh Mehrabi | Palash Goyal | Prasoon Goyal | Lisa Bauer | Jwala Dhamala | Aram Galstyan | Rahul Gupta | Reza Ghanadan
Findings of the Association for Computational Linguistics: NAACL 2024
Roy Siegelmann | Ninareh Mehrabi | Palash Goyal | Prasoon Goyal | Lisa Bauer | Jwala Dhamala | Aram Galstyan | Rahul Gupta | Reza Ghanadan
Findings of the Association for Computational Linguistics: NAACL 2024
Large Language Models (LLMs) are powerful tools which have been both dominant and commonplace in the field of Artificial Intelligence. Yet, LLMs have a tendency to devolve into toxic degeneration, wherein otherwise safe and unproblematic models begin generating toxic content. For the sake of social responsibility and inspired by the biological mechanisms of inhibition control, we introduce the paradigm of Education for Societal Norms (ESN). By collecting and labeling examples as acceptable and unacceptable (in this case toxic and non-toxic), and including a corresponding acceptable rewrite with every unacceptable example, we introduce a new mechanism for LLM detoxification. We annotate a dataset of 2,850 entries and use it to fine-tune a model, which we call a Model with Inhibition Control (MICo). Evaluating this model on toxicity detection capability, rewrite detoxification, meaning preservation, and overall toxicity reduction, we discover significant improvements over the baseline model. In our experiments we show that overall toxicity of this model is more than 60% reduced, with over 75% reduction in severe toxicity.
Adapting LLM Predictions in In-Context Learning with Data Priors
Javier Chiyah-Garcia | Prasoon Goyal | Michael Johnston | Reza Ghanadan
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Javier Chiyah-Garcia | Prasoon Goyal | Michael Johnston | Reza Ghanadan
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
In-Context Learning (ICL) has enabled Large Language Models (LLMs) to excel as general-purpose models in zero and few-shot task settings. However, since LLMs are often not trained on the downstream tasks, they lack crucial contextual knowledge from the data distributions, which limits their task adaptability.This paper explores using data priors to automatically customize prompts in ICL. We extract these priors in a dataset-agnostic way basedon historical information, enabling LLMs to personalize their output towards users or tasks at inference time. We find that they improve LLM’s output by injecting latent dataset-specific information for the task of rating prediction. Throughout a series of experiments, we show replicable results across LLMs and datasets on what information and methods are most effective for adapting ICL outputs with priors. Our findings offer a systematic approach to customizing prompts with additional information in a privacy-friendly manner, requiring only aggregated data that is computationally efficient.
2017
Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers
Lars Hellsten | Brian Roark | Prasoon Goyal | Cyril Allauzen | Françoise Beaufays | Tom Ouyang | Michael Riley | David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)
Lars Hellsten | Brian Roark | Prasoon Goyal | Cyril Allauzen | Françoise Beaufays | Tom Ouyang | Michael Riley | David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)