2025
pdf
bib
abs
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
Shaona Ghosh
|
Prasoon Varshney
|
Makesh Narsimhan Sreedhar
|
Aishwarya Padmakumar
|
Traian Rebedea
|
Jibin Rajan Varghese
|
Christopher Parisien
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types. Using a hybrid data generation pipeline that combines human annotations with a multi-LLM “jury” system to assess the safety of responses we obtain Aegis2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets generated leveraging GPT-4. Additionally, we introduce a novel training blend that combines topic following data with safety data. This approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference. We plan to open-source Aegis2.0 data and models to the research community to aid in safety guardrailing of LLMs.
2024
pdf
bib
abs
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Traian Rebedea
|
Makesh Sreedhar
|
Shaona Ghosh
|
Jiaqi Zeng
|
Christopher Parisien
Findings of the Association for Computational Linguistics: EMNLP 2024
Recent advancements in instruction-tuning datasets have predominantly focused on specific tasks like mathematical or logical reasoning. There has been a notable gap in data designed for aligning language models to maintain topic relevance in conversations - a critical aspect for deploying chatbots to production. We introduce the CantTalkAboutThis dataset to help language models remain focused on the subject at hand during task-oriented interactions. It consists of synthetic dialogues on a wide range of conversation topics from different domains. These dialogues are interspersed with distractor turns that intentionally divert the chatbot from the predefined topic. Fine-tuning language models on this dataset helps make them resilient to deviating from the assigned role and improves their ability to maintain topical coherence compared to general-purpose instruction-tuned LLMs like gpt-4-turbo and Mixtral-Instruct. Additionally, preliminary observations suggest that training models on this dataset also enhance their performance on fine-grained instruction following tasks, including safety alignment.
2017
pdf
bib
abs
Neuramanteau: A Neural Network Ensemble Model for Lexical Blends
Kollol Das
|
Shaona Ghosh
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
The problem of blend formation in generative linguistics is interesting in the context of neologism, their quick adoption in modern life and the creative generative process guiding their formation. Blend quality depends on multitude of factors with high degrees of uncertainty. In this work, we investigate if the modern neural network models can sufficiently capture and recognize the creative blend composition process. We propose recurrent neural network sequence-to-sequence models, that are evaluated on multiple blend datasets available in the literature. We propose an ensemble neural and hybrid model that outperforms most of the baselines and heuristic models upon evaluation on test data.