Swagatam Das

2026

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Sharanya Dasgupta | Arkaprabha Basu | Sujoy Nath | Swagatam Das
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, Large Language Models (LLMs) have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a common underlying issue, "representational misalignment" in their latent activation space. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the LLM’s parameters. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the Reinforcement Learning from Human Feedback (RLHF)-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.

2025

pdf bib abs

Assessing the Limits of In-Context Learning beyond Functions using Partially Ordered Relation
Debanjan Dutta | Faizanuddin Ansari | Swagatam Das
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Generating rational and generally accurate responses to tasks, often accompanied by example demonstrations, highlights Large Language Model’s (LLM’s) remarkable In-Context Learning (ICL) capabilities without requiring updates to the model’s parameter space. Despite having an ongoing exploration focused on the inference from a document-level concept, its behavior in learning well-defined functions or relations in context needs a careful investigation. In this article, we present the performance of ICL on partially ordered relation by introducing the notion of inductively increasing complexity in prompts. In most cases, the saturated performance of the chosen metric indicates that while ICL offers some benefits, its effectiveness remains constrained as we increase the complexity in the prompts even in presence of sufficient demonstrative examples. The behavior is evident from our empirical findings and has further been theoretically justified in term of its implicit optimization process.The code is available here.

Co-authors

Venues

Fix author