ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Sharanya Dasgupta; Arkaprabha Basu; Sujoy Nath; Swagatam Das

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, Swagatam Das

Abstract

Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, Large Language Models (LLMs) have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a common underlying issue, "representational misalignment" in their latent activation space. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the LLM’s parameters. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the Reinforcement Learning from Human Feedback (RLHF)-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.

Anthology ID:: 2026.eacl-long.212
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4565–4584
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.212/
DOI:
Bibkey:
Cite (ACL):: Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, and Swagatam Das. 2026. ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4565–4584, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models (Dasgupta et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.212.pdf

PDF Cite Search Fix data