Soo Kyung Kim


2025

pdf bib
GOODLIAR: A Reinforcement Learning-Based Deceptive Agent for Disrupting LLM Beliefs on Foundational Principles
Soo Kyung Kim | Hyunsoo Cho
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) often succumb to adversarial prompts, a phenomenon popularly known as “jailbreaking.” While jailbreaking primarily targets short-term noncompliance with predefined policies, we argue that a deeper vulnerability lies in altering an LLM’s fundamental axiomatic beliefs, such as mathematical or philosophical truths. In this work, we introduce GoodLiar, a reinforcement learning (RL)-based framework that generates deceptive contexts to systematically rewrite an LLM’s core logical or philosophical understandings. By incentivizing an RL agent to produce persuasive and coherent arguments, GoodLiar aims to induce persistent belief shifts, rather than merely influencing immediate judgments of factual truthfulness. %rather than one-off policy breaches. Our approach introduces DA-ILQL, a novel offline RL method that extends ILQL by integrating on-policy data and language exploration to enhance the language discovery and optimization. Through extensive evaluations on multiple LLMs, we show that deceptive contexts discovered by GoodLiar consistently outperform simple multi-turn prompting methods.