GOODLIAR: A Reinforcement Learning-Based Deceptive Agent for Disrupting LLM Beliefs on Foundational Principles

Soo Kyung Kim, Hyunsoo Cho


Abstract
Large Language Models (LLMs) often succumb to adversarial prompts, a phenomenon popularly known as “jailbreaking.” While jailbreaking primarily targets short-term noncompliance with predefined policies, we argue that a deeper vulnerability lies in altering an LLM’s fundamental axiomatic beliefs, such as mathematical or philosophical truths. In this work, we introduce GoodLiar, a reinforcement learning (RL)-based framework that generates deceptive contexts to systematically rewrite an LLM’s core logical or philosophical understandings. By incentivizing an RL agent to produce persuasive and coherent arguments, GoodLiar aims to induce persistent belief shifts, rather than merely influencing immediate judgments of factual truthfulness. %rather than one-off policy breaches. Our approach introduces DA-ILQL, a novel offline RL method that extends ILQL by integrating on-policy data and language exploration to enhance the language discovery and optimization. Through extensive evaluations on multiple LLMs, we show that deceptive contexts discovered by GoodLiar consistently outperform simple multi-turn prompting methods.
Anthology ID:
2025.findings-acl.160
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3076–3101
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.160/
DOI:
Bibkey:
Cite (ACL):
Soo Kyung Kim and Hyunsoo Cho. 2025. GOODLIAR: A Reinforcement Learning-Based Deceptive Agent for Disrupting LLM Beliefs on Foundational Principles. In Findings of the Association for Computational Linguistics: ACL 2025, pages 3076–3101, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
GOODLIAR: A Reinforcement Learning-Based Deceptive Agent for Disrupting LLM Beliefs on Foundational Principles (Kim & Cho, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.160.pdf