Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents

Jihye Kim

Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents

Abstract

K-Level reasoning—recursive modeling of opponent beliefs—improves LLM negotiation utility but frequently elicits coercive and toxic behaviors that undermine real-world deployability. We propose an Observer–Planner–Actor architecture with a Modular Appraisal Gate that (i) dynamically estimates the opponent’s cognitive level and (ii) filters hostile drafts via an LLM-as-a-judge. In randomized interventions on the CaSiNo dataset, our gated agent eliminates toxicity (0%) and reduces coercion from 35% to 6% compared to a strong static-K baseline, albeit with an alignment tax in utility. However, the gate does not reduce preference hallucinations—strategic misrepresentation of the agent’s own priorities. K-Level reasoning incidentally suppresses this behavior (from 35% in a vanilla baseline to 22%), but gating coercion releases the suppression, returning hallucination to vanilla-baseline levels (33–37%). We term this pattern a deceptive bypass: output-level filters address the form of hostility but leave surface-compliant manipulation channels intact, demonstrating that they alone are insufficient to align utility-driven strategic agents.

Anthology ID:: 2026.trustnlp-main.17
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 287–294
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.17/
DOI:
Bibkey:
Cite (ACL):: Jihye Kim. 2026. Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 287–294, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents (Kim, TrustNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.17.pdf

PDF Cite Search Fix data