Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents

Jihye Kim


Abstract
K-Level reasoning—recursive modeling of opponent beliefs—improves LLM negotiation utility but frequently elicits coercive and toxic behaviors that undermine real-world deployability. We propose an Observer–Planner–Actor architecture with a Modular Appraisal Gate that (i) dynamically estimates the opponent’s cognitive level and (ii) filters hostile drafts via an LLM-as-a-judge. In randomized interventions on the CaSiNo dataset, our gated agent eliminates toxicity (0%) and reduces coercion from 35% to 6% compared to a strong static-K baseline, albeit with an alignment tax in utility. However, the gate does not reduce preference hallucinations—strategic misrepresentation of the agent’s own priorities. K-Level reasoning incidentally suppresses this behavior (from 35% in a vanilla baseline to 22%), but gating coercion releases the suppression, returning hallucination to vanilla-baseline levels (33–37%). We term this pattern a deceptive bypass: output-level filters address the form of hostility but leave surface-compliant manipulation channels intact, demonstrating that they alone are insufficient to align utility-driven strategic agents.
Anthology ID:
2026.trustnlp-main.17
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
287–294
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.17/
DOI:
Bibkey:
Cite (ACL):
Jihye Kim. 2026. Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 287–294, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents (Kim, TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.17.pdf