Jihye Kim

2026

Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents
Jihye Kim
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

K-Level reasoning—recursive modeling of opponent beliefs—improves LLM negotiation utility but frequently elicits coercive and toxic behaviors that undermine real-world deployability. We propose an Observer–Planner–Actor architecture with a Modular Appraisal Gate that (i) dynamically estimates the opponent’s cognitive level and (ii) filters hostile drafts via an LLM-as-a-judge. In randomized interventions on the CaSiNo dataset, our gated agent eliminates toxicity (0%) and reduces coercion from 35% to 6% compared to a strong static-K baseline, albeit with an alignment tax in utility. However, the gate does not reduce preference hallucinations—strategic misrepresentation of the agent’s own priorities. K-Level reasoning incidentally suppresses this behavior (from 35% in a vanilla baseline to 22%), but gating coercion releases the suppression, returning hallucination to vanilla-baseline levels (33–37%). We term this pattern a deceptive bypass: output-level filters address the form of hostility but leave surface-compliant manipulation channels intact, demonstrating that they alone are insufficient to align utility-driven strategic agents.

pdf bib abs

SlugRAG at SemEval-2026 Task 8: Domain-Specific Fine-Tuning and Model Scaling for Multi-Turn RAG Retrieval
Pratibha Revankar | Jihye Kim | Umit Azirakhmet
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Multi-Turn Retrieval-Augmented Generation (MT-RAG) requires resolving context-dependent ambiguities across conversational turns. We present a systematic evaluation of dense retrieval optimization for the MTRAGEval benchmark (Task 8, Subtask A: Retrieval Only), investigating training-time strategies and inference-time query reformulation across four diverse English-language domains: CLAPNQ (legal/patent), FIQA (financial), GOVT (government documents), and CLOUD (cloud computing). Our experiments demonstrate that domain-specific fine-tuning yields the most substantial gains, with our best CLAPNQ model achieving Recall@10 of 0.6016 and nDCG@10 of 0.4981—representing 58.3\% and 66.0\% improvements over the pre-trained BGE baseline. Domain-specific models average 44.3\% improvement in Recall@10 and 47.8\% in nDCG@10 across all domains. Additionally, fine-tuning larger embedding models (BGE-large) achieves the best overall performance (nDCG@10: 0.5101, Recall@10: 0.6221), highlighting the complementary impact of model capacity and domain adaptation.

Co-authors

Venues

Fix author