Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games
Marissa Zhao Li, Esha Shivakumar, Peiran Wang, Ying Li, Yuan Tian
Abstract
As large language models are increasingly adopted and trusted in real-world applications, understanding their behavior beyond single-turn prompting has become critical. Existing safety evaluations primarily focus on refusal-based methods that test whether models avoid responding to inappropriate or violent requests, leaving open questions about how models behave in interactive social settings. In this paper, we observe the adversarial behavior of LLM models through a multi-agent simulation across five diverse social deduction conversational games, acting as testbeds that provide social pressures and survival stress based on game design without explicit prompt injections. From these interactions, we construct a closed behavioral taxonomy derived through open card sorting, applied uniformly across models using a meta-LLM for behavior labeling. This approach displays that models exhibit distinct behavioral profiles and that models’ different ways of structured deliberation influence both social stability and competitive success.- Anthology ID:
- 2026.findings-acl.2043
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 41099–41115
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.2043/
- DOI:
- Cite (ACL):
- Marissa Zhao Li, Esha Shivakumar, Peiran Wang, Ying Li, and Yuan Tian. 2026. Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41099–41115, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Measuring Large Language Models’ Adversarial Behavior in Social Deduction Games (Li et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.2043.pdf