Hyojin Chin

2026

I Am Not Them: Persistent Outgroup Bias in Large Language Models Arising from Social Identity Persona Setting
Wenchao Dong | Assem Zhunis | Dongyoung Jeong | Hyojin Chin | Jiyoung Han | Meeyoung Cha
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This research examines how large language models internalize social identities assigned through targeted prompts. Guided by social identity theory, we investigate whether and how these identity assignments cause AI systems to differentiate between "we" (the ingroup) and "they" (the outgroup). We demonstrate that self-categorization of social identity leads to both ingroup favoritism and outgroup bias, with the latter manifesting as strongly as the former. This finding is significant given the fundamental role of outgroup bias in driving intergroup prejudice and discrimination as documented in social psychology. We further propose a strategic intervention to mitigate such bias by guiding language models to adopt the identity of the initially disfavored group. This method, validated across both political and gender domains, exposes a critical dual function of group alignment: adopting one social identity inherently alters the model’s stance toward outgroups, effectively neutralizing pre-existing biases. Our work shows that understanding human-like AI behaviors is a critical prerequisite to building more balanced and socially responsible technology.

2024

pdf bib abs

While detecting offensive language in online spaces remains an important societal issue, there is still a significant gap in existing research and practial datasets specific to chatbots. Furthermore, many of the current efforts by service providers to automatically filter offensive language are vulnerable to users’ deliberate text manipulation tactics, such as misspelling words. In this study, we analyze offensive language patterns in real logs of 6,254,261 chat utterance pairs from the commercial chat service Simsimi, which cover a variety of conversation topics. Based on the observed patterns, we introduce a novel offensive language detection method—a contrastive learning model that embeds chat content with a random masking strategy. We show that this model outperforms existing models in detecting offensive language in open-domain chat conversations while also demonstrating robustness against users’ deliberate text manipulation tactics when using offensive language. We release our curated chatbot dataset to foster research on offensive language detection in open-domain conversations and share lessons learned from mitigating offensive language on a live platform.

Co-authors

Venues

LREC2
COLING1

Fix author