Yaru Cao
2025
SocialEval: Evaluating Social Intelligence of Large Language Models
Jinfeng Zhou
|
Yuxuan Chen
|
Yihan Shi
|
Xuanming Zhang
|
Leqi Lei
|
Yi Feng
|
Zexuan Xiong
|
Miao Yan
|
Xunzhi Wang
|
Yaru Cao
|
Jianing Yin
|
Shuai Wang
|
Quanyu Dai
|
Zhenhua Dong
|
Hongning Wang
|
Minlie Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs’ SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs’ formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
2024
ToMBench: Benchmarking Theory of Mind in Large Language Models
Zhuang Chen
|
Jincenzi Wu
|
Jinfeng Zhou
|
Bosi Wen
|
Guanqun Bi
|
Gongyao Jiang
|
Yaru Cao
|
Mengting Hu
|
Yunghwei Lai
|
Zexuan Xiong
|
Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs’ ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
Search
Fix author
Co-authors
- Minlie Huang 2
- Zexuan Xiong 2
- Jinfeng Zhou 2
- Guanqun Bi 1
- Zhuang Chen 1
- show all...
Venues
- acl2