SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong; Weiqin Zu; Xinyu Chen; Yaodong Yang (杨耀东); Song-Chun Zhu; Xue Feng

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng

Abstract

Understanding social interaction, which encompasses perceiving numerous and subtle multimodal cues, inferring unobservable mental states and relations, and dynamically predicting others’ behavior, is the foundation for achieving human-machine interaction. Despite rapid advances in Multimodal Large Language Models (MLLMs), the rich and multifaceted nature of social interaction has hindered the development of benchmarks that holistically evaluate and guide their social interaction abilities. Based on social relation theory, which has been widely regarded as a foundational framework for understanding social behavior, we provide SIV-Bench, a novel video benchmark for systematically evaluating MLLMs’ capabilities across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 originally collected video clips and 5,455 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It covers 14 typical relationships, diverse video lengths, genres, presentation styles, and linguistic and cultural backgrounds. Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck. An in-depth analysis of the reasoning process attributes MLLMs’ suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. Moreover, we find audio and subtitles aid in reasoning-intensive SSR and SDP. Together, SIV-Bench offers a unified testbed to measure progress, expose limitations, and guide future research toward more socially intelligent MLLMs.

Anthology ID:: 2026.findings-acl.1863
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37379–37403
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1863/
DOI:
Bibkey:
Cite (ACL):: Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, and Xue Feng. 2026. SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37379–37403, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning (Kong et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1863.pdf
Checklist:: 2026.findings-acl.1863.checklist.pdf

PDF Cite Search Checklist Fix data