Tsung-Han Lin
2025
V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning
Zongyu Lin
|
Zhikun Xu
|
Xiaohan Song
|
Yixin Wan
|
Xingcheng Yao
|
Tsung-Han Lin
|
Selina Song
|
Pranav Subbaraman
|
Ben Zhou
|
Kai-Wei Chang
|
Yizhou Sun
Findings of the Association for Computational Linguistics: ACL 2025
Social commonsense reasoning naturally involves both the verbal and non-verbal cues of a social interaction. It is important for Large Vision-Language Models (VLMs) to leverage both textual and visual information in performing tasks like social understanding and reasoning. However, while current LLMs have shown good social reasoning capabilities in textual context, whether they can effectively incorporate visual information in social comprehension remains under-explored. To narrow the gap, we first construct and propose a benchmark: V-Social, featuring well-aligned text and visual content, tailored to assess visual social commonsense for multimodal foundation models. Through experimenting with V-Social, we find that even the most advanced VLM, GPT-4o, often falls short in social commonsense reasoning. This highlights the critical need to enhance the social grounding of VLMs. One major obstacle for improving this is the lack of high-quality data with good reasoning process. To overcome this obstacle, we introduce V-AlphaSocial, a novel method that generates high-quality chain-of-thought reasoning paths from unlabeled data. We design a visual reasoning reward model to improve VLM, and then iteratively refine both the VLM and the reward model. Our extensive analysis showcases how our method enhances social commonsense reasoning, proposing an effective approach that facilitates deeper exploration into field.
Search
Fix author
Co-authors
- Kai-Wei Chang 1
- Zongyu Lin 1
- Xiaohan Song 1
- Selina Song 1
- Pranav Subbaraman 1
- show all...