Wangbo Zhao
2026
GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs
Weidong Tang | Jierui Li | Yueling Hou | Zihan Mei | Can Zhang | Xinyan Wan | Zhiyuan Liang | Pengfei Zhou | Yang You | Wangbo Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weidong Tang | Jierui Li | Yueling Hou | Zihan Mei | Can Zhang | Xinyan Wan | Zhiyuan Liang | Pengfei Zhou | Yang You | Wangbo Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models systematically fail at this: collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, and cannot be recovered by summing individual intentions. We present ***GroupToM-Bench***, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal that frontier models perform significantly below human levels, exposing fundamental blind spots in modeling social structures and nonlinear collective behavior.
2025
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
xu Zhao Pan | Pengfei Zhou | Jiaxin Ai | Wangbo Zhao | Kai Wang | Xiaojiang Peng | Wenqi Shao | Hongxun Yao | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
xu Zhao Pan | Pengfei Zhou | Jiaxin Ai | Wangbo Zhao | Kai Wang | Xiaojiang Peng | Wenqi Shao | Hongxun Yao | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, whereas the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answers Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.