Yuxuan Zhang
2025
Synergizing Multimodal Temporal Knowledge Graphs and Large Language Models for Social Relation Recognition
Haorui Wang
|
Zheng Wang
|
Yuxuan Zhang
|
Bo Wang
|
Bin Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent years have witnessed remarkable advances in Large Language Models (LLMs). However, in the task of social relation recognition, Large Language Models (LLMs) encounter significant challenges due to their reliance on sequential training data, which inherently restricts their capacity to effectively model complex graph-structured relationships. To address this limitation, we propose a novel low-coupling method synergizing multimodal temporal Knowledge Graphs and Large Language Models (mtKG-LLM) for social relation reasoning. Specifically, we extract multimodal information from the videos and model the social networks as spatial Knowledge Graphs (KGs) for each scene. Temporal KGs are constructed based on spatial KGs and updated along the timeline for long-term reasoning. Subsequently, we retrieve multi-scale information from the graph-structured knowledge for LLMs to recognize the underlying social relation. Extensive experiments demonstrate that our method has achieved state-of-the-art performance in social relation recognition. Furthermore, our framework exhibits effectiveness in bridging the gap between KGs and LLMs. Our code will be released after acceptance.
Interesting Culture: Social Relation Recognition from Videos via Culture De-confounding
Yuxuan Zhang
|
Yangfu Zhu
|
Haorui Wang
|
Bin Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
Social relationship recognition, as one of the fundamental tasks in video understanding, contributes to the construction and application of multi-modal knowledge graph. Previous works have mainly focused on two aspects: generating character graphs and multi-modal fusion. However, they often overlook the impact of cultural differences on relationship recognition. Specifically, relationship recognition models are susceptible to being misled by training data from a specific cultural context. This can result in the learning of culture-specific spurious correlations, ultimately restricting the ability to recognize social relationships in different cultures. Therefore, we employ a customized causal graph to analyze the confounding effects of culture in the relationship recognition task. We propose a Cultural Causal Intervention (CCI) model that mitigates the influence of culture as a confounding factor in the visual and textual modalities. Importantly, we also construct a novel video social relation recognition (CVSR) dataset to facilitate discussion and research on cultural factors in video tasks. Extensive experiments conducted on several datasets demonstrate that the proposed model surpasses state-of-the-art methods.