Yinghan Shen

2025

pdf bib abs
Differentiated Vision: Unveiling Entity-Specific Visual Modality Requirements for Multimodal Knowledge Graph
Minghang Liu | Yinghan Shen | Zihe Huang | Yuanzhuo Wang | Xuhui Jiang | Huawei Shen
Findings of the Association for Computational Linguistics: EMNLP 2025

Multimodal Knowledge Graphs (MMKGs) enhance knowledge representations by integrating structural and multimodal information of entities. Recently, MMKGs have proven effective in tasks such as information retrieval, knowledge discovery, and question answering. Current methods typically utilize pre-trained visual encoders to extract features from images associated with each entity, emphasizing complex cross-modal interactions. However, these approaches often overlook the varying relevance of visual information across entities. Specifically, not all entities benefit from visual data, and not all associated images are pertinent, with irrelevant images introducing noise and potentially degrading model performance. To address these issues, we propose the Differentiated Vision for Multimodal Knowledge Graphs (DVMKG) model. DVMKG evaluates the necessity of visual modality for each entity based on its intrinsic attributes and assesses image quality through representativeness and diversity. Leveraging these metrics, DVMKG dynamically adjusts the influence of visual data during feature integration, tailoring it to the specific needs of different entity types. Extensive experiments on multiple benchmark datasets confirm the effectiveness of DVMKG, demonstrating significant improvements over existing methods.

2024

Entity Alignment (EA) is vital for integrating diverse knowledge graph (KG) data, playing a crucial role in data-driven AI applications. Traditional EA methods primarily rely on comparing entity embeddings, but their effectiveness is constrained by the limited input KG data and the capabilities of the representation learning techniques. Against this backdrop, we introduce ChatEA, an innovative framework that incorporates large language models (LLMs) to improve EA. To address the constraints of limited input KG data, ChatEA introduces a KG-code translation module that translates KG structures into a format understandable by LLMs, thereby allowing LLMs to utilize their extensive background knowledge to improve EA accuracy. To overcome the over-reliance on entity embedding comparisons, ChatEA implements a two-stage EA strategy that capitalizes on LLMs’ capability for multi-step reasoning in a dialogue format, thereby enhancing accuracy while preserving efficiency. Our experimental results affirm ChatEA’s superior performance, highlighting LLMs’ potential in facilitating EA tasks.The source code is available at https://anonymous.4open.science/r/ChatEA/.

Multimodal entity alignment (MMEA) integrates multi-source and cross-modal knowledge graphs, a crucial yet challenging task for data-centric applications.Traditional MMEA methods derive the visual embeddings of entities and combine them with other modal data for alignment by embedding similarity comparison.However, these methods are hampered by the limited comprehension of visual attributes and deficiencies in realizing and bridging the semantics of multimodal data. To address these challenges, we propose MM-ChatAlign, a novel framework that utilizes the visual reasoning abilities of MLLMs for MMEA.The framework features an embedding-based candidate collection module that adapts to various knowledge representation strategies, effectively filtering out irrelevant reasoning candidates. Additionally, a reasoning and rethinking module, powered by MLLMs, enhances alignment by efficiently utilizing multimodal information.Extensive experiments on four MMEA datasets demonstrate MM-ChatAlign’s superiority and underscore the significant potential of MLLMs in MMEA tasks.The source code is available at https://github.com/jxh4945777/MMEA/.

Co-authors

Venues

Fix author