Multimodal large language models (MLLMs) demonstrate strong capabilities in multimodal understanding, reasoning, and interaction but still face the fundamental limitation of hallucinations, where they generate erroneous or fabricated information. To mitigate hallucinations, existing methods annotate pair-responses (one non-hallucination vs one hallucination) using manual methods or GPT-4V, and train alignment algorithms to improve the correspondence between images and text. More critically, an image description often involve multiple dimensions (e.g., object attributes, posture, and spatial relationships), making it challenging for the model to comprehensively learn multidimensional information from pair-responses. To this end, in this paper, we propose RRHFV, which is the first using rank-responses (one non-hallucination vs multiple ranking hallucinations) to mitigate multimodal hallucinations. Instead of using pair-responses to train the model, RRHF-V expands the number of hallucinatory responses, so that the responses with different scores in a rank-response enable the model to learn rich semantic information across various dimensions of the image. Further, we propose a scene graph-based approach to automatically construct rank-responses in a cost-effective and automatic manner. We also design a novel training objective based on rank loss and margin loss to balance the differences between hallucinatory responses within a rankresponse, thereby improving the model’s image comprehension. Experiments on two MLLMs of different sizes and four widely used benchmarks demonstrate that RRHF-V is effective in mitigating hallucinations and outperforms the DPO method based on pair-responses.
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multimodal knowledge graphs (MMKGs). However, the intrinsic noise within modalities, such as the inconsistency in visual modality and redundant attributes, has not been thoroughly investigated. Excessive noise not only weakens semantic representation but also increases the risk of overfitting in attention-based fusion methods. To address this, we propose LGEA, a novel LLMguided MMEA framework that prioritizes noise reduction before fusion. Specifically, LGEA introduces two key strategies: (1) fine-grained visual filtering to remove irrelevant images at the semantic level, and (2) contextual summarization of attribute information to enhance entity semantics. To our knowledge, we are the first work to apply LLMs for both visual filtering and attribute-level semantic enhancement in MMEA. Experiments on multiple benchmarks, including the noisy FB YG dataset, show that LGEA sets a new state-of-the-art (SOTA) in robust multi-modal alignment, highlighting the potential of noise-aware strategies as a promising direction for future MMEA research.
Entity alignment (EA) aims to identify entities in different knowledge graphs (KGs) that represent the same real-world objects. Traditional EA methods typically embed entity information into vector space under the guidance of seed entity pairs, and align entities by calculating and comparing the similarity between entity embeddings. With the advent of large language models (LLMs), emerging methods are increasingly integrating LLMs with traditional methods to leverage external knowledge and improve EA accuracy. However, this integration also introduces additional computational complexity and operational overhead, and still requires seed pairs that are scarce and expensive to obtain. To address these challenges, we propose EasyEA, the first end-to-end EA framework based on LLMs that requires no training. EasyEA consists of three main stages: (1) Information Summarization, (2) Embedding and Feature Fusion, and (3) Candidate Selection. By automating the EA process, EasyEA significantly reduces the reliance on seed entity pairs while demonstrating superior performance across various datasets, covering crosslingual, sparse, large-scale, and heterogeneous scenarios. Extensive experimental results show that EasyEA not only simplifies the EA process but also achieves state-of-the-art (SOTA) performance on diverse datasets, providing a promising solution for advancing EA tasks.
Multimodal entity alignment aims to identify equivalent entities in heterogeneous knowledge graphs by leveraging complementary information from multiple modalities. However, existing methods often overlook the quality of input modality embeddings during modality interaction – such as missing modality generation, modal information transfer, modality fusion – which may inadvertently amplify noise propagation while suppressing discriminative feature representations. To address these issues, we propose a novel model – CLAMEA for capturing latent modal association for multimodal entity alignment. Specifically, we use a self- attention mechanism to enhance salient information while attenuating noise within individual modality embeddings. We design a dynamic modal attention flow fusion module to capture and balance latent intra- and inter-modal associations and generate fused modality embeddings. Based on both fused and available modalities, we adopt variational autoencoder (VAE) to generate high quality embeddings for the missing modality. We use a cross-modal association extraction module to extract latent modal associations from the completed modality embeddings, further enhancing embedding quality. Experimental results on two real-world datasets demonstrate the effectiveness of our approach, which achieves an absolute 3.1% higher Hits@ 1 score than the sota method.