Zhuo Liu

2025

pdf bib abs
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting
Jiarui Wu | Zhuo Liu | Hangfeng He
Findings of the Association for Computational Linguistics: NAACL 2025

Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.

2024

pdf bib abs
Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP
Zeliang Zhang | Zhuo Liu | Mingqian Feng | Chenliang Xu
Findings of the Association for Computational Linguistics: EMNLP 2024

CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP’s understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.

Zhuo Liu

2025

2024

2012

Co-authors

Venues