HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Fan Yuan, Chi Qin, Xiaogang Xu, Piji Li


Abstract
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks. However, these models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images. Many existing work detects hallucination by directly judging whether an object exists in an image, overlooking the association between the object and semantics. To address this issue, we propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD). This framework incorporates hallucination feedback at both object and sentence semantic levels. Remarkably, even with a marginal degree of training, this approach can alleviate over 15% of hallucination. Simultaneously, HELPD penalizes the output logits according to the image attention window to avoid being overly affected by generated text. HELPD can be seamlessly integrated with any LVLMs. Our experiments demonstrate that the proposed framework yields favorable results across multiple hallucination benchmarks. It effectively mitigates hallucination for different LVLMs and concurrently improves their text generation quality.
Anthology ID:
2024.emnlp-main.105
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1768–1785
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.105/
DOI:
10.18653/v1/2024.emnlp-main.105
Bibkey:
Cite (ACL):
Fan Yuan, Chi Qin, Xiaogang Xu, and Piji Li. 2024. HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1768–1785, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (Yuan et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2024.emnlp-main.105.pdf