Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Bo Zheng, Yancheng He, Shilong Li


Abstract
Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual correlated tokens without fine-grained annotations. Specifically, we introduce a token-level visual-anchored reward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLaVA and Qwen, our TPO boosts the performance absolute improvement for hallucination benchmarks.
Anthology ID:
2025.findings-emnlp.1076
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19754–19767
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1076/
DOI:
10.18653/v1/2025.findings-emnlp.1076
Bibkey:
Cite (ACL):
Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Bo Zheng, Yancheng He, and Shilong Li. 2025. Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19754–19767, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation (Gu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1076.pdf
Checklist:
 2025.findings-emnlp.1076.checklist.pdf