Joint Multimodal Preference Optimization for Fine-Grained Visual-Textual Alignment

Jiwon Kim, Hyunsoo Yoon


Abstract
Recent research has focused on addressing multimodal hallucinations in Large Vision-Language Models (LVLMs) by extending Direct Preference Optimization (DPO) to incorporate visual preference supervision. However, these methods often lack fine-grained visual contrast mechanisms and rely on single-margin optimization. This in turn limits their ability to capture precise visual semantics and results in weak multimodal alignment. To address these issues, we propose Joint Multimodal Preference Optimization (JoMPO), a novel optimization framework that symmetrically integrates a text-conditioned preference loss with a visual ranking-based objective. JoMPO leverages semantically contrastive image–text pairs and listwise ranking over multiple visual contexts, enabling fine-grained visual grounding and more robust cross-modal alignment. To support this framework, we introduce the Visual–Textual Contrast (VTC) dataset, consisting of image pairs that are semantically similar but visually distinct, each paired with a contextually grounded textual response. When trained with only 5k contrastive pairs, JoMPO consistently demonstrates superior performance across diverse benchmarks, highlighting its effectiveness in mitigating hallucinations and improving image-text alignment in LVLMs.
Anthology ID:
2026.findings-eacl.5
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
79–94
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.5/
DOI:
Bibkey:
Cite (ACL):
Jiwon Kim and Hyunsoo Yoon. 2026. Joint Multimodal Preference Optimization for Fine-Grained Visual-Textual Alignment. In Findings of the Association for Computational Linguistics: EACL 2026, pages 79–94, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Joint Multimodal Preference Optimization for Fine-Grained Visual-Textual Alignment (Kim & Yoon, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.5.pdf
Checklist:
 2026.findings-eacl.5.checklist.pdf