Token-level Inference-Time Alignment for Vision-Language Models

Kejia Chen; Junjun Zheng; Jiawen Zhang; Manxi Lin; Xiao Pan; Jiacong Hu; Jian Lou; Zunlei Feng; Mingli Song

Token-level Inference-Time Alignment for Vision-Language Models

Kejia Chen, Junjun Zheng, Jiawen Zhang, Manxi Lin, Xiao Pan, Jiacong Hu, Jian Lou, Zunlei Feng, Mingli Song

Abstract

Vision-Language Models (VLMs) often prioritize linguistic fluency over visual fidelity, leading to hallucinations where generated text contradicts the image. Countering this bias typically requires resource-heavy fine-tuning or high-latency verification methods that provide feedback only after the full response is generated. To overcome these limitations, we present a framework for Token-level Inference-Time Alignment (TITA) that steers the decoding process without updating the base model parameters. By training a lightweight reward model to capture visual preferences, TITA extracts implicit guidance through log-probability ratios. This approach functions as an inference-time adaptation of Direct Preference Optimization (DPO), injecting dense feedback to correct the output distribution at every generation step. Across diverse architectures including LLaVA-1.5, Qwen3-VL, and InternVL3.5, TITA consistently improves performance on 13 benchmarks. For example, TITA boosts LLaVA-1.5-7B by 8.6% on MMVet and achieves a 74.0 MMStar score with Qwen3-VL-8B. Specifically, these gains incur negligible overhead (~0.2s per query), offering a superior trade-off between alignment effectiveness and efficiency. Our code is available at: https://github.com/Thecommonirin/TITA.

Anthology ID:: 2026.findings-acl.1253
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25012–25029
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1253/
DOI:
Bibkey:
Cite (ACL):: Kejia Chen, Junjun Zheng, Jiawen Zhang, Manxi Lin, Xiao Pan, Jiacong Hu, Jian Lou, Zunlei Feng, and Mingli Song. 2026. Token-level Inference-Time Alignment for Vision-Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25012–25029, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Token-level Inference-Time Alignment for Vision-Language Models (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1253.pdf
Checklist:: 2026.findings-acl.1253.checklist.pdf

PDF Cite Search Checklist Fix data