Yu Zhang

Harbin

Other people with similar names: Yu Zhang (Southeast University), Yu Zhang (Zhejiang), Yu Zhang (Oxford), Yu Zhang (SJTU), Yu Zhang, Yu Zhang, Yu Zhang, Yu Zhang, Yu Zhang (UIUC, Texas A&M), Yu Zhang (Harbin), Yu Zhang, Yu Zhang, Yu Zhang, Yu Zhang

Unverified author pages with similar names: Yu Zhang

2026

pdf bib abs

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines.Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit modality bias, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification.To address this, we propose Modality-aware Consistency Reasoning (MCR), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO).Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

2025

pdf bib abs

InImageTrans: Multimodal LLM-based Text Image Machine Translation
Fei Zuo | Kehai Chen | Yu Zhang | Zhengshan Xue | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Multimodal large language models (MLLMs) have shown remarkable capabilities across various downstream tasks. However, when MLLMs are transferred to the text image machine translation (TiMT) task, preliminary experiments reveal that MLLMs suffer from serious repetition and omission hallucinations. To alleviate these issues, this paper first designs an efficient MLLM named InImageTrans for TiMT and then proposes a simple and effective method named multi-conditional direct preference optimization (mcDPO) for advancing the TiMT. Particularly, the proposed mcDPO not only guides the MLLM in rejecting repetition output by creating text output preference pairs automatically, but also guides the MLLM in paying more attention to text information in images by creating image input preference pairs. Furthermore, we build a high-quality benchmark called MCiT for comprehensively evaluating the TiMT capabilities of InImageTrans. Experimental results show that the proposed method significantly outperforms existing open-source MLLMs on MCiT.

Co-authors

Zhengshan Xue 1

Jun Yu 1

Min Zhang 1

Fei Zuo 1

Venues

Findings2

Fix author