2025
pdf
bib
abs
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan
|
Liqiang Niu
|
Fandong Meng
|
Wenbo Li
|
Jie Zhou
|
Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2025
Recently, large multimodal models (LMMs) have achieved significant advancements. When dealing with high-resolution images, dominant LMMs typically divide them into multiple local images and a global image, leading to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. Specifically, we first apply the multiple pooling layers to obtain visual tokens at different granularities. Then we propose a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we put forward RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53× increase in inference speed on the AI2D benchmark).
pdf
bib
abs
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Zhibin Lan
|
Liqiang Niu
|
Fandong Meng
|
Jie Zhou
|
Jinsong Su
Findings of the Association for Computational Linguistics: EMNLP 2025
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model’s representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
pdf
bib
abs
TIU-Bench: A Benchmark for Evaluating Large Multimodal Models on Text-rich Image Understanding
Kun Zhang
|
Liqiang Niu
|
Zhen Cao
|
Fandong Meng
|
Jie Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Text-rich images are ubiquitous in real-world applications, serving as a critical medium for conveying complex information and facilitating accessibility.Despite recent advances driven by Multimodal Large Language Models (MLLMs), existing benchmarks suffer from limited scale, fragmented scenarios, and evaluation protocols that fail to fully capture holistic image understanding.To address these gaps, we present TIU-Bench, a large-scale, multilingual benchmark comprising over 100,000 full-image annotations and 22,000 rigorously validated question-answer (QA) pairs that span 18 subtasks across diverse real-world scenarios.TIU-Bench introduces a novel full-image structured output format that jointly models geometric, textual, and relational information, enabling fine-grained evaluation of perception and reasoning capabilities. Furthermore, we propose a two-stage understanding framework named T2TIU, which first generates a structured representation of the entire image and subsequently conducts reasoning on this representation to address complex visual-textual queries.Extensive experiments on 10 state-of-the-art generative models highlight the challenges and opportunities in advancing text-rich image understanding.Our benchmark and framework provide a comprehensive platform for developing and evaluating next-generation multimodal AI systems.
2024
pdf
bib
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation
Zhibin Lan
|
Liqiang Niu
|
Fandong Meng
|
Jie Zhou
|
Min Zhang
|
Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2024
pdf
bib
abs
UMTIT: Unifying Recognition, Translation, and Generation for Multimodal Text Image Translation
Liqiang Niu
|
Fandong Meng
|
Jie Zhou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Prior research in Image Machine Translation (IMT) has focused on either translating the source image solely into the target language text or exclusively into the target image. As a result, the former approach lacked the capacity to generate target images, while the latter was insufficient in producing target text. In this paper, we present a Unified Multimodal Text Image Translation (UMTIT) model that not only translates text images into the target language but also generates consistent target images. The UMTIT model consists of two image-text modality conversion steps: the first step converts images to text to recognize the source text and generate translations, while the second step transforms text to images to create target images based on the translations. Due to the limited availability of public datasets, we have constructed two multimodal image translation datasets. Experimental results show that our UMTIT model is versatile enough to handle tasks across multiple modalities and outperforms previous methods. Notably, UMTIT surpasses the state-of-the-art TrOCR in text recognition tasks, achieving a lower Character Error Rate (CER); it also outperforms cascading methods in text translation tasks, obtaining a higher BLEU score; and, most importantly, UMTIT can generate high-quality target text images.