UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER

Jielong Tang; Yang Yang (杨阳); Jianxing Yu; Zhen-Xing Wang; Haoyuan Liang; Liang Yao; Jian Yin

UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER

Jielong Tang, Yang Yang, Jianxing Yu, Zhen-Xing Wang, Haoyuan Liang, Liang Yao, Jian Yin

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) is a new information extraction task. It requires models to extract named entities and ground them to real-world visual objects. Previous methods, relying on domain-specific fine-tuning, struggle with unseen multimodal entities due to limited knowledge and generalization. Recently, multimodal large language models (MLLMs) have demonstrated strong open-set abilities. However, their performance is hindered by the lack of in-domain knowledge due to costly training for GMNER datasets. To address these limitations, we propose **UnCo**, a two-stage Uncertainty-driven Collaborative framework that leverages the complementary strengths of small fine-tuned models and MLLMs. Specifically, **in stage one**, we equip the small model with a unified uncertainty estimation (UE) for multimodal entities. This enables the small model to express "I do not know" when recognizing unseen entities beyond its capabilities. Predictions with high uncertainty are then filtered and delegated to the MLLM. **In stage two**, an Uncertainty-aware Hierarchical Correction mechanism guides the MLLM to refine uncertain predictions using its open-domain knowledge. Ultimately, UnCo effectively retains the in-domain knowledge of small models while utilizing the capabilities of MLLMs to handle unseen samples. Extensive experiments demonstrate UnCo’s effectiveness on two GMNER benchmarks.

Anthology ID:: 2025.emnlp-main.388
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7644–7662
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.388/
DOI:
Bibkey:
Cite (ACL):: Jielong Tang, Yang Yang, Jianxing Yu, Zhen-Xing Wang, Haoyuan Liang, Liang Yao, and Jian Yin. 2025. UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7644–7662, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER (Tang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.388.pdf
Checklist:: 2025.emnlp-main.388.checklist.pdf

PDF Cite Search Checklist Fix data