KnowDR-REC: Auditing Knowledge-Conditioned Visual Grounding in Referring Expression Comprehension

Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Linyi Yang, Guoyang Liu


Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated the capacity for multi-modal reasoning, current Referring Expression Comprehension (REC) benchmarks lag behind, predominantly relying on intra-image cues and neglecting the integration of external world knowledge, which significantly impedes the evolution of REC towards real-world applications. This limitation obscures a model’s true capability to conduct textual reasoning (entity resolution), resolve spatial location (visual grounding), and verify reference validity (hallucination rejection). To address this, we introduce KnowDR-REC, a targeted audit benchmark comprising 1,042 positive triplets derived from real-world knowledge, along with rigorously matched negative samples. Unlike traditional datasets, we implement a controllable counterfactual evaluation mechanism that subjects textual expressions to single-factor perturbations (entity, relation, or time) to test sensitivity to fine-grained factual changes. Extensive evaluation of 18 state-of-the-art LMMs exposes a critical “binding hallucination,” revealing that current high performance is often built on fragile visual shortcuts rather than true understanding. KnowDR-REC thus serves as a pivotal diagnostic instrument, steering future research toward the genuine integration of perception and reasoning.
Anthology ID:
2026.findings-acl.1923
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38607–38629
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1923/
DOI:
Bibkey:
Cite (ACL):
Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Linyi Yang, and Guoyang Liu. 2026. KnowDR-REC: Auditing Knowledge-Conditioned Visual Grounding in Referring Expression Comprehension. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38607–38629, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
KnowDR-REC: Auditing Knowledge-Conditioned Visual Grounding in Referring Expression Comprehension (Jin et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1923.pdf
Checklist:
 2026.findings-acl.1923.checklist.pdf