Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu; Yuhao Wang; Zhe Chen; Pingjie Wang; Zhiyuan Zhu; Yixuan Hou; Yanfeng Wang; Yu Wang

Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang

Abstract

Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

Anthology ID:: 2026.acl-long.1217
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26430–26453
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1217/
DOI:
Bibkey:
Cite (ACL):: Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, and Yu Wang. 2026. Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26430–26453, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs (Liu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1217.pdf
Checklist:: 2026.acl-long.1217.checklist.pdf

PDF Cite Search Checklist Fix data