Zitong Yu
Also published as: Zitong YU
2026
Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification
Jiayu Zhang | Shuo Ye | Qilang Ye | Zihan Song | Jiajian Huang | Zitong YU
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiayu Zhang | Shuo Ye | Qilang Ye | Zihan Song | Jiajian Huang | Zitong YU
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R2ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R2ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.
2025
Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units
Chao Hao | Zezheng Wang | Yanhua Huang | Ruiwen Xu | Wenzhe Niu | Xin Liu | Zitong Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chao Hao | Zezheng Wang | Yanhua Huang | Ruiwen Xu | Wenzhe Niu | Xin Liu | Zitong Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The codes will be released soon.