Yuankai Qi
2026
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
Mingkai Tian | Guorong Li | Yuankai Qi | Anton Van Den Hengel | Qingming Huang
Findings of the Association for Computational Linguistics: EACL 2026
Mingkai Tian | Guorong Li | Yuankai Qi | Anton Van Den Hengel | Qingming Huang
Findings of the Association for Computational Linguistics: EACL 2026
Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract video-informed text prompts to guide language models in generating captions. However, by using representations at a single granularity (e.g., noun phrases or full sentences), these methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics, to promote prompt diversity while ensuring visual relevance. Extensive experiments on both in-domain and cross-domain settings demonstrate that the proposed method consistently outperforms state-of-the-art approaches.
2024
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
Gaoxiang Cong | Yuankai Qi | Liang Li | Amin Beheshti | Zhedong Zhang | Anton Hengel | Ming-Hsuan Yang | Chenggang Yan | Qingming Huang
Findings of the Association for Computational Linguistics: ACL 2024
Gaoxiang Cong | Yuankai Qi | Liang Li | Amin Beheshti | Zhedong Zhang | Anton Hengel | Ming-Hsuan Yang | Chenggang Yan | Qingming Huang
Findings of the Association for Computational Linguistics: ACL 2024
Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing state-of-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current state-of-the-art. The code will be made available at https://github.com/GalaxyCong/StyleDubber.
2022
Diagnosing Vision-and-Language Navigation: What Really Matters
Wanrong Zhu | Yuankai Qi | Pradyumna Narayana | Kazoo Sone | Sugato Basu | Xin Wang | Qi Wu | Miguel Eckstein | William Yang Wang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Wanrong Zhu | Yuankai Qi | Pradyumna Narayana | Kazoo Sone | Sugato Basu | Xin Wang | Qi Wu | Miguel Eckstein | William Yang Wang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or training techniques to boost navigation performance. However, there still exist non-negligible gaps between machines’ performance and human benchmarks. Moreover, the agents’ inner mechanisms for navigation decisions remain unclear. To the best of our knowledge, how the agents perceive the multimodal input is under-studied and needs investigation. In this work, we conduct a series of diagnostic experiments to unveil agents’ focus during navigation. Results show that indoor navigation agents refer to both object and direction tokens when making decisions. In contrast, outdoor navigation agents heavily rely on direction tokens and poorly understand the object tokens. Transformer-based agents acquire a better cross-modal understanding of objects and display strong numerical reasoning ability than non-Transformer-based agents. When it comes to vision-and-language alignments, many models claim that they can align object tokens with specific visual targets. We find unbalanced attention on the vision and text input and doubt the reliability of such cross-modal alignments.