Jinxing Zhou
2026
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yanghao Zhou | Haitian Li | Rexar Lin | Heyan Huang | Jinxing Zhou | Changsen Yuan | Tian Lan | Ziqin Zhou | Yudong Li | Jiajun Xu | Jingyun Liao | YiMing Cheng | Xuefeng Chen | Xian-Ling Mao | Yousheng Feng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, structural failures in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively diagnosed. To address this issue, we introduce MTAVG-Bench, a failure-driven diagnostic benchmark for multi-talker dialogue-centric audio-video generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using mainstream T2AV models with carefully designed prompts, yielding 2.4k manually annotated QA pairs for fine-grained failure diagnosis. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. Built on a hierarchical failure taxonomy and a targeted QA protocol, MTAVG-Bench is primarily designed to evaluate whether proprietary and open-source omni-models can reliably identify failure modes in multi-speaker T2AV outputs. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.
2025
MAviS: A Multimodal Conversational Assistant For Avian Species
Yevheniia Kryklyvets | Mohammed Irfan Kurpath | Sahal Shaji Mullappilly | Jinxing Zhou | Fahad Shahbaz Khan | Rao Muhammad Anwer | Salman Khan | Hisham Cholakkal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yevheniia Kryklyvets | Mohammed Irfan Kurpath | Sahal Shaji Mullappilly | Jinxing Zhou | Fahad Shahbaz Khan | Rao Muhammad Anwer | Salman Khan | Hisham Cholakkal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fine-grained understanding and species-specific, multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models (MM-LLMs) face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the **MAviS-Dataset**, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question–answer pairs. Building on the MAviS-Dataset, we introduce **MAviS-Chat**, a multimodal LLM that supports audio, vision, and text designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present **MAviS-Bench**, a benchmark of over 25,000 Q&A pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive MM-LLMs for ecological applications. Our code, training data, evaluation benchmark, and models are available at https://github.com/yevheniia-uv/MAviS.