Chenxing Li
2026
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu | Wenfu Wang | Zuchao Li | Chenxing Li | Yiyang Zhao | Hanzhao Li | Liqiang Zhang | Meng Yu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2026
Jiliang Hu | Wenfu Wang | Zuchao Li | Chenxing Li | Yiyang Zhao | Hanzhao Li | Liqiang Zhang | Meng Yu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2026
While large audio language models (LALMs) have driven significant progress in multimodal conversational systems, current benchmarks suffer from critical limitations: they are largely English-centric, use synthetic speech, and fail to provide comprehensive, discriminative evaluation across key dimensions. To fill this gap, we present Voice Chat Bot Bench (VCB Bench), a novel, high-quality Chinese benchmark built exclusively on real human speech. VCB Bench assesses LALMs across three complementary axes: instruction following (including speech-level control beyond text commands), knowledge understanding (including general knowledge, reasoning, and daily dialogue), and robustness (evaluating stability under variations in content, environment, and speaker characteristics). Experiments conducted on representative LALMs reveal notable performance disparities and offer tangible insights for future improvements. VCB Bench serves as a reproducible and fine-grained framework, providing standardized evaluation and practical guidance for the development of Chinese voice conversational models.
2025
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
Duzhen Zhang | Yong Ren | Zhong-Zhi Li | Yahan Yu | Jiahua Dong | Chenxing Li | Zhilong Ji | Jinfeng Bai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Duzhen Zhang | Yong Ren | Zhong-Zhi Li | Yahan Yu | Jiahua Dong | Chenxing Li | Zhilong Ji | Jinfeng Bai
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.
2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang | Yahan Yu | Jiahua Dong | Chenxing Li | Dan Su | Chenhui Chu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2024
Duzhen Zhang | Yahan Yu | Jiahua Dong | Chenxing Li | Dan Su | Chenhui Chu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2024
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a [real-time tracking website](https://mm-llms.github.io/) for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.