Jiliang Hu
2026
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu | Wenfu Wang | Zuchao Li | Chenxing Li | Yiyang Zhao | Hanzhao Li | Liqiang Zhang | Meng Yu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2026
Jiliang Hu | Wenfu Wang | Zuchao Li | Chenxing Li | Yiyang Zhao | Hanzhao Li | Liqiang Zhang | Meng Yu | Dong Yu
Findings of the Association for Computational Linguistics: ACL 2026
While large audio language models (LALMs) have driven significant progress in multimodal conversational systems, current benchmarks suffer from critical limitations: they are largely English-centric, use synthetic speech, and fail to provide comprehensive, discriminative evaluation across key dimensions. To fill this gap, we present Voice Chat Bot Bench (VCB Bench), a novel, high-quality Chinese benchmark built exclusively on real human speech. VCB Bench assesses LALMs across three complementary axes: instruction following (including speech-level control beyond text commands), knowledge understanding (including general knowledge, reasoning, and daily dialogue), and robustness (evaluating stability under variations in content, environment, and speaker characteristics). Experiments conducted on representative LALMs reveal notable performance disparities and offer tangible insights for future improvements. VCB Bench serves as a reproducible and fine-grained framework, providing standardized evaluation and practical guidance for the development of Chinese voice conversational models.
2024
VHASR: A Multimodal Speech Recognition System With Vision Hotwords
Jiliang Hu | Zuchao Li | Ping Wang | Haojun Ai | Lefei Zhang | Hai Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jiliang Hu | Zuchao Li | Ping Wang | Haojun Ai | Lefei Zhang | Hai Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.