Wenqian Cui

2025

Text-based Large Language Models (LLMs) have recently gained significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, highlighting the need for voice-based models. In this context, Speech Language Models (SpeechLMs)—foundation models designed to understand and generate speech—emerge as a promising solution for end-to-end speech interaction. This survey offers a comprehensive overview of recent approaches to building SpeechLMs, outlining their core architectural components, training methodologies, evaluation strategies, and the challenges and potential directions for future research in this rapidly advancing field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

pdf bib abs
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Wenqian Cui | Xiaoqi Jiao | Ziqiao Meng | Irwin King
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs’ knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs’ knowledge understanding through pure speech interactions. Our benchmark uniquely maintains speech format for both inputs and outputs, evaluates model robustness across diverse input audio conditions, and pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Through systematic evaluation, we demonstrate that current SLMs exhibit poor performance on VoxEval, show sensitivity to varying audio conditions, and possess limited reasoning capabilities, highlighting critical areas for future development. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval

Co-authors

Dianzhi Yu 1

Guangyan Zhang 1

Venues

acl2

Fix author