VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang


Abstract
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. In this work, we introduce VocalNet, a series of high-performance speech LLMs featuring a scalable and model-agnostic training framework as well as a novel multi-token prediction (MTP) paradigm for speech generation. We first propose an efficient two-stage training framework that enables LLMs to acquire real-time speech interaction capabilities. Through extensive experiments on various training configurations, we ensure both simplicity and effectiveness in the training strategy. Furthermore, inspired by advances in language modeling, we introduce MTP into the domain of speech LLMs—an alternative to traditional next-token prediction (NTP)—which enables the model to predict multiple future tokens at each step. Through systematic analysis and improved implementation, we show that MTP not only accelerates inference speed but also significantly enhances speech quality. Experimental results demonstrate that VocalNet achieves performance comparable to state-of-the-art Omni LLMs while outperforming existing open-source speech LLMs, despite using limited training data.
Anthology ID:
2025.emnlp-main.989
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19595–19612
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.989/
DOI:
Bibkey:
Cite (ACL):
Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. 2025. VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19595–19612, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation (Wang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.989.pdf
Checklist:
 2025.emnlp-main.989.checklist.pdf