Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen M. Meng


Abstract
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.
Anthology ID:
2025.emnlp-main.1266
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24924–24935
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1266/
DOI:
Bibkey:
Cite (ACL):
Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen M. Meng. 2025. Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24924–24935, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs (Wang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1266.pdf
Checklist:
 2025.emnlp-main.1266.checklist.pdf