When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu


Abstract
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for future research.
Anthology ID:
2025.findings-acl.1041
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20298–20315
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1041/
DOI:
Bibkey:
Cite (ACL):
Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, and Chenhui Chu. 2025. When Large Language Models Meet Speech: A Survey on Integration Approaches. In Findings of the Association for Computational Linguistics: ACL 2025, pages 20298–20315, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
When Large Language Models Meet Speech: A Survey on Integration Approaches (Yang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1041.pdf