Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang (张裕浩); Zhiheng Liu; Fan Bu; Ruiyu Zhang; Benyou Wang; Haizhou Li

Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li

Abstract

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms other advanced speech LLMs in speech translation and AIR-Bench speech tasks with only a fraction of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation.

Anthology ID:: 2025.acl-long.917
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18718–18738
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.917/
DOI:
Bibkey:
Cite (ACL):: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, and Haizhou Li. 2025. Soundwave: Less is More for Speech-Text Alignment in LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18718–18738, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Soundwave: Less is More for Speech-Text Alignment in LLMs (Zhang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.917.pdf

PDF Cite Search Fix data