Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs

Keqi Deng, Guangzhi Sun, Phil Woodland


Abstract
Wav2Prompt is proposed which allows integrating spoken input with a text-based large language model (LLM). Wav2Prompt uses a straightforward training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), and spoken-query-based question answering (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned and then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST, a Wav2Prompt-LLM combination gave a 5 BLEU point increase over an ASR-LLM cascade.
Anthology ID:
2025.naacl-long.354
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6940–6956
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-long.354/
DOI:
Bibkey:
Cite (ACL):
Keqi Deng, Guangzhi Sun, and Phil Woodland. 2025. Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6940–6956, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Wav2Prompt: End-to-End Speech Prompt Learning and Task-based Fine-tuning for Text-based LLMs (Deng et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-long.354.pdf