Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli


Abstract
The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
Anthology ID:
2024.acl-long.789
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14760–14778
Language:
URL:
https://aclanthology.org/2024.acl-long.789
DOI:
10.18653/v1/2024.acl-long.789
Bibkey:
Cite (ACL):
Marco Gaido, Sara Papi, Matteo Negri, and Luisa Bentivogli. 2024. Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14760–14778, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? (Gaido et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.789.pdf
Video:
 https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.789.mp4