LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models
Zuhair Hasan Shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T
Abstract
Integrating speech and text capabilities into large language models (LLMs) is a challenging task and we present Large Rank Adaptation (LaRA) for effective cross-modal integration of speech and text in the LLM framework. Unlike conventional LoRA, our method requires significantly larger ranks comparable to the pretrained weights to accommodate the complexities of speech-text cross-modality learning. The approach utilizes HuBERT to convert speech into discrete tokens and fine-tunes the pretrained LLM to adapt to cross-modal inputs and outputs. The work employs a Hi-Fi GAN vocoder to synthesize speech waveforms from the generated speech units. The initial studies use the Librispeech corpus to teach the model the relationships between speech and text, and Daily Talk, which involves dialog conversations, to adapt for interaction. The proposed work demonstrates adaptation for spoken and text conversations. However, the proposed framework can be easily extended to other cross-modal applications.- Anthology ID:
- 2024.findings-emnlp.480
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8201–8211
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.480/
- DOI:
- 10.18653/v1/2024.findings-emnlp.480
- Cite (ACL):
- Zuhair Hasan Shaik, Pradyoth Hegde, Prashant Bannulmath, and Deepak K T. 2024. LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8201–8211, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models (Shaik et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.480.pdf