MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos


Abstract
Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT (CITATION) with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
Anthology ID:
2024.findings-naacl.231
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3643–3655
Language:
URL:
https://aclanthology.org/2024.findings-naacl.231
DOI:
Bibkey:
Cite (ACL):
Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. 2024. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3643–3655, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response (Deng et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2024.findings-naacl.231.pdf
Copyright:
 2024.findings-naacl.231.copyright.pdf