Emmanouil Benetos


2024

pdf
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Zihao Deng | Yinghao Ma | Yudong Liu | Rongchen Guo | Ge Zhang | Wenhu Chen | Wenhao Huang | Emmanouil Benetos
Findings of the Association for Computational Linguistics: NAACL 2024

Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT (CITATION) with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

pdf
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ruibin Yuan | Hanfeng Lin | Yi Wang | Zeyue Tian | Shangda Wu | Tianhao Shen | Ge Zhang | Yuhang Wu | Cong Liu | Ziya Zhou | Liumeng Xue | Ziyang Ma | Qin Liu | Tianyu Zheng | Yizhi Li | Yinghao Ma | Yiming Liang | Xiaowei Chi | Ruibo Liu | Zili Wang | Chenghua Lin | Qifeng Liu | Tao Jiang | Wenhao Huang | Wenhu Chen | Jie Fu | Emmanouil Benetos | Gus Xia | Roger Dannenberg | Wei Xue | Shiyin Kang | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2024

While LLMs demonstrate impressive capabilities in musical knowledge, we find that music reasoning is still an unsolved task.We introduce ChatMusician, an open-source large language model (LLM) that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language.ChatMusician can understand and generate music with a pure text tokenizer without external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score.ChatMusician is capable of composing well-structured, full-length music, condition on texts, chords, melodies, motifs, musical forms, etc.On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 by a noticeable margin. We show that ChatMusician preserves or even surpasses the original LLaMA2 7B’s language abilities by evaluating on MMLU benchmark.Our work reveals that LLMs can be an excellent compressor for music, which can be seen as humanity’s creative language, but there remains significant territory to be conquered.We release our 5B token music-language corpora MusicPiles, the collected MusicTheoryBench, code, model and demo.