2025
pdf
bib
abs
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu
|
Yashan Wang
|
Ruibin Yuan
|
Guo Zhancheng
|
Xu Tan
|
Ge Zhang
|
Monan Zhou
|
Jing Chen
|
Xuefeng Mu
|
Yuejie Gao
|
Yuanliang Dong
|
Jiafeng Liu
|
Xiaobing Li
|
Feng Yu
|
Maosong Sun
Findings of the Association for Computational Linguistics: NAACL 2025
Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.
pdf
bib
abs
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai
|
Xeron Du
|
Yiming Liang
|
Leo Jin
|
Junting Zhou
|
Ziqiang Liu
|
Feiteng Fang
|
Mingshan Chang
|
Tianyu Zheng
|
Xincheng Zhang
|
Nuo Ma
|
Zekun Moore Wang
|
Ruibin Yuan
|
Haihong Wu
|
Hongquan Lin
|
Wenhao Huang
|
Jiajun Zhang
|
Chenghua Lin
|
Jie Fu
|
Min Yang
|
Shiwen Ni
|
Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
2024
pdf
bib
abs
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan
|
Junqi Dai
|
Jiasheng Ye
|
Yunhua Zhou
|
Dong Zhang
|
Zhigeng Liu
|
Xin Zhang
|
Ruibin Yuan
|
Ge Zhang
|
Linyang Li
|
Hang Yan
|
Jie Fu
|
Tao Gui
|
Tianxiang Sun
|
Yu-Gang Jiang
|
Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages.We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs.Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/.
pdf
bib
abs
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ruibin Yuan
|
Hanfeng Lin
|
Yi Wang
|
Zeyue Tian
|
Shangda Wu
|
Tianhao Shen
|
Ge Zhang
|
Yuhang Wu
|
Cong Liu
|
Ziya Zhou
|
Liumeng Xue
|
Ziyang Ma
|
Qin Liu
|
Tianyu Zheng
|
Yizhi Li
|
Yinghao Ma
|
Yiming Liang
|
Xiaowei Chi
|
Ruibo Liu
|
Zili Wang
|
Chenghua Lin
|
Qifeng Liu
|
Tao Jiang
|
Wenhao Huang
|
Wenhu Chen
|
Jie Fu
|
Emmanouil Benetos
|
Gus Xia
|
Roger Dannenberg
|
Wei Xue
|
Shiyin Kang
|
Yike Guo
Findings of the Association for Computational Linguistics: ACL 2024
While LLMs demonstrate impressive capabilities in musical knowledge, we find that music reasoning is still an unsolved task.We introduce ChatMusician, an open-source large language model (LLM) that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language.ChatMusician can understand and generate music with a pure text tokenizer without external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score.ChatMusician is capable of composing well-structured, full-length music, condition on texts, chords, melodies, motifs, musical forms, etc.On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 by a noticeable margin. We show that ChatMusician preserves or even surpasses the original LLaMA2 7B’s language abilities by evaluating on MMLU benchmark.Our work reveals that LLMs can be an excellent compressor for music, which can be seen as humanity’s creative language, but there remains significant territory to be conquered.We release our 5B token music-language corpora MusicPiles, the collected MusicTheoryBench, code, model and demo.
pdf
bib
abs
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
Yizhi Li
|
Ge Zhang
|
Xingwei Qu
|
Jiali Li
|
Zhaoqun Li
|
Noah Wang
|
Hao Li
|
Ruibin Yuan
|
Yinghao Ma
|
Kai Zhang
|
Wangchunshu Zhou
|
Yiming Liang
|
Lei Zhang
|
Lei Ma
|
Jiajun Zhang
|
Zuowen Li
|
Wenhao Huang
|
Chenghua Lin
|
Jie Fu
Findings of the Association for Computational Linguistics: ACL 2024
The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following.Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (**CIF-Bench**), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances.Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts.This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.
2020
pdf
bib
abs
Chinese Grammatical Error Correction Based on Hybrid Models with Data Augmentation
Yi Wang
|
Ruibin Yuan
|
Yan‘gen Luo
|
Yufang Qin
|
NianYong Zhu
|
Peng Cheng
|
Lihuan Wang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications
A better Chinese Grammatical Error Diagnosis (CGED) system for automatic Grammatical Error Correction (GEC) can benefit foreign Chinese learners and lower Chinese learning barriers. In this paper, we introduce our solution to the CGED2020 Shared Task Grammatical Error Correction in detail. The task aims to detect and correct grammatical errors that occur in essays written by foreign Chinese learners. Our solution combined data augmentation methods, spelling check methods, and generative grammatical correction methods, and achieved the best recall score in the Top 1 Correction track. Our final result ranked fourth among the participants.