Xiaobing Li
2026
Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores
Congren Dai | Yue Yang | Krinos Li | Huichi Zhou | Shijie Liang | Zhang Bo | Enyang Liu | Ge Jin | Hongran An | Haosen Zhang | Peiyuan Jing | KinHei Lee | Zhenxuan Zhang | Xiaobing Li | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Congren Dai | Yue Yang | Krinos Li | Huichi Zhou | Shijie Liang | Zhang Bo | Enyang Liu | Ge Jin | Hongran An | Haosen Zhang | Peiyuan Jing | KinHei Lee | Zhenxuan Zhang | Xiaobing Li | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision–Language Models to interpret full musical notation remains insufficiently examined.We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question–answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
2025
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
Shangda Wu | Guo Zhancheng | Ruibin Yuan | Junyan Jiang | SeungHeon Doh | Gus Xia | Juhan Nam | Xiaobing Li | Feng Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Shangda Wu | Guo Zhancheng | Ruibin Yuan | Junyan Jiang | SeungHeon Doh | Gus Xia | Juhan Nam | Xiaobing Li | Feng Yu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities–including sheet music, performance signals, and audio recordings–with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu | Yashan Wang | Ruibin Yuan | Guo Zhancheng | Xu Tan | Ge Zhang | Monan Zhou | Jing Chen | Xuefeng Mu | Yuejie Gao | Yuanliang Dong | Jiafeng Liu | Xiaobing Li | Feng Yu | Maosong Sun
Findings of the Association for Computational Linguistics: NAACL 2025
Shangda Wu | Yashan Wang | Ruibin Yuan | Guo Zhancheng | Xu Tan | Ge Zhang | Monan Zhou | Jing Chen | Xuefeng Mu | Yuejie Gao | Yuanliang Dong | Jiafeng Liu | Xiaobing Li | Feng Yu | Maosong Sun
Findings of the Association for Computational Linguistics: NAACL 2025
Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.
2023
Lingxi: A Diversity-aware Chinese Modern Poetry Generation System
Xinran Zhang | Maosong Sun | Jiafeng Liu | Xiaobing Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Xinran Zhang | Maosong Sun | Jiafeng Liu | Xiaobing Li
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Chinese modern poetry generation has been a challenging task. One issue is the Chinese word segmentation (CWS) which is critical to comprehend the Chinese language but was not always considered in common tokenization methods. Another is the decoding (sampling) method which may induce repetition and boredom and severely lower the diversity of the generated poetry. To address these issues, we present Lingxi, a diversity-aware Chinese modern poetry generation system. For the CWS issue, we propose a novel framework that incorporates CWS in the tokenization process. The proposed method can achieve a high vocabulary coverage rate with a reasonable vocabulary size. For the decoding method and the diversity issue, we propose a novel sampling algorithm that flattens the high likelihood part of the predicted distribution of the language model to emphasize the comparatively low-likelihood words and increase the diversity of generated poetry. Empirical results show that even when the top 60% of cumulative probability mass of the predicted distribution is flattened, our method achieves comparable or even better performance than baseline sampling methods. Our system is available at http://lingxi.website.
Search
Fix author
Co-authors
- Maosong Sun (孙茂松) 4
- Jiafeng Liu 2
- Shangda Wu 2
- Feng Yu 2
- Ruibin Yuan 2
- Guo Zhancheng 2
- Hongran An 1
- Zhang Bo 1
- Jing Chen 1
- Congren Dai 1
- Seungheon Doh 1
- Yuanliang Dong 1
- Yuejie Gao 1
- Junyan Jiang 1
- Ge Jin 1
- Peiyuan Jing 1
- KinHei Lee 1
- Krinos Li 1
- Shijie Liang 1
- Enyang Liu 1
- Xuefeng Mu 1
- Juhan Nam 1
- Xu Tan 1
- Yashan Wang 1
- Gus Xia 1
- Yue Yang 1
- Ge Zhang 1
- Xinran Zhang 1
- Haosen Zhang 1
- Zhenxuan Zhang 1
- Monan Zhou 1
- Huichi Zhou 1