2025
pdf
bib
abs
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
Shangda Wu
|
Guo Zhancheng
|
Ruibin Yuan
|
Junyan Jiang
|
SeungHeon Doh
|
Gus Xia
|
Juhan Nam
|
Xiaobing Li
|
Feng Yu
|
Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities–including sheet music, performance signals, and audio recordings–with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.
2024
pdf
bib
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
Anna Kruspe
|
Sergio Oramas
|
Elena V. Epure
|
Mohamed Sordo
|
Benno Weck
|
SeungHeon Doh
|
Minz Won
|
Ilaria Manco
|
Gabriel Meseguer-Brocal
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
pdf
bib
abs
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
Hayeon Bang
|
Eunjin Choi
|
Megan Finch
|
Seungheon Doh
|
Seolhee Lee
|
Gyeong-Hoon Lee
|
Juhan Nam
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multimodal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.
2021
pdf
bib
Music Playlist Title Generation: A Machine-Translation Approach
Seungheon Doh
|
Junwon Lee
|
Juhan Nam
Proceedings of the 2nd Workshop on NLP for Music and Spoken Audio (NLP4MusA)