Generative Music Models’ Alignment with Professional and Amateur Users’ Expectations

Zihao Wang; Jiaxing Yu; Haoxuan Liu; Zehui Zheng; Yuhang Jin; Shuyu Li; Shulei Ji; Kejun Zhang

Generative Music Models’ Alignment with Professional and Amateur Users’ Expectations

Zihao Wang, Jiaxing Yu, Haoxuan Liu, Zehui Zheng, Yuhang Jin, Shuyu Li, Shulei Ji, Kejun Zhang

Abstract

Recent years have witnessed rapid advancements in text-to-music generation using large language models, yielding notable outputs. A critical challenge is understanding users with diverse musical expertise and generating music that meets their expectations, an area that remains underexplored.To address this gap, we introduce the novel task of Professional and Amateur Description-to-Song Generation. This task focuses on aligning generated content with human expressions from varying musical proficiency levels, aiming to produce songs that accurately meet auditory expectations and adhere to musical structural conventions. We utilized the MuChin dataset, which contains annotations from both professionals and amateurs for identical songs, as the source for these distinct description types. We also collected a pre-train dataset of over 1.5 million songs; lyrics were included for some, while for others, lyrics were generated using Automatic Speech Recognition (ASR) models.Furthermore, we propose MuDiT/MuSiT, a single-stage framework designed to enhance human-machine alignment in song generation. This framework employs Chinese MuLan (ChinMu) for cross-modal comprehension between natural language descriptions and auditory musical attributes, thereby aligning generated songs with user-defined outcomes. Concurrently, a DiT/SiT model facilitates end-to-end generation of complete songs audio, encompassing both vocals and instrumentation. We proposed metrics to evaluate semantic and auditory discrepancies between generated content and target music. Experimental results demonstrate that MuDiT/MuSiT outperforms baseline models and exhibits superior alignment with both professional and amateur song descriptions.

Anthology ID:: 2025.findings-acl.360
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6909–6920
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.360/
DOI:
Bibkey:
Cite (ACL):: Zihao Wang, Jiaxing Yu, Haoxuan Liu, Zehui Zheng, Yuhang Jin, Shuyu Li, Shulei Ji, and Kejun Zhang. 2025. Generative Music Models’ Alignment with Professional and Amateur Users’ Expectations. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6909–6920, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Generative Music Models’ Alignment with Professional and Amateur Users’ Expectations (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.360.pdf

PDF Cite Search Fix data