Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores

Congren Dai; Yue Yang; Krinos Li; Huichi Zhou; Shijie Liang; Zhang Bo; Enyang Liu; Ge Jin; Hongran An; Haosen Zhang; Peiyuan Jing; KinHei Lee; Zhenxuan Zhang; Xiaobing Li; Maosong Sun (孙茂松)

Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores

Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, Maosong Sun

Abstract

Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision–Language Models to interpret full musical notation remains insufficiently examined.We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question–answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.

Anthology ID:: 2026.acl-long.493
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10777–10799
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.493/
DOI:
Bibkey:
Cite (ACL):: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, and Maosong Sun. 2026. Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10777–10799, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores (Dai et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.493.pdf
Checklist:: 2026.acl-long.493.checklist.pdf

PDF Cite Search Checklist Fix data