BoYaEval: Evaluating Multimodal Large Language Models on Understanding Ancient Chinese Musical Scores

Jiajia Li, Weizhi Xue, Yao Yao, Qiwei Li, Chenchong, Zuchao Li, Ping Wang, Hai Zhao


Abstract
Multimodal Large Language Models (MLLMs) excel in general tasks but struggle with specialized, structured cultural symbols. We introduce BoYaEval, the first comprehensive benchmark dedicated to deciphering diverse Ancient Chinese musical notations, including five types of ancient Chinese music notation systems. These systems utilize unique spatial layouts and specialized ideograms to encode pitch and intricate playing techniques. BoYaEval comprises 3,175 high-quality images across these notation styles and establishes a three-tier evaluation: Structural Parsing (symbol recognition), Instructional Translation (technique mapping), and Musical Reasoning (melody derivation). We evaluate 21 leading MLLMs. Results indicate that while models perform adequately in basic recognition, they fail in cross-system compositional logic, scoring only around 27% on reasoning tasks. BoYaEval highlights the limitations of current MLLMs in processing diverse spatial-symbolic dependencies, bridging the gap between ancient wisdom and modern AI for digitizing intangible cultural heritage. The BoYaEval benchmark is publicly available at https://huggingface.co/datasets/MYTH-Lab/BoYaEval.
Anthology ID:
2026.acl-long.997
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21858–21873
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.997/
DOI:
Bibkey:
Cite (ACL):
Jiajia Li, Weizhi Xue, Yao Yao, Qiwei Li, Chenchong, Zuchao Li, Ping Wang, and Hai Zhao. 2026. BoYaEval: Evaluating Multimodal Large Language Models on Understanding Ancient Chinese Musical Scores. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21858–21873, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
BoYaEval: Evaluating Multimodal Large Language Models on Understanding Ancient Chinese Musical Scores (Li et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.997.pdf
Checklist:
 2026.acl-long.997.checklist.pdf