MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding

Xue Xia; Zheyuan Yang; Arman Cohan; Yilun Zhao

MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding

Xue Xia, Zheyuan Yang, Arman Cohan, Yilun Zhao

Abstract

We introduce MMSciCode, a comprehensive expert-level, multilingual multi-discipline benchmark for evaluating foundation models in scientific code generation. It includes 624 expert-annotated research coding problems spanning six core scientific disciplines. Compared to prior benchmarks, MMSciCode features three key advancements. First, it challenges models to integrate domain-specific knowledge with algorithmic reasoning to implement core functions from research papers, moving beyond the isolated, general-purpose coding tasks typically assessed in current benchmarks. Second, each problem is meticulously annotated by domain experts through a rigorous paper-grounded process, with strict quality controls implemented to ensure dataset integrity and authenticity. Finally, each problem is equipped with comprehensive unit test suites and containerized environments, enabling reproducible and diagnostic evaluation of both functional correctness and domain validity. We conduct an extensive evaluation of 28 state-of-the-art foundation models and 2 agentic coding tools on MMSciCode. Our results reveal that even the best non-agentic model achieves only around 15% accuracy, while the top agentic coding tool reaches 32.2%, both still far below human expert performance of 68.8%. Through comprehensive error analyses and case studies, we identify substantial performance gaps between models and human experts, providing actionable insights for advancing expert-level scientific code generation.

Anthology ID:: 2026.acl-long.1566
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33981–33999
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1566/
DOI:
Bibkey:
Cite (ACL):: Xue Xia, Zheyuan Yang, Arman Cohan, and Yilun Zhao. 2026. MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33981–33999, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding (Xia et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1566.pdf
Checklist:: 2026.acl-long.1566.checklist.pdf

PDF Cite Search Checklist Fix data