SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Yiyang Gu; Junwei Yang; Junyu Luo; Ye Yuan; Bin Feng; Yingce Xia; Shufang Xie; Kaili Liu; Bohan Wu; Qi Shi; Haoran Li; Beier Xiao; Zhiping Xiao; Xiao Luo; Weizhi Zhang; Philip S. Yu; Zequn Liu; Ming Zhang

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

Abstract

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs.

Anthology ID:: 2026.acl-long.2117
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 45661–45678
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2117/
DOI:
Bibkey:
Cite (ACL):: Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, and Ming Zhang. 2026. SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45661–45678, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models (Gu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2117.pdf
Checklist:: 2026.acl-long.2117.checklist.pdf

PDF Cite Search Checklist Fix data