Bohan Wu
2026
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs.
2025
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
Junyu Luo | Bohan Wu | Xiao Luo | Zhiping Xiao | Yiqiao Jin | Rong-Cheng Tu | Nan Yin | Yifan Wang | Jingyang Yuan | Wei Ju | Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junyu Luo | Bohan Wu | Xiao Luo | Zhiping Xiao | Yiqiao Jin | Rong-Cheng Tu | Nan Yin | Yifan Wang | Jingyang Yuan | Wei Ju | Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Jinsheng Huang | Liang Chen | Taian Guo | Fu Zeng | Yusheng Zhao | Bohan Wu | Ye Yuan | Haozhe Zhao | Zhihui Guo | Yichi Zhang | Jingyang Yuan | Wei Ju | Luchen Liu | Tianyu Liu | Baobao Chang | Ming Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Jinsheng Huang | Liang Chen | Taian Guo | Fu Zeng | Yusheng Zhao | Bohan Wu | Ye Yuan | Haozhe Zhao | Zhihui Guo | Yichi Zhang | Jingyang Yuan | Wei Ju | Luchen Liu | Tianyu Liu | Baobao Chang | Ming Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEVALPRO, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEVALPRO comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEVALPRO is **more challenging** (the best LMM lags behind human performance by 31.73%, compared to an average gap of 8.03% in previous benchmarks) and **more trustworthy** (the best LLM trails the best LMM by 23.09%, whereas the gap for previous benchmarks is just 14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
Search
Fix author
Co-authors
- Ming Zhang 3
- Wei Ju 2
- Junyu Luo 2
- Xiao Luo 2
- Zhiping Xiao 2
- Jingyang Yuan 2
- Ye Yuan 2
- Baobao Chang (常宝宝) 1
- Liang Chen 1
- Bin Feng 1
- Yiyang Gu 1
- Taian Guo 1
- Zhihui Guo 1
- Jinsheng Huang 1
- Yiqiao Jin 1
- Haoran Li 1
- Luchen Liu 1
- Tianyu Liu 1
- Kaili Liu 1
- Zequn Liu 1
- Qi Shi 1
- Rong-Cheng Tu 1
- Yifan Wang 1
- Yingce Xia 1
- Beier Xiao 1
- Shufang Xie 1
- Junwei Yang 1
- Nan Yin 1
- Philip S. Yu 1
- Fu Zeng 1
- Yichi Zhang 1
- Weizhi Zhang 1
- Yusheng Zhao 1
- Haozhe Zhao 1