Bin Feng
2026
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs.
2025
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
Weidi Luo | He Cao | Zijing Liu | Yu Wang | Aidan Wong | Bin Feng | Yuan Yao | Yu Li
Findings of the Association for Computational Linguistics: NAACL 2025
Weidi Luo | He Cao | Zijing Liu | Yu Wang | Aidan Wong | Bin Feng | Yuan Yao | Yu Li
Findings of the Association for Computational Linguistics: NAACL 2025
With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM’s robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model’s general functionality.
Rethinking Text-based Protein Understanding: Retrieval or LLM?
Juntong Wu | Zijing Liu | He Cao | Li Hao | Bin Feng | Zishan Shu | Ke Yu | Li Yuan | Yu Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Juntong Wu | Zijing Liu | He Cao | Li Hao | Bin Feng | Zishan Shu | Ke Yu | Li Yuan | Yu Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to assess the model’s performance in this domain accurately. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data will be available.