Zeyuan Wang
2025
Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization
Yuhao Wang
|
Keyan Ding
|
Kehua Feng
|
Zeyuan Wang
|
Ming Qin
|
Xiaotong Li
|
Qiang Zhang
|
Huajun Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and *denovo* design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
2024
InstructProtein: Aligning Human and Protein Language via Knowledge Instruction
Zeyuan Wang
|
Qiang Zhang
|
Keyan Ding
|
Ming Qin
|
Xiang Zhuang
|
Xiaotong Li
|
Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing the annotation imbalance and the absence of instructional signals in the existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by a large margin.
Search
Fix author
Co-authors
- Huajun Chen 2
- Keyan Ding 2
- Xiaotong Li 2
- Ming Qin 2
- Qiang Zhang 2
- show all...
Venues
- acl2