LM2Protein: A Structure-to-Token Protein Large Language Model
Chang Zhou, Yuheng Shan, Pengan Chen, Xiangyu Shi, Zikang Wang, Yanting Li, Jiyue Jiang
Abstract
Proteins are critical for various molecular functions, relying on their precise tertiary structures. This structure-sequence relationship is complex and degenerate, meaning multiple sequences can fold into a similar structure. The challenges in protein prediction, design, and modification increase with sequence complexity, while research on RNA-protein interactions, especially RNA-binding proteins (RBPs), is gaining importance. Large-scale pre-trained language models (LLMs) have shown promising results in handling biological sequences by treating them as natural language, though integrating spatial structures remains complex due to the need for specialized visual and 3D modeling approaches. We introduce a method to integrate protein 3D structural data within a sequence processing framework, converting 3D coordinates into discrete structure tokens using a VQ-VAE-like network. This simplifies the handling of 3D data, avoiding complex pipelines and facilitating a unified sequence-to-sequence processing model. Our approach demonstrates strong performance across a range of tasks, achieving high sequence recovery in inverse folding and protein-conditioned RNA design. These outstanding results demonstrate significant potential for application in complex biological systems research.- Anthology ID:
- 2025.findings-emnlp.369
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7023–7029
- Language:
- URL:
- https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.369/
- DOI:
- 10.18653/v1/2025.findings-emnlp.369
- Cite (ACL):
- Chang Zhou, Yuheng Shan, Pengan Chen, Xiangyu Shi, Zikang Wang, Yanting Li, and Jiyue Jiang. 2025. LM2Protein: A Structure-to-Token Protein Large Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7023–7029, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- LM2Protein: A Structure-to-Token Protein Large Language Model (Zhou et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.369.pdf