LM2Protein: A Structure-to-Token Protein Large Language Model

Chang Zhou; Yuheng Shan; Pengan Chen; Xiangyu Shi (石响宇); Zikang Wang; Yanting Li; Jiyue Jiang

doi:10.18653/v1/2025.findings-emnlp.369

LM2Protein: A Structure-to-Token Protein Large Language Model

Chang Zhou, Yuheng Shan, Pengan Chen, Xiangyu Shi, Zikang Wang, Yanting Li, Jiyue Jiang

Abstract

Proteins are critical for various molecular functions, relying on their precise tertiary structures. This structure-sequence relationship is complex and degenerate, meaning multiple sequences can fold into a similar structure. The challenges in protein prediction, design, and modification increase with sequence complexity, while research on RNA-protein interactions, especially RNA-binding proteins (RBPs), is gaining importance. Large-scale pre-trained language models (LLMs) have shown promising results in handling biological sequences by treating them as natural language, though integrating spatial structures remains complex due to the need for specialized visual and 3D modeling approaches. We introduce a method to integrate protein 3D structural data within a sequence processing framework, converting 3D coordinates into discrete structure tokens using a VQ-VAE-like network. This simplifies the handling of 3D data, avoiding complex pipelines and facilitating a unified sequence-to-sequence processing model. Our approach demonstrates strong performance across a range of tasks, achieving high sequence recovery in inverse folding and protein-conditioned RNA design. These outstanding results demonstrate significant potential for application in complex biological systems research.

Anthology ID:: 2025.findings-emnlp.369
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7023–7029
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.369/
DOI:: 10.18653/v1/2025.findings-emnlp.369
Bibkey:
Cite (ACL):: Chang Zhou, Yuheng Shan, Pengan Chen, Xiangyu Shi, Zikang Wang, Yanting Li, and Jiyue Jiang. 2025. LM2Protein: A Structure-to-Token Protein Large Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7023–7029, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: LM2Protein: A Structure-to-Token Protein Large Language Model (Zhou et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.369.pdf
Checklist:: 2025.findings-emnlp.369.checklist.pdf

PDF Cite Search Checklist Fix data