Rethinking Text-based Protein Understanding: Retrieval or LLM?

Juntong Wu, Zijing Liu, He Cao, Li Hao, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li


Abstract
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to assess the model’s performance in this domain accurately. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data will be available.
Anthology ID:
2025.emnlp-main.1211
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23737–23757
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1211/
DOI:
Bibkey:
Cite (ACL):
Juntong Wu, Zijing Liu, He Cao, Li Hao, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, and Yu Li. 2025. Rethinking Text-based Protein Understanding: Retrieval or LLM?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23737–23757, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Rethinking Text-based Protein Understanding: Retrieval or LLM? (Wu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1211.pdf
Checklist:
 2025.emnlp-main.1211.checklist.pdf