Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, Avinash Sahu


Abstract
Proteins play critical roles in biological systems, yet 99.7% of over 227 million known protein sequences remain uncharacterized due to the limitations of experimental methods. To assist experimentalists in narrowing down hypotheses and accelerating protein characterization, we present Protein2Text, a multimodal large language model that interprets protein sequences and generates informative text to address open-ended questions about protein functions and attributes. By integrating a resampling mechanism within an adapted LLaVA framework, our model effectively maps protein sequences into a language-compatible space, enhancing its capability to handle diverse and complex queries. Trained on a newly curated dataset derived from PubMed articles and rigorously evaluated using four comprehensive benchmarks—including in-domain and cross-domain evaluations—Protein2Text outperforms several existing models in open-ended question-answering tasks. Our work also highlights the limitations of current evaluation metrics applied to template-based approaches, which may lead to misleading results, emphasizing the need for unbiased assessment methods. Our model weights, evaluation datasets, and evaluation scripts are publicly available at https://github.com/alaaj27/Protein2Text.git.
Anthology ID:
2025.naacl-industry.68
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
918–937
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.68/
DOI:
Bibkey:
Cite (ACL):
Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, and Avinash Sahu. 2025. Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 918–937, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text (Jararweh et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.68.pdf