Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text
Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, Avinash Sahu
Abstract
Proteins play critical roles in biological systems, yet 99.7% of over 227 million known protein sequences remain uncharacterized due to the limitations of experimental methods. To assist experimentalists in narrowing down hypotheses and accelerating protein characterization, we present Protein2Text, a multimodal large language model that interprets protein sequences and generates informative text to address open-ended questions about protein functions and attributes. By integrating a resampling mechanism within an adapted LLaVA framework, our model effectively maps protein sequences into a language-compatible space, enhancing its capability to handle diverse and complex queries. Trained on a newly curated dataset derived from PubMed articles and rigorously evaluated using four comprehensive benchmarks—including in-domain and cross-domain evaluations—Protein2Text outperforms several existing models in open-ended question-answering tasks. Our work also highlights the limitations of current evaluation metrics applied to template-based approaches, which may lead to misleading results, emphasizing the need for unbiased assessment methods. Our model weights, evaluation datasets, and evaluation scripts are publicly available at https://github.com/alaaj27/Protein2Text.git.- Anthology ID:
- 2025.naacl-industry.68
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 918–937
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.68/
- DOI:
- Cite (ACL):
- Ala Jararweh, Oladimeji Macaulay, David Arredondo, Yue Hu, Luis E Tafoya, Kushal Virupakshappa, and Avinash Sahu. 2025. Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 918–937, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text (Jararweh et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.68.pdf