Protein-STORY: Semantic Text-Oriented Representation Yields biologically meaningful Protein embeddings

Nabil Ibtehaz, Daisuke Kihara


Abstract
Unsupervised representation learning using masked language modeling on the language of life has transformed protein research, enabling the analysis of a protein universe that is expanding at an exponential pace. However, most current models rely solely on sequence data, overlooking decades of expert-curated biological knowledge stored in natural language. While recent multimodal and knowledge-graph-based approaches attempt to bridge this gap, they often rely on shallow functional labels that lack the contextual depth of full textual narratives. We present Protein-STORY, a general pipeline that synthesizes protein embeddings from diverse, multi-source text descriptions. At the core of our approach is a novel network architecture designed for the semantic compression of multi document embeddings, which integrates high-fidelity functional and structural insights into a unified representation. Our experiments demonstrate that Protein-STORY produces biologically meaningful embeddings (r ≈ 0.75) that outperform existing models on diverse downstream tasks (+2 pts F1 in function prediction). Furthermore, by projecting the story of a protein into a natural language semantic space, our model enables effective zero-shot text-prompted protein search.
Anthology ID:
2026.acl-short.73
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
883–897
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-short.73/
DOI:
Bibkey:
Cite (ACL):
Nabil Ibtehaz and Daisuke Kihara. 2026. Protein-STORY: Semantic Text-Oriented Representation Yields biologically meaningful Protein embeddings. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 883–897, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Protein-STORY: Semantic Text-Oriented Representation Yields biologically meaningful Protein embeddings (Ibtehaz & Kihara, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-short.73.pdf
Checklist:
 2026.acl-short.73.checklist.pdf