Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment

Hyuntae Park, Yeachan Kim, SangKeun Lee


Abstract
Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule–text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule–description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, underscoring the importance of substructure-aware alignment in molecule-text learning.
Anthology ID:
2025.emnlp-main.1197
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23470–23490
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1197/
DOI:
Bibkey:
Cite (ACL):
Hyuntae Park, Yeachan Kim, and SangKeun Lee. 2025. Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23470–23490, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment (Park et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1197.pdf
Checklist:
 2025.emnlp-main.1197.checklist.pdf