Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion

Duong Tran; Nhat Truong Pham; Nguyen Nguyen; Balachandran Manavalan

Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion

Duong Tran, Nhat Truong Pham, Nguyen Nguyen, Balachandran Manavalan

Abstract

This paper introduces Mol2Lang-VLM, an enhanced method for refining generative pre-trained language models for molecule captioning using multimodal features to achieve more accurate caption generation. Our approach leverages the encoder and decoder blocks of the Transformer-based architecture by introducing third sub-layers into both. Specifically, we insert sub-layers in the encoder to fuse features from SELFIES strings and molecular images, while the decoder fuses features from SMILES strings and their corresponding descriptions. Moreover, cross multi-head attention is employed instead of common multi-head attention to enable the decoder to attend to the encoder’s output, thereby integrating the encoded contextual information for better and more accurate caption generation. Performance evaluation on the CheBI-20 and L+M-24 benchmark datasets demonstrates Mol2Lang-VLM’s superiority, achieving higher accuracy and quality in caption generation compared to existing methods. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/mol2lang/.

Anthology ID:: 2024.langmol-1.12
Volume:: Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
Venues:: LangMol | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 97–102
Language:
URL:: https://aclanthology.org/2024.langmol-1.12
DOI:
Bibkey:
Cite (ACL):: Duong Tran, Nhat Truong Pham, Nguyen Nguyen, and Balachandran Manavalan. 2024. Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 97–102, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion (Tran et al., LangMol-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.langmol-1.12.pdf

PDF Search