Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation

Nguyen Nguyen, Nhat Truong Pham, Duong Tran, Balachandran Manavalan


Abstract
Generating de novo molecules from textual descriptions is challenging due to potential issues with molecule validity in SMILES representation and limitations of autoregressive models. This work introduces Lang2Mol-Diff, a diffusion-based language-to-molecule generative model using the SELFIES representation. Specifically, Lang2Mol-Diff leverages the strengths of two state-of-the-art molecular generative models: BioT5 and TGM-DLM. By employing BioT5 to tokenize the SELFIES representation, Lang2Mol-Diff addresses the validity issues associated with SMILES strings. Additionally, it incorporates a text diffusion mechanism from TGM-DLM to overcome the limitations of autoregressive models in this domain. To the best of our knowledge, this is the first study to leverage the diffusion mechanism for text-based de novo molecule generation using the SELFIES molecular string representation. Performance evaluation on the L+M-24 benchmark dataset shows that Lang2Mol-Diff outperforms all existing methods for molecule generation in terms of validity. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/lang2mol/.
Anthology ID:
2024.langmol-1.15
Volume:
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
Venues:
LangMol | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
128–134
Language:
URL:
https://aclanthology.org/2024.langmol-1.15
DOI:
10.18653/v1/2024.langmol-1.15
Bibkey:
Cite (ACL):
Nguyen Nguyen, Nhat Truong Pham, Duong Tran, and Balachandran Manavalan. 2024. Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 128–134, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation (Nguyen et al., LangMol-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.langmol-1.15.pdf