Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation
Nguyen Nguyen, Nhat Truong Pham, Duong Tran, Balachandran Manavalan
Abstract
Generating de novo molecules from textual descriptions is challenging due to potential issues with molecule validity in SMILES representation and limitations of autoregressive models. This work introduces Lang2Mol-Diff, a diffusion-based language-to-molecule generative model using the SELFIES representation. Specifically, Lang2Mol-Diff leverages the strengths of two state-of-the-art molecular generative models: BioT5 and TGM-DLM. By employing BioT5 to tokenize the SELFIES representation, Lang2Mol-Diff addresses the validity issues associated with SMILES strings. Additionally, it incorporates a text diffusion mechanism from TGM-DLM to overcome the limitations of autoregressive models in this domain. To the best of our knowledge, this is the first study to leverage the diffusion mechanism for text-based de novo molecule generation using the SELFIES molecular string representation. Performance evaluation on the L+M-24 benchmark dataset shows that Lang2Mol-Diff outperforms all existing methods for molecule generation in terms of validity. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/lang2mol/.- Anthology ID:
- 2024.langmol-1.15
- Volume:
- Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Carl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
- Venues:
- LangMol | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 128–134
- Language:
- URL:
- https://aclanthology.org/2024.langmol-1.15
- DOI:
- 10.18653/v1/2024.langmol-1.15
- Cite (ACL):
- Nguyen Nguyen, Nhat Truong Pham, Duong Tran, and Balachandran Manavalan. 2024. Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation. In Proceedings of the 1st Workshop on Language + Molecules (L+M 2024), pages 128–134, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation (Nguyen et al., LangMol-WS 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.langmol-1.15.pdf