Multilingual Molecular Representation Learning via Contrastive Pre-training

Zhihui Guo, Pramod Sharma, Andy Martinez, Liang Du, Robin Abraham


Abstract
Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
Anthology ID:
2022.acl-long.242
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3441–3453
Language:
URL:
https://aclanthology.org/2022.acl-long.242
DOI:
10.18653/v1/2022.acl-long.242
Bibkey:
Cite (ACL):
Zhihui Guo, Pramod Sharma, Andy Martinez, Liang Du, and Robin Abraham. 2022. Multilingual Molecular Representation Learning via Contrastive Pre-training. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3441–3453, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Multilingual Molecular Representation Learning via Contrastive Pre-training (Guo et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.acl-long.242.pdf
Data
MoleculeNet