Transformer-based Approach for Predicting Chemical Compound Structures

Yutaro Omote, Kyoumoto Matsushita, Tomoya Iwakura, Akihiro Tamura, Takashi Ninomiya


Abstract
By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.
Anthology ID:
2020.aacl-main.19
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–162
Language:
URL:
https://aclanthology.org/2020.aacl-main.19
DOI:
Bibkey:
Cite (ACL):
Yutaro Omote, Kyoumoto Matsushita, Tomoya Iwakura, Akihiro Tamura, and Takashi Ninomiya. 2020. Transformer-based Approach for Predicting Chemical Compound Structures. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 154–162, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Transformer-based Approach for Predicting Chemical Compound Structures (Omote et al., AACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.aacl-main.19.pdf